BeautifulSoup Grab Visible Webpage Text

StackOverflow

Basically, I want to use BeautifulSoup to grab strictly the visible text on a webpage. For instance, this webpage is my test case. And I mainly want to just get the body text (article) and maybe even a few tab names here and there. I have tried the suggestion in this SO question that returns lots of <script> tags and html comments which I don"t want. I can"t figure out the arguments I need for the function findAll() in order to just get the visible texts on a webpage.

So, how should I find all visible text excluding scripts, comments, css etc.?

Answer rating: 263

Try this:

from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.request


def tag_visible(element):
    if element.parent.name in ["style", "script", "head", "title", "meta", "[document]"]:
        return False
    if isinstance(element, Comment):
        return False
    return True


def text_from_html(body):
    soup = BeautifulSoup(body, "html.parser")
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts)  
    return u" ".join(t.strip() for t in visible_texts)

html = urllib.request.urlopen("http://www.nytimes.com/2009/12/21/us/21storm.html").read()
print(text_from_html(html))




Get Solution for free from DataCamp guru