Basically, I want to use BeautifulSoup to grab strictly the visible text on a webpage. For instance, this webpage is my test case. And I mainly want to just get the body text (article) and maybe even a few tab names here and there. I have tried the suggestion in this SO question that returns lots of
<script> tags and html comments which I don"t want. I can"t figure out the arguments I need for the function
findAll() in order to just get the visible texts on a webpage.
So, how should I find all visible text excluding scripts, comments, css etc.?
from bs4 import BeautifulSoup from bs4.element import Comment import urllib.request def tag_visible(element): if element.parent.name in ["style", "script", "head", "title", "meta", "[document]"]: return False if isinstance(element, Comment): return False return True def text_from_html(body): soup = BeautifulSoup(body, "html.parser") texts = soup.findAll(text=True) visible_texts = filter(tag_visible, texts) return u" ".join(t.strip() for t in visible_texts) html = urllib.request.urlopen("http://www.nytimes.com/2009/12/21/us/21storm.html").read() print(text_from_html(html))
Efficiently perform data collection, wrangling, analysis, and visualization using Python. Recent advancements in computing and artificial intelligence have completely changed the way we understand ...
A Practical, No-Nonsense Introduction to Python Development. You already know you want to learn Python, and a smarter way to learn Python 3 is to learn by doing. The Python Workshop focuses on buil...
Shabbir Challawala has over 8 years of rich experience in providing solutions based on MySQL and PHP technologies. He is currently working with KNOWARTH Technologies. He has worked in various PHP-base...
In Learn Python 3 the Hard Way PDF, you'll learn Python by working through 52 brilliantly crafted exercises. Read them. Enter your code exactly. (No copying and pasting!) Correct your mistakes. ...