There are basically two ways to retrieve data from a website:
Use the site API (if it exists). For example, Facebook has a Facebook Graph API that allows you to retrieve data hosted on Facebook.
Access the HTML address of a web page and extract useful information / data from it. This technique is called web scrubbing or web scraping or web scraping.
This article discusses the steps involved in web scrubbing using the Python web scrubbing implementation with Beautiful Soup p>
Steps to clean up webpages:
Send an HTTP request to the URL of the webpage you want to access ... The server responds to the request by returning the HTML content of the web page. For this task, we will use a third-party HTTP library for Python requests.
After we have access to the HTML content, we are faced with the task of parsing the data. Since most of the HTML data is nested, we cannot retrieve the data simply by processing the strings. Need a parser that can create a nested / tree-like HTML data structure. There are many HTML parser libraries available, but the most advanced is html5lib.
Now all we have to do is navigate and look for the parse tree we created, which is tree traversal. To accomplish this, we will use another third party python library, Beautiful Soup . It is a Python library for extracting data from HTML and XML files.
Step 1: Install the required third-party libraries
The easiest way install external libraries in Python — this is to use pip. pip — it is a package management system used to install and manage software packages written in Python. All you have to do is:
The great thing about the BeautifulSoup library is that it is built on top of HTML parsing libraries like html5lib, lxml, html.parser, etc. So Thus, the BeautifulSoup object and the parser library reference can be generated at the same time.
In the above example
soup = BeautifulSoup (r.content, 'html5lib')
We create a BeautifulSoup object by passing two arguments:
r.content : This is raw HTML content.
html5lib strong>: specifying the HTML parser we want to use.
Now soup.prettify () is printed, strong> it gives a visual representation of the parse tree generated th from the raw HTML content.
Step 4: Search and Navigate the Parse Tree
Now we would like to extract some useful data from the HTML content. The soup object contains all the data in a nested structure that can be retrieved programmatically. In our example, we are clearing a web page that contains multiple quotes. So, we would like to create a program to save these quotes (and all the necessary information about them).
# Python program to clean up site # and save quotes from the site
The first argument — this is the HTML tag you want to search for, and the second argument — it is a dictionary type element for specifying additional attributes associated with this tag. The find () method returns the first matching element. You can try typing table.prettify (), to understand what this piece of code does.
Now, in the table element, you can see that each quote is inside a container div whose class is — quotation mark. So, we're looping through each div that has a class — quote. Here we are using the findAll () method, which is similar to the find method in terms of arguments, but returns a list of all matching elements. Each quote is now repeated using a variable named row. Here is one example line of HTML content for better understanding:
We create a dictionary to store all the information about the quote. The nested structure can be accessed using dot notation. To access text inside HTML — element, we use .text:
quote ['theme'] = row.h5.text
We can add, remove, modify and access tag attributes. This is done by treating the tag like a dictionary:
quote ['url'] = row.a [' href']
Finally, all quotes are added to a list called quotes.
Finally, we would like to save all our data in some CSV file.
filename = 'inspirational_quotes.csv' with open (filename,' wb') as f: w = csv.DictWriter (f, ['theme',' url', 'img',' lines', 'author']) w.writeheader () for quote in quotes: w.writerow (quote)
Here we create a CSV file named inspirational_quotes.csv and save all the quotes in it for future reference.
So, this was a simple example of how to create a web scraper in Python. From here, you can try to delete any other website of your choice. If you have any requests, please post them below in the comments section.
Note: Web cleaning is considered illegal in many cases. It can also cause your IP to be blocked by your site.
This blog is powered by Nikhil Kumar . If you are as Python.Engineering and would like to contribute, you can also write an article using contribute.python.engineering or by posting an article contribute @ python.engineering. See my article appearing on the Python.Engineering homepage and help other geeks.
Please post comments if you find anything wrong or if you would like to share more information on the topic discussed above.