Implementing Web Scraping in Python with BeautifulSoup



There are basically two ways to retrieve data from a website:

  • Use the site API (if it exists). For example, Facebook has a Facebook Graph API that allows you to retrieve data hosted on Facebook.
  • Access the HTML address of a web page and extract useful information / data from it. This technique is called web scrubbing or web scraping or web scraping.

This article discusses the steps involved in web scrubbing using the Python web scrubbing implementation with Beautiful Soup

Steps to clean up webpages:

  1. Send an HTTP request to the URL of the webpage you want to access … The server responds to the request by returning the HTML content of the web page. For this task, we will use a third-party HTTP library for Python requests.
  2. After we have access to the HTML content, we are faced with the task of parsing the data. Since most of the HTML data is nested, we cannot retrieve the data simply by processing the strings. Need a parser that can create a nested / tree-like HTML data structure. 
    There are many HTML parser libraries available, but the most advanced is html5lib.
  3. Now all we have to do is navigate and look for the parse tree we created, which is tree traversal. To accomplish this, we will use another third party python library, Beautiful Soup . It is a Python library for extracting data from HTML and XML files.

Step 1: Install the required third-party libraries

  • The easiest way install external libraries in Python — this is to use pip.  pip — it is a package management system used to install and manage software packages written in Python. 
    All you have to do is:
 pip install requests pip install html5lib pip install bs4 

Step 2. Access HTML content from web page

import requests

URL = " https://www.python.engineering/data-structures/ "

r = requests.get (URL)

print (r.content)

Let`s try to understand this piece code.

  • First of all, import the requests library.
  • Then provide the URL of the web page you want to clean up.
  • Send an HTTP request to the specified url and store the response from the server in the response object r. 
  • Now, as r.content, output the raw HTML content of the web page. It is of type “string”.

Step 3: Parse the HTML content

The great thing about the BeautifulSoup library is that it is built on top of HTML parsing libraries like html5lib, lxml, html.parser, etc. So Thus, the BeautifulSoup object and the parser library reference can be generated at the same time.

In the above example

 soup = BeautifulSoup (r.content, `html5lib`) 

We create a BeautifulSoup object by passing two arguments:

  • r.content : This is raw HTML content.
  • html5lib : specifying the HTML parser we want to use.

Now soup.prettify () is printed,   it gives a visual representation of the parse tree generated th from the raw HTML content.

Step 4: Search and Navigate the Parse Tree

Now we would like to extract some useful data from the HTML content. The soup object contains all the data in a nested structure that can be retrieved programmatically. In our example, we are clearing a web page that contains multiple quotes. So, we would like to create a program to save these quotes (and all the necessary information about them).

# This won`t work in the online IDE

import requests

from bs4 import BeautifulSoup

 

URL = " http: // www.values.com/inspirational-quotes "

r = requests.get (URL)

 

soup = BeautifulSoup (r.content, `html5lib` )

print (soup.prettify ())

# Python program to clean up site
# and save quotes from the site

import requests

from bs4 import BeautifulSoup

import csv

 

URL = " http: // www .values.com / inspirational-quotes "

r = requests.get (URL)

 

soup = BeautifulSoup (r.content, `html5lib ` )

  

quotes = []  # list for storing quotes

 

table = soup.find ( `div` , attrs = { `id` : `container` })

  

for row in table.findAll ( ` div` , attrs = { `class` : ` quote` }):

quote = {}

quote [ ` theme` ] = row.h5.text

quote [ `url` ] = row.a [ `href` ]

  quote [ ` img` ] = row.img [ `src` ]

quote [ `lines ` ] = row.h6.text

quote [ `author` ] = row.p.text

quotes.append (quote)

 

filename = `inspirational_quotes.csv`

with open (filename, `wb` ) as f:

w = csv.DictWriter (f, [ ` theme` , ` url` , `img` , `lines` , ` author` ])

w.writeheader ()

for quote in quotes:

w.writerow (quote)

Before moving I next, we recommend that you look at the HTML content of the web page we printed using the soup.prettify () method and try to find a pattern or a way to go to quotes.

  • All the quotes are inside a container div whose id is — container. So we find this div element (called the table in the above code) using the find ()
     table = soup.find (`div`, attrs = {` id`: `container `}) 

    The first argument — this is the HTML tag you want to search for, and the second argument — it is a dictionary type element for specifying additional attributes associated with this tag. The find () method returns the first matching element. You can try typing table.prettify (), to understand what this piece of code does.

  • Now, in the table element, you can see that each quote is inside a container div whose class is — quotation mark. So, we`re looping through each div that has a class — quote. 
    Here we are using the findAll () method, which is similar to the find method in terms of arguments, but returns a list of all matching elements. Each quote is now repeated using a variable named row. 
    Here is one example line of HTML content for better understanding:

    Now let`s look at this piece of code:

     for row in table.findAll (`div`, attrs = {` class`: ` quote`}): quote = {} quote [`theme`] = row.h5.text quote [` url`] = row.a [`href`] quote [` img`] = row.img [`src` ] quote [`lines`] = row.h6.text quote [` author`] = row.p.text quotes.append (quote) 

    We create a dictionary to store all the information about the quote. The nested structure can be accessed using dot notation. To access text inside HTML — element, we use .text:

     quote [`theme`] = row.h5.text 

    We can add, remove, modify and access tag attributes. This is done by treating the tag like a dictionary:

     quote [`url`] = row.a [` href`] 

    Finally, all quotes are added to a list called quotes.

  • Finally, we would like to save all our data in some CSV file.
     filename = `inspirational_quotes.csv` with open (filename,` wb`) as f: w = csv.DictWriter (f, [`theme`,` url`, `img`,` lines`, `author`]) w.writeheader () for quote in quotes: w.writerow (quote)  

    Here we create a CSV file named inspirational_quotes.csv and save all the quotes in it for future reference.

So, this was a simple example of how to create a web scraper in Python. From here, you can try to delete any other website of your choice. If you have any requests, please post them below in the comments section.

Note: Web cleaning is considered illegal in many cases. It can also cause your IP to be blocked by your site.

This blog is powered by Nikhil Kumar . If you are as Python.Engineering and would like to contribute, you can also write an article using contribute.python.engineering or by posting an article contribute @ python.engineering. See my article appearing on the Python.Engineering homepage and help other geeks.

Please post comments if you find anything wrong or if you would like to share more information on the topic discussed above.