Change language

Scraping a website with Python: A Beginner’s Guide

| | |

In this article we will learn how to create a Python HTML scraper that gets unofficial access to the site code and allows you to extract the necessary data.

Difference from API calls

An alternative method of retrieving site data is API calls. API calls are an official way for the site owner to retrieve data directly from the database or common files. It usually requires permission from the site owner and a special token. However, the API is not always available, which is why scraping is so attractive, but its legality raises questions.

Legal considerations

Scraping can violate copyright or site usage rules, especially when it is used for profit, competitive advantage, or harm (e.g. because of too frequent requests). However, scraping is publicly available and used for personal use, academic purposes, or harmless non-commercial use.

If the data is paid for, requires registration, has explicit protection against scraping, contains sensitive data, or contains users' personal information, then any type of scraping should be avoided.

Installing Beautiful Soup in Python

Beautiful Soup is a Python library for scraping website data via HTML code.

Install the latest version of the library.

$ pip install beautifulsoup4

To make requests, install requests (a library for sending HTTP requests):

$ pip install requests

Import libraries in a Python or Jupiter notebook file:

from bs4 import BeautifulSoup
import requests

And a few of the standard libraries you'll need for Python scraping:

import re
from re import sub
from decimal import Decimal
import io
from datetime import datetime
import pandas as pd

Introduction

Imagine we want to scrap a platform containing publicly available real estate listings. We want to get the price of the property, its address, distance, station name, and the closest type of transportation to it in order to find out how property prices are distributed according to the availability of public transportation in a particular city.

Suppose that the query leads to a results page that looks like this:

Результаты поиск для скрапинга на Python

Once we know which elements of the site store the necessary data, we need to come up with scraping logic that will allow us to get all the information we need from each ad.

We have the following questions to answer:

  • How do we get one data point for one property (for example, the data from the price tag in the first ad)?
  • How can we get all data points for one property from an entire page (for example, all price tags from one page)?
    How can I get all data points for one property from all result pages (e.g. all price tags from all result pages)?
  • How to eliminate inconsistency when data can be of different types (e.g., there are some ads where the price field has a query price. We end up with a column consisting of numeric and string values, which in our case does not allow for analysis)?
  • What is the best way to extract complex information (For example, suppose each ad contains information about public transportation, such as "0.5 miles to XY subway station")?

The logic for retrieving a single data point

You can find all the code examples for scraping in Python in the Jupiter Notebook file on the author's GitHub.

Querying the site code

First, we use the search query we made in the browser in the Python script:

# search
url = 'https://www.website.com/london/page_size=25&q=london&pn=1'

# retrieving html
html_text = requests.get(url).text

# using lxml parser
soup = BeautifulSoup(html_text, 'lxml')

The soup variable contains the full HTML code of the search results page.

Find property tags

For this we will need a browser. Some popular browsers offer a convenient way to get information about a particular element directly. In Google Chrome, you can select any element on the site and right-click and select "Explore Element" . The site code with the highlighted element will open on the right.

HTML classes and the id attribute

HTML classes and id are mainly used to reference a class in a CSS style sheet so that the data can be displayed in a consistent way.
In the example above, the class used to retrieve price information from one listing is also used to retrieve prices from other listings (which is consistent with the primary purpose of the class).

Note that an HTML class can also reference price tags outside of the ad section (e.g., special offers that are not related to the search query but still appear on the results page). However, for the purposes of this article, we focus only on prices in real estate listings.

That's why we focus on the listing first and only look for the HTML class in the source code for a particular listing:

# using lxml parser
soup = BeautifulSoup(html_text, 'lxml')

# finding the single ad
ad = soup.find('div', class_ = 'css-ad-wrapper-123456')

# finding the price
price = ad.find('p', class_ = 'css-aaabbbccc').text

Using .text at the end of the find() method allows us to return only plain text, as shown in the browser. Without .text it will return all the source code of the HTML string referenced by the class:

Important note: we always need to specify an element, in this case it is p.

The logic to get all data points from one page

To get price points for all listings, we apply the find.all() method instead of find():

ads = ad.find_all('p', class_ = 'css-ad-wrapper-123456')

The ads variable now contains the HTML code for each ad on the first results page as a list of lists. This storage format is very useful because it allows you to access the source code for specific ads by index.

To get all the price tags, we use a dictionary to collect the data:

map = {}
id = 0
# retrieving all elements:
ads = ad.find_all('p', class_ = 'css-ad-wrapper-123456')

for i in range(len(ads)):

    ad = ads[i]
    id += 1
    map[id] = {}

    # finding the price
    price = ad.find('p', class_ = 'css-aaabbbccc').text
    # finding the address
    address = ad.find('p', class_ = 'css-address-123456').text

    map[id]["address"] = address
    map[id]["price"] = price

Important note: Using an identifier allows us to find ads in the dictionary:

Getting data points from all pages

Typically, search results are either paginated or scroll down infinitely.

Option 1: Website with pagination

The URLs retrieved from a search query usually contain information about the current page number.

ссылки на страницы результатов поиска

As you can see in the figure above, the ending of the URL refers to the page number of the results.

Important note: The page number in the URL usually becomes visible from the second page. Using a basic URL with an extra &pn=1 fragment to call the first page will still work (in most cases).

Applying one for-loop on top of another allows us to loop through the result pages:

url = 'https://www.website.com/london/page_size=25&q=london&pn='

map = {}

id = 0

# max number of pages
max_pages = 15

for p in range(max_pages):

    cur_url = url + str(p + 1)

    print("Scraping the page #: %d" % (p + 1))

    html_text = requests.get(cur_url).text
    soup = BeautifulSoup(html_text, 'lxml')

    ads = soup.find_all('div', class_ = 'css-ad-wrapper-123456')

    for i in range(len(ads)):

        ad = ads[i]
        id += 1
        map[id] = {}

        price = ad.find('p', class_ = 'css-aaabbbccc').text
        address = ad.find('p', class_ = 'css-address-123456').text
        map[id]["address"] = address
        map[id]["price"] = price

Determining the last page of results

You may be wondering how to determine the last page of results? In most cases, after reaching the last page, any query with a larger number than the actual number of the last page will take us back to the first page. Consequently, using a very large number to wait for the script to complete does not work. After a while, it will start collecting repetitive values.

To solve this problem, we will check if there is a button on the page with such a link:

url = 'https://www.website.com/london/page_size=25&q=london&pn='
map = {}
id = 0
# using a very large number
max_pages = 9999
for p in range(max_pages):

    cur_url = url + str(p + 1)
    print("Scraping the page #: %d" % (p + 1))
    html_text = requests.get(cur_url).text
    soup = BeautifulSoup(html_text, 'lxml')
    ads = soup.find_all('div', class_ = 'css-ad-wrapper-123456')

    # searching for the link
    page_nav = soup.find_all('a', class_ = 'css-button-123456')

    if(len(page_nav) == 0):
        print("Max page number: %d" % (p))
        break
    (...)

Option 2. A site with infinite scrolling

In this case the HTML scraper won't work. We will discuss alternative methods at the end of this article.

Eliminating inconsistent data

If we need to get rid of unnecessary data at the beginning of scraping in Python, we can use a workaround method:

A function to detect anomalies

def is_skipped(price):
    '''
    Identify what "prices" are not actually prices
       (for example "Check the price")
    '''
    for i in range(len(price)):
        if(price[i] != '£' and price[i] != ','
           and (not price[i].isdigit())):
              return True
    return False

And apply it when collecting data:

(...)
for i in range(len(ads)):

        ad = ads[i]
        id += 1
        map[id] = {}

        price = ad.find('p', class_ = 'css-aaabbbccc').text
        # skipping the ad without a price
        if(is_skipped(price)): continue
        map[id]["price"] = price

On-the-fly data formatting

We may have noticed that the price is stored in a string along with commas with the currency symbol. We can fix this back at the scraping stage:

def to_num(price):
    value = Decimal(sub(r'[^\d.]', '', price))
    return float(value)

We use this function:

(...)
for i in range(len(ads)):

        ad = ads[i]
        id += 1
        map[id] = {}

        price = ad.find('p', class_ = 'css-aaabbbccc').text
        if(is_dropped(price)): continue
        map[id]["price"] = to_num(price)
        (...)

Get nested data

Public transportation information has a nested structure. We need data about distance, station name, and type of transport.

Selecting the information according to the rules

Each piece of data is represented as: number of miles, station name. Use the word "miles" as a delimiter.

map[id]["distance"] = []
map[id]["station"] = []
transport = ad.find_all('div', class_ = 'css-transport-123')
for i in range(len(transport)):
       s = transport[i].text
       x = s.split(' miles ')
       map[id]["distance"].append(float(x[0]))
       map[id]["station"].append(x[1])

Initially, the variable transport stores two lists, because there are two lines of information about public transportation (for example, "0.3 miles Sloane Square", "0.5 miles South Kensington"). We loop through these lists, using the len of the transport as index values, and split each line into two variables: distance and station.

Find additional HTML attributes for visual information

In the code of the page we can find the attribute testid, which indicates the type of public transport. It is not displayed in the browser, but it is responsible for the image that is displayed on the page. We need to use the css-StyledIcon class to get this data:

map[id]["distance"] = []
map[id]["station"] = []
map[id]["transport_type"] = []
transport = ad.find_all('div', class_ = 'css-transport-123')
type = ad.find_all('span', class_ = 'css-StyledIcon')
for i in range(len(transport)):
       s = transport[i].text
       x = s.split(' miles ')
       map[id]["distance"].append(float(x[0]))
       map[id]["station"].append(x[1])
       map[id]["transport_type"].append(type[i]['testid'])

Convert to dataframe and export to CSV

When the scraping is done, all the extracted data is available in the dictionary of dictionaries.

Let's look at just one declaration first, to better demonstrate the final steps of the transformation.

Данные объявления

Convert the dictionary to a list of lists to get rid of nesting

result = []
cur_row = 0
for idx in range(len(map[1]["distance"])):
    result.append([])

    result[cur_row].append(str(map[1]["uuid"]))
    result[cur_row].append(str(map[1]["price"]))
    result[cur_row].append(str(map[1]["address"]))
    result[cur_row].append(str(map[1]["distance"][idx]))
    result[cur_row].append(str(map[1]["station"][idx]))
    result[cur_row].append(str(map[1]["transport_type"][idx]))

    cur_row += 1
Данные без вложенности

Create a dataframe

We can export the dataframe into CSV:

filename = 'test.csv'
df.to_csv(filename)

Convert all ads to a dataframe:

result = []
cur_row = 0
for id in map.keys():
    cur_price = map[id]["price"]
    cur_address = map[id]["address"]
    for idx in range(len(map[id]["distance"])):
        result.append([])
        result[cur_row].append(int(cur_id))
        result[cur_row].append(float(cur_price))
        result[cur_row].append(str(cur_address))
        result[cur_row].append(float(map[id]["distance"][idx]))
        result[cur_row].append(str(map[id]["station"][idx]))
        result[cur_row].append(str(map[id]["transport_type"][idx]))
        cur_row += 1
# converting to dataframe
df = pd.DataFrame(result, columns = ["ad_id", "price","address", "distance", "station", "transport_type"])
# экспорт в csv
filename = 'test.csv'
df.to_csv(filename)

We did it! Now our scraper is ready for testing.

Limitations of HTML scraping and its alternatives

This example shows how easy HTML scraping in Python can be in the standard case. It doesn't require researching the documentation. It requires more creative thinking than web development experience.

However, HTML scrapers have disadvantages:

  • You can only access information in the HTML code, which is loaded directly when the URL is called. Websites that require JavaScript and Ajax to load content will not work.
  • HTML classes or identifiers may change due to website updates.
  • Can be easily detected if requests seem abnormal for a website (e.g., a very large number of requests in a short period of time).


Alternatives:

  • Shell scripts - loads the entire page, can handle html with regular expressions.
  • Screen scraper - represent a real user, use a browser (Selenium, PhantomJS).
  • Scraping software - designed for standard cases, do not require writing code (webscraper.io).
  • Web services scrapers - do not require writing code, good at scraping, paid (zyte.com).

Shop

Learn programming in R: courses

$

Best Python online courses for 2022

$

Best laptop for Fortnite

$

Best laptop for Excel

$

Best laptop for Solidworks

$

Best laptop for Roblox

$

Best computer for crypto mining

$

Best laptop for Sims 4

$

Latest questions

NUMPYNUMPY

Common xlabel/ylabel for matplotlib subplots

12 answers

NUMPYNUMPY

How to specify multiple return types using type-hints

12 answers

NUMPYNUMPY

Why do I get "Pickle - EOFError: Ran out of input" reading an empty file?

12 answers

NUMPYNUMPY

Flake8: Ignore specific warning for entire file

12 answers

NUMPYNUMPY

glob exclude pattern

12 answers

NUMPYNUMPY

How to avoid HTTP error 429 (Too Many Requests) python

12 answers

NUMPYNUMPY

Python CSV error: line contains NULL byte

12 answers

NUMPYNUMPY

csv.Error: iterator should return strings, not bytes

12 answers

News


Wiki

Python | How to copy data from one Excel sheet to another

Common xlabel/ylabel for matplotlib subplots

Check if one list is a subset of another in Python

sin

How to specify multiple return types using type-hints

exp

Printing words vertically in Python

exp

Python Extract words from a given string

Cyclic redundancy check in Python

Finding mean, median, mode in Python without libraries

cos

Python add suffix / add prefix to strings in a list

Why do I get "Pickle - EOFError: Ran out of input" reading an empty file?

Python - Move item to the end of the list

Python - Print list vertically