Hi everyone. My name is Iveta and I’m a Content Manager at Oxylabs. Today, I’m here to explain to you everything about scraping dynamic JavasScript AJAX websites with BeautifulSoup. Let’s get started.
The driver for your browser. Driver for Chrome can be downloaded from this page. You should download the zip file containing the driver and unzip it. Visit this link for information about drivers for other browsers You’ll also need a Python Selenium Package. This package can be installed using the pip command. Or, if you’re using Anaconda, this can be installed from the conda-forge channel.
The basic skeleton of the Python script for launching a browser, loading the page, and then closing the browser is simple. The executable_path is the complete path of the driver. On Windows, to forward slashes, backslashes need to be changed.
Now, our objective is to find the author element.
You should simply load the website in Chrome, right-click the author name, and click Inspect. This should load Developer Tools with the author element highlighted.
This is a small element with its class attribute set to author Selenium allows various methods to locate the HTML elements. Here are a few of the methods that can be useful.
There are also a few other methods, which may be useful for other scenarios, for example: We think that the most useful methods are find_element_by_css_selector and find_element_by_xpath. Any of these should be able to select most of the scenarios. Lets modify the code so that the first author can be printed.
If you want to print all the authors, you should know that all the find_element methods have a counterpart – find_elements. In this case, you should note the pluralization. To find all the authors, simply change one line.
This returns a list of elements. We can simply run a loop to print all the authors.
We already discussed that the Beautiful Soup object needs HTML.
For web scraping static public websites, the HTML can be retrieved using the requests library. The next step is parsing this HTML string into the BeautifulSoup object.
Lets find out how to scrape a dynamic website with BeautifulSoup.
The rendered HTML of the page is available in the attribute page_source.
Once the soup object is available, all Beautiful Soup methods can be used as usual.
Once the script is ready, there’s no need for the browser to be visible when the script is running. The browser can be hidden, and the script will still run fine. This behavior of a browser is also known as a headless browser.
To make the browser headless, import ChromeOptions. If needed, you’ll also find Options classes for other browsers by simply searching for them via Google.
Now, you need to create an object of this class, and set the headless attribute to True.
Finally, you should send this object while creating the Chrome instance Now when you run the script, the browser won’t be visible.
So, loading the browser is expensive – it takes up CPU, RAM, and bandwidth which aren’t really needed. When you scrape a website, its the data that’s important. All these CSS, images, and rendering aren’t really needed.
The fastest and most efficient way of scraping dynamic public web pages with Python is to locate the actual place where the data is located.
There are two places where this data can be located: The main page itself, in JSON format, embedded in a tag Other files which are loaded asynchronously. The data can be in JSON format, or as partial HTML.
Now, we can take a look at a few examples. Let’s open our chosen website quotes.toscrape.com/js in Chrome.
Once the page is loaded, press Ctrl+U to view the source. Press Ctrl+F to bring up the search box, and search for the author “Albert.” We can immediately see that data is embedded as a JSON object on the page. You should also note that this is a part of the script where this data is being assigned to a variable data.
In this case, we can use the Requests library to get the page, and use Beautiful Soup to parse the page and get the script element.
Web scraping dynamic sites can follow a completely different path. Sometimes the data is loaded on a separate page altogether.
A website called Librivox is a great example of this case.
You should open Developer Tools, go to Network Tab and filter by XHR. Now, you should open this link or search for any book. You’ll see that the data is an HTML embedded in JSON.
In this case, you should note a few things that are listed here.