
Hi everyone. My name is Iveta and I’m a Content Manager at Oxylabs. Today, I’m here to explain to you everything about scraping dynamic JavasScript AJAX websites with BeautifulSoup. Let’s get started.
Collecting public data from most of the websites may be comparatively easy. However, many websites are dynamic and use JavaScript to load their content. These web pages require a different approach to collect the required data.
Dynamically loading the content using JavaScript is also known as AJAX (Asynchronous JavaScript and XML) So, before we start this tutorial, we need to find a public website to scrape. In this case, we choose a website quotes.toscrape.com First of all, the easiest way to determine if a website is dynamic is by using Chrome or Edge because both of these browsers use Chromium under the hood You should open Developer Tools by pressing the F12 key. Ensure that the focus is on Developer tools and press the CTRL+SHIFT+P key combination to open Command Menu Of course, you’ll see a lot of commands. Start typing disable, and the commands will be filtered to show Disable JavaScript. Select this option to disable JavaScript Now reload this page by pressing Ctrl+R or F5.If this is a dynamic website, most of the content will disappear In some cases, the websites will still show the data but will fall back to basic functionality. For example, a website can have an infinite scroll. If JavaScript is disabled, a regular pagination will be shown Let’s jump to another topic – can BeautifulSoup render JavaScript? The short answer is no It’s important to understand the words like parsing and rendering. Parsing is simply converting a string representation of a Python object into an actual object. Rendering is essentially interpreting HTML, JavaScript, CSS, and images into something that we see in the browser BeautifulSoup is a Python library for pulling data out of HTML files. This involves parsing an HTML string into the BeautifulSoup object. For parsing, first, we need the HTML as a string to begin with. Dynamic websites don’t have the data in the HTML directly. It means that BeautifulSoup cannot work with dynamic websites Selenium library can automate loading and rendering websites in a browser like Chrome or Firefox. Even though Selenium supports pulling data out of HTML, it’s possible to extract complete HTML and use Beautiful Soup instead to extract the data Lets begin dynamic web scraping with Python using Selenium first Installing Selenium involves installing three things: The browser of your choice (which you probably already have): Chrome, Firefox, Edge, Internet Explorer, Safari, and Opera browsers are supported. In this tutorial, we ‘ll be using Chrome.
The driver for your browser. Driver for Chrome can be downloaded from this page. You should download the zip file containing the driver and unzip it. Visit this link for information about drivers for other browsers You’ll also need a Python Selenium Package. This package can be installed using the pip command. Or, if you’re using Anaconda, this can be installed from the conda-forge channel.
The basic skeleton of the Python script for launching a browser, loading the page, and then closing the browser is simple. The executable_path is the complete path of the driver. On Windows, to forward slashes, backslashes need to be changed.
Now, our objective is to find the author element.
You should simply load the website in Chrome, right-click the author name, and click Inspect. This should load Developer Tools with the author element highlighted.
This is a small element with its class attribute set to author Selenium allows various methods to locate the HTML elements. Here are a few of the methods that can be useful.
There are also a few other methods, which may be useful for other scenarios, for example: We think that the most useful methods are find_element_by_css_selector and find_element_by_xpath. Any of these should be able to select most of the scenarios. Lets modify the code so that the first author can be printed.
If you want to print all the authors, you should know that all the find_element methods have a counterpart – find_elements. In this case, you should note the pluralization. To find all the authors, simply change one line.
This returns a list of elements. We can simply run a loop to print all the authors.
We already discussed that the Beautiful Soup object needs HTML.
For web scraping static public websites, the HTML can be retrieved using the requests library. The next step is parsing this HTML string into the BeautifulSoup object.
Lets find out how to scrape a dynamic website with BeautifulSoup.
The rendered HTML of the page is available in the attribute page_source.
Once the soup object is available, all Beautiful Soup methods can be used as usual.
Once the script is ready, there’s no need for the browser to be visible when the script is running. The browser can be hidden, and the script will still run fine. This behavior of a browser is also known as a headless browser.
To make the browser headless, import ChromeOptions. If needed, you’ll also find Options classes for other browsers by simply searching for them via Google.
Now, you need to create an object of this class, and set the headless attribute to True.
Finally, you should send this object while creating the Chrome instance Now when you run the script, the browser won’t be visible.
So, loading the browser is expensive – it takes up CPU, RAM, and bandwidth which aren’t really needed. When you scrape a website, its the data that’s important. All these CSS, images, and rendering aren’t really needed.
The fastest and most efficient way of scraping dynamic public web pages with Python is to locate the actual place where the data is located.
There are two places where this data can be located: The main page itself, in JSON format, embedded in a tag Other files which are loaded asynchronously. The data can be in JSON format, or as partial HTML.
Now, we can take a look at a few examples. Let’s open our chosen website quotes.toscrape.com/js in Chrome.
Once the page is loaded, press Ctrl+U to view the source. Press Ctrl+F to bring up the search box, and search for the author “Albert.” We can immediately see that data is embedded as a JSON object on the page. You should also note that this is a part of the script where this data is being assigned to a variable data.
In this case, we can use the Requests library to get the page, and use Beautiful Soup to parse the page and get the script element.
You should also note that there are multiple elements. The one which contains the data that we need doesn’t have an src attribute. Lets use this to extract the script element. Remember that this script contains other JavaScript code apart from the data that we’re interested in. For this reason, we’re going to use a regular expression to extract this data. The data variable is a list containing one item. Now we can use the JSON library to convert this string data into a Python object. In this case, the output will be the Python object You should note that this list cannot be converted to any format as required. Also, each item contains a link to the author page. It means that you can read these links and create a spider to get public data from all these pages.
Web scraping dynamic sites can follow a completely different path. Sometimes the data is loaded on a separate page altogether.
A website called Librivox is a great example of this case.
You should open Developer Tools, go to Network Tab and filter by XHR. Now, you should open this link or search for any book. You’ll see that the data is an HTML embedded in JSON.
In this case, you should note a few things that are listed here.
Now, you can see the script to extract this data I hope this tutorial has helped you learn to collect the required data from JavaScript rendered websites. If you have any questions about any other specific topic, drop us a line at [email protected] or write a comment below. Also, if you find this content beneficial, press the like button and share it on social media. Thanks for watching and see you in other videos!