Change language

Scraping Dynamic JavaScript Websites – Beautiful Soup Python

Scraping Dynamic JavaScript Websites - Beautiful Soup Python

Hi everyone. My name is Iveta and I’m a Content  Manager at Oxylabs. Today, I’m here to explain to you everything about scraping dynamic  JavasScript AJAX websites with BeautifulSoup.  Let’s get started.

Collecting public data from most of  the websites may be comparatively easy. However, many websites are dynamic and  use JavaScript to load their content. These web pages require a different  approach to collect the required data.

Dynamically loading the content using  JavaScript is also known as AJAX (Asynchronous JavaScript and XML) So, before we start this tutorial, we  need to find a public website to scrape. In this case, we choose a  website First of all, the easiest way to  determine if a website is dynamic is by using Chrome or Edge because both of  these browsers use Chromium under the hood You should open Developer Tools by pressing  the F12 key. Ensure that the focus is on Developer tools and press the CTRL+SHIFT+P  key combination to open Command Menu Of course, you’ll see a lot of commands.  Start typing disable, and the commands will be filtered to show Disable JavaScript. Select this option to disable JavaScript Now reload this page by pressing Ctrl+R  or F5.If this is a dynamic website, most of the content will disappear ​​In some cases, the websites will still show the  data but will fall back to basic functionality. For example, a website can  have an infinite scroll. If JavaScript is disabled, a  regular pagination will be shown Let’s jump to another topic – can  BeautifulSoup render JavaScript?  The short answer is no  It’s important to understand the  words like parsing and rendering.  Parsing is simply converting  a string representation of a Python object into an actual object. Rendering is essentially interpreting HTML, JavaScript, CSS, and images into  something that we see in the browser BeautifulSoup is a Python library  for pulling data out of HTML files. This involves parsing an HTML string into  the BeautifulSoup object. For parsing, first, we need the HTML as a string to  begin with. Dynamic websites don’t have the data in the HTML directly. It means that  BeautifulSoup cannot work with dynamic websites Selenium library can automate loading and  rendering websites in a browser like Chrome or Firefox. Even though Selenium  supports pulling data out of HTML, it’s possible to extract complete HTML and  use Beautiful Soup instead to extract the data Lets begin dynamic web scraping  with Python using Selenium first Installing Selenium involves  installing three things:  The browser of your choice (which  you probably already have): Chrome, Firefox, Edge, Internet Explorer,  Safari, and Opera browsers are supported. In this tutorial, we ‘ll be using Chrome.

The driver for your browser. Driver for  Chrome can be downloaded from this page. You should download the zip file  containing the driver and unzip it.  Visit this link for information  about drivers for other browsers You’ll also need a Python Selenium Package.  This package can be installed using the pip command. Or, if you’re using Anaconda, this  can be installed from the conda-forge channel.

The basic skeleton of the Python script  for launching a browser, loading the page, and then closing the browser is simple. The executable_path is the complete path of the driver. On Windows, to forward  slashes, backslashes need to be changed.

Now, our objective is to find the author element.

You should simply load the website in Chrome,  right-click the author name, and click Inspect. This should load Developer Tools  with the author element highlighted.

This is a small element with its  class attribute set to author Selenium allows various methods  to locate the HTML elements.  Here are a few of the methods that can be useful.

There are also a few other methods, which may  be useful for other scenarios, for example: We think that the most useful methods  are find_element_by_css_selector and find_element_by_xpath. Any of these should  be able to select most of the scenarios.  Lets modify the code so that  the first author can be printed.

If you want to print all the authors,  you should know that all the find_element methods have a counterpart – find_elements. In  this case, you should note the pluralization. To find all the authors, simply change one line.

This returns a list of elements. We can  simply run a loop to print all the authors.

We already discussed that the  Beautiful Soup object needs HTML.

For web scraping static public websites, the  HTML can be retrieved using the requests library. The next step is parsing this HTML  string into the BeautifulSoup object.

Lets find out how to scrape a  dynamic website with BeautifulSoup.

The rendered HTML of the page is  available in the attribute page_source.

Once the soup object is available, all  Beautiful Soup methods can be used as usual.

Once the script is ready, there’s no need for the  browser to be visible when the script is running. The browser can be hidden, and  the script will still run fine. This behavior of a browser is  also known as a headless browser.

To make the browser headless, import  ChromeOptions. If needed, you’ll also find Options classes for other browsers  by simply searching for them via Google.

Now, you need to create an object of this  class, and set the headless attribute to True.

Finally, you should send this object  while creating the Chrome instance Now when you run the script,  the browser won’t be visible.

So, loading the browser is expensive  – it takes up CPU, RAM, and bandwidth which aren’t really needed. When you scrape  a website, its the data that’s important. All these CSS, images, and  rendering aren’t really needed.

The fastest and most efficient  way of scraping dynamic public web pages with Python is to locate the  actual place where the data is located.

There are two places where  this data can be located:  The main page itself, in JSON  format, embedded in a tag  Other files which are loaded asynchronously. The  data can be in JSON format, or as partial HTML.

Now, we can take a look at a few examples.  Let’s open our chosen website in Chrome.

Once the page is loaded, press Ctrl+U to view the source. Press Ctrl+F to bring up the  search box, and search for the author “Albert.” We can immediately see that data is embedded as  a JSON object on the page. You should also note that this is a part of the script where this  data is being assigned to a variable data.

In this case, we can use the  Requests library to get the page, and use Beautiful Soup to parse the  page and get the script element.

You should also note that there  are multiple elements. The one which contains the data  that we need doesn’t have an src attribute. Lets use this to extract the script element.  Remember that this script contains other  JavaScript code apart from the data that we’re interested in. For this reason, we’re going  to use a regular expression to extract this data.  The data variable is a list containing one  item. Now we can use the JSON library to convert this string data into a Python object. In this case, the output will be the Python object  You should note that this list cannot  be converted to any format as required. Also, each item contains a link to  the author page. It means that you can read these links and create a spider  to get public data from all these pages.

Web scraping dynamic sites can  follow a completely different path. Sometimes the data is loaded  on a separate page altogether.

A website called Librivox is  a great example of this case.

You should open Developer Tools,  go to Network Tab and filter by XHR. Now, you should open this  link or search for any book. You’ll see that the data is  an HTML embedded in JSON.

In this case, you should note a  few things that are listed here.

Now, you can see the script to extract this data I hope this tutorial has helped you learn to collect the required data from  JavaScript rendered websites.  If you have any questions about any other  specific topic, drop us a line at [email protected] or write a comment below. Also, if you find  this content beneficial, press the like button and share it on social media. Thanks  for watching and see you in other videos!