Change language

Web Scraping with Python 101 – Extract Data from any Website

Web Scraping with Python 101 - Extract Data from any Website

Hi, welcome. This is Ander from ZenRows.  Today well show you, step-by-step,  how to scrape a website using Python.  We picked for the demo,  a remote job board. Lets dive in! Once on the page, right-click and inspect  to show DevTools on the Elements tab.  It will show all the elements on the page. Then  click on the "select item" on the menu bar, and this will highlight every  item that our mouse hovers.  Select the wrapper of the job offer  list. It has an ID of "initial_job_list".  Well go to the console to get it  using "document.getElementById".  Click on in and it will take  you back to the Elements tab.  Once there, we can see that every  job offer has a "job-tile" class.  Once again, we go to the console and  select them with "querySelectorAll".  And there, we see an array with  more than a hundred elements.  Each one of these being a job offer. And inside each of them, we can see another item with the "job-tile-title"  class, which is the job position.

After inspecting the page, we go to  the editor and import "requests", the library well use to obtain the HTML.  Next, create a variable with the URL we want - - and call "requests.get" with it.  Thats the function that will  fetch the content for us.  And finally, we print the response. We go to the console to run the script by executing python and name  of the file you just created.  And we see that its a 200 response,  meaning that everything went OK.  Now were printing "response.ok", a boolean  that will mark if its a right or wrong result.  In this case true means everything went OK. Next, we print the text and see that the HTML is there. And once again, Everything went OK.

Moving to a different topic,  well address the primary pain with web scraping, which is getting blocked. The main reason you can get blocked is making too many requests from the same  IP, and using proxies will avoid that because each request will get a different IP. In this case, were adding ZenRows proxy.  Its URL consists of your key ""  and port eight-thousand-and-one.  With that and adding "proxies" to the request,  everything will get routed through ZenRows, and you wont have to worry  about getting blocked by IP.  If we run it, the script will  do exactly the same as before.  Well continue without the  proxy part for simplicity.  But lets keep in mind that  we can add it back anytime.

Next, were importing "BeautifulSoup". This library will handle the HTML and allow us to use selectors as we did in the  browser with "document.querySelectorAll".  Were creating the object using BeautifulSoup,  passing the responses content and which kind of parser we want, which is HTML in this case. And then, we get the pages title and print it.  In this case, the text is precisely the  same as the title shown on the page.  Now we want to start using  the selectors we saw earlier.  First, we need the main wrapper, the  element with the ID "initial_job_list".  So we do that using the "soup.find" function  and passing an ID with the selector we want.  It will generate a new object called  "jobs_wrapper" and then use that wrapper ".find".  Here were using nesting, finding  an element inside another element.  Were getting a "job-tile", the  first one that matches using class.  Note that it is "class" underscore because  "class" is a reserved word in Python.  Then inside the job, we will get the job  title by selecting the class "job-tile-title".  And, same as before, using  nesting with "job.find".  Then, print to see that everything went right. And executing again, we get "Senior PHP Developer", which  was the first item on the list.

Now, back to the browser. We want to get some more information, not just the job title. Back on Dev Tools, we see that theres the "job-tile", which  is the element we got on the script.  But its parent contains an ID,  which is probably helpful for deduplicating in the future,  or whatever we want to use for.  Once inside the element, we also  have a "p" - which is a paragraph.  It contains an "a" tag with the  company and a "span" with the location.  And then, to the right, the  category in a different tag.  We should note them all. Also, take another look if youre interested in something else, and then go back to the editor. And well continue where we left off.

Once with the job element, we want  to generate a more populated outcome.  For the moment, it will be a  dictionary with all the items in there, and then were going to print it. So the first item, the ID, is the "job.parent" to refer to the  parent element of the job we had earlier.  Then ".get("id")", which will get  the ID, which is an attribute.  Next is the title, and we have  that from the previous one.  So no problem there, but we  wanna print the "text.strip".  It will trim the white spaces  to the left and the right.  After that, the link. In the browser, weve also seen that the job title  item contains an "href" attribute, which is the link of an "a" tag. So we want to store that too.  We have a problem with the company  because there was no selector to get it.  And the same goes for location.  We will generate an extra element by  getting the paragraph, element dot find "p".  But then were going to  take the text and split it.  There was a dot, which is the one on the three  key on a keyboard, "shift+3" will print that dot.  And splitting that, well have the company on the  first item and the location on the second one.  Add them to the result, as before, by  stripping the content to avoid white spaces.  The last element, the category, is the  selector that we just saw, so the same thing.  "job.find" selector, using "class" with  the underscore equals the selector we had, and then, ".text", just as before. Now, we save and run the script.  Then well see a dictionary with several items, and everything looks fine. So we got all the content from the first position.  So far, so good.

But the problem is, we only have  one job offer, and we want them all.  Before doing that, we will extract the  result part to a helper function that we will call for each element on the list. Well see a bit later how to do that.  For now, just move the result part that we generated in the previous  step to a helper function.  Here we have to return the result  instead of storing an internal variable.  Nothing really changes here in the process.  We just moved part of the logic  to a function to reuse it soon.  Then, we have to call the newly  created function to work as before.  Were now calling the function to  get the data that we have created.  Save, run, and exactly the same  results as before. Perfect! Next, were going to get all the elements. Rename the jobs variable and replace "find" with "find_all". That will get an array with all the items that match the class "job-tile". Then, we rename the variable to "results" in plural, just to be precise. And finally, results will be an array of each job in jobs with the data  extracted using the newly created function.  Save, call it, and see a result  similar to the previous one, but with an array with multiple elements.  Around one hundred and forty, each one of  them representing an item of the job board.

I just copied it to a new file  to see the JSON format perfectly.  We can now check, for example, how many  offers are there for the company "".  Or for positions with the "software-dev"  in the URL, whatever we want.  We got a JSON properly structured, and we can query it, but thats not  the best format for human interaction.  Next were going export data to a CSV  file, and check on Excel how it looks.

For that, we import pandas, a library that  will help us by providing a dataframe and a function to store that data in CSV format. We want to create a new data element, a data frame using the pandas library. We pass in the results.  Then we call "to_csv".  Theres also "to_sql", but were  not going to use it for the moment.  Passing the target file name,  for example, "offers.csv".  Then "index=False" removes an index  automatically generated by the "pandas" library.  We execute it, and the files generated, so  we can open it using Excel or LibreOffice.

And it exported all the data  to a CSV or Excel regular file.  You can browse and take a  look; everything is in there.  We can look for data manually or create filters.  Then, using them, look for companies,  like for example, "Toptal".  And the same goes for positions. How many data engineer offers are there?  So at this point, this is a good option; this  is great because we can filter and take a look.  But it is not an actual database. Well learn how to dump all this information in a database like MongoDB or  MySQL for processing in a future video.

One of the aspects of programming that  we didnt cover here was error handling.  We just supposed that everything went fine. The web page will always return a 200 status code. But thats not the case.  For those cases, we will add  try/except error handling.  Just to be sure that everything is working fine. Indent everything to the right, then, "except Exception as e", and were  going to print it for the moment.  Nothing fancy, just avoiding that something  breaks and we dont know how or why.  The second point is, well only get to the data  processing part if the "response.ok" is true.  Meaning that everything went fine. For example, the status code was 200, which is the case that weve seen so far.  Or else, were going to print the response to  check which is the status code and what happened.  And then, another topic that  we didnt cover is reusability.  For that, we want to move the  storage part to each own function.  Why are we going to do this? As in any programming task, reusability is essential. And single responsibility is also noteworthy.  Were now moving everything related to  pandas and storage to its own function.  And maybe in the future, to its own file,  so we can edit it or modify it easily.  Meaning that we can store  data in MySQL, for example, just by modifying the function, that file,  and not the whole snippet weve seen so far.  This is a simple case, but the idea is the  same: single responsibility for everything.  And good practices, like in general  software development, must be followed here.  Scraping is just another  aspect of developing software.

Once again, to the browser, we  will explore the job offer page.  Up until now, we found everything on the home  page, so this will be completely different.  As we can see, theres a wall of text  to the left and some info to the right.  And exploring just as before with DevTools, we see that a "job-description"  contains the full description.  And theres an element to  the right with the metadata.

Back to the editor. We will modify our script to get the information from the job  offer page and not just from the home page.  First, comment out the store results; were not interested in that. We are just changing some code to show how to browse from the home  page to the results page one by one.  And to avoid scrapping more than a hundred  results, we want to take just the first two.  Well make three calls in total: home  page, first job offer, second job offer.  We will do something similar as we did  in the previous step for the job details.  Were going to iterate over the results, but  instead of taking information from the job object, we will make a new request.  In this case, the URL will be domain -  - plus the link we got from the job, the one we saw in the "href" extracted in a previous step. Then, the "BeautifulSoup" part is the same, creating a "soup" object from the "response.text"  part and parsing it as HTML, just as before.  Nothing new here.  Next, were gonna append an object  containing the title to the details.  In this case, the title will be the "h1",  the one we saw on the browser up there.  Great, "h1" dot text and strip, just  as we did before a few minutes ago.  Then, the second element will be the description,  which will be a whole lot of text, no small items as before, but the entire block. Well see later that it is huge.  Its not easily readable, but we can  store it for whatever we might need.  For example, later analysis for Machine  learning training or sentiment analysis.  And the last item will be meta. Well see that something will happen here because we will take the whole  element by the ID we want to print.  But as we saw, metadata has lots of parts. And in this case, we can print it, and we will store, the whole block as HTML and not text. Maybe it would be helpful to extract some different parts there, like  category, job type, or salary.  We leave that to the viewer. Finally, we print it.  We see that the title is there; the description  also is there, with an extended text.  And the metadata is there  with all the HTML that we saw.

And this is the end of todays tutorial.  We hope you understood all the content and  enjoyed yourself by learning how to scrape a website with Python. Thanks for watching.