
Hi, welcome. This is Ander from ZenRows. Today well show you, step-by-step, how to scrape a website using Python. We picked Remotive.io for the demo, a remote job board. Lets dive in! Once on the page, right-click and inspect to show DevTools on the Elements tab. It will show all the elements on the page. Then click on the "select item" on the menu bar, and this will highlight every item that our mouse hovers. Select the wrapper of the job offer list. It has an ID of "initial_job_list". Well go to the console to get it using "document.getElementById". Click on in and it will take you back to the Elements tab. Once there, we can see that every job offer has a "job-tile" class. Once again, we go to the console and select them with "querySelectorAll". And there, we see an array with more than a hundred elements. Each one of these being a job offer. And inside each of them, we can see another item with the "job-tile-title" class, which is the job position.
After inspecting the page, we go to the editor and import "requests", the library well use to obtain the HTML. Next, create a variable with the URL we want - remotive.io - and call "requests.get" with it. Thats the function that will fetch the content for us. And finally, we print the response. We go to the console to run the script by executing python and name of the file you just created. And we see that its a 200 response, meaning that everything went OK. Now were printing "response.ok", a boolean that will mark if its a right or wrong result. In this case true means everything went OK. Next, we print the text and see that the HTML is there. And once again, Everything went OK.
Moving to a different topic, well address the primary pain with web scraping, which is getting blocked. The main reason you can get blocked is making too many requests from the same IP, and using proxies will avoid that because each request will get a different IP. In this case, were adding ZenRows proxy. Its URL consists of your key "@proxy.zenrows.com" and port eight-thousand-and-one. With that and adding "proxies" to the request, everything will get routed through ZenRows, and you wont have to worry about getting blocked by IP. If we run it, the script will do exactly the same as before. Well continue without the proxy part for simplicity. But lets keep in mind that we can add it back anytime.
Next, were importing "BeautifulSoup". This library will handle the HTML and allow us to use selectors as we did in the browser with "document.querySelectorAll". Were creating the object using BeautifulSoup, passing the responses content and which kind of parser we want, which is HTML in this case. And then, we get the pages title and print it. In this case, the text is precisely the same as the title shown on the page. Now we want to start using the selectors we saw earlier. First, we need the main wrapper, the element with the ID "initial_job_list". So we do that using the "soup.find" function and passing an ID with the selector we want. It will generate a new object called "jobs_wrapper" and then use that wrapper ".find". Here were using nesting, finding an element inside another element. Were getting a "job-tile", the first one that matches using class. Note that it is "class" underscore because "class" is a reserved word in Python. Then inside the job, we will get the job title by selecting the class "job-tile-title". And, same as before, using nesting with "job.find". Then, print to see that everything went right. And executing again, we get "Senior PHP Developer", which was the first item on the list.
Now, back to the browser. We want to get some more information, not just the job title. Back on Dev Tools, we see that theres the "job-tile", which is the element we got on the script. But its parent contains an ID, which is probably helpful for deduplicating in the future, or whatever we want to use for. Once inside the element, we also have a "p" - which is a paragraph. It contains an "a" tag with the company and a "span" with the location. And then, to the right, the category in a different tag. We should note them all. Also, take another look if youre interested in something else, and then go back to the editor. And well continue where we left off.
Once with the job element, we want to generate a more populated outcome. For the moment, it will be a dictionary with all the items in there, and then were going to print it. So the first item, the ID, is the "job.parent" to refer to the parent element of the job we had earlier. Then ".get("id")", which will get the ID, which is an attribute. Next is the title, and we have that from the previous one. So no problem there, but we wanna print the "text.strip". It will trim the white spaces to the left and the right. After that, the link. In the browser, weve also seen that the job title item contains an "href" attribute, which is the link of an "a" tag. So we want to store that too. We have a problem with the company because there was no selector to get it. And the same goes for location. We will generate an extra element by getting the paragraph, element dot find "p". But then were going to take the text and split it. There was a dot, which is the one on the three key on a keyboard, "shift+3" will print that dot. And splitting that, well have the company on the first item and the location on the second one. Add them to the result, as before, by stripping the content to avoid white spaces. The last element, the category, is the selector that we just saw, so the same thing. "job.find" selector, using "class" with the underscore equals the selector we had, and then, ".text", just as before. Now, we save and run the script. Then well see a dictionary with several items, and everything looks fine. So we got all the content from the first position. So far, so good.
But the problem is, we only have one job offer, and we want them all. Before doing that, we will extract the result part to a helper function that we will call for each element on the list. Well see a bit later how to do that. For now, just move the result part that we generated in the previous step to a helper function. Here we have to return the result instead of storing an internal variable. Nothing really changes here in the process. We just moved part of the logic to a function to reuse it soon. Then, we have to call the newly created function to work as before. Were now calling the function to get the data that we have created. Save, run, and exactly the same results as before. Perfect! Next, were going to get all the elements. Rename the jobs variable and replace "find" with "find_all". That will get an array with all the items that match the class "job-tile". Then, we rename the variable to "results" in plural, just to be precise. And finally, results will be an array of each job in jobs with the data extracted using the newly created function. Save, call it, and see a result similar to the previous one, but with an array with multiple elements. Around one hundred and forty, each one of them representing an item of the job board.
I just copied it to a new file to see the JSON format perfectly. We can now check, for example, how many offers are there for the company "Plan.com". Or for positions with the "software-dev" in the URL, whatever we want. We got a JSON properly structured, and we can query it, but thats not the best format for human interaction. Next were going export data to a CSV file, and check on Excel how it looks.
For that, we import pandas, a library that will help us by providing a dataframe and a function to store that data in CSV format. We want to create a new data element, a data frame using the pandas library. We pass in the results. Then we call "to_csv". Theres also "to_sql", but were not going to use it for the moment. Passing the target file name, for example, "offers.csv". Then "index=False" removes an index automatically generated by the "pandas" library. We execute it, and the files generated, so we can open it using Excel or LibreOffice.
And it exported all the data to a CSV or Excel regular file. You can browse and take a look; everything is in there. We can look for data manually or create filters. Then, using them, look for companies, like for example, "Toptal". And the same goes for positions. How many data engineer offers are there? So at this point, this is a good option; this is great because we can filter and take a look. But it is not an actual database. Well learn how to dump all this information in a database like MongoDB or MySQL for processing in a future video.
One of the aspects of programming that we didnt cover here was error handling. We just supposed that everything went fine. The remotive.io web page will always return a 200 status code. But thats not the case. For those cases, we will add try/except error handling. Just to be sure that everything is working fine. Indent everything to the right, then, "except Exception as e", and were going to print it for the moment. Nothing fancy, just avoiding that something breaks and we dont know how or why. The second point is, well only get to the data processing part if the "response.ok" is true. Meaning that everything went fine. For example, the status code was 200, which is the case that weve seen so far. Or else, were going to print the response to check which is the status code and what happened. And then, another topic that we didnt cover is reusability. For that, we want to move the storage part to each own function. Why are we going to do this? As in any programming task, reusability is essential. And single responsibility is also noteworthy. Were now moving everything related to pandas and storage to its own function. And maybe in the future, to its own file, so we can edit it or modify it easily. Meaning that we can store data in MySQL, for example, just by modifying the function, that file, and not the whole snippet weve seen so far. This is a simple case, but the idea is the same: single responsibility for everything. And good practices, like in general software development, must be followed here. Scraping is just another aspect of developing software.
Once again, to the browser, we will explore the job offer page. Up until now, we found everything on the home page, so this will be completely different. As we can see, theres a wall of text to the left and some info to the right. And exploring just as before with DevTools, we see that a "job-description" contains the full description. And theres an element to the right with the metadata.
Back to the editor. We will modify our script to get the information from the job offer page and not just from the home page. First, comment out the store results; were not interested in that. We are just changing some code to show how to browse from the home page to the results page one by one. And to avoid scrapping more than a hundred results, we want to take just the first two. Well make three calls in total: home page, first job offer, second job offer. We will do something similar as we did in the previous step for the job details. Were going to iterate over the results, but instead of taking information from the job object, we will make a new request. In this case, the URL will be domain - remotive.io - plus the link we got from the job, the one we saw in the "href" extracted in a previous step. Then, the "BeautifulSoup" part is the same, creating a "soup" object from the "response.text" part and parsing it as HTML, just as before. Nothing new here. Next, were gonna append an object containing the title to the details. In this case, the title will be the "h1", the one we saw on the browser up there. Great, "h1" dot text and strip, just as we did before a few minutes ago. Then, the second element will be the description, which will be a whole lot of text, no small items as before, but the entire block. Well see later that it is huge. Its not easily readable, but we can store it for whatever we might need. For example, later analysis for Machine learning training or sentiment analysis. And the last item will be meta. Well see that something will happen here because we will take the whole element by the ID we want to print. But as we saw, metadata has lots of parts. And in this case, we can print it, and we will store, the whole block as HTML and not text. Maybe it would be helpful to extract some different parts there, like category, job type, or salary. We leave that to the viewer. Finally, we print it. We see that the title is there; the description also is there, with an extended text. And the metadata is there with all the HTML that we saw.
And this is the end of todays tutorial. We hope you understood all the content and enjoyed yourself by learning how to scrape a website with Python. Thanks for watching.