Implementing Web Scraping in Python with Scrapy

Python Methods and Functions

To overcome this problem, it is possible to use MultiThreading / Multiprocessing with the BeautifulSoup module and he / she can create a spider that can help crawl the website and extract data. Use Scrapy to save time.

 With the help of Scrapy one can: 1. Fetch millions of data efficiently 2. Run it on server 3. Fetching data 4. Run spider in multiple processes 

Scrapy comes with brand new functionality to create a spider, run it, and then simply save the data by deleting it. This looks pretty weird at first, but it's for the best.

Let's talk about installing, building and testing the spider.

Step 1: Create a virtual environment

It is good to create one virtual environment as it isolates the program and does not affect other programs on the machine. To create a virtual environment, first install it using:

  sudo apt-get install python3-venv  

Create one folder and then activate it:

  mkdir scrapy-project & amp; & amp; cd scrapy-project python3 -m venv myvenv 

If the above command throws an error try this:

  python3.5 -m venv myvenv  

After creating the virtual environment, activate it using:

  source myvenv / bin / activate  

Step 2: Installing Scrapy Module

Install Scrapy using:

  pip install scrapy  

To install scrapy for any particular python version :

  python3.5 -m pip install scrapy  

Replace version 3.5 with another version such as 3.6.

Step 3: Create a Scrapy Project

When using Scrapy, you need to create a scrapy project.

  scrapy startproject gfg  

In Scrapy, always try to create one spider that helps to retrieve data, so that To create it, go to the spider folder and create one python file there. Create one spider named gfgfetch.py ​​ Python gfgfetch.py ​​.

Step 4: Create the spider

Go to spiders folder and create gfgfetch.py ​​. When creating a spider, always create one class with a unique name and define the requirements. First of all, name the spider by giving it a variable name, and then provide the starting URL where the spider will start crawling. Identify some methods that will help you go much deeper into this site. For now, let's drop all existing URLs and keep all those URLs.

import scrapy

 

class ExtractUrls (scrapy .Spider):

 

# This name must always be unique

name = "extract"  

 

# Function to be called

def start_requests ( self ):

 

# enter URL here

  urls = [ ' https://www.python.engineering/ ' ,]

 

for url in urls:

yield scrapy.Request (url = url, callback = self . parse )

The main motive — get each url and then request it. Extract all URL or anchor tags from it. To do this, we need to create another parse method to get data at a given URL.

Step 5: fetch data from this page
Test a few things before writing the parse function, such as how to get any data from a given page. To do this, use a scrap shell. This is the same as the Python interpreter, but with the ability to clear data from a given URL. In short, it is a Python interpreter with Scrapy functionality.

  scrapy shell URL

Note. Make sure you are in the same directory as the scrapy.cfg file, otherwise it won't work.

  scrapy shell https://www.python.engineering/  

Now use selectors to get data from this page. These selectors can be either CSS or Xpath. For now, let's try to extract all URLs using the CSS Selector.

  • To get the anchor tag:
      response.css ('a')  
  • To extract data:
      links = response.css ('a') .extract ()  
  • For example, links [0 ] will show something like this:
      '& lt; a href =" https://www.python.engineering/ "title =" Python.Engineering "rel =" home "& gt; Python.Engineering & lt; / a & gt;' 
  • Use attributes to get the href attribute.
      links = response.css (' a :: attr (href) ') .extract ()  

This will get all the href data, which is very useful. Use that link and start requesting it.

Now, let's create a parse method and get all the URLs, and then return it. Go to that particular URL and get more links from that page and this will happen over and over again. In short, we get all the URLs present on this page.

Scrapy, by default, filters those URLs that have already been visited. This way it won't crawl the same URL path again. But it is possible that on two different pages there are two or more two similar links. For example, every page will have a heading link available, which means that this heading link will be present every time the page is requested. So try to rule it out by checking it out.

# Parse function

def parse ( self , response):

 

# Additional function to get title

title = response.css ( ' title :: text' ) .extract_first () 

 

# Get anchor tags

links = response.css ( 'a :: attr (href)' ). extract () 

 

  for link in links:

  yield  

{

' title' : title,

' links' : link

}

 

if 'pythonengineering' < code class = "keyword"> in link: 

yield scrapy.Request (url = link, callback = self . parse)

Below is the implementation of the scraper:

# scrap module import

import scrapy

 

class ExtractUrls (scrapy.Spider):

name = "extract"

  

  # query function

def start_requests ( self ):

urls = [ ' http://www.python.engineering ' ,]

 

for url in urls:

yield scrapy.Request (url = url, callback = self . parse)

 

# Parse function

def parse ( self , response):

 

# Additional function to get the title

title = response.css ( 'title :: text' ). extract_first () 

  

  # Get anchor tags

links = response.css ( 'a :: attr (href)' ). extract () 

 

for link in links:

yield  

{

'title' : title,

'links' : link

}

 

  if ' pythonengineering' in link: 

yield scrapy.Request (url = link, callback = self . parse)

Step 6: In the last step, run the spider and get the output as a simple json file

  scrapy crawl NAME_OF_SPIDER -o links.json  

Here is the name of the spider "extract" for this example. It will download data within seconds.

Logout:

Note: Clearing any web page is not a legal activity ... Do not perform cleanup operations without permission.

Link: https://doc.scrapy.org/ en / .





Get Solution for free from DataCamp guru