Change language

How to make a spider bot in PHP?

Note.In this article, I will use the words spider-bot and web-crawler interchangeably. Some people may use them in other contexts, but in this article, both words mean the same thing.There are many things you can do to improve these spider bots and make them more advanced - add features like maintaining a Popularity Index, and also implement some anti-spam features such as punishing websites with no content or websites using "bait and bait" strategies such as adding keywords that have nothing to do with page content! Alternatively, you can try to generate the keywords and description from the page, which Googlebot does all the time. Below is a list of relevant articles that you can watch if you want to improve this spiderbot.Simple SpiderBot:The simple version will not recursively and will just print all the links it does will find on the web page. Please note that all our main logic will happen in the followLinks function!
  • Program : function followLink ( $url ) { // We need these options when creating the context $options = array ( ’http’ = > array ( ’method’ = > "GET" , ’user-agent’ = > "gfgBot / 0.1" ) ); // Create context for communication $context = stream_context_create ( $options ); // Create new HTML DomDocument for web scrapers $doc = new DomDocument(); @ $doc -> loadHTML ( file_get_contents ( $url , false, $context )); // Get all anchor nodes in the DOM $links = $doc -> getElementsByTagName ( ’a’ ); // iterate over all anchor nodes // found in document foreach ( $links as $i ) echo $i -> getAttribute ( ’href’ ).
    ; } followLink ( " http : //example.com " ); ?>
  • Output:Now, this it was not good - we only get one link - this is because we only have one link on the example.com site and since we are not repeating, we are not following the resulting link. You can run followLink (http://apple.com) if you want to see it in action. However, if you are using engineerforengineer.com, you may receive an error as GeeksforGeeks will block our request (for security reasons, of course).
    https://www.iana.org/domains/example
  • Explanation :
    • Line 3:we create an array $options. You don’t need to understand much about this, other than that it will be required when creating the context. Please note that user-agent - gfgBot - you can change it to whatever you like. You can even use GoogleBot to trick a site into thinking your crawler is - google spider robot if it uses this method to bot detection.
    • Line 10:we create a context for communication. For everything you need, the context - to tell a story, you need context. To create a window in OpenGL, you need a context - same for HTML5 Canvas and same for PHP Network Communication! Sorry if I got out of "context" but I had to.
    • Line 13:create a DomDocument, which is basically a data structure for DOM processing, typically used for HTML and XML files.
    • Line 14:we load the HTML by providing the content of the document! This process may generate some warnings (as it is deprecated), so we suppress all warnings.
    • Line 17:we create basically an array of all anchor nodes, that we find in the DOM.
    • Line 21:we print all the links referenced by these anchor nodes.
    • Line 24:We get all links on example.com! It only has one link that is displayed.
    Slightly more complex Spider-Bot:in the previous code we had a basic spider-bot and it was good. but it was more of a scraper than a crawler (for the difference between a scraper and a scanner, see this article ) a>). We have not repeated - we did not "follow" the links we received. So in this iteration we will do exactly that and assume we have a database that we insert links into (for indexing). Any link will be inserted into the database using the insertIntoDatabase function!
    • Program :

      // List of all links we crawled! $crawledLinks = array (); function followLink ( $url , $depth = 0) {   global $crawledLinks ; $crawling = array (); // Give up to prevent any seemingly endless loop if ( $depth > 5) { echo " " ; return ; } $options = array ( ’http’ = > array ( ’method’ = > "GET" , ’user-agent’ = > " gfgBot / 0.1 " ) ); $context = stream_context_create ( $options ); $doc = new DomDocument(); @ $doc -> loadHTML ( file_get_contents ( $url , false, $context )); $links = $doc -> getElementsByTagName ( ’a’ ); foreach ( $links as $i ) {

      $link = $i -> getAttribute ( ’href’ ); if (ignoreLink ( $link )) continue ; $link = convertLink ( $url , $link ); if (! in_array ( $link , $crawledLinks )) { $crawledLinks [] = $link ; $crawling [] = $link ; insertIntoDatabase ( $link , $depth ); } } foreach ( $crawling as $crawlURL ) { echo ( " . (10 * $depth ). "; ’ > " . " [ +] Crawling < u > $crawlURL < / u >
      "
      ); followLink ( $crawlURL , $depth + 1); } if ( count ( $crawling ) == 0) echo ( " . (10 * $depth ). ";’ > " . "[!] Didn’t Find any Links in < u > $url! < / u >
      "
      ); } // Converts a relative URL to an absolute URL // No conversion is performed if it is already in the absolute URL function convertLink ( $site , $path ) { if ( substr_compare ( $path , "//" , 0, 2) == 0) return parse_url ( $site ) [ ’scheme’ ]. $path ; elseif ( substr_compare ( $path , " http: // " , 0, 7) = = 0 or substr_compare ( $path , " https://" , 0, 8) == 0 or substr_compare ( $path , "www." , 0, 4) == 0)   return $path ; // Absolutely Absolute URL !! else return $site . ’/’ . $path ; } // do we want to ignore the link function ignoreLink ( $url ) { return $url [0] == "#" or substr ( $url , 0, 11) == "javascript:" ; } // Print message and insert into array / database! function insertIntoDatabase ( $link , $depth ) { echo ( " . (10 * $depth ). "’ > " . "Inserting new Link: - $link" . "
      "
      ); $crawledLinks [] = $link ; } followLink ( " http://guimp.com/ " ) ?>

    • Exit:
      Inserting new Link: -  http://guimp.com//home.html   [ +] Crawling http://guimp.com//home.html  Inserting new Link: -  http: // www .guimp.com  Inserting new Link: -  http://guimp.com//home.html/pong.html  Inserting new Link : -  http://guimp.com//home.html/blog.html   [+] Crawling http://www.guimp.com  Inserting new Link: -  http: //www.guimp .com / home.html   [+] Crawling  http: //www.g uimp.com/home.html   Inserting new Link: -  http://www.guimp.com/home.html/pong. html  Inserting new Link: -  http://www.guimp.com/home.html/blog.html   [+] Crawling http://www.guimp.com/home.html/pong.html   [!] Didn’t Find any Links in http://www.guimp.com/home.html/pong.html!   [+] Crawling http://www.guimp.com/home.html/blog.html   [!] Didn’t Find any Links in http://www.guimp.com/home.html/blog.html!   [+] Crawling http://guimp.com//home.html/pong.html   [!] Didn’t Find any Links in http://guimp.com//home.html/pong.html!   [+] Crawling http://guimp.com//home.html/blog.html  
      
      
      
                         

Shop

Best laptop for Sims 4

$

Best laptop for Zoom

$499

Best laptop for Minecraft

$590

Best laptop for engineering student

$

Best laptop for development

$

Best laptop for Cricut Maker

$

Best laptop for hacking

$890

Best laptop for Machine Learning

$950

Latest questions

NUMPYNUMPY

psycopg2: insert multiple rows with one query

12 answers

NUMPYNUMPY

How to convert Nonetype to int or string?

12 answers

NUMPYNUMPY

How to specify multiple return types using type-hints

12 answers

NUMPYNUMPY

Javascript Error: IPython is not defined in JupyterLab

12 answers

Wiki

Python OpenCV | cv2.putText () method

numpy.arctan2 () in Python

Python | os.path.realpath () method

Python OpenCV | cv2.circle () method

Python OpenCV cv2.cvtColor () method

Python - Move item to the end of the list

time.perf_counter () function in Python

Check if one list is a subset of another in Python

Python os.path.join () method