Note.In this article, I will use the words spider-bot and web-crawler interchangeably. Some people may use them in other contexts, but in this article, both words
mean the same thing.There are many things you can do to improve these spider bots and make them more advanced - add features like maintaining a Popularity Index, and also implement some anti-spam features such as punishing websites with no content or websites using "bait and bait" strategies such as adding keywords that have nothing to do with page content! Alternatively, you can try to
generate the keywords and description from the page, which Googlebot does all the time. Below is a list of relevant articles that you can watch if you want to improve this spiderbot.
Simple SpiderBot:The simple version will not recursively and will just print all the links it does will find on the web page. Please note that all our main logic will happen in the
followLinks
function!
- Program :
function
followLink (
$url
) {
// We need these options when creating the context
$options
=
array
(
’http’
= >
array
(
’method’
= >
"GET"
,
’user-agent’
= >
"gfgBot / 0.1"
)
);
// Create context for communication
$context
= stream_context_create (
$options
);
// Create new HTML DomDocument for web scrapers
$doc
=
new
DomDocument();
@
$doc
-> loadHTML (
file_get_contents
(
$url
, false,
$context
));
// Get all anchor nodes in the DOM
$links
=
$doc
-> getElementsByTagName (
’a’
);
// iterate over all anchor nodes
// found in document
foreach
(
$links
as
$i
)
echo
$i
-> getAttribute (
’href’
).
’
’
;
}
followLink (
" http : //example.com "
);
?>
Output:Now, this it was not good - we only get one link - this is because we only have one link on the example.com
site and since we are not repeating, we are not following the resulting link. You can run followLink (http://apple.com)
if you want to see it in action. However, if you are using engineerforengineer.com, you may receive an error as GeeksforGeeks will block our request (for security reasons, of course).https://www.iana.org/domains/example
Explanation : - Line 3:we create an array
$options
. You don’t need to understand much about this, other than that it will be required when creating the context. Please note that user-agent
- gfgBot - you can change it to whatever you like. You can even use GoogleBot to trick a site into thinking your crawler is - google spider robot if it uses this method to bot detection. - Line 10:we create a context for communication. For everything you need, the context - to tell a story, you need context. To create a window in OpenGL, you need a context - same for HTML5 Canvas and same for PHP Network Communication! Sorry if I got out of "context" but I had to.
- Line 13:create a DomDocument, which is basically a data structure for DOM processing, typically used for HTML and XML files.
- Line 14:we load the HTML by providing the content of the document! This process may generate some warnings (as it is deprecated), so we suppress all warnings.
- Line 17:we create basically an array of all anchor nodes, that we find in the DOM.
- Line 21:we print all the links referenced by these anchor nodes.
- Line 24:We get all links on example.com! It only has one link that is displayed.
Slightly more complex Spider-Bot:in the previous code we had a basic spider-bot and it was good. but it was more of a scraper than a crawler (for the difference between a scraper and a scanner, see this article ) a>). We have not repeated - we did not "follow" the links we received. So in this iteration we will do exactly that and assume we have a database that we insert links into (for indexing). Any link will be inserted into the database using the insertIntoDatabase
function! - Program :
// List of all links we crawled!
$crawledLinks
=
array
();
function
followLink (
$url
,
$depth
= 0) {
global
$crawledLinks
;
$crawling
=
array
();
// Give up to prevent any seemingly endless loop
if
(
$depth
> 5) {
echo
" "
;
return
;
}
$options
=
array
(
’http’
= >
array
(
’method’
= >
"GET"
,
’user-agent’
= >
" gfgBot / 0.1 "
)
);
$context
= stream_context_create (
$options
);
$doc
=
new
DomDocument();
@
$doc
-> loadHTML (
file_get_contents
(
$url
, false,
$context
));
$links
=
$doc
-> getElementsByTagName (
’a’
);
foreach
(
$links
as
$i
) {
$link
=
$i
-> getAttribute (
’href’
);
if
(ignoreLink (
$link
))
continue
;
$link
= convertLink (
$url
,
$link
);
if
(! in_array (
$link
,
$crawledLinks
)) {
$crawledLinks
[] =
$link
;
$crawling
[] =
$link
;
insertIntoDatabase (
$link
,
$depth
);
}
}
foreach
(
$crawling
as
$crawlURL
) {
echo
(
" . (10 *
$depth
).
"; ’ > "
.
" [ +] Crawling < u > $crawlURL < / u >
"
);
followLink (
$crawlURL
,
$depth
+ 1);
}
if
(
count
(
$crawling
) == 0)
echo
(
" . (10 *
$depth
).
";’ > "
.
"[!] Didn’t Find any Links in < u > $url! < / u >
"
);
}
// Converts a relative URL to an absolute URL
// No conversion is performed if it is already in the absolute URL
function
convertLink (
$site
,
$path
) {
if
(
substr_compare
(
$path
,
"//"
, 0, 2) == 0)
return
parse_url
(
$site
) [
’scheme’
].
$path
;
elseif
(
substr_compare
(
$path
,
" http: // "
, 0, 7) = = 0
or
substr_compare
(
$path
,
" https://"
, 0, 8) == 0
or
substr_compare
(
$path
,
"www."
, 0, 4) == 0)
return
$path
;
// Absolutely Absolute URL !!
else
return
$site
.
’/’
.
$path
;
}
// do we want to ignore the link
function
ignoreLink (
$url
) {
return
$url
[0] ==
"#"
or
substr
(
$url
, 0, 11) ==
"javascript:"
;
}
// Print message and insert into array / database!
function
insertIntoDatabase (
$link
,
$depth
) {
echo
(
" . (10 *
$depth
).
"’ > "
.
"Inserting new Link: - $link"
.
"
"
);
$crawledLinks
[] =
$link
;
}
followLink (
" http://guimp.com/ "
)
?>
- Exit:
Inserting new Link: - http://guimp.com//home.html [ +] Crawling http://guimp.com//home.html Inserting new Link: - http: // www .guimp.com Inserting new Link: - http://guimp.com//home.html/pong.html Inserting new Link : - http://guimp.com//home.html/blog.html [+] Crawling http://www.guimp.com Inserting new Link: - http: //www.guimp .com / home.html [+] Crawling http: //www.g uimp.com/home.html Inserting new Link: - http://www.guimp.com/home.html/pong. html Inserting new Link: - http://www.guimp.com/home.html/blog.html [+] Crawling http://www.guimp.com/home.html/pong.html [!] Didn’t Find any Links in http://www.guimp.com/home.html/pong.html! [+] Crawling http://www.guimp.com/home.html/blog.html [!] Didn’t Find any Links in http://www.guimp.com/home.html/blog.html! [+] Crawling http://guimp.com//home.html/pong.html [!] Didn’t Find any Links in http://guimp.com//home.html/pong.html! [+] Crawling http://guimp.com//home.html/blog.html