Python | A program to scan a web page and get the most frequent words

Counters | File handling | Python Methods and Functions

First, create a web crawler using the requests module and the beautiful soup module, which will fetch data from web pages and store it in a list. There might be some unwanted words or characters (like special characters, spaces) that can be filtered out in order to reduce the number and get the results you want. After counting each word, we can also have the number of the most (say 10 or 20) frequent words.

Library modules and functions used:

requests : Will allow you to send HTTP / 1.1 requests and many more.
beautifulsoup4 : For pulling data out of HTML and XML files.
operator : Exports a set of efficient functions corresponding to the intrinsic operators.
collections : Implements high-performance container datatypes.

Below is the implementation of the discussed idea:

# Python3 word frequency program
# counter after crawling web page

import requests

 

from bs4 import BeautifulSoup

import operator

from collections import Counter

 
& # 39; & # 39; & # 39; Function defining web crawler / core
a spider that will receive information from
this website, and submit the content to
second function clean_wordlist () & # 39; & # 39; & # 39;

def start (url):

 

# empty list to store content

# site taken from our web crawler

wordlist = []

source_code = requests.get (url) .text

 

# BeautifulSoup object that will

# ping the requested URL for data

soup = BeautifulSoup (source_code, ' html.parser' )

 

# The text on this web page is stored in

  # & lt; div & gt; tags with the class & lt; entry-content & gt;

for each_text in soup.findAll ( ' div' , { 'class' : ' entry-content' }):

content = each_text.text

 

  # use split () to split the sentence into

# words and convert them to lowercase

words = content.lower (). split ()

  

  for each_word in words:

  wordlist.append (each_word)

clean_wordlist (wordlist)

 
# The function removes all unnecessary characters

def clean_wordlist (wordlist):

  

  clean_list = []

for word in wordlist:

symbols = '! @ # $% ^ & amp; * () _- + = {[}] |;:" & lt; & gt;? /., '

  

  for i in range ( 0 , len (symbols)):

word = word.replace (symbols [i], '')

 

if len (word) & gt;  0 :

clean_list.append (word)

create_dictionary (clean_list)

 
# Creates a dictionary containing the words of each
# count and top_20 spoken words

def create_dictionary (clean_list ):

word_count = {}

 

for word in clean_list:

  if word in word_count:

word_count [word] + = 1

else :

word_count [word ] = 1

  

& # 39; & # 39; & # 39; To count every word in

crawled page - & gt;

 

# operator .itemgetter () takes one

# parameter or 1 (denotes keys)

# or 0 (indicates the corresponding values)

 

for key value in sorted (word_count.items (),

key = operator.itemgetter (1)):

print (& quot;% s:% s & quot;% (key, value))

  

& lt; - & # 39; & # 39; & # 39;

 

 

c = Counter (word_count)

  

  # returns the most common elements

top = c.most_common ( 10 )

print (top)

 
# Driver code

if __ name__ = = '__main__' :

  start ( " https://www.python.engineering/programming-language-choose/amp/ " )

 [('to', 10), (' in', 7), ('is', 6), ( 'language', 6), (' the', 5), ('programming', 5), (' a', 5), ('c', 5), (' you', 5), ('of ', 4)] 




Tutorials