Change language

Python | A program to scan a web page and get the most frequent words

| |

First, create a web crawler using the requests module and the beautiful soup module, which will fetch data from web pages and store it in a list. There might be some unwanted words or characters (like special characters, spaces) that can be filtered out in order to reduce the number and get the results you want. After counting each word, we can also have the number of the most (say 10 or 20) frequent words.

Library modules and functions used:

requests : Will allow you to send HTTP / 1.1 requests and many more.
beautifulsoup4 : For pulling data out of HTML and XML files.
operator : Exports a set of efficient functions corresponding to the intrinsic operators.
collections : Implements high-performance container datatypes.

Below is the implementation of the discussed idea:

# Python3 word frequency program
# counter after crawling web page

import requests

 

from bs4 import BeautifulSoup

import operator

from collections import Counter

 
& # 39; & # 39; & # 39; Function defining web crawler / core
a spider that will receive information from
this website, and submit the content to
second function clean_wordlist () & # 39; & # 39; & # 39;

def start (url):

 

# empty list to store content

# site taken from our web crawler

wordlist = []

source_code = requests.get (url) .text

 

# BeautifulSoup object that will

# ping the requested URL for data

soup = BeautifulSoup (source_code, ’ html.parser’ )

 

# The text on this web page is stored in

  # "div" tags with the class "entry-content"

for each_text in soup.findAll ( ’ div’ , { ’class’ : ’ entry-content’ }):

content = each_text.text

 

  # use split () to split the sentence into

# words and convert them to lowercase

words = content.lower (). split ()

  

  for each_word in words:

  wordlist.append (each_word)

clean_wordlist (wordlist)

 
# The function removes all unnecessary characters

def clean_wordlist (wordlist):

  

  clean_list = []

for word in wordlist:

symbols = ’! @ # $% ^ & amp; * () _- + = {[}] |;:" "& gt;? /., ’

  

  for i in range ( 0 , len (symbols)):

word = word.replace (symbols [i], ’’)

 

if len (word)"  0 :

clean_list.append (word)

create_dictionary (clean_list)

 
# Creates a dictionary containing the words of each
# count and top_20 spoken words

def create_dictionary (clean_list ):

word_count = {}

 

for word in clean_list:

  if word in word_count:

word_count [word] + = 1

else :

word_count [word ] = 1

  

& # 39; & # 39; & # 39; To count every word in

crawled page -"

 

# operator .itemgetter () takes one

# parameter or 1 (denotes keys)

# or 0 (indicates the corresponding values)

 

for key value in sorted (word_count.items (),

key = operator.itemgetter (1)):

print (& quot;% s:% s & quot;% (key, value))

  

"- & # 39; & # 39; & # 39;

 

 

c = Counter (word_count)

  

  # returns the most common elements

top = c.most_common ( 10 )

print (top)

 
# Driver code

if __ name__ = = ’__main__’ :

  start ( " https://python.engineering/programming-language-choose/amp/ " )

 [(’to’, 10), (’ in’, 7), (’is’, 6), ( ’language’, 6), (’ the’, 5), (’programming’, 5), (’ a’, 5), (’c’, 5), (’ you’, 5), (’of ’, 4)] 

Shop

Learn programming in R: courses

$

Best Python online courses for 2022

$

Best laptop for Fortnite

$

Best laptop for Excel

$

Best laptop for Solidworks

$

Best laptop for Roblox

$

Best computer for crypto mining

$

Best laptop for Sims 4

$

Latest questions

NUMPYNUMPY

Common xlabel/ylabel for matplotlib subplots

12 answers

NUMPYNUMPY

How to specify multiple return types using type-hints

12 answers

NUMPYNUMPY

Why do I get "Pickle - EOFError: Ran out of input" reading an empty file?

12 answers

NUMPYNUMPY

Flake8: Ignore specific warning for entire file

12 answers

NUMPYNUMPY

glob exclude pattern

12 answers

NUMPYNUMPY

How to avoid HTTP error 429 (Too Many Requests) python

12 answers

NUMPYNUMPY

Python CSV error: line contains NULL byte

12 answers

NUMPYNUMPY

csv.Error: iterator should return strings, not bytes

12 answers

News


Wiki

Python | How to copy data from one Excel sheet to another

Common xlabel/ylabel for matplotlib subplots

Check if one list is a subset of another in Python

sin

How to specify multiple return types using type-hints

exp

Printing words vertically in Python

exp

Python Extract words from a given string

Cyclic redundancy check in Python

Finding mean, median, mode in Python without libraries

cos

Python add suffix / add prefix to strings in a list

Why do I get "Pickle - EOFError: Ran out of input" reading an empty file?

Python - Move item to the end of the list

Python - Print list vertically