Bag of Words (BoW) Model in NLP



This model can be visualized with a table that contains the number of words that match the word itself.

Applying the Bag of Words model:

Let`s take this example paragraph for our task:

Beans. I was trying to explain to somebody as we were flying in, that`s corn. That`s beans. And they were very impressed at my agricultural knowledge. Please give it up for Amaury once again for that outstanding introduction. I have a bunch of good friends here today, including somebody who I served with, who is one of the finest senators in the country, and we`re lucky to have him, your Senator, Dick Durbin is here. I also noticed, by the way, former Governor Edgar here, who I haven`t seen in a long time, and somehow he has not aged and I have. And it`s great to see you, Governor. I want to thank President Killeen and everybody at the U of I System for making it possible for me to be here today. And I am deeply honored at the Paul Douglas Award that is being given to me. He is somebody who set the path for so much outstanding public service here in Illinois. Now, I want to start by addressing the elephant in the room. I know people are still wondering why I didn`t speak at the commencement.

Step # 1: First, we`ll process the data so that:

  • Convert text to lowercase.
  • Remove all non-dictionary characters.
  • Remove all punctuation marks.

# Python3 code for text preprocessing

import nltk

import re

import numpy as np

  
# execute text here like:
# text = & quot; & quot; & quot; # place text here & quot; & quot; & quot;

dataset = nltk .sent_tokenize (text)

for i in range ( len (dataset)):

  dataset [i] = dataset [i] .lower ()

dataset [i] = re.sub (r `W` , ` ` , dataset [i ])

dataset [i] = re.sub (r `s +`  , `` , dataset [i])

Exit:

Pre-processed text

You can further process the text in according to your needs.

Step # 2: Getting the most common words in our text.

We will apply the following steps to generate our model.

  • We declare a dictionary to store our bundle of words.
  • Next, we break each sentence into words.
  • Now, for each word in the sentence, we check if this word exists in our dictionary.
  • If it is, then we increase its score by 1. If it is not, we add it to our dictionary and set its score to 1.

    # Create Bag of Words model

    word2count = {}

    for data in dataset:

    words = nltk.word_tokenize (data)

    for word in words:

    if word not in word2count.keys ():

    word2count [ word] = 1

    else :

    word2count [word] + = 1

    Exit:

    Bag of Words Dictionary

    We have 118 words in our model. However, when processing large texts, the number of words can reach millions. We don`t need to use all these words. Therefore, we select a certain number of the most frequently used words. To do this we use:

    import heapq

    freq_words = heapq.nlargest ( 100 , word2count, key = word2count.get)

    where 100 is the number of words we want. If our text is large, we add more.

    100 most frequent words

    Step # 3: Building the Bag of Words model
    In this step we create a vector that tells us if the word is in each sentence frequent word or not. If a word in a sentence is a frequent word, we set it to 1, otherwise we set it to 0.
    This can be done with the following code:

    X = []

    for data in dataset:

    vector = []

    for word in freq_words:

    if word in nltk. word_tokenize (data):

    vector.append ( 1 )

      else :

    vector.append ( 0 )

    X.append (vector)

    X = np.asarray (X)

    Exit:

    BoW Model