Word prediction using N-gram and CDF concepts



Problem statement. For any input word and text file, predict the next n words that may appear after the input word in the text file.

Examples:

  Input:  is  Output:  is it simply makes sure that there are never  Input:  is  Output :  is split, all the maximum amount of objects, it  Input:  the  Output:  the exact same position. There will be some. 

Note. To illustrate the example, I`ve assigned a variable body to some text. If you want to check the data against real text data, you can find it here .

Solution — We can approach this problem using the concepts of probability. First, we have to calculate the frequency of all words occurring immediately after input in the text file (n-gram, here it is 1-gram, because we always find the next 1 word in the entire data file). Then, using those frequencies, calculate the CDF of all those words and just pick a random word from it. To select this random word, we take a random number and find the smallest CDF greater than or equal to the random number. We do this because we want the most likely answer for each case. So this can be achieved with cdf as it gives the cumulative probability for each word in the list.

Having found the CDF, we can easily find the matching word and add that word to the output string. Now, if you want, you can also add a word to the input string and send the whole string to repeat the process to find the next word, or you can just post the word you found using cdf. I did it using the old approach. 
Note. If you enter the same word multiple times, you will get a different output. It depends on the size of your data file. The larger the file, the more likely there is another exit.

Code for the above algorithm

import random

from collections import Counter

 
# This function calculates the frequency of the (i + 1) th
# word in the whole corpus, where i is the index
# sentence or word.

 

def next_word_freq (array, sentence):

  

sen_len, word_list = len (sentence.split ()), []

 

for i in range ( len (array)):

  

# If an offer matches an offer in the range (i, i + x)

# and the length is less than the length of the body, add

# word in word_list.

  

if `` . join (array [i: i + sen_len]). lower () = = sentence.lower ():

 

if i + sen_len & lt;  len (array) - 1 :

  

word_list.append (array [i + sen_len])

 

# Return the count of each word in word_list

 

return dict (Counter (word_list))

 
# Calculate the CDF of each word in
# Dictionary counter.

 

def CDF (d):

  

prob_sum, sum_vals = 0 , sum (d.values ​​())

 

for k, v in d.items ():

 

# Calculate PMF of each word by separating

# frequency by sum of all frequencies, then add

# all PMFs before the i-th word that is CDF

  # this word

 

pmf = v / sum_vals

prob_sum + = pmf

d [k] = prob_sum

 

# Return the cdf dictionary

  

return d

  
# The main function reads the sentence / word as input
# from the user and reads the corpus file. For faster processing,
# we took only the first 1000 words.

 

 

def main (sent, x, n):

 

# I`m using this sample text here to illustrate the output.

# If someone wants to use a text file, they can use the same. 

# for reading corpus from file has been commented below.

 

# corpus = open ( & # 39; a.txt & # 39 ;, & # 39; r & # 39;). read ()

 

corpus = & # 39; & # 39; & # 39; text Unlikely chance unless you do it programmatically.

However, imagine the game spawns multiple players at the spawn point,

this would be exactly the same place ... I`m not entirely sure if you are

What does an integer with a spin mean? Why is this

a mismatch between data and structure? The structure does not

accept a given number of objects, it could be anything, so the new

nodes have been created. It just makes sure there are no more

X leaves within 1 node. Random is not an option, of course.

My splitting algorithm always created the maximum number of nodes

already split the current node. But I think I should change

is this behavior? In fact, all books have different authors. And

most have a different place too. There will be some with the same

location, but different authors though. I think my library should

be able to store books from the same position. It never happens

equally attractive leaf nodes. If the node is split, all child elements will

reflect a different part of the parent node. & # 39; & # 39; & # 39;

 

l = corpus.split ()

 

# "temp_out" will be used to store each partial sentence

# which will later be saved to "submitted". "Out" is used to store

# final output.

 

temp_out = ``

out = sent + ``

 

for i in range (n - x):

 

# call the next_word_freq method, which in spooks

# frequency of each word next to posted in

# whole word corpus.

 

func_out = next_word_freq (l, sent)

 

# cdf_dict stores the cdf of each word in the map above

  # which is calculated using the CDF method.

 

cdf_dict = CDF (func_out)

 

# We use a random number to predict the next word.

# Word with CDF is greater than or equal to rand

  # and less than or equal to 1.

 

rand = random.uniform ( 0 , 1 )

 

# If cdf_dict is empty, which means that the word you entered .sentence

# does not exist in the corpus. Therefore, break the loop and just type

# the word you entered. To implement this we use a try-except block.

# If an error occurs, it means that there are not enough values ​​to unpack

# and this can only happen when your input is missing in the corpus.

 

try : key, val = zip ( * cdf_dict.items ())

except : break

 

# Loop through the cdf values ​​and find the smallest value

# is greater than or equal to a random number. This is the

# cdf value of your predicted word. Add a value key to the output

# string and update the variable "sent" as "temp_out"

 

for j in range ( len (val)):

 

if rand & lt; = val [j]:

pos = j

break

 

  temp_out = key [pos]

out = out + temp_out + ``

sent = temp_out

  

print (out, end = `` )

 

if __ name__ = = `__main__` :

  

inp_sent = `is`

# The output will be 10 words, including the input sentence / word.

main (inp_sent, len (inp_sent), 10 )

 
# Code provided by Gagan Talreya.

The above concept is used in areas such as Natural Langauage Processing. This is a naive approach just to illustrate the concept. In fact, there are many more algorithms for word prediction. You can find one of them here