Python | Positional index



To better understand the motivation, please note that the user is requesting “St. Mary`s school”. Now the inverted index will give us a list of documents containing the terms “saint,” “Mary,” and “school,” independently of each other. However, we really need documents in which the entire phrase “St. Mary`s school” occurs verbatim. To successfully respond to such queries, we need a document index, which also stores the positions of the terms.

Message list
In the case of an inverted index, the transaction list — this is a list of documents in which the term appears. It is usually sorted by document ID and saved as a linked list.

The above figure shows an example of a message list for the term “hello”. This indicates that “hello” appears in documents with docIDs 3, 5, 10, 23 and 27. It also indicates the frequency of document 5 (highlighted in green). An example Python data format is provided that contains a dictionary and linked lists for storing a list of publications.

 {"hello": [5, [3, 5, 10, 23, 27]]} 

In the case of a positional index, the positions at which the term appears in a particular document are also stored along with the docID.

The above figure shows the same posting list implemented for a positional index. The blue rectangles indicate the position of the term “hello” in the respective documents. For example, “hello” appears in document 5 at three positions: 120, 125, and 278. In addition, the frequency of the term is stored for each document. An example Python data format for the same is given.

 {"hello": [5, [{3: [3, [120, 125, 278]]}, {5: [1, [28] ]}, {10: [2, [132, 182]]}, {23: [3, [0, 12, 28]]}, {27: [1, [2]]}]} 

You can also omit the term frequency in separate documents for simplicity (as in the example code). The data format is as follows.

 {"hello": [5, {3: [120, 125, 278]}, {5: [28]}, {10: [132, 182]} , {23: [0, 12, 28]}, {27: [2]}]} 

Steps to build a positional index

  • Retrieve document.
  • Remove stop words, trim resulting words
  • If the word is already in the dictionary, add the document and the corresponding positions in which it appears. Otherwise, create a new entry.
  • Also update the word frequency for each document and also not. documents it appears in.

Code
To implement a positional index, we use an example dataset called “20 newsgroups.”

# importing libraries

import numpy as np

import os

import nltk

from nltk.stem import PorterStemmer

from nltk.tokenize import TweetTokenizer

from natsort import natsorted

import string

 

def read_file (filename):

with open (filename, `r` , encoding = "ascii" , errors = "surrogateescape" ) as f:

stuff = f.read ()

 

f.close ()

 

# Remove header and footer.

stuff = remove_header_footer (stuff)

 

return stuff

 

def remove_header_footer (final_string):

new_final_string = ""

tokens = final_string.split ( ` ` )

  

# Remove [0] tokens and [-1] tokens

for token in tokens [ 1 : - 1 ]:

  new_final_string + = token + ""

return new_final_string

 

def preprocessing (final_string):

# Tokenize.

tokenizer = TweetTokenizer ()

token_list = tokenizer.tokenize (final_string)

  

  Remove punctuation marks.

table = str . maketrans (` `, ` `, ` `)

  token_list = [word.translate (table) for word in token_list]

punctuations = (string.punctuation) .replace ( "" " ," ")

  trans_table = str . maketrans (` `,` `, punctuations)

stripped_words = [word.translate (trans_table) for word in token_list]

token_list = [ str for str in stripped_words if str ]

 

# Change to lowercase.

token_list = [word.lower () for word in token_list]

return token_list

 
# In this example, we create a positional index for folder 1 only.

folder_names = [ "comp.graphics" ]

 
# Initialize stemmer.

stemmer = PorterStemmer ()

 
# Initialize file number

fileno = 0

 
# Initialize the dictionary.

pos_index = {}

 
# Initialize file association (fileno - & gt; filename).

file_map = { }

 

for folder_name in folder_names:

  

# Open files.

file_names = natsorted (os.listdir ( "20_newsgroups /" + folder_name))

 

# For each file.

for file_name in file_names:

 

# Read file content.

  stuff = read_file ( " 20_newsgroups / " + folder_name + "/" + file_name)

 

# This is a list of words in text order.

# We need to keep the order because we want positions.

# The function "pre liner processing "performs basic punctuation removal,

  # remove stop words, etc. .

final_token_list = preprocessing (stuff)

 

# For position and term in tokens.

  for pos, term in enumerate (final_token_list):

 

# Stop the term first.

  term = stemmer.stem (term)

 

# If the term already exists in the positional index dictionary.

  if term in pos_index:

 

# Increase the overall frequency by 1.

pos_index [term] [ 0 ] = pos_index [term] [ 0 ] + 1

 

# Check if the term previously existed in this DocID.

  if fileno in pos_index [term] [ 1 ]:

  pos_index [term] [ 1 ] [fileno] .append (pos)

 

else :

  pos_index [term] [ 1 ] [fileno] = [pos]

 

# If the term does not exist in positional index dictionary

# (first meeting).

else :

 

# Initialize the list.

pos_index [term] = []

# Common frequency 1.

pos_index [term] .append ( 1 )

# The message list is initially empty.

pos_index [term] .append ({}) 

# Add document ID to message list.

pos_index [term] [ 1 ] [fileno] = [pos]

 

# No file map. to the file name.

file_map [fileno] = " 20_newsgroups / " + folder_name + "/" + file_name

 

# Increase the file number. counter for displaying document ID

fileno + = 1

  
# Example positional index for code review.

sample_pos_idx = pos_index [ " andrew " ]

print ( "Positional Index" )

print (sample_pos_idx)

 

file_list = sample_pos_idx [ 1 ]

print ( "Filename, [Positions]" )

for fileno,positions in file_list.items ():

print (file_map [fileno], positions)

Exit:

 Positional Index [10, {215: [2081] , 539: [66], 591: [879], 616: [462, 473], 680: [135], 691: [2081], 714: [4], 809: [333], 979: [0] }] Filename, [Positions] 20_newsgroups / comp.graphics / 38376 [2081] 20_newsgroups / comp.graphics / 38701 [66] 20_newsgroups / comp.graphics / 38753 [879] 20_newsgroups / comp.graphics / 38778 [462, 473] 20_newsgroups /comp.graphics/38842 [135] 20_newsgroups / comp.graphics / 38853 [2081] 20_newsgroups / comp.graphics / 38876 [4] 20_newsgroups / comp.graphics / 38971 [333] 20_newsgroups / comp.graphics / 39663 [0]