Change language

Python | Positional index

| | |

To better understand the motivation, please note that the user is requesting "St. Mary’s school". Now the inverted index will give us a list of documents containing the terms “saint,” “Mary,” and “school,” independently of each other. However, we really need documents in which the entire phrase "St. Mary’s school" occurs verbatim. To successfully respond to such queries, we need a document index, which also stores the positions of the terms.

Message list
In the case of an inverted index, the transaction list — this is a list of documents in which the term appears. It is usually sorted by document ID and saved as a linked list.

The above figure shows an example of a message list for the term "hello". This indicates that “hello” appears in documents with docIDs 3, 5, 10, 23 and 27. It also indicates the frequency of document 5 (highlighted in green). An example Python data format is provided that contains a dictionary and linked lists for storing a list of publications.

 {"hello": [5, [3, 5, 10, 23, 27]]} 

In the case of a positional index, the positions at which the term appears in a particular document are also stored along with the docID.

The above figure shows the same posting list implemented for a positional index. The blue rectangle-method/">rectangles indicate the position of the term "hello" in the respective documents. For example, “hello” appears in document 5 at three positions: 120, 125, and 278. In addition, the frequency of the term is stored for each document. An example Python data format for the same is given.

 {"hello": [5, [{3: [3, [120, 125, 278]]}, {5: [1, [28] ]}, {10: [2, [132, 182]]}, {23: [3, [0, 12, 28]]}, {27: [1, [2]]}]} 

You can also omit the term frequency in separate documents for simplicity (as in the example code). The data format is as follows.

 {"hello": [5, {3: [120, 125, 278]}, {5: [28]}, {10: [132, 182]} , {23: [0, 12, 28]}, {27: [2]}]} 

Steps to build a positional index

  • Retrieve document.
  • Remove stop words, trim resulting words
  • If the word is already in the dictionary, add the document and the corresponding positions in which it appears. Otherwise, create a new entry.
  • Also update the word frequency for each document and also not. documents it appears in.

Code
To implement a positional index, we use an example dataset called "20 newsgroups."

# importing libraries

import numpy as np

import os

import nltk

from nltk.stem import PorterStemmer

from nltk.tokenize import TweetTokenizer

from natsort import natsorted

import string

 

def read_file (filename):

with open (filename, ’r’ , encoding = "ascii" , errors = "surrogateescape" ) as f:

stuff = f.read ()

 

f.close ()

 

# Remove header and footer.

stuff = remove_header_footer (stuff)

 

return stuff

 

def remove_header_footer (final_string):

new_final_string = ""

tokens = final_string.split ( ’ ’ )

  

# Remove [0] tokens and [-1] tokens

for token in tokens [ 1 : - 1 ]:

  new_final_string + = token + ""

return new_final_string

 

def preprocessing (final_string):

# Tokenize.

tokenizer = TweetTokenizer ()

token_list = tokenizer.tokenize (final_string)

  

  Remove punctuation marks.

table = str . maketrans (’ ’, ’ ’, ’ ’)

  token_list = [word.translate (table) for word in token_list]

punctuations = (string.punctuation) .replace ( "" " ," ")

  trans_table = str . maketrans (’ ’,’ ’, punctuations)

stripped_words = [word.translate (trans_table) for word in token_list]

token_list = [ str for str in stripped_words if str ]

 

# Change to lowercase.

token_list = [word.lower () for word in token_list]

return token_list

 
# In this example, we create a positional index for folder 1 only.

folder_names = [ "comp.graphics" ]

 
# Initialize stemmer.

stemmer = PorterStemmer ()

 
# Initialize file number

fileno = 0

 
# Initialize the dictionary.

pos_index = {}

 
# Initialize file association (fileno -" filename).

file_map = { }

 

for folder_name in folder_names:

  

# Open files.

file_names = natsorted (os.listdir ( "20_newsgroups /" + folder_name))

 

# For each file.

for file_name in file_names:

 

# Read file content.

  stuff = read_file ( " 20_newsgroups / " + folder_name + "/" + file_name)

 

# This is a list of words in text order.

# We need to keep the order because we want positions.

# The function "pre liner processing "performs basic punctuation removal,

  # remove stop words, etc. .

final_token_list = preprocessing (stuff)

 

# For position and term in tokens.

  for pos, term in enumerate (final_token_list):

 

# Stop the term first.

  term = stemmer.stem (term)

 

# If the term already exists in the positional index dictionary.

  if term in pos_index:

 

# Increase the overall frequency by 1.

pos_index [term] [ 0 ] = pos_index [term] [ 0 ] + 1

 

# Check if the term previously existed in this DocID.

  if fileno in pos_index [term] [ 1 ]:

  pos_index [term] [ 1 ] [fileno] .append (pos)

 

else :

  pos_index [term] [ 1 ] [fileno] = [pos]

 

# If the term does not exist in positional index dictionary

# (first meeting).

else :

 

# Initialize the list.

pos_index [term] = []

# Common frequency 1.

pos_index [term] .append ( 1 )

# The message list is initially empty.

pos_index [term] .append ({}) 

# Add document ID to message list.

pos_index [term] [ 1 ] [fileno] = [pos]

 

# No file map. to the file name.

file_map [fileno] = " 20_newsgroups / " + folder_name + "/" + file_name

 

# Increase the file number. counter for displaying document ID

fileno + = 1

  
# Example positional index for code review.

sample_pos_idx = pos_index [ " andrew " ]

print ( "Positional Index" )

print (sample_pos_idx)

 

file_list = sample_pos_idx [ 1 ]

print ( "Filename, [Positions]" )

for fileno,positions in file_list.items ():

print (file_map [fileno], positions)

Exit:

 Positional Index [10, {215: [2081] , 539: [66], 591: [879], 616: [462, 473], 680: [135], 691: [2081], 714: [4], 809: [333], 979: [0] }] Filename, [Positions] 20_newsgroups / comp.graphics / 38376 [2081] 20_newsgroups / comp.graphics / 38701 [66] 20_newsgroups / comp.graphics / 38753 [879] 20_newsgroups / comp.graphics / 38778 [462, 473] 20_newsgroups /comp.graphics/38842 [135] 20_newsgroups / comp.graphics / 38853 [2081] 20_newsgroups / comp.graphics / 38876 [4] 20_newsgroups / comp.graphics / 38971 [333] 20_newsgroups / comp.graphics / 39663 [0] 

Shop

Learn programming in R: courses

$

Best Python online courses for 2022

$

Best laptop for Fortnite

$

Best laptop for Excel

$

Best laptop for Solidworks

$

Best laptop for Roblox

$

Best computer for crypto mining

$

Best laptop for Sims 4

$

Latest questions

NUMPYNUMPY

Common xlabel/ylabel for matplotlib subplots

12 answers

NUMPYNUMPY

How to specify multiple return types using type-hints

12 answers

NUMPYNUMPY

Why do I get "Pickle - EOFError: Ran out of input" reading an empty file?

12 answers

NUMPYNUMPY

Flake8: Ignore specific warning for entire file

12 answers

NUMPYNUMPY

glob exclude pattern

12 answers

NUMPYNUMPY

How to avoid HTTP error 429 (Too Many Requests) python

12 answers

NUMPYNUMPY

Python CSV error: line contains NULL byte

12 answers

NUMPYNUMPY

csv.Error: iterator should return strings, not bytes

12 answers

News


Wiki

Python | How to copy data from one Excel sheet to another

Common xlabel/ylabel for matplotlib subplots

Check if one list is a subset of another in Python

sin

How to specify multiple return types using type-hints

exp

Printing words vertically in Python

exp

Python Extract words from a given string

Cyclic redundancy check in Python

Finding mean, median, mode in Python without libraries

cos

Python add suffix / add prefix to strings in a list

Why do I get "Pickle - EOFError: Ran out of input" reading an empty file?

Python - Move item to the end of the list

Python - Print list vertically