To better understand the motivation, please note that the user is requesting "St. Mary’s school". Now the inverted index will give us a list of documents containing the terms “saint,” “Mary,” and “school,” independently of each other. However, we really need documents in which the entire phrase "St. Mary’s school" occurs verbatim. To successfully respond to such queries, we need a document index, which also stores the positions of the terms.
Message list
In the case of an inverted index, the transaction list — this is a list of documents in which the term appears. It is usually sorted by document ID and saved as a linked list.
The above figure shows an example of a message list for the term "hello". This indicates that “hello” appears in documents with docIDs 3, 5, 10, 23 and 27. It also indicates the frequency of document 5 (highlighted in green). An example Python data format is provided that contains a dictionary and linked lists for storing a list of publications.
{"hello": [5, [3, 5, 10, 23, 27]]}
In the case of a positional index, the positions at which the term appears in a particular document are also stored along with the docID.
The above figure shows the same posting list implemented for a positional index. The blue rectangle-method/">rectangles indicate the position of the term "hello" in the respective documents. For example, “hello” appears in document 5 at three positions: 120, 125, and 278. In addition, the frequency of the term is stored for each document. An example Python data format for the same is given.
{"hello": [5, [{3: [3, [120, 125, 278]]}, {5: [1, [28] ]}, {10: [2, [132, 182]]}, {23: [3, [0, 12, 28]]}, {27: [1, [2]]}]}
You can also omit the term frequency in separate documents for simplicity (as in the example code). The data format is as follows.
{"hello": [5, {3: [120, 125, 278]}, {5: [28]}, {10: [132, 182]} , {23: [0, 12, 28]}, {27: [2]}]}
Steps to build a positional index
- Retrieve document.
- Remove stop words, trim resulting words
- If the word is already in the dictionary, add the document and the corresponding positions in which it appears. Otherwise, create a new entry.
- Also update the word frequency for each document and also not. documents it appears in.
Code
To implement a positional index, we use an example dataset called "20 newsgroups."
|
Exit:
Positional Index [10, {215: [2081] , 539: [66], 591: [879], 616: [462, 473], 680: [135], 691: [2081], 714: [4], 809: [333], 979: [0] }] Filename, [Positions] 20_newsgroups / comp.graphics / 38376 [2081] 20_newsgroups / comp.graphics / 38701 [66] 20_newsgroups / comp.graphics / 38753 [879] 20_newsgroups / comp.graphics / 38778 [462, 473] 20_newsgroups /comp.graphics/38842 [135] 20_newsgroups / comp.graphics / 38853 [2081] 20_newsgroups / comp.graphics / 38876 [4] 20_newsgroups / comp.graphics / 38971 [333] 20_newsgroups / comp.graphics / 39663 [0]