NLP | IOB tags

What are chunks?
Chunks are made up of words, and word types are defined using part-of-speech tags. It is even possible to define a pattern or words that cannot be part of Chuck, and such words are known as slits.

What are IOB tags?
This is a chunk format. These tags are similar to part-of-speech tags, but provide can denote the inner, outer, and beginning of a passage. Not just a nominal phrase is allowed here, but several different types of fragments.

Example: this is a fragment from the conll2000 corpus . Each word has a part of speech tag followed by an IOB tag on a separate line:

 Mr. NNP B-NP Meador NNP I-NP had VBD B-VP been VBN I-VP executive JJ B-NP vice NN I-NP president NN I-NP of IN B-PP Balcor NNP B-NP 

What does this mean?
B-NP: the beginning of a noun phrase
I-NP: describes that the word is inside the current nominal phrase. 
O: end of the sentence. 
B-VP and I-VP: beginning and inside a verb phrase.

Code # 1: How it works — break words into parts using IOB tags.

# Loading libraries

from nltk.corpus.reader import ConllChunkCorpusReader

 
# Initialization

reader = ConllChunkCorpusReader (

` .` , r `. *. iob` , ( `NP` , ` VP` , `PP` ))

  
reader.chunked_words ()

 
reader.iob_words ()

Output:

 [Tree (`NP`, [(` Mr.`, `NNP`), (` Meador`, `NNP`)]), Tree ( `VP`, [(` had`, `VBD`), (` been`, `VBN`)]), ...] [(` Mr.`, `NNP`,` B-NP`), ( `Meador`,` NNP`, `I-NP`), ...] 

Code # 2: How it works — fragmentation of a sentence with IOB tags.

# Loading libraries

from nltk.corpus.reader import ConllChunkCorpusReader

 
# Initialization

reader = ConllChunkCorpusReader (

`.` , r `. *. iob` , ( `NP` , `VP` , ` PP` ))

  
reader.chunked_sents ()

 
reader.iob_sents ()

Output:

 [Tree (`S`, [Tree (` NP`, [(`Mr.`,` NNP`), (`Meador`,` NNP`)] ), Tree (`VP`, [(` had`, `VBD`), (` been`, `VBN`)]), Tree (` NP`, [(`executive`,` JJ`), (` vice`, `NN`), (` president`, `NN`)]), Tree (` PP`, [(`of`,` IN`)]), Tree (`NP`, [(` Balcor` , `NNP`)]), (` .`, `.`)])] [[(` Mr.`, `NNP`,` B-NP`), (`Meador`,` NNP`, `I -NP`), (`had`,` VBD`, `B-VP`), (` been`, `VBN`,` I-VP`), (`executive`,` JJ`, `B-NP `), (` vice`, `NN`,` I-NP`), (`president`,` NN`, `I-NP`), (` of`, `IN`,` B-PP`) , (`Balcor`,` NNP`, `B-NP`), (` .`, `.`,` O`)]] 

Let`s look at the code above:

  • The ConllChunkCorpusReader class is used to read the IOB corpus.
  • There is no paragraph separation, and each sentence is separated by a blank line, so the method s para_ * are not available.
  • A tuple or list indicating the types of chunks in the file, such as (& # 39; NP & # 39 ;, & # 39; VP & # 39;, & # 39; PP & # 39 ;), is the third argument to ConllChunkCorpusReader.
  • The iob_words () and iob_sents () methods return lists of three tuples (word, pos, iob)

Code # 3: Leaves of trees — those. tagged tokens

# Loading libraries

from nltk.corpus.reader import ConllChunkCorpusReader

 
# Initialization

reader = ConllChunkCorpusReader (

`.` , r `. *. iob` , ( `NP` , ` VP` , `PP` ))

 

reader.chunked_words () [ 0 ]. leaves ()

 

reader.chunked_sents () [ 0 ]. leaves ()

 

reader.chunked_paras () [ 0 ] [ 0 ]. Leaves ()

Output:

 [(`Earlier`,` JJR`), (`staff-reduction`,` NN`), (`moves` , `NNS`)] [(` Earlier`, `JJR`), (` staff-reduction`, `NN`), (` moves`, `NNS`), (` have`, `VBP`), ( `trimmed`,` VBN`), (`about`,` IN`), (`300`,` CD`), (`jobs`,` NNS`), (`,`, `,`), ( `the`,` DT`), (`spokesman`,` NN`), (`said`,` VBD`), (`.`,` .`)] [(`Earlier`,` JJR`), (`staff-reduction`,` NN`), (`moves`,` NNS`), (`have`,` VBP`), (`trimmed`,` VBN`), (`about`,` IN` ), (`300`,` CD`), (`jobs`,` NNS`), (`,`, `,`), (`the`,` DT`), (`spokesman`,` NN` ), (`said`,` VBD`), (`.`,` .`)]