NLP | IOB tags

File handling | NLP | Python Methods and Functions | String Variables

What are chunks?
Chunks are made up of words, and word types are defined using part-of-speech tags. It is even possible to define a pattern or words that cannot be part of Chuck, and such words are known as slits.

What are IOB tags?
This is a chunk format. These tags are similar to part-of-speech tags, but provide can denote the inner, outer, and beginning of a passage. Not just a nominal phrase is allowed here, but several different types of fragments.

Example: this is a fragment from the conll2000 corpus . Each word has a part of speech tag followed by an IOB tag on a separate line:

 Mr. NNP B-NP Meador NNP I-NP had VBD B-VP been VBN I-VP executive JJ B-NP vice NN I-NP president NN I-NP of IN B-PP Balcor NNP B-NP 

What does this mean?
B-NP: the beginning of a noun phrase
I-NP: describes that the word is inside the current nominal phrase. 
O: end of the sentence. 
B-VP and I-VP: beginning and inside a verb phrase.

Code # 1: How it works — break words into parts using IOB tags.

# Loading libraries

from nltk.corpus.reader import ConllChunkCorpusReader

 
# Initialization

reader = ConllChunkCorpusReader (

' .' , r '. *. iob' , ( 'NP' , ' VP' , 'PP' ))

  
reader.chunked_words ()

 
reader.iob_words ()

Output:

 [Tree ('NP', [(' Mr.', 'NNP'), (' Meador', 'NNP')]), Tree ( 'VP', [(' had', 'VBD'), (' been', 'VBN')]), ...] [(' Mr.', 'NNP',' B-NP'), ( 'Meador',' NNP', 'I-NP'), ...] 

Code # 2: How it works — fragmentation of a sentence with IOB tags.

# Loading libraries

from nltk.corpus.reader import ConllChunkCorpusReader

 
# Initialization

reader = ConllChunkCorpusReader (

'.' , r '. *. iob' , ( 'NP' , 'VP' , ' PP' ))

  
reader.chunked_sents ()

 
reader.iob_sents ()

Output:

 [Tree ('S', [Tree (' NP', [('Mr.',' NNP'), ('Meador',' NNP')] ), Tree ('VP', [(' had', 'VBD'), (' been', 'VBN')]), Tree (' NP', [('executive',' JJ'), (' vice', 'NN'), (' president', 'NN')]), Tree (' PP', [('of',' IN')]), Tree ('NP', [(' Balcor' , 'NNP')]), (' .', '.')])] [[(' Mr.', 'NNP',' B-NP'), ('Meador',' NNP', 'I -NP'), ('had',' VBD', 'B-VP'), (' been', 'VBN',' I-VP'), ('executive',' JJ', 'B-NP '), (' vice', 'NN',' I-NP'), ('president',' NN', 'I-NP'), (' of', 'IN',' B-PP') , ('Balcor',' NNP', 'B-NP'), (' .', '.',' O')]] 

Let's look at the code above:

  • The ConllChunkCorpusReader class is used to read the IOB corpus.
  • There is no paragraph separation, and each sentence is separated by a blank line, so the method s para_ * are not available.
  • A tuple or list indicating the types of chunks in the file, such as (& # 39; NP & # 39 ;, & # 39; VP & # 39;, & # 39; PP & # 39 ;), is the third argument to ConllChunkCorpusReader.
  • The iob_words () and iob_sents () methods return lists of three tuples (word, pos, iob)

Code # 3: Leaves of trees — those. tagged tokens

# Loading libraries

from nltk.corpus.reader import ConllChunkCorpusReader

 
# Initialization

reader = ConllChunkCorpusReader (

'.' , r '. *. iob' , ( 'NP' , ' VP' , 'PP' ))

 

reader.chunked_words () [ 0 ]. leaves ()

 

reader.chunked_sents () [ 0 ]. leaves ()

 

reader.chunked_paras () [ 0 ] [ 0 ]. Leaves ()

Output:

 [('Earlier',' JJR'), ('staff-reduction',' NN'), ('moves' , 'NNS')] [(' Earlier', 'JJR'), (' staff-reduction', 'NN'), (' moves', 'NNS'), (' have', 'VBP'), ( 'trimmed',' VBN'), ('about',' IN'), ('300',' CD'), ('jobs',' NNS'), (',', ','), ( 'the',' DT'), ('spokesman',' NN'), ('said',' VBD'), ('.',' .')] [('Earlier',' JJR'), ('staff-reduction',' NN'), ('moves',' NNS'), ('have',' VBP'), ('trimmed',' VBN'), ('about',' IN' ), ('300',' CD'), ('jobs',' NNS'), (',', ','), ('the',' DT'), ('spokesman',' NN' ), ('said',' VBD'), ('.',' .')]