NLP | Tagger-based Training Choker | Set 2

    Conll2000 Body defines snippets using IOB tags.

  • It indicates where the chunk starts and ends, as well as its types.
  • A part of speech tag can be trained with these IOB tags to further develop the ChunkerI subclass.
  • First, using chunked_sents() corpus, it turns out d tree and then converted to the format used by the part-of-speech tagger.
  • conll_tag_chunks() uses tree2conlltags () to convert the sentence tree to a list of three tuples of the form (word, pos, iob).
    • pos: part of speech
    • iob: IOB tag, for example — B_NP, I_NP to indicate that the work is at the beginning and within the n phrase, respectively.
  • conlltags2tree () is a reference to tree2conlltags ( )
  • The three tuples are then converted to two tuples that the tagger can recognize.
  • The RegexpParser class uses part-of-speech tags for snippet templates, so part-of-speech tags are used as if they were words for a tag.
  • conll_tag_chunks() takes 3 tuples (word, pos, iob) and returns a list of 2 tuples like ( pos, iob)

Code # 1: Let`s figure it out

from nltk.chunk.util import tree2conlltags, conlltags2tree

from nltk.tree import Tree

 

t = Tree ( ` S` , [Tree ( `NP` , [( `the` , ` DT` ), ( `book` , `NN` )])])

  

print ( "Tree2conlltags:" , tree2conlltags (t))

 

c = conlltags2tree ([( `the` , `DT` , `B-NP` ), ( ` book` , `NN` , ` I-NP` )])

 

print ( "conlltags2tree:" , c)

 
# Convert 3 tuples to 2 tuples.

print ( "conll_tag_chunnks for tree:" , conll_tag_chunks ([t]))

Exit :

 Tree2conlltags: [(`the`,` DT`, `B-NP`), (` book`, `NN`,` I-NP` )] conlltags2tree: Tree (`S`, [Tree (` NP`, [(`the`,` DT`), (`book`,` NN`)])]) conll_tag_chunnks for tree: [[(`DT `,` B-NP`), (`NN`,` I-NP`)]] 

Code # 2: TagChunker class using conll2000 corpus

from chunkers import TagChunker

from nltk.corpus import conll2000

 
# data

conll_train = conll2000. chunked_sents ( `train.txt` )

conll_test = conll2000.chunked_sents ( `test.txt` )

 
# chunker initialization

chunker = TagChunker (conll_train)

 
# testing

score = chunker.evaluate (conll_test)

 

a = score.accuracy ()

p = score.precision ()

r = recall

 

print ( "Accuracy of TagChunker:" , a)

print ( "Precision of TagChunker: " , p)

print ( "Recall of TagChunker:" , r)

Output:

 Accuracy of TagChunker: 0.8950545623403762 Precision of TagChunker: 0.8114841974355675 Recall of TagChunker: 0.8644191676944863 

Note: The performance of conll2000 is not as good as that of treebank_chunk, but conll2000 — much larger corpus.

Code # 3: TagChunker using the UnigramTagger class

# loading libraries

from chunkers import TagChunker

from nltk.tag import UnigramTagger

 

uni_chunker = TagChunker (train_chunks,

tagger_classes = [UnigramTagger])

 

score = uni_chunker.evaluate (test_chunks)

 

a = score.accuracy ()

 

print ( "Accuracy of TagChunker:" , a)

Output:

 Accuracy of TagChunker: 0.9674925924335466 

Argument tagger_classes is passed directly to the backoff_tagger () function, so that means they must be subclasses of SequentialBackoffTagger. When tested, the default value tagger_classes = [UnigramTagger, BigramTagger] usually gives the best results, but it can vary from case to case.