NLP | Training a named chunker object

Named entity chunk trees can be generated from the ieer corpus using ieertree2conlltags () and ieer_chunked_sents () . This can be used to train the ClassifierChunker class generated in the chunk based on the classification.

Code # 1: ieertree2conlltags ()

import nltk.tag

from nltk.chunk.util import conlltags2tree

from nltk.corpus import ieer

 

def ieertree2conlltags (tree, tag = nltk.tag.pos_tag):

words, ents =   zip ( * tree.pos ())

iobs = []

prev = None

for ent in ents:

if ent = = tree.label ():

iobs.append ( `O` )

prev = None

elif prev = = ent:

  iobs.append ( `I-% s` % ent)

else :

iobs.append ( `B-% s` % ent)

prev = ent

 

words, tags = zip ( * tag (words))

 

return zip (words, tags, iobs)

Code # 2: ieer_chunked_sents ()

import nltk.tag

from nltk.chunk.util import conlltags2tree

from nltk.corpus import ieer

< p>  

def ieer_chunked_sents (tag = nltk.tag.pos_tag):

for doc in ieer .parsed_docs ():

tagged = ieertree2conlltags (doc.text, tag)

yield conlltags2tree (tagged)

Using 80 of 94 sentences for training and remaining for testing.

Code # 3: How the classifier works in the first sentence of the treebank_chunk corpus.

from nltk.corpus import ieer

from chunkers import ieer_chunked_sents, ClassifierChunker

from nltk.corpus import treebank_chunk

 

ieer_chunks = list (ieer_chunked_sents ())

 

print ( "Length of ieer_chunks:" , len (ieer_chunks))

 
# chunker initialization

chunker = ClassifierChunker (ieer_chunks [: 80 ])

print ( "parsing:" , chunker.parse (

treebank_chunk.tagged_sents () [ 0 ]))

 
# rating

score = chunker.evaluate (ieer_chunks [ 80 :])

 

a  = score.accuracy ()

p = score.precision ()

r = score.recall ()

 

print ( "Accuracy:" , a)

print ( "Precision: " , p)

print ( "Recall:" , r)

Output:

 Length of ieer_chunks: 94 parsing: Tree (`S`, [Tree ( `LOCATION`, [(` Pierre`, `NNP`), (` Vinken`, `NNP`)]), (`, `,`, `), Tree (` DURATION`, [(`61`,` CD`), (`years`, `NNS`)]), Tree (` MEASURE`, [(`old`,` JJ`)]), (`,`, `,`), (`will`,` MD`), (`join` , `VB`), (` the`, `DT`), (` board`, `NN`), (` as`, `IN`), (` a`, `DT`), (` nonexecutive` , `JJ`), (` director`, `NN`), Tree (` DATE`, [(`Nov.`,` NNP`), (`29`,` CD`)]), (`.` , `.`)]) Accuracy: 0.8829018388070625 Precision: 0.4088717454194793 Recall: 0.5053635280095352 

How does it work?
The ieer trees generated by ieer_chunked_sents () are not entirely accurate. There are no explicit sentence breaks here, so each document is a single tree. Also, the words are not explicitly tagged, this is work using nltk.tag.pos_tag ().