NLP | Location Tags Mining



A different kind of ChunkParserI subclass can be used to identify LOCATION blocks. As it uses the corpus of gazetteers to identify the location of words. Gazetteers` case — it is a WordListCorpusReader class which contains the following location words:

  • Country Names
  • US States and Abbreviations
  • Mexican States
  • Major US cities
  • Canadian provinces

LocationChunker class searches for words found in the directory corpus by iterating over a tag sentence … It creates a LOCATION block using IOB tags when it finds one or more location words. IOB LOCATION tags are created in iob_locations () and the parse () method converts IOB tags into a tree.

Code # 1: LocationChunker class

from nltk.chunk import ChunkParserI

from nltk.chunk.util import conlltags2tree

from nltk.corpus import gazetteers

 

class LocationChunker (ChunkParserI):

def __ init__ ( self ):

self . locations = set (gazetteers.words ())

self . lookahead = 0

for loc in self . locations:

nwords = loc.count ( `` )

if nwords & gt;  self . lookahead:

self . lookahead = nwords

Code # 2: iob_locations () method

def iob_locations ( self , tagged_sent):

 

i = 0

l = len (tagged_sent)

  inside = False

 

while i & lt; l:

word, tag = tagged_sent [i]

j = i + 1

k = j + self . lookahead

nextwords, nexttags = [], []

loc = False

 

while j & lt; k:

if `` . join ([word] + nextwords) in self . locations:

if inside:

yield word, tag, ` I-LOCATION`

else :

yield word, tag, `B-LOCATION`

  for nword, ntag in zip (nextwords, nexttags):

yield nword, ntag, `I- LOCATION`

loc, inside = True , True

i = j

break

  

if j & lt; l:

nextword, nexttag = tagged_sent [j]

nextwords. append (nextword)

nexttags.append (nexttag)

j + = 1

else :

break

if not loc:

inside = F alse

i + = 1

yield word, tag, `O`

 

def parse ( self , tagged_sent):

  iobs = self . iob_locations (tagged_sent)

return conlltags2tree (iobs)

Code # 3: use the LocationChunker class to parse pre placement

from nltk.chunk import ChunkParserI

from chunkers import sub_leaves

from chunkers import LocationChunker

 

t = loc. parse ([( `San` , ` NNP` ), ( `Francisco` , ` NNP` ),

  ( `CA` , `NNP` ), ( ` is` , `BE` ), ( ` cold` , `JJ` ), 

( `compared` , `VBD` ), ( ` to` , `TO` ), ( `San` , ` NNP` ),

( `Jose`  , `NNP` ), ( `CA` , ` NNP` )])

 

print ( "Location:" , sub_leaves (t, `LOCATION` ))

Output:

 Location: [[(`San`,` NNP`), (`Francisco`,` NNP`), (`CA`,` NNP`)] , [(`San`,` NNP`), (`Jose`,` NNP`), (`CA`,` NNP`)]]