NLP | Configuring Using Tagged Corpus Reader



How can we use Tagged Corpus Reader?

  • Configuring the word tokenizer
  • Configuring the tokenizer proposal
  • Configuring the reader paragraph blocks
  • Customizing the tag separator
  • Converting tags to a universal set of tags

Code # 1: Configuring Word Tokenizer

# Loading libraries

from nltk. tokenize import SpaceTokenizer

from nltk.corpus.reader import TaggedCorpusRead er

 

x = TaggedCorpusReader ( ` .` , r `. *. pos`

  word_tokenizer = SpaceTokenizer ())

 
x.words ()

Output:

 [`The`,` expense`, `and`,` time `,` involved`, `are`, ...] 

Code # 2: for proposal

# Loading Libraries

from nltk.tokenize import LineTokenizer

from nltk .corpus.reader import TaggedCorpusReader

 

x = TaggedCorpusReader ( `.` , r `. *. pos`

sent_tokenizer = LineTokenizer ())

  
x.sents ()

Output:

 [[`The`,` expense`, `and`,` time`, `involved`,` are`, `astronomical`,` .`]]  

Paragraph setup

  • Suppose a paragraph separated by blank lines
  • Made with para_block_reader function, which is nltk.corpus.reader.util.read_blankline_block
  • Number of other block readers is present in nltk.corpus.reader.util whose purpose is to read blocks of text from the stream.

Setting the delimiter tag

  • If & # 39; / & # 39; is not used as a word / tag separator, you can pass an alternate string to the TaggedCorpusReader for sep.
  • By default, this is sep = & # 39; / & # 39; , but if anyone someone wants to separate words and tags with & # 39; | & # 39;, for example & # 39; word | tag & # 39; then sep = & # 39; | & # 39; passed to.

Converting tags to a generic tag set
Tagset: is a list of POS tags used by one or more corporations. 
Generic tag set: this is a simplified and concise tag set with only 12 part-of-speech tags

Code # 3: Match corpus tags to generic tag set

from nltk.corpus.reader import TaggedCorpusReader

  

x = TaggedCorpusReader ( `.` , r `. *. pos` , tagset = `en-brown` )

x.tagged_words (tagset = `universal` )

Output:

 [(`The`,` DET`), (`expense`,` NOUN`), (`and`,` CONJ`), ...]  

Code # 4: Map corpus tags to generic tags

Output:

 [(`Pierre`,` NNP`), (`Vinken`,` NNP`), (`,`, `,`), ...] [(`Pierre`,` NOUN`), (`Vinken`,` NOUN`), (`,`, `.`), ... ] [(`Pierre`,` UNK`), (`Vinken`,` UNK`), (`,`, `UNK`), ...] 

from nltk.corpus.reader import TaggedCorpusReader

from nltk.corpus import treebank

 
treebank.tagged_words ()

 

treebank.tagged_words (tagset = `universal` )

 

treebank.tagged_words (tagset = `brown` )