NLP | Regex and Affix labeling

NLP | Python Methods and Functions | Regular Expressions

Understanding the concept —

  • RegexpTagger is a subclass of SequentialBackoffTagger. It can be placed before the DefaultTagger class to tag words that n-gram tag tags have missed, and thus can be a useful part of the rollback chain.
  • On initialization, templates are saved to RegexpTagger class choose_tag () , it iterates over templates. It then returns the first tag of the expression, which can match the current word using re.match ().
  • This way, if two given expressions match, then the tag of the first one will be returned without even trying to use the second expression .
  • If this template is like — (r & # 39 ;. * & # 39 ;, & # 39; NN & # 39;), the RegexpTagger class can replace the DefaultTagger class

Code # 1: Regular Expression Python Module and Repeated Syntax

patterns = [(r '^ d + $' , 'CD' ),

# gerunds, meaning interesting

(r '. * ing $' , 'VBG' ), 

  # that is, a miracle

  (r '. * ment $' , 'NN' ),

  # that is great

(r '. * ful $' , 'JJ' )]

The RegexpTagger class expects a list of two tuples

 - & gt; first element in the tuple is a regular expression - & gt; second element is the tag 

Code # 2: Using RegexpTagger

# Loading libraries

from tag_util import patterns

from nltk.tag import RegexpTagger

from nltk. corpus import treebank

  

test_data = treebank.tagged_sents () [ 3000 :]

 

tagger = RegexpTagger (patterns)

print ( " Accuracy: " , tagger.evaluate ( test_data))

Output:

 Accuracy: 0.037470321605870924 

What is affix tagging?
This is a subclass of ContextTagger. In the case of the AffixTagger class, the context is either a suffix or a prefix of a word. Thus, it clearly indicates that this class can learn tags based on substrings of a fixed length of the beginning or end of a word. 
Indicates three-character suffixes. These words must be at least 5 characters long, and None is returned as a tag if the word contains less than five characters.

Code # 3: Understanding AffixTagger.

# loading libraries

from tag_util import word_tag_model

from nltk.corpus import treebank

from nltk.tag import AffixTagger

 
# initialize learning and testing the suite

train_data = < / code> treebank.tagged_sents () [: 3000 ]

test_data = treebank.tagged_sents () [ 3000 :]

 

print ( "Train data:" , train_data [ 1 ])

  
# Tagger initialization

tag = AffixTagger (train_data)

 
# Testing

print ( "Accuracy:" , tag.evaluate (test_data))

Output:

 Train data: [('Mr.',' NNP'), ('Vinken',' NNP'), ('is ',' VBZ'), ('chairman',' NN'), ('of',' IN'), ('Elsevier',' NNP'), ('NV',' NNP'), (', ',', '), (' the', 'DT'), (' Dutch', 'NNP'), (' publishing', 'VBG'), (' group', 'NN'), ('. ',' .')] Accuracy: 0.27558817181092166 

Code # 4: AffixTagger, specifying 3-character prefixes.

# Specify 3 character prefixes

prefix_tag = AffixTagger (train_data, 

affix_length = 3 )

  
# Testing

accuracy = prefix_tag.evaluate (test_data)

 

print ( "Accuracy:" , accuracy)

Output:

 Accuracy: 0.23587308439456076 

Code # 5: AffixTagger with 2-character suffixes

# Specify two-character suffixes

sufix_tag = AffixTagger (train_data, 

  affix_length = < code class = "keyword"> - 2 )

  
# Testing

accuracy = sufix_tag.evaluate (test_data)

 

print ( "Accuracy:" , accuracy)

Output:

 Accuracy: 0.31940427368875457 




Get Solution for free from DataCamp guru