NLP | Text by category

File handling | NLP | Python Methods and Functions | String Variables

Code # 1: classification

# Load brown body

from nltk.corpus import brown

 
brown.categories ()

Output:

 ['adventure',' belles_lettres', 'editorial' , 'fiction',' government', 'hobbies',' humor', 'learned',' lore', 'mystery',' news', 'religion',' reviews', 'romance',' science_fiction'] 

How to classify a corpus?
The easiest way — have one file for each category. Below are two excerpts from the movie_reviews corpus:

  • movie_pos.txt
  • movie_neg.txt

Using these two files, we have two categories — pos and neg.

Code # 2: Let's Classify

from nltk.corpus.reader import CategorizedPlaintextCorpusReader

 

reader = CategorizedPlaintextCorpusReader (

'.' , r 'movie _. *. txt' , cat_pattern = r 'movie_ (w +). txt' )

 

print ( "Catego rize: " , reader.categories ())

  

print ( "Negative field:" , reader.fileids (categories = [ 'neg ' ]))

  

print ( "Posiitve field:" , reader .fileids (categories = [ 'pos' ]))

Output:

 Categorize: ['neg',' pos'] Negative field: ['movie_neg.txt'] Posiitve field: [' movie_pos.txt'] 

Code # 3: instead of cat_pattern using in cat_map < / strong>

from nltk.corpus.reader import CategorizedPlaintextCorpusReader

 

reader = CategorizedPlaintextCorpusReader (

' .' , r 'movie _. *. txt' , cat_map = { ' movie_pos.txt' : [ 'pos' ], 

'movie_neg.txt' : [ 'neg' ]})

 

print ( " Categorize: " , reader.categories () )

Output:

 Categorize: ['neg',' pos']