NLP | Text by category



Code # 1: classification

# Load brown body

from nltk.corpus import brown

 
brown.categories ()

Output:

 [`adventure`,` belles_lettres`, `editorial` , `fiction`,` government`, `hobbies`,` humor`, `learned`,` lore`, `mystery`,` news`, `religion`,` reviews`, `romance`,` science_fiction`]  

How to classify a corpus?
The easiest way — have one file for each category. Below are two excerpts from the movie_reviews corpus:

  • movie_pos.txt
  • movie_neg.txt

Using these two files, we have two categories — pos and neg.

Code # 2: Let`s Classify

from nltk.corpus.reader import CategorizedPlaintextCorpusReader

 

reader = CategorizedPlaintextCorpusReader (

`.` , r `movie _. *. txt` , cat_pattern = r `movie_ (w +). txt` )

 

print ( "Catego rize: " , reader.categories ())

  

print ( "Negative field:" , reader.fileids (categories = [ `neg ` ]))

  

print ( "Posiitve field:" , reader .fileids (categories = [ `pos` ]))

Output:

 Categorize: [`neg`,` pos`] Negative field: [`movie_neg.txt`] Posiitve field: [` movie_pos.txt`] 

Code # 3: instead of cat_pattern using in cat_map < / strong>

from nltk.corpus.reader import CategorizedPlaintextCorpusReader

 

reader = CategorizedPlaintextCorpusReader (

` .` , r `movie _. *. txt` , cat_map = { ` movie_pos.txt` : [ `pos` ], 

`movie_neg.txt` : [ `neg` ]})

 

print ( " Categorize: " , reader.categories () )

Output:

 Categorize: [`neg`,` pos`]