NLP | How to type words with Execnet and Redis

steps :

  • For each label in the movie_reviews corpus (which only has pos and neg labels), start by getting a list of — labels and words.
  • Then, from the dist_featx module, get the word_scores using Score_words ().
  • The total word count — 39, 764, and word_scores is an instance of RedisOrderedDict .
  • Then get the first 1000 words and inspect the first five using the keys () method to see what they are.
  • Delete the keys in Redis after getting all the required words from word_scores, as the data is no longer needed.

Code :

# importing libraries

from dist_featx import score_words

from nltk.corpus import movie_reviews

  
# search category by category

category = movie_reviews.categories ()

 

print ( "Categories:" , category)

category_words = [

  (l, movie_reviews.words (categories = [l])) 

for l in category]

 
# Results

word_scores = score_words (category_words)

print ( "Length:" , len (word_scores))

  
# best words

topn_words = word_scores.keys (end = 1000 )

print ( "Top Words:" , topn_words [ 0 : 5 ])

 
# Remove keys in Redis upon receipt
# everything needed from word_scores

from redis import Redis

r = Redis ()

print ([r.delete (key) for  

key in [ `word_fd` , ` label_word_fd: neg` ,

`label_word_fd: pos` , `word_scores` ]])

Output:

 Categories: [`neg`,` pos`] Length: 39767 Top Words : [b`bad`, b`, `, b`and`, b`?`, b`movie`] [1, 1, 1, 1] 

Score_ words () — this is a function from dist_featx. But he is expected to wait a while, as it will take a while. The overhead of execnet and Redis means it will take significantly longer than the unallocated version of the function in memory.

How does it work?
The dist_featx module. py contains a score_words () function that does the following:

  • Opens gateways and channels.
  • Sends provisioning data to each channel.
  • For counting, it sends each (label, words) tuple down the channel.
  • Sends a ready message to each channel.
  • Waits for a ready response back.
  • closes channels and gateways.
  • Based on the counts, each word score is calculated.
  • Save the score in RedisOrderedDict.

Write down all the words and save the results as only the account is over. The code is below:
Code :

# Library import

import itertools, execnet, remote_word_count

from nltk.metrics import BigramAssocMeasures

from redis import Redis

from redisprob import RedisHashFreqDist, RedisConditionalHashFreqDist

from rediscollections import RedisOrderedDict

  
# Word count

def score_words (category_words, 

score_fn = BigramAssocMeasures.chi_sq, 

host = ` localhost` , specs = [( `popen` , 2 )]):

gateways = []

channels = []

  

# counting

for spec, count in specs:

for i in range (count):

gw = execnet.makegateway (spec)

gateways.append (gw)

channel = gw.remote_exec (remote_word_count)

channel.send ((host, ` word_fd` , `category_word_fd` ))

  channels .append (channel)

 

cyc = itertools.cycle (channels)

 

# channel sync

for category, words in category_words:

channel = next (cyc)

channel.send ((category, list (words)))

 

for channel in channels:

channel.send ( `done` )

  assert `done` = = channel.receive ()

channel.waitclose ( 5 )

  

  for gateway in gateways:

gateway.exit ()

 

r = Redis (host)

# frequency allocation

fd = RedisHashFreqDist (r, `word_fd` )

cfd = RedisConditionalHashFreqDist (r, `category_word_fd` )

  word_scores = RedisOrderedDict (r , `word_scores` )

  n_xx = cfd.N ()

 

for category in cfd.conditions ():

n_xi = cfd [category ] .N ()

 

for word, n_ii in cfd [ category] .iterite ms ():

word = word.decode ()

n_ix = fd [word]

 

if n_ii and n_ix and n_xi and n_xx:

score = score_fn (n_ii, (n_ix, n_xi), n_xx)

word_scores [word] = score

  # final grades

return word_scores

A different scoring method should be used if there are more than two labels. To compare two labels, the scoring method will only be accurate. The requirements will determine how you store word scores. 
There are two types of data that can be received over the channel after receiving an instance:

  1. Done message: Indicates that no more data is being received over the channel. 
    Reply with a different ready message, finally exit the loop to close the channel.
  2. 2-tuple (label, words): used to iterate to increment counters in RedisHashFreqDist and RedisConditionalHashFreqDist