NLP | Distributed Marking with Execnet — part 1

What is Execnet?

  • Execnet — it is a distributed execution library for Python.
  • This allows you to create gateways and pipes for remote code execution.
  • Gateway — it is the connection between the calling process and the remote environment.
  • The remote environment can be a local subprocess or an SSH connection to the remote host.
  • The channel is created from the gateway and provides communication between the channel creator and the remote code .
  • Thus, execnet is a kind of message transfer interface (MPI) , where the gateway creates a connection and the channel is used to send messages back and forth .

Since many NLTK processes use 100% CPU during computation, execnet — this is the ideal way to distribute this computation to maximize resource utilization. You can create one gateway per CPU core, and it doesn't matter if the cores are on the local machine or distributed across remote machines. In many situations, you only need to have trained objects and data on one machine and send objects and data to remote nodes if necessary. 
Install execnet:
It should be as easy as sudo pip install execnet or sudo easy_install execnet . The current version of execnet, at the time of this writing, is 1.2. The execnet home page with documentation and API examples is located at http://codespeak.net/execnet/.

How does it work?
Pickle needs to be imported in order to serialize (pass) the tagger. Execnet does not initially know how to handle complex objects such as a part-of-speech mark, so the mark must be dumped into a string using pickle.dumps ().
The standard tagger is used, which is used by with nltk.tag.pos_tag () , but any pretrained part-of-speech tagger can be used as long as it implements the TaggerI interface. Execnet can be started by creating a gateway with execnet.makegateway () after having a serialized tagger. 
The default gateway creates a Python subprocess, and the remote_exec () function of the remote_tag module can be called to create the channel. With an open channel, it can be sent via a serialized tagger, followed by the first tokenized clause of the tree corpus. 
Visually, the communication process looks like this

Calling channel.receive () will now return a tagged sentence, which is equivalent to the first tagged sentence in the body of the tree structure, so the tagging is known to have worked. Finally, it ends up exiting the gateway, which closes the channel and kills the subprocess.

import execnet, remote_tag, nltk .tag, nltk.data

from nltk.corpus import treebank

import pickle

 

pickled_tagger = pickle.dumps (nltk.data.load (nltk.tag._POS_TAGGER))

gw = execnet.makegateway ()

 

channel = gw.remote_ exec (remote_tag)

channel.send (pickled_tagger)

channel. send (treebank.sents () [ 0 ])

 

tagged_sentence = channel.receive ()

 
# will output

tagged_sentence = = treebank.tagged_sents () [ 0 ]

  
gw.exit ()

Output:

 True 




Get Solution for free from DataCamp guru