8 great Python libraries for natural language processing
Natural language processing or NLP for short is best described as ’AI for speech and text.’ The magic behind voice commands speech and text translation sentiment analysis text summarization and many other linguistic applications and analyses natural language processing has been improved dramatically through deep learning.
The Python language provides a convenient front-end to all varieties of machine learning including NLP. In fact there is an embarrassment of NLP riches to choose from in the Python ecosystem. In this article well explore each of the NLP libraries available for Python—their use cases their strengths their weaknesses and their general level of popularity.
[ Also on InfoWorld: 6 projects that push Python performance ]
Note that some of these libraries provide higher-level versions of the same functionality exposed by others making that functionality easier to use at the cost of some precision or performance. Youll want to choose a library well-suited both to your level of expertise and to the nature of the project.
The CoreNLP library — a product of Stanford University — was built to be a production-ready natural language processing solution capable of delivering NLP predictions and analyses at scale. CoreNLP is written in Java but multiple Python packages and APIs are available for it including a native Python NLP library called Stanza.
CoreNLP includes a broad range of language tools—grammar tagging named entity recognition parsing sentiment analysis and plenty more. It was designed to be human language agnostic and currently supports Arabic Chinese French German and Spanish in addition to English (with Russian Swedish and Danish support available from third parties). CoreNLP also includes a web API server a convenient way to serve predictions without too much additional work.
The easiest place to start with CoreNLPs Python wrappers is Stanza the reference implementation created by the Stanford NLP Group. In addition to being well-documented Stanza is also maintained regularly; many of the other Python libraries for CoreNLP were not updated for some time.
CoreNLP also supports the use of NLTK a major Python NLP library discussed below. As of version 3.2.3 NLTK includes interfaces to CoreNLP in its parser. Just be sure to use the correct API.
The obvious downside of CoreNLP is that youll need some familiarity with Java to get it up and running but thats nothing a careful reading of the documentation cant achieve. Another hurdle could be CoreNLPs licensing. The whole toolkit is licensed under the GPLv3 meaning any use in proprietary software that you distribute to others will require a commercial license.
Gensim does just two things but does them exceedingly well. Its focus is statistical semantics—analyzing documents for their structure then scoring other documents based on their similarity.
Gensim can work with very large bodies of text by streaming documents to its analysis engine and performing unsupervised learning on them incrementally. It can create multiple types of models each suited to different scenarios: Word2Vec Doc2Vec FastText and Latent Dirichlet Allocation.
Gensims detailed documentation includes tutorials and how-to guides that explain key concepts and illustrate them with hands-on examples. Common recipes are also available on the Gensim GitHub repo.
The latest version Gensim 4 supports Python 3 only but brings major optimizations to common algorithms such as Word2Vec a less complex OOP model and many other modernizations.
The Natural Language Toolkit or NLTK for short is among the best-known and most powerful of the Python natural language processing libraries. Many corpora (data sets) and trained models are available to use with NLTK out of the box so you can start experimenting with NLTK right away.
As the documentation states NLTK provides a wide variety of tools for working with text: ’classification tokenization stemming tagging parsing and semantic reasoning.’ It can also work with some third-party tools to enhance its functionality such as the Stanford Tagger TADM and MEGAM.
Keep in mind that NLTK was created by and for an academic research audience. It was not designed to serve NLP models in a production environment. The documentation is also somewhat sparse; even the how-tos are thin. Also there is no 64-bit binary; youll need to install the 32-bit edition of Python to use it. Finally NLTK is not the fastest library either but it can be sped up with parallel processing.
If you are determined to leverage whats inside NLTK you might start instead with TextBlob (discussed below).
[ Also on InfoWorld: The best free online PyTorch courses and tutorials ]
If all you need to do is scrape a popular website and analyze what you find reach for Pattern. This natural language processing library is far smaller and narrower than other libraries covered here but that also means its focused on doing one common job really well.
Pattern comes with built-ins for scraping a number of popular web services and sources (Google Wikipedia Twitter Facebook generic RSS etc.) all of which are available as Python modules (e.g.
from pattern.web import Twitter). You dont have to reinvent the wheels for getting data from those sites with all of their individual quirks. You can then perform a variety of common NLP operations on the data such as sentiment analysis.
Pattern exposes some of its lower-level functionality allowing you to to use NLP functions n-gram search vectors and graphs directly if you like. It also has a built-in helper library for working with common databases (MySQL SQLite and MongoDB in the future) making it easy to work with tabular data stored from previous sessions or obtained from third parties.
Polyglot as the name implies enables natural language processing applications that deal with multiple languages at once.
The NLP features in Polyglot echo whats found in other NLP libraries: tokenization named entity recognition part-of-speech tagging sentiment analysis word embeddings etc. For each of these operations Polyglot provides models that work with the needed languages.
Note that Polyglots language support differs greatly from feature to feature. For instance the language detection system supports almost 200 languages tokenization supports 165 languages (largely because it uses the Unicode Text Segmentation algorithm) and sentiment analysis supports 136 languages while part-of-speech tagging supports only 16.
PyNLPI (pronounced ’pineapple’) has only a basic roster of natural language processing functions but it has some truly useful data-conversion and data-processsing features for NLP data formats.
Most of the NLP functions in PyNLPI are for basic jobs like tokenization or n-gram extraction along with some statistical functions useful in NLP like Levenshtein distance between strings or Markov chains. Those functions are implemented in pure Python for convenience so theyre unlikely to have production-level performance.
But PyNLPI shines for working with some of the more exotic data types and formats that have sprung up in the NLP space. PyNLPI can read and process GIZA Moses++ SoNaR Taggerdata and TiMBL data formats and devotes an entire module to working with FoLiA the XML document format used to annotate language resources like corpora (bodies of text used for translation or other analysis).
Youll want to reach for PyNLPI whenever youre dealing with those data types.
SpaCy which taps Python for convenience and Cython for speed is billed as ’industrial-strength natural language processing.’ Its creators claim it compares favorably to NLTK CoreNLP and other competitors in terms of speed model size and accuracy. SpaCy contains models for multiple languages although only 16 of the 64 supported have full data pipelines available for them.
SpaCy includes most every feature found in those competing frameworks: speech tagging dependency parsing named entity recognition tokenization sentence segmentation rule-based match operations word vectors and tons more. SpaCy also includes optimizations for GPU operations—both for accelerating computation and for storing data on the GPU to avoid copying.
The documentation for SpaCy is excellent. A setup wizard generates command-line installation actions for Windows Linux and macOS and for different Python environments (pip conda etc.) as well. Language models install as Python packages so they can be tracked as part of an applications dependency list.
The latest version of the framework SpaCy 3.0 provides many upgrades. In addition to using the Ray framework for performing distributed training on multiple machines it offers a new transformer-based pipeline system for better accuracy a new training system and workflow configuration model end-to-end workflow managament and a good deal more.
TextBlob is a friendly front-end to the Pattern and NLTK libraries wrapping both of those libraries in high-level easy-to-use interfaces. With TextBlob you spend less time struggling with the intricacies of Pattern and NLTK and more time getting results.
[ Get the latest news and insights in software development. Subscribe to the InfoWorld First Look newsletter ]
TextBlob smooths the way by leveraging native Python objects and syntax. The quickstart examples show how texts to be processed are simply treated as strings and common NLP methods like part-of-speech tagging are available as methods on those string objects.
Another advantage of TextBlob is you can ’lift the hood’ and alter its functionality as you grow more confident. Many default components like the sentiment analysis system or the tokenizer can be swapped out as needed. You can also create high-level objects that combine components—this sentiment analyzer that classifier etc.—and re-use them with minimal effort. This way you can prototype something quickly with TextBlob then refine it later.