Working in Natural Language Processing (NLP) and analyzing free-form text is about having the right…

3 min readJan 29, 2019

Working in Natural Language Processing (NLP) and analyzing free-form text is about having the right set of tools. This is an introduction to the essential tools in Python’s Natural Language Toolkit (NLTK).

Understanding how chat-bots work and how NLP is part of this, let’s look at examples with Python’s NLTK.

The code we will be working with is here.

Tokenization

Our first step in working with free-form text is to tokenize sentences into words.

10 ['Jim', 'is', 'bringing', 'his', 'bulldog', 'to', 'eat', 'at', 'Friendlys', '?']

Next it’s useful to remove stop-words, like ‘is’, ‘his’, etc.

6 ['Jim', 'bringing', 'bulldog', 'eat', 'Friendlys', '?']

You can easily purge punctuation marks with a similar approach.

Stemming

Our next step is to stem each word.

jim
bring
bulldog
eat
friend

Notice that the word ‘bringing’ is stemmed to ‘bring’ so we can now match it with the stems for ‘bring’ and ‘brings’. The stem ‘talk’ matches the stem for ‘talking’, ‘talker’, ‘talked’, etc. We use stemming in many text classification functions, for example: Naive Bayes.

Part-of-Speech Tagging

Our next tool to try is the part-of-speech (pos) tagger.

[('Jim', 'NNP'), ('bringing', 'VBG'), ('bulldog', 'JJ'), ('eat', 'NN'), ('Friendlys', 'NNP'), ('?', '.')] Nouns: [('eat', 'NN')]

The off-the-shelf tagger for NLTK uses the Penn Treebank tagset. Wondering why it’s called a ‘Treebank’? See here.

The tags, eg. ‘NNP’, ‘VBG’ are listed and explained here.

Chunking

A few more elaborate tools are the chunker and entity-extractor.

[('Jim', 'NNP', 'B-PERSON'), ('is', 'VBZ', 'O'), ('bringing', 'VBG', 'O'), ('his', 'PRP$', 'O'), ('bulldog', 'NN', 'O'), ('to', 'TO', 'O'), ('eat', 'VB', 'O'), ('at', 'IN', 'O'), ('Friendlys', 'NNP', 'B-ORGANIZATION'), ('?', '.', 'O')]
[('Friendlys', 'NNP', 'B-ORGANIZATION')]

It identified ‘bulldog’ correctly as a noun, ‘Jim’ as a person and ‘Friendlys’ as an organization.

Just as with part-of-speech tagging (pos), you’ll want to train the named-entity chunker. This is important to understand: if you are extracting entities from legalese, for example, you need to train the model on a legalese corpus (collection of documents).

Named entity training data needs to be consistent with the kind of text you will be analyzing.

WordNet Interface

NLTK includes an interface to WordNet. This provides very handy utilities, including word definitions, synonyms, and many more.

We can easily lookup the definition for a noun in our sentence: ‘bulldog’

('bulldog', 'NN', 'O')
a sturdy thickset short-haired breed with a large head and strong undershot lower jaw; developed originally in England for bull baiting

We can use synset similarity() to learn about the correlation between words in a general sense.

0.8387096774193549      # bulldog -> poodle
0.36363636363636365     # bulldog -> car
0.13333333333333333     # bulldog -> space

Notice that “poodle” is very similar to “bulldog”, and “car” much less so. A highly dissimilar word is “space”. Apparently in most English writing these two terms do not appear together!

We can use similarity scores in ‘bag of words’ approaches to text classification. In text classification with Neural Networks, each input neuron carries a NxM matrix of floating point values in relation to each word in the ‘bag’. The ultimate ‘similarity’ score is zero (0) or one (1), but intermediate values are often useful: “bulldog” isn’t exactly the same as “poodle” but in most contexts it’s darn close (not zero).

We can also look up lists of synonyms and antonyms.

{'bulldog', 'English_bulldog'}      # synonymous to 'bulldog'
set()                               # no antonyms{'rich', 'deep', 'ample', 'full-bodied', 'robust', 'plenteous', 'rich_people', 'fertile', 'productive', 'racy', 'copious', 'plentiful', 'fat'}                              # synonyms for 'rich'
{'poor_people', 'poor', 'lean'}     # antonyms for 'rich'

WordNet knows that ‘English bulldog’ is synonymous with ‘bulldog’, and it knows that ‘poor’ is an antonym to ‘rich’. Notice that it posited no antonym for our noun, as this wouldn’t make sense.

Using these essential tools you can analyze free-form text and begin to deconstruct it. This is a starting point for analytics, classification, etc.

For further study refer to the NLTK book.

For a deeper dive into text classification with Naive Bayes see here, with Neural Networks see here, both using NLTK.

Tokenization

Stemming

Part-of-Speech Tagging

Chunking

WordNet Interface

Written by Rayudu yarlagadda