Tokenizing Words and Sentences, Stopwords Removal (NLP part 1)

Welcome to the first blog on our NLP course. In this we will be talking about Tokenization of words and sentences and stopwords removal from a text.

Why Tokenization?

Tokenization is done in order to process your data in a strict singular entity form. This means that, for example, if we tokenize my text into words, then each word will be the smallest entity to be processed in our logic, if we tokenize sentences, then each sentence will be the smallest entity (or token) to be processed. But why do we actually need it?

Some of you may say, to word tokenize the text, we can just split the text by whitespaces and remove all the punctuation marks. Let’s see via an example…

Mr. O’Neill thinks that the boys’ stories about Chile’s capital aren’t amusing.

While breaking the above sentence into tokens (or words), how should be break “O’Neill” –

  • o , neill
  • oneill
  • o’neill
  • o’ , neill
  • neill

And for aren’t –

  • are , n’t
  • arent
  • aren’t
  • aren , t

As we see, splitting the terms only by whitespace is not going to give us proper results. These issues are language specific, and are really tricky to write down rules for such cases. In English, hyphenation is used for various purposes ranging from splitting up vowels in words (co-education) to joining nouns as names (Hewlett-Packard) to a copyediting device to show word grouping (the hold-him-back-and-drag-him-away maneuver).

It is easy to feel that the first example should be regarded as one token (and is indeed more commonly written as just coeducation), the last should be separated into words, and that the middle case is unclear. Handling hyphens automatically can thus be complex: it can either be done as a classification problem, or more commonly by some heuristic rules, such as allowing short hyphenated prefixes on words, but not longer hyphenated forms.

Stopwords

Stopwords are common words (a, the, of, an, this, etc.) which are found in large number in the text but rarely have much value. These words could be hindering with our processing, increasing the complexity and also confusing the logic. Therefore, almost all search engines and NLP applications remove stop words from the text before applying their logic.

Implementation

Therefore, we use a pre-trained tokenizer (word_tokenize) which is trained over the Penn Treebank Dataset to tokenize words. Make sure you have downloaded NLTK, a python NLP package.

pip install nltk

Open python and type these –

import nltk
nltk.download()

A dialogue box would open, download “all” packages.

Now here is the code to word tokenize your text. Try it yourself by running this example on your machines.

Do share your views in the comments section below.

“Follow Blog Via Email” at the bottom of the page to keep updated with the NLP course.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Powered by WordPress.com.

Up ↑

%d bloggers like this: