Welcome to the first blog on our NLP course. In this we will be talking about Tokenization of words and sentences and stopwords removal from a text.
Why Tokenization?
Tokenization is done in order to process your data in a strict singular entity form. This means that, for example, if we tokenize my text into words, then each word will be the smallest entity to be processed in our logic, if we tokenize sentences, then each sentence will be the smallest entity (or token) to be processed. But why do we actually need it?
Some of you may say, to word tokenize the text, we can just split the text by whitespaces and remove all the punctuation marks. Let’s see via an example…
Mr. O’Neill thinks that the boys’ stories about Chile’s capital aren’t amusing.
While breaking the above sentence into tokens (or words), how should be break “O’Neill” –
- o , neill
- oneill
- o’neill
- o’ , neill
- neill
And for aren’t –
- are , n’t
- arent
- aren’t
- aren , t
As we see, splitting the terms only by whitespace is not going to give us proper results. These issues are language specific, and are really tricky to write down rules for such cases. In English, hyphenation is used for various purposes ranging from splitting up vowels in words (co-education) to joining nouns as names (Hewlett-Packard) to a copyediting device to show word grouping (the hold-him-back-and-drag-him-away maneuver).
It is easy to feel that the first example should be regarded as one token (and is indeed more commonly written as just coeducation), the last should be separated into words, and that the middle case is unclear. Handling hyphens automatically can thus be complex: it can either be done as a classification problem, or more commonly by some heuristic rules, such as allowing short hyphenated prefixes on words, but not longer hyphenated forms.
Stopwords
Stopwords are common words (a, the, of, an, this, etc.) which are found in large number in the text but rarely have much value. These words could be hindering with our processing, increasing the complexity and also confusing the logic. Therefore, almost all search engines and NLP applications remove stop words from the text before applying their logic.
Implementation
Therefore, we use a pre-trained tokenizer (word_tokenize) which is trained over the Penn Treebank Dataset to tokenize words. Make sure you have downloaded NLTK, a python NLP package.
pip install nltk
Open python and type these –
import nltk nltk.download()
A dialogue box would open, download “all” packages.
Now here is the code to word tokenize your text. Try it yourself by running this example on your machines.
import nltk | |
from nltk.corpus import stopwords | |
text = "Hi, Laptops from Hewlett-Packard aren't running MacOS. We would love an Apple Mac for our work." | |
print('Word tokenization – ') | |
word_token = nltk.word_tokenize(text) | |
print(word_token) | |
print('\nSentence tokenization – ') | |
print(nltk.sent_tokenize(text)) | |
print('Word tokens after removing stopwords – ') | |
stop = set(stopwords.words('english')) | |
ans = [token for token in word_token if token not in stop] | |
print(ans) |
Do share your views in the comments section below.
“Follow Blog Via Email” at the bottom of the page to keep updated with the NLP course.
Leave a Reply