Text normalization aims to transform text into a more predictable and consistent form. The main goals of text normalization include:
Tokenization is the process of splitting text into individual words or tokens. NLTK provides several tokenizers, such as the word_tokenize
function.
import nltk
from nltk.tokenize import word_tokenize
# Download the necessary data
nltk.download('punkt')
text = "Hello, how are you today?"
tokens = word_tokenize(text)
print(tokens)
In this code, we first import the word_tokenize
function from NLTK’s tokenize
module. We then download the punkt
data, which is required for tokenization. Finally, we tokenize the input text and print the resulting tokens.
Lowercasing is a simple yet effective normalization technique that converts all text to lowercase.
text = "Hello, HOW are you?"
lowercased_text = text.lower()
print(lowercased_text)
Here, we use the built - in lower()
method of Python strings to convert the text to lowercase.
Punctuation marks can be removed using regular expressions.
import re
text = "Hello, how are you today?"
# Remove punctuation using regular expression
clean_text = re.sub(r'[^\w\s]', '', text)
print(clean_text)
The re.sub()
function replaces all non - alphanumeric and non - whitespace characters with an empty string.
Stop words are common words such as “the”, “and”, “is” that do not carry much semantic meaning. NLTK provides a list of stop words for different languages.
from nltk.corpus import stopwords
# Download the stopwords data
nltk.download('stopwords')
text = "Hello, how are you today?"
tokens = word_tokenize(text)
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
print(filtered_tokens)
In this code, we first download the stop words data for English. We then tokenize the text and create a set of stop words. Finally, we filter out the stop words from the tokens.
Stemming is the process of reducing words to their base or root forms. NLTK provides several stemmers, such as the PorterStemmer
.
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
words = ["running", "runs", "ran"]
stemmed_words = [stemmer.stem(word) for word in words]
print(stemmed_words)
Here, we create an instance of the PorterStemmer
and apply it to a list of words to get their stemmed forms.
Lemmatization is similar to stemming but produces more meaningful base forms. NLTK’s WordNetLemmatizer
can be used for lemmatization.
from nltk.stem import WordNetLemmatizer
# Download the WordNet data
nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()
words = ["running", "runs", "ran"]
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
print(lemmatized_words)
We first download the wordnet
data, which is required for lemmatization. Then we create an instance of the WordNetLemmatizer
and apply it to a list of words.
PorterStemmer
may not handle irregular verbs correctly.Text normalization is a fundamental step in NLP that can significantly improve the performance of various NLP tasks. NLTK provides a wide range of tools and libraries to perform text normalization, including tokenization, lowercasing, stop words removal, stemming, and lemmatization. By understanding the core concepts, typical usage scenarios, common pitfalls, and best practices, you can effectively apply text normalization techniques in real - world situations.