Text classification is the process of assigning one or more predefined categories to a text document based on its content. For example, classifying emails as spam or non - spam, or news articles as sports, politics, or entertainment.
The Natural Language Toolkit (NLTK) is a Python library that provides easy - to - use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning.
In text classification, feature extraction is the process of transforming text data into a format that machine learning algorithms can understand. Common features include word frequencies, presence of specific words, and n - grams.
NLTK provides several machine learning classifiers, such as Naive Bayes, Decision Trees, and Maximum Entropy classifiers, which can be used for text classification tasks.
We will use the movie reviews dataset from NLTK for this tutorial. The dataset contains movie reviews labeled as positive or negative.
import nltk
from nltk.corpus import movie_reviews
# Download the movie reviews dataset if not already downloaded
nltk.download('movie_reviews')
# Get all fileids and their corresponding categories
documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
# Shuffle the documents to ensure randomness
import random
random.shuffle(documents)
Text preprocessing involves cleaning and normalizing the text data. We will convert all words to lowercase and remove stopwords.
from nltk.corpus import stopwords
nltk.download('stopwords')
import string
# Get stopwords and punctuation
stop_words = set(stopwords.words('english'))
punctuations = set(string.punctuation)
def preprocess_text(words):
# Convert to lowercase and remove stopwords and punctuation
return [word.lower() for word in words if word.isalpha() and word.lower() not in stop_words and word not in punctuations]
# Preprocess the documents
documents = [(preprocess_text(words), category) for words, category in documents]
We will use the most common words as features.
# Get all words in the dataset
all_words = []
for words, _ in documents:
all_words.extend(words)
# Calculate the frequency distribution of words
from nltk.probability import FreqDist
all_words = FreqDist(all_words)
# Select the top 2000 most common words as features
word_features = list(all_words.keys())[:2000]
def document_features(document):
document_words = set(document)
features = {}
for word in word_features:
features[word] = (word in document_words)
return features
# Extract features from the documents
feature_sets = [(document_features(words), category) for words, category in documents]
We will use the Naive Bayes classifier from NLTK to train our model.
# Split the dataset into training and testing sets
train_set, test_set = feature_sets[:1900], feature_sets[1900:]
# Train the Naive Bayes classifier
from nltk.classify import NaiveBayesClassifier
classifier = NaiveBayesClassifier.train(train_set)
We will evaluate the performance of our classifier using accuracy, precision, recall, and F1 - score.
# Calculate accuracy
from nltk.classify.util import accuracy
print("Accuracy:", accuracy(classifier, test_set))
# Get the most informative features
classifier.show_most_informative_features(10)
# Calculate precision, recall, and F1 - score
from collections import defaultdict
from sklearn.metrics import classification_report
# Get true labels and predicted labels
true_labels = [label for _, label in test_set]
predicted_labels = [classifier.classify(features) for features, _ in test_set]
# Print classification report
print(classification_report(true_labels, predicted_labels))
In this tutorial, we have learned how to perform text classification using NLTK. We covered the core concepts, typical usage scenarios, and walked through a step - by - step process from data preparation to model evaluation. By following the best practices and avoiding common pitfalls, you can build effective text classification models for various real - world applications.