Text summarization is the process of reducing a text document to its most important information while preserving its overall meaning. There are two main types of text summarization:
NLTK is a Python library that provides a wide range of tools and resources for natural language processing tasks. It includes algorithms for tokenization, stemming, tagging, parsing, and more. For text summarization, we will mainly use NLTK’s tokenization and frequency analysis capabilities.
Tokenization is the process of splitting text into individual words or sentences. In the context of text summarization, sentence tokenization is used to split the text into sentences, and word tokenization is used to split sentences into words.
Frequency analysis involves counting the occurrence of each word in the text. Words that occur more frequently are considered more important and are likely to be included in the summary.
Here is a step-by-step guide to building a simple extractive text summarizer using NLTK:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
from collections import defaultdict
import string
# Download necessary NLTK data
nltk.download('punkt')
nltk.download('stopwords')
def summarize_text(text, num_sentences=3):
# Step 1: Sentence Tokenization
sentences = sent_tokenize(text)
# Step 2: Word Tokenization and Frequency Analysis
stop_words = set(stopwords.words('english'))
word_frequencies = defaultdict(int)
for sentence in sentences:
# Word tokenization
words = word_tokenize(sentence.lower())
for word in words:
if word not in stop_words and word not in string.punctuation:
word_frequencies[word] += 1
# Step 3: Calculate Sentence Scores
sentence_scores = defaultdict(int)
for sentence in sentences:
for word in word_tokenize(sentence.lower()):
if word in word_frequencies:
sentence_scores[sentence] += word_frequencies[word]
# Step 4: Select Top Sentences
sorted_sentences = sorted(sentence_scores.items(), key=lambda item: item[1], reverse=True)
top_sentences = [sentence for sentence, score in sorted_sentences[:num_sentences]]
# Step 5: Generate Summary
summary = ' '.join(top_sentences)
return summary
# Example usage
text = """
Natural language processing (NLP) is a subfield of artificial intelligence (AI)
that focuses on the interaction between computers and human language.
It involves teaching computers to understand, interpret, and generate human language.
NLP has many applications, including machine translation, speech recognition,
and text summarization.
"""
summary = summarize_text(text)
print(summary)
sent_tokenize
.num_sentences
sentences based on their scores.Building a text summarizer with NLTK is a straightforward process that can be used to quickly generate extractive summaries. By understanding the core concepts, typical usage scenarios, common pitfalls, and best practices, you can build a more effective text summarizer. While extractive summarization has its limitations, it is a useful technique for many real-world applications.