Keywords are the most relevant and representative words or phrases in a text. They help in summarizing the content, indexing documents, and facilitating information retrieval. For example, in a news article about climate change, keywords could be “climate change”, “global warming”, “carbon emissions”, etc.
TF - IDF is a statistical measure that evaluates how important a word is to a document in a collection or corpus. The term frequency (TF) measures the number of times a word appears in a document, while the inverse document frequency (IDF) measures how common or rare a word is across the entire corpus. The product of TF and IDF gives the TF - IDF score, which is higher for words that are frequent in a particular document but rare in the corpus.
POS tagging is the process of assigning a part of speech (such as noun, verb, adjective, etc.) to each word in a sentence. It is useful in keyword extraction because certain parts of speech, like nouns and adjectives, are more likely to be keywords than others, such as prepositions and conjunctions.
To follow along with the code examples in this blog post, you need to have Python installed on your system. You can install NLTK using pip
:
pip install nltk
After installing NLTK, you also need to download some NLTK data, such as the stopwords and the punkt tokenizer. You can do this in Python:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
import string
# Download NLTK data if not already downloaded
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
def preprocess_text(text):
# Convert to lowercase
text = text.lower()
# Remove punctuation
text = text.translate(str.maketrans('', '', string.punctuation))
# Tokenize the text
tokens = word_tokenize(text)
# Remove stopwords
stop_words = set(stopwords.words('english'))
filtered_tokens = [token for token in tokens if token not in stop_words]
return filtered_tokens
# Example text
text = "Natural language processing (NLP) is a subfield of artificial intelligence. It focuses on the interaction between computers and human language."
preprocessed_tokens = preprocess_text(text)
# Perform POS tagging
pos_tags = nltk.pos_tag(preprocessed_tokens)
# Filter keywords based on POS tags (nouns and adjectives)
keywords = [word for word, pos in pos_tags if pos.startswith('NN') or pos.startswith('JJ')]
print("Keywords based on POS tagging:", keywords)
# Combine the text into a list for TF - IDF vectorization
corpus = [text]
vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform(corpus)
feature_names = vectorizer.get_feature_names_out()
dense = tfidf_matrix.todense()
doc = dense[0].tolist()[0]
phrase_scores = [pair for pair in zip(range(0, len(doc)), doc) if pair[1] > 0]
sorted_phrase_scores = sorted(phrase_scores, key=lambda t: t[1] * -1)
top_keywords = []
for phrase, score in sorted_phrase_scores[:5]:
top_keywords.append(feature_names[phrase])
print("Top keywords based on TF - IDF:", top_keywords)
Extracting keywords from text using NLTK is a powerful technique that can be applied in various real - world scenarios. By understanding the core concepts, typical usage scenarios, and following best practices, you can effectively extract meaningful keywords from text data. NLTK provides a rich set of tools and algorithms, such as POS tagging and TF - IDF, which can be combined to achieve better results. However, it is important to be aware of the common pitfalls and take steps to avoid them.