Classification is a supervised machine - learning task where the goal is to assign input data to one of several predefined classes. In the context of a spam classifier, the two classes are “ham” (legitimate messages) and “spam”.
Raw text data cannot be directly used by most machine - learning algorithms. Feature extraction is the process of converting text data into a numerical format that the algorithms can understand. Common techniques for text feature extraction include bag - of - words and TF - IDF (Term Frequency - Inverse Document Frequency).
The data is typically split into two subsets: a training set and a testing set. The model is trained on the training set, learning the patterns that distinguish between ham and spam messages. The testing set is used to evaluate the performance of the trained model.
We’ll use the well - known SMS Spam Collection dataset. First, let’s load the data and preprocess it.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
# Load the dataset
data = pd.read_csv('spam.csv', encoding='latin - 1')
# Select relevant columns
data = data[['v1', 'v2']]
data.columns = ['label', 'message']
# Encode the labels
label_encoder = LabelEncoder()
data['label'] = label_encoder.fit_transform(data['label'])
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data['message'], data['label'], test_size=0.2, random_state=42)
We’ll use the TF - IDF vectorizer to convert text messages into numerical features.
from sklearn.feature_extraction.text import TfidfVectorizer
# Initialize the TF - IDF vectorizer
vectorizer = TfidfVectorizer()
# Fit and transform the training data
X_train_tfidf = vectorizer.fit_transform(X_train)
# Transform the testing data
X_test_tfidf = vectorizer.transform(X_test)
We’ll use the Multinomial Naive Bayes classifier, which is a popular choice for text classification tasks.
from sklearn.naive_bayes import MultinomialNB
# Initialize the classifier
classifier = MultinomialNB()
# Train the classifier
classifier.fit(X_train_tfidf, y_train)
We’ll use several metrics to evaluate the performance of the model.
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
# Make predictions on the testing data
y_pred = classifier.predict(X_test_tfidf)
# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 - score: {f1}")
Building a spam classifier using Scikit - learn is a relatively straightforward process. By understanding the core concepts, following the step - by - step process, and avoiding common pitfalls, you can build an effective spam classifier. This can be applied in various real - world scenarios to filter out unwanted messages and improve the user experience.