As mentioned earlier, imbalanced datasets occur when the distribution of classes is uneven. This can cause problems because most machine learning algorithms are designed to maximize overall accuracy. In an imbalanced dataset, simply predicting the majority class all the time can result in a high accuracy score, but it fails to capture the minority class, which is often the class of interest.
When dealing with imbalanced datasets, traditional accuracy is not a good metric. Instead, we should use metrics such as precision, recall, F1 - score, and the area under the receiver operating characteristic curve (ROC - AUC). Precision measures the proportion of true positive predictions among all positive predictions, recall measures the proportion of true positive predictions among all actual positive samples, and the F1 - score is the harmonic mean of precision and recall.
Sampling techniques aim to balance the class distribution by either increasing the number of samples in the minority class (oversampling) or decreasing the number of samples in the majority class (undersampling).
Cost - sensitive learning assigns different misclassification costs to different classes. This way, the model is more penalized for misclassifying the minority class, which encourages it to focus on correctly classifying the minority class.
Ensemble methods combine multiple models to improve performance. In the context of imbalanced datasets, some ensemble methods can be used to handle the imbalance effectively.
In fraud detection, the number of legitimate transactions is much larger than the number of fraudulent transactions. A model trained on this imbalanced dataset might predict all transactions as legitimate to achieve high accuracy, but it fails to detect fraud. Handling the imbalance is crucial to identify fraudulent transactions accurately.
As mentioned earlier, in medical diagnosis, the number of healthy patients is often much larger than the number of patients with a rare disease. A good model should be able to detect the rare disease accurately, which requires handling the imbalanced dataset.
In anomaly detection, normal events are much more common than abnormal events. An imbalanced dataset can lead to a model that fails to detect anomalies, and handling the imbalance can improve the detection performance.
One of the most popular oversampling techniques is the Synthetic Minority Over - sampling Technique (SMOTE). SMOTE creates synthetic samples for the minority class by interpolating between existing minority class samples.
from imblearn.over_sampling import SMOTE
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
# Generate a synthetic imbalanced dataset
X, y = make_classification(n_classes=2, class_sep=2,
weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0,
n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Apply SMOTE
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)
# Train a logistic regression model on the resampled data
model = LogisticRegression()
model.fit(X_train_resampled, y_train_resampled)
# Make predictions on the test set
y_pred = model.predict(X_test)
# Print the classification report
print(classification_report(y_test, y_pred))
Random undersampling is a simple undersampling technique that randomly removes samples from the majority class to balance the class distribution.
from imblearn.under_sampling import RandomUnderSampler
# Apply random undersampling
rus = RandomUnderSampler(random_state=42)
X_train_resampled, y_train_resampled = rus.fit_resample(X_train, y_train)
# Train a logistic regression model on the resampled data
model = LogisticRegression()
model.fit(X_train_resampled, y_train_resampled)
# Make predictions on the test set
y_pred = model.predict(X_test)
# Print the classification report
print(classification_report(y_test, y_pred))
In Scikitlearn, many classifiers support cost - sensitive learning by specifying the class_weight
parameter.
# Train a logistic regression model with cost - sensitive learning
model = LogisticRegression(class_weight='balanced')
model.fit(X_train, y_train)
# Make predictions on the test set
y_pred = model.predict(X_test)
# Print the classification report
print(classification_report(y_test, y_pred))
One popular ensemble method for imbalanced datasets is the BalancedBaggingClassifier from the imblearn library.
from imblearn.ensemble import BalancedBaggingClassifier
# Create a BalancedBaggingClassifier
bbc = BalancedBaggingClassifier(base_estimator=LogisticRegression(),
sampling_strategy='auto',
replacement=False,
random_state=0)
# Fit the model
bbc.fit(X_train, y_train)
# Make predictions on the test set
y_pred = bbc.predict(X_test)
# Print the classification report
print(classification_report(y_test, y_pred))
Oversampling techniques, especially simple oversampling methods like random oversampling, can lead to overfitting. Since the same samples are replicated, the model might learn the noise in these samples and perform poorly on new data.
Undersampling can result in information loss because it removes samples from the majority class. This can lead to a decrease in the model’s ability to learn the patterns in the majority class.
Using traditional accuracy as the evaluation metric can be misleading in imbalanced datasets. It is important to use appropriate metrics such as precision, recall, F1 - score, and ROC - AUC.
Always use evaluation metrics other than accuracy, such as precision, recall, F1 - score, and ROC - AUC, to evaluate the performance of models on imbalanced datasets.
Combining oversampling and undersampling techniques can sometimes yield better results than using a single technique. For example, we can first undersample the majority class to a certain extent and then oversample the minority class.
There is no one - size - fits - all solution for handling imbalanced datasets. It is important to experiment with different techniques such as sampling techniques, cost - sensitive learning, and ensemble methods to find the best approach for a particular dataset.
Use cross - validation to ensure the stability of the model’s performance. Cross - validation helps to estimate the model’s performance on unseen data more accurately.
Handling imbalanced datasets is a crucial task in machine learning, and Scikitlearn provides a variety of techniques to address this issue. By understanding the core concepts, typical usage scenarios, common pitfalls, and best practices, we can effectively handle imbalanced datasets and build models that perform well on both the majority and minority classes. It is important to choose the appropriate technique based on the characteristics of the dataset and to use appropriate evaluation metrics to measure the model’s performance.