How to Handle Imbalanced Datasets in Scikitlearn

In the realm of machine learning, imbalanced datasets are a common challenge. An imbalanced dataset is one where the distribution of classes is significantly skewed, with one class having far more samples than the others. For instance, in a medical diagnosis dataset, the number of healthy patients might far exceed the number of patients with a rare disease. This imbalance can lead to sub - optimal performance of machine learning models, as they tend to be biased towards the majority class. Scikitlearn, a popular machine learning library in Python, provides several techniques to handle imbalanced datasets. In this blog post, we will explore these techniques, understand their core concepts, typical usage scenarios, common pitfalls, and best practices.

Table of Contents

  1. Core Concepts
  2. Typical Usage Scenarios
  3. Common Techniques in Scikitlearn
    • Resampling Methods
      • Oversampling
      • Undersampling
    • Cost - Sensitive Learning
    • Ensemble Methods
  4. Code Examples
  5. Common Pitfalls
  6. Best Practices
  7. Conclusion
  8. References

Core Concepts

Imbalanced Datasets

As mentioned earlier, imbalanced datasets occur when the distribution of classes is uneven. This can cause problems because most machine learning algorithms are designed to maximize overall accuracy. In an imbalanced dataset, simply predicting the majority class all the time can result in a high accuracy score, but it fails to capture the minority class, which is often the class of interest.

Evaluation Metrics

When dealing with imbalanced datasets, traditional accuracy is not a good metric. Instead, we should use metrics such as precision, recall, F1 - score, and the area under the receiver operating characteristic curve (ROC - AUC). Precision measures the proportion of true positive predictions among all positive predictions, recall measures the proportion of true positive predictions among all actual positive samples, and the F1 - score is the harmonic mean of precision and recall.

Sampling Techniques

Sampling techniques aim to balance the class distribution by either increasing the number of samples in the minority class (oversampling) or decreasing the number of samples in the majority class (undersampling).

Cost - Sensitive Learning

Cost - sensitive learning assigns different misclassification costs to different classes. This way, the model is more penalized for misclassifying the minority class, which encourages it to focus on correctly classifying the minority class.

Ensemble Methods

Ensemble methods combine multiple models to improve performance. In the context of imbalanced datasets, some ensemble methods can be used to handle the imbalance effectively.

Typical Usage Scenarios

Fraud Detection

In fraud detection, the number of legitimate transactions is much larger than the number of fraudulent transactions. A model trained on this imbalanced dataset might predict all transactions as legitimate to achieve high accuracy, but it fails to detect fraud. Handling the imbalance is crucial to identify fraudulent transactions accurately.

Medical Diagnosis

As mentioned earlier, in medical diagnosis, the number of healthy patients is often much larger than the number of patients with a rare disease. A good model should be able to detect the rare disease accurately, which requires handling the imbalanced dataset.

Anomaly Detection

In anomaly detection, normal events are much more common than abnormal events. An imbalanced dataset can lead to a model that fails to detect anomalies, and handling the imbalance can improve the detection performance.

Common Techniques in Scikitlearn

Resampling Methods

Oversampling

One of the most popular oversampling techniques is the Synthetic Minority Over - sampling Technique (SMOTE). SMOTE creates synthetic samples for the minority class by interpolating between existing minority class samples.

from imblearn.over_sampling import SMOTE
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Generate a synthetic imbalanced dataset
X, y = make_classification(n_classes=2, class_sep=2,
                           weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0,
                           n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Apply SMOTE
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

# Train a logistic regression model on the resampled data
model = LogisticRegression()
model.fit(X_train_resampled, y_train_resampled)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Print the classification report
print(classification_report(y_test, y_pred))

Undersampling

Random undersampling is a simple undersampling technique that randomly removes samples from the majority class to balance the class distribution.

from imblearn.under_sampling import RandomUnderSampler

# Apply random undersampling
rus = RandomUnderSampler(random_state=42)
X_train_resampled, y_train_resampled = rus.fit_resample(X_train, y_train)

# Train a logistic regression model on the resampled data
model = LogisticRegression()
model.fit(X_train_resampled, y_train_resampled)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Print the classification report
print(classification_report(y_test, y_pred))

Cost - Sensitive Learning

In Scikitlearn, many classifiers support cost - sensitive learning by specifying the class_weight parameter.

# Train a logistic regression model with cost - sensitive learning
model = LogisticRegression(class_weight='balanced')
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Print the classification report
print(classification_report(y_test, y_pred))

Ensemble Methods

One popular ensemble method for imbalanced datasets is the BalancedBaggingClassifier from the imblearn library.

from imblearn.ensemble import BalancedBaggingClassifier

# Create a BalancedBaggingClassifier
bbc = BalancedBaggingClassifier(base_estimator=LogisticRegression(),
                                sampling_strategy='auto',
                                replacement=False,
                                random_state=0)

# Fit the model
bbc.fit(X_train, y_train)

# Make predictions on the test set
y_pred = bbc.predict(X_test)

# Print the classification report
print(classification_report(y_test, y_pred))

Common Pitfalls

Overfitting with Oversampling

Oversampling techniques, especially simple oversampling methods like random oversampling, can lead to overfitting. Since the same samples are replicated, the model might learn the noise in these samples and perform poorly on new data.

Information Loss with Undersampling

Undersampling can result in information loss because it removes samples from the majority class. This can lead to a decrease in the model’s ability to learn the patterns in the majority class.

Incorrect Evaluation Metrics

Using traditional accuracy as the evaluation metric can be misleading in imbalanced datasets. It is important to use appropriate metrics such as precision, recall, F1 - score, and ROC - AUC.

Best Practices

Use Appropriate Evaluation Metrics

Always use evaluation metrics other than accuracy, such as precision, recall, F1 - score, and ROC - AUC, to evaluate the performance of models on imbalanced datasets.

Combine Sampling Techniques

Combining oversampling and undersampling techniques can sometimes yield better results than using a single technique. For example, we can first undersample the majority class to a certain extent and then oversample the minority class.

Experiment with Different Techniques

There is no one - size - fits - all solution for handling imbalanced datasets. It is important to experiment with different techniques such as sampling techniques, cost - sensitive learning, and ensemble methods to find the best approach for a particular dataset.

Cross - Validation

Use cross - validation to ensure the stability of the model’s performance. Cross - validation helps to estimate the model’s performance on unseen data more accurately.

Conclusion

Handling imbalanced datasets is a crucial task in machine learning, and Scikitlearn provides a variety of techniques to address this issue. By understanding the core concepts, typical usage scenarios, common pitfalls, and best practices, we can effectively handle imbalanced datasets and build models that perform well on both the majority and minority classes. It is important to choose the appropriate technique based on the characteristics of the dataset and to use appropriate evaluation metrics to measure the model’s performance.

References

  1. Scikitlearn Documentation: https://scikit - learn.org/stable/
  2. imblearn Documentation: https://imbalanced - learn.org/stable/
  3. “Learning from Imbalanced Data” by He and Garcia, IEEE Transactions on Knowledge and Data Engineering, 2009.