The cornerstone of scikit - learn’s API is the concept of estimators. An estimator is any object that can learn from data. This includes classifiers, regressors, transformers, and clusterers. All estimators in scikit - learn follow a common interface with two main methods: fit()
and predict()
(or transform()
for transformers).
from sklearn.linear_model import LinearRegression
import numpy as np
# Generate some sample data
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 4, 6, 8, 10])
# Create an estimator (Linear Regression in this case)
estimator = LinearRegression()
# Fit the estimator to the data
estimator.fit(X, y)
# Make predictions
new_X = np.array([[6]])
prediction = estimator.predict(new_X)
print(f"Prediction for X = 6: {prediction[0]}")
In this code, LinearRegression
is an estimator. The fit()
method is used to train the model on the input data X
and target data y
. The predict()
method is then used to make predictions on new data.
Transformers are a special type of estimator that are used for data preprocessing. They transform the input data in some way, such as scaling, encoding categorical variables, or imputing missing values. Transformers have a fit_transform()
method which combines the fit()
and transform()
operations.
from sklearn.preprocessing import StandardScaler
import numpy as np
# Generate some sample data
X = np.array([[1], [2], [3], [4], [5]])
# Create a transformer (Standard Scaler in this case)
transformer = StandardScaler()
# Fit and transform the data
X_transformed = transformer.fit_transform(X)
print(f"Transformed data: {X_transformed}")
Here, StandardScaler
is a transformer. The fit_transform()
method first fits the scaler to the data to learn the mean and standard deviation, and then transforms the data accordingly.
Pipelines are used to chain multiple estimators together. They are especially useful when you need to perform a sequence of data preprocessing steps followed by a machine learning model.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
import numpy as np
# Generate some sample data
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 4, 6, 8, 10])
# Create a pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('regressor', LinearRegression())
])
# Fit the pipeline to the data
pipeline.fit(X, y)
# Make predictions
new_X = np.array([[6]])
prediction = pipeline.predict(new_X)
print(f"Prediction using pipeline for X = 6: {prediction[0]}")
In this code, the pipeline first applies the StandardScaler
to the data and then fits a LinearRegression
model.
Scikit - learn’s API is widely used for classification tasks, such as spam detection, image classification, and sentiment analysis.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
# Load the iris dataset
iris = load_iris()
X = iris.data
y = iris.target
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create a classifier
classifier = DecisionTreeClassifier()
# Fit the classifier to the training data
classifier.fit(X_train, y_train)
# Make predictions on the test data
y_pred = classifier.predict(X_test)
# Calculate the accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of the classifier: {accuracy}")
This code demonstrates how to use scikit - learn for a simple classification task using the Iris dataset.
Regression tasks, such as predicting house prices or stock prices, can also be easily implemented using scikit - learn.
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
# Load the Boston housing dataset
# Note: In scikit - learn 1.2 and later, the Boston dataset is deprecated
# For demonstration purposes, we'll use it here. You can use other datasets in practice.
boston = load_boston()
X = boston.data
y = boston.target
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create a regressor
regressor = Ridge()
# Fit the regressor to the training data
regressor.fit(X_train, y_train)
# Make predictions on the test data
y_pred = regressor.predict(X_test)
# Calculate the mean squared error
mse = mean_squared_error(y_test, y_pred)
print(f"Mean squared error of the regressor: {mse}")
This code shows how to perform a regression task using the Boston housing dataset.
Overfitting occurs when a model performs well on the training data but poorly on new, unseen data. This can happen if the model is too complex for the data.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
iris = load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create a very deep decision tree which may overfit
classifier = DecisionTreeClassifier(max_depth=None)
classifier.fit(X_train, y_train)
# Evaluate on training data
y_train_pred = classifier.predict(X_train)
train_accuracy = accuracy_score(y_train, y_train_pred)
# Evaluate on test data
y_test_pred = classifier.predict(X_test)
test_accuracy = accuracy_score(y_test, y_test_pred)
print(f"Training accuracy: {train_accuracy}")
print(f"Test accuracy: {test_accuracy}")
In this code, the decision tree with no maximum depth may overfit the training data, resulting in a high training accuracy but a lower test accuracy.
Data leakage occurs when information from the test set is accidentally used during the training process. This can lead to overly optimistic performance estimates.
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
import numpy as np
iris = load_iris()
X = iris.data
y = iris.target
# Incorrect way: Scaling the whole data before splitting
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
classifier = DecisionTreeClassifier()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy with data leakage: {accuracy}")
To avoid data leakage, the scaler should be fit only on the training data and then applied to the test data.
Cross - validation is a technique used to evaluate the performance of a model more robustly. It involves splitting the data into multiple subsets and training and evaluating the model on different combinations of these subsets.
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
iris = load_iris()
X = iris.data
y = iris.target
classifier = DecisionTreeClassifier()
# Perform 5 - fold cross - validation
scores = cross_val_score(classifier, X, y, cv=5)
print(f"Cross - validation scores: {scores}")
print(f"Mean cross - validation score: {np.mean(scores)}")
This code demonstrates how to use cross - validation to evaluate a decision tree classifier.
Hyperparameters are parameters that are not learned from the data but are set before training. Tuning these hyperparameters can improve the performance of the model.
from sklearn.datasets import load_iris
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
iris = load_iris()
X = iris.data
y = iris.target
# Define the parameter grid
param_grid = {
'max_depth': [2, 3, 4, 5],
'min_samples_split': [2, 3, 4]
}
classifier = DecisionTreeClassifier()
# Perform grid search
grid_search = GridSearchCV(classifier, param_grid, cv=5)
grid_search.fit(X, y)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_}")
This code shows how to use GridSearchCV
to perform hyperparameter tuning for a decision tree classifier.
Scikit - learn’s API design is effective because it provides a standardized and intuitive way to access and use machine learning algorithms. The concepts of estimators, transformers, and pipelines make it easy to build complex machine learning workflows. However, users need to be aware of common pitfalls such as overfitting and data leakage. By following best practices like cross - validation and hyperparameter tuning, users can make the most of scikit - learn and build more accurate and robust machine learning models.