Linear regression is a statistical method that attempts to model the relationship between a dependent variable $y$ and one or more independent variables $x$ by fitting a linear equation to the observed data. The simplest form of linear regression is the univariate case, where the relationship is modeled as $y = mx + c$, where $m$ is the slope of the line and $c$ is the y - intercept.
Numpy provides the numpy.polyfit()
function, which can be used to perform polynomial fits. For a linear fit, we use a polynomial of degree 1. This function computes the coefficients of a polynomial $p(x)$ of degree $n$ that is the best fit (in a least - squares sense) for a given set of data points $(x_i, y_i)$.
First, we need to import the Numpy library.
import numpy as np
Let’s generate some sample data to perform the linear fit on.
# Generate x values
x = np.array([1, 2, 3, 4, 5])
# Generate y values with some noise
y = 2 * x + 1 + np.random.randn(5)
We use the np.polyfit()
function to perform the linear fit.
# Perform linear fit (degree = 1)
m, c = np.polyfit(x, y, 1)
print(f"Slope (m): {m}")
print(f"Y - intercept (c): {c}")
We can use the obtained coefficients to predict new values.
# New x values for prediction
new_x = np.array([6, 7])
predicted_y = m * new_x + c
print(f"Predicted y values: {predicted_y}")
We can use the Matplotlib library to visualize the original data points and the fitted line.
import matplotlib.pyplot as plt
# Plot the original data points
plt.scatter(x, y, label='Original data')
# Generate points for the fitted line
line_x = np.linspace(min(x), max(x), 100)
line_y = m * line_x + c
# Plot the fitted line
plt.plot(line_x, line_y, 'r-', label='Fitted line')
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.show()
We can calculate the residuals (the differences between the actual and predicted values) to analyze the goodness of fit.
# Calculate predicted values for original x
predicted_y_original = m * x + c
# Calculate residuals
residuals = y - predicted_y_original
# Calculate the sum of squared residuals
ssr = np.sum(residuals**2)
print(f"Sum of squared residuals: {ssr}")
from sklearn.preprocessing import StandardScaler
scaler_x = StandardScaler()
scaler_y = StandardScaler()
x_normalized = scaler_x.fit_transform(x.reshape(-1, 1)).flatten()
y_normalized = scaler_y.fit_transform(y.reshape(-1, 1)).flatten()
m_normalized, c_normalized = np.polyfit(x_normalized, y_normalized, 1)
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
m_train, c_train = np.polyfit(x_train, y_train, 1)
predicted_y_test = m_train * x_test + c_train
Numpy’s linear fit capabilities, mainly through the np.polyfit()
function, provide a simple yet powerful way to perform linear regression on data. By understanding the fundamental concepts, usage methods, common practices, and best practices, readers can effectively use Numpy for linear fitting tasks. Whether it is for simple data analysis or more complex scientific computing projects, Numpy’s linear fit can be a valuable tool in the data scientist’s toolkit.