Random sampling is the process of selecting a sample from a population in such a way that each member of the population has an equal chance of being included in the sample. This helps to ensure that the sample is representative of the population.
numpy.random.choice
This function is used to generate a random sample from a given 1-D array.
import numpy as np
# Create a population
population = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
# Sampling with replacement
sample_with_replacement = np.random.choice(population, size=5, replace=True)
print("Sample with replacement:", sample_with_replacement)
# Sampling without replacement
sample_without_replacement = np.random.choice(population, size=5, replace=False)
print("Sample without replacement:", sample_without_replacement)
numpy.random.randint
This function is used to generate random integers from a specified range.
# Generate a sample of 5 random integers between 0 and 10 (inclusive)
random_integers = np.random.randint(0, 11, size=5)
print("Random integers:", random_integers)
numpy.random.normal
This function is used to generate random samples from a normal (Gaussian) distribution.
# Generate a sample of 10 numbers from a normal distribution with mean 0 and standard deviation 1
normal_sample = np.random.normal(loc=0, scale=1, size=10)
print("Normal sample:", normal_sample)
When exploring a large dataset, it can be time - consuming to work with the entire dataset. Sampling a small subset of the data can help you quickly understand the characteristics of the data, such as the distribution of variables, the presence of outliers, etc.
import pandas as pd
# Generate a large dataset
data = pd.DataFrame(np.random.randn(10000, 5), columns=['A', 'B', 'C', 'D', 'E'])
# Take a sample of 100 rows
sample_data = data.sample(n=100)
print("Sample data shape:", sample_data.shape)
In machine learning, it is common to split the dataset into training and testing sets. You can use sampling techniques to create these subsets.
from sklearn.model_selection import train_test_split
# Assume X is the feature matrix and y is the target vector
X = np.random.randn(100, 5)
y = np.random.randint(0, 2, 100)
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("Training set shape:", X_train.shape)
print("Testing set shape:", X_test.shape)
When you want your results to be reproducible, it is important to set a random seed. This ensures that the same sequence of random numbers is generated every time the code is run.
np.random.seed(42)
sample = np.random.choice(population, size=5, replace=False)
print("Reproducible sample:", sample)
When sampling, it is crucial to ensure that the sample is representative of the population. If the sample is biased, the conclusions drawn from the sample may not be valid for the population. You can use statistical tests and visualizations to check for sampling bias.
Numpy sampling provides a powerful set of tools for data scientists and researchers to work with data. By understanding the fundamental concepts, usage methods, common practices, and best practices, you can effectively use Numpy sampling for various tasks such as data exploration, model training, and hypothesis testing. Remember to set a random seed for reproducibility and check for sampling bias to ensure the validity of your results.