Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
---
title: "Loading Data in Scikit-Learn"
sidebar_label: Data Loading
description: "How to use Scikit-Learn's built-in datasets, fetchers, and external loaders to prepare data for modeling."
tags: [scikit-learn, data-loading, python, machine-learning, datasets]
---

Before you can train a model, you need to get your data into a format that Scikit-Learn understands. Scikit-Learn works primarily with **NumPy arrays** or **Pandas DataFrames**, but it also provides built-in tools to help you get started quickly.

## 1. The Scikit-Learn Data Format

Regardless of how you load your data, Scikit-Learn expects two main components:

1. **The Feature Matrix ($X$):** A 2D array of shape `(n_samples, n_features)`.
2. **The Target Vector ($y$):** A 1D array of shape `(n_samples)` containing the labels or values you want to predict.

## 2. Built-in "Toy" Datasets

Scikit-Learn comes bundled with small datasets that require no internet connection. These are perfect for testing your code or learning new algorithms.

* `load_iris()`: Classic classification dataset (flowers).
* `load_diabetes()`: Regression dataset.
* `load_digits()`: Classification dataset (handwritten digits).

```python
from sklearn.datasets import load_iris

# Load the dataset
iris = load_iris()

# Access data and labels
X = iris.data
y = iris.target

print(f"Features: {iris.feature_names}")
print(f"Target Names: {iris.target_names}")

```

## 3. Fetching Large Real-World Datasets

For larger datasets, Scikit-Learn provides "fetchers" that download data from the internet and cache it locally in your `~/scikit_learn_data` folder.

* `fetch_california_housing()`: Predict median house values.
* `fetch_20newsgroups()`: Text dataset for NLP.
* `fetch_lfw_people()`: Labeled Faces in the Wild (for face recognition).

```python
from sklearn.datasets import fetch_california_housing

housing = fetch_california_housing()
print(f"Dataset shape: {housing.data.shape}")

```

## 4. Loading from External Sources

In a professional environment, you will rarely use the built-in datasets. You will likely load data from **CSVs**, **SQL Databases**, or **Pandas DataFrames**.

### From Pandas to Scikit-Learn

Scikit-Learn is designed to be "Pandas-friendly." You can pass DataFrames directly into models.

```python
import pandas as pd
from sklearn.linear_model import LinearRegression

# Load your own CSV
df = pd.read_csv('my_data.csv')

# Split into X and y
X = df[['feature1', 'feature2']] # Select specific columns
y = df['target_column']

# Train model directly
model = LinearRegression().fit(X, y)

```

## 5. Generating Synthetic Data

Sometimes you need to create "fake" data to test how an algorithm handles specific scenarios (like high noise or non-linear patterns).

```python
from sklearn.datasets import make_blobs, make_moons

# Create 3 distinct clusters for a classification task
X, y = make_blobs(n_samples=100, centers=3, n_features=2, random_state=42)

```

## 6. Understanding the "Bunch" Object

When you use `load_*` or `fetch_*`, Scikit-Learn returns a **`Bunch` object**. This is essentially a dictionary that contains:

* `.data`: The feature matrix.
* `.target`: The labels.
* `.feature_names`: The names of the columns.
* `.DESCR`: A full text description of where the data came from.

:::tip
Use `as_frame=True` in your loader to get the data returned as a Pandas DataFrame immediately: `data = load_iris(as_frame=True).frame`
:::

## References for More Details

* **[Sklearn Dataset Loading Guide](https://scikit-learn.org/stable/datasets.html):** Exploring all 20+ available fetchers and loaders.
* **[OpenML Integration](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_openml.html):** Accessing thousands of community-uploaded datasets via `fetch_openml`.

---

**Now that you can load data, the next step is to ensure it's in the right shape and split correctly for training and testing.**
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
---
title: Data Preparation in Scikit-Learn
sidebar_label: Data Preparation
description: "Transforming raw data into model-ready features using Scikit-Learn's preprocessing and imputation tools."
tags: [scikit-learn, preprocessing, encoding, scaling, imputation]
---

Before feeding data into an algorithm, it must be cleaned and transformed. Scikit-Learn provides a robust suite of **Transformers**—classes that follow a standard `.fit()` and `.transform()` API—to automate this work.

## 1. Handling Missing Values

Machine Learning models cannot handle `NaN` (Not a Number) or `null` values. The `SimpleImputer` class helps fill these gaps.

```python
from sklearn.impute import SimpleImputer
import numpy as np

# Sample data with missing values
X = [[1, 2], [np.nan, 3], [7, 6]]

# strategy='mean', 'median', 'most_frequent', or 'constant'
imputer = SimpleImputer(strategy='mean')
X_filled = imputer.fit_transform(X)

```

## 2. Encoding Categorical Data

Computers understand numbers, not words. If you have a column for "City" (New York, Paris, Tokyo), you must encode it.

### A. One-Hot Encoding (Nominal)

Creates a new binary column for each category. Best for data without a natural order.

```python
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse_output=False)
cities = [['New York'], ['Paris'], ['Tokyo']]
encoded_cities = encoder.fit_transform(cities)

```

### B. Ordinal Encoding (Ranked)

Converts categories into integers (). Use this when the order matters (e.g., Small, Medium, Large).

## 3. Feature Scaling

As discussed in our [Data Engineering module](/tutorial/machine-learning/data-engineering-basics/data-cleaning-and-preprocessing/feature-scaling), scaling ensures that features with large ranges (like Salary) don't overpower features with small ranges (like Age).

### Standardization (`StandardScaler`)

Rescales data to have a mean of and a standard deviation of .

$$
z = \frac{x - \mu}{\sigma}
$$

### Normalization (`MinMaxScaler`)

Rescales data to a fixed range, usually .

```python
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_filled)

```

## 4. The `fit` vs `transform` Rule

One of the most important concepts in Scikit-Learn is the distinction between these two methods:

* **`.fit()`**: The transformer calculates the parameters (e.g., the mean and standard deviation of your data). **Only do this on Training data.**
* **`.transform()`**: The transformer applies those calculated parameters to the data.
* **`.fit_transform()`**: Does both in one step.

```mermaid
graph TD
Train[Training Data] --> Fit[Fit: Learn Mean/Std]
Fit --> TransTrain[Transform Training Data]
Fit --> TransTest[Transform Test Data]

style Fit fill:#f3e5f5,stroke:#7b1fa2,color:#333

```

:::warning
Never `fit` on your Test data. This leads to **Data Leakage**, where your model "cheats" by seeing the distribution of the test set during training.
:::

## 5. ColumnTransformer: Selective Processing

In real datasets, you have a mix of types: some columns need scaling, others need encoding, and some need nothing. `ColumnTransformer` allows you to apply different prep steps to different columns simultaneously.

```python
from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer(
transformers=[
('num', StandardScaler(), ['age', 'income']),
('cat', OneHotEncoder(), ['city', 'gender'])
])

# X_processed = preprocessor.fit_transform(df)

```

---

## References for More Details

* **[Scikit-Learn Preprocessing Guide](https://scikit-learn.org/stable/modules/preprocessing.html):** Discovering advanced transformers like `PowerTransformer` or `PolynomialFeatures`.
* **[Imputing Missing Values](https://scikit-learn.org/stable/modules/impute.html):** Learning about `IterativeImputer` (MICE) and `KNNImputer`.

---

**Manual data preparation can get messy and hard to replicate. To solve this, Scikit-Learn uses a powerful tool to chain all these steps together into a single object.**
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
---
title: Hyperparameter Tuning
sidebar_label: Hyperparameter Tuning
description: "Optimizing model performance using GridSearchCV, RandomizedSearchCV, and Halving techniques."
tags: [scikit-learn, hyperparameter-tuning, grid-search, optimization, model-selection]
---

In Machine Learning, there is a crucial difference between **Parameters** and **Hyperparameters**:

* **Parameters:** Learned by the model during training (e.g., weights in a regression or coefficients in a neural network).
* **Hyperparameters:** Set by the engineer *before* training starts (e.g., the depth of a Decision Tree or the number of neighbors in KNN).

**Hyperparameter Tuning** is the automated search for the best combination of these settings to minimize error.

## 1. Why Tune Hyperparameters?

Most algorithms come with default settings that work reasonably well, but they are rarely optimal for your specific data. Proper tuning can often bridge the gap between a mediocre model and a state-of-the-art one.

## 2. GridSearchCV: The Exhaustive Search

`GridSearchCV` takes a predefined list of values for each hyperparameter and tries **every possible combination**.

* **Pros:** Guaranteed to find the best combination within the provided grid.
* **Cons:** Computationally expensive. If you have 5 parameters with 5 values each, you must train the model $5^5 = 3,125$ times.

```python
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [None, 10, 20],
'min_samples_split': [2, 5]
}

grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=5)
grid_search.fit(X_train, y_train)

print(f"Best Parameters: {grid_search.best_params_}")

```

## 3. RandomizedSearchCV: The Efficient Alternative

Instead of trying every combination, `RandomizedSearchCV` picks a fixed number of random combinations from a distribution.

* **Pros:** Much faster than GridSearch. It often finds a result almost as good as GridSearch in a fraction of the time.
* **Cons:** Not guaranteed to find the absolute best "peak" in the parameter space.

```python
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

param_dist = {
'n_estimators': randint(50, 500),
'max_depth': [None, 10, 20, 30, 40, 50],
}

random_search = RandomizedSearchCV(RandomForestClassifier(), param_dist, n_iter=20, cv=5)
random_search.fit(X_train, y_train)

```

## 4. Advanced: Successive Halving

For massive datasets, even Random Search is slow. Scikit-Learn offers **HalvingGridSearch**. It trains all combinations on a small amount of data, throws away the bottom 50%, and keeps "promising" candidates for the next round with more data.

```mermaid
graph TD
S1[Round 1: 100 candidates, 10% data] --> S2[Round 2: 50 candidates, 20% data]
S2 --> S3[Round 3: 25 candidates, 40% data]
S3 --> S4[Final Round: Best candidates, 100% data]

style S1 fill:#fff3e0,stroke:#ef6c00,color:#333
style S4 fill:#e8f5e9,stroke:#2e7d32,color:#333

```

## 5. Avoiding the Validation Trap

If you tune your hyperparameters using the **Test Set**, you are "leaking" information. The model will look great on that test set, but fail on new data.

**The Solution:** Use **Nested Cross-Validation** or ensure that your `GridSearchCV` only uses the **Training Set** (it will internally split the training data into smaller validation folds).

```mermaid
graph LR
FullData[Full Dataset] --> Split{Initial Split}
Split --> Train[Training Set]
Split --> Test[Hold-out Test Set]

subgraph Optimization [GridSearch with Internal CV]
Train --> CV1[Fold 1]
Train --> CV2[Fold 2]
Train --> CV3[Fold 3]
end

Optimization --> BestModel[Best Hyperparameters]
BestModel --> FinalEval[Final Evaluation on Test Set]

```

## 6. Tuning Strategy Summary

| Method | Best for... | Resource Usage |
| --- | --- | --- |
| **Manual Tuning** | Initial exploration / small models | Low |
| **GridSearch** | Small number of parameters | High |
| **RandomSearch** | Many parameters / large search space | Moderate |
| **Halving Search** | Large datasets / expensive training | Low-Moderate |

## References for More Details

* **[Sklearn Tuning Guide](https://scikit-learn.org/stable/modules/grid_search.html):** Deep dive into `HalvingGridSearchCV` and custom scoring.

---

**Now that your model is fully optimized and tuned, it's time to evaluate its performance using metrics that go beyond simple "Accuracy."**
Loading