codeharborhub · ajay-dhangar · Dec 29, 2025 · Dec 29, 2025
@@ -0,0 +1,112 @@
+---
+title: "Loading Data in Scikit-Learn"
+sidebar_label: Data Loading
+description: "How to use Scikit-Learn's built-in datasets, fetchers, and external loaders to prepare data for modeling."
+tags: [scikit-learn, data-loading, python, machine-learning, datasets]
+---
+
+Before you can train a model, you need to get your data into a format that Scikit-Learn understands. Scikit-Learn works primarily with **NumPy arrays** or **Pandas DataFrames**, but it also provides built-in tools to help you get started quickly.
+
+## 1. The Scikit-Learn Data Format
+
+Regardless of how you load your data, Scikit-Learn expects two main components:
+
+1.  **The Feature Matrix ($X$):** A 2D array of shape `(n_samples, n_features)`.
+2.  **The Target Vector ($y$):** A 1D array of shape `(n_samples)` containing the labels or values you want to predict.
+
+## 2. Built-in "Toy" Datasets
+
+Scikit-Learn comes bundled with small datasets that require no internet connection. These are perfect for testing your code or learning new algorithms.
+
+* `load_iris()`: Classic classification dataset (flowers).
+* `load_diabetes()`: Regression dataset.
+* `load_digits()`: Classification dataset (handwritten digits).
+
+```python
+from sklearn.datasets import load_iris
+
+# Load the dataset
+iris = load_iris()
+
+# Access data and labels
+X = iris.data
+y = iris.target
+
+print(f"Features: {iris.feature_names}")
+print(f"Target Names: {iris.target_names}")
+
+```
+
+## 3. Fetching Large Real-World Datasets
+
+For larger datasets, Scikit-Learn provides "fetchers" that download data from the internet and cache it locally in your `~/scikit_learn_data` folder.
+
+* `fetch_california_housing()`: Predict median house values.
+* `fetch_20newsgroups()`: Text dataset for NLP.
+* `fetch_lfw_people()`: Labeled Faces in the Wild (for face recognition).
+
+```python
+from sklearn.datasets import fetch_california_housing
+
+housing = fetch_california_housing()
+print(f"Dataset shape: {housing.data.shape}")
+
+```
+
+## 4. Loading from External Sources
+
+In a professional environment, you will rarely use the built-in datasets. You will likely load data from **CSVs**, **SQL Databases**, or **Pandas DataFrames**.
+
+### From Pandas to Scikit-Learn
+
+Scikit-Learn is designed to be "Pandas-friendly." You can pass DataFrames directly into models.
+
+```python
+import pandas as pd
+from sklearn.linear_model import LinearRegression
+
+# Load your own CSV
+df = pd.read_csv('my_data.csv')
+
+# Split into X and y
+X = df[['feature1', 'feature2']] # Select specific columns
+y = df['target_column']
+
+# Train model directly
+model = LinearRegression().fit(X, y)
+
+```
+
+## 5. Generating Synthetic Data
+
+Sometimes you need to create "fake" data to test how an algorithm handles specific scenarios (like high noise or non-linear patterns).
+
+```python
+from sklearn.datasets import make_blobs, make_moons
+
+# Create 3 distinct clusters for a classification task
+X, y = make_blobs(n_samples=100, centers=3, n_features=2, random_state=42)
+
+```
+
+## 6. Understanding the "Bunch" Object
+
+When you use `load_*` or `fetch_*`, Scikit-Learn returns a **`Bunch` object**. This is essentially a dictionary that contains:
+
+* `.data`: The feature matrix.
+* `.target`: The labels.
+* `.feature_names`: The names of the columns.
+* `.DESCR`: A full text description of where the data came from.
+
+:::tip
+Use `as_frame=True` in your loader to get the data returned as a Pandas DataFrame immediately: `data = load_iris(as_frame=True).frame`
+:::
+
+## References for More Details
+
+* **[Sklearn Dataset Loading Guide](https://scikit-learn.org/stable/datasets.html):** Exploring all 20+ available fetchers and loaders.
+* **[OpenML Integration](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_openml.html):** Accessing thousands of community-uploaded datasets via `fetch_openml`.
+
+---
+
+**Now that you can load data, the next step is to ensure it's in the right shape and split correctly for training and testing.**
@@ -0,0 +1,120 @@
+---
+title: Data Preparation in Scikit-Learn
+sidebar_label: Data Preparation
+description: "Transforming raw data into model-ready features using Scikit-Learn's preprocessing and imputation tools."
+tags: [scikit-learn, preprocessing, encoding, scaling, imputation]
+---
+
+Before feeding data into an algorithm, it must be cleaned and transformed. Scikit-Learn provides a robust suite of **Transformers**—classes that follow a standard `.fit()` and `.transform()` API—to automate this work.
+
+## 1. Handling Missing Values
+
+Machine Learning models cannot handle `NaN` (Not a Number) or `null` values. The `SimpleImputer` class helps fill these gaps.
+
+```python
+from sklearn.impute import SimpleImputer
+import numpy as np
+
+# Sample data with missing values
+X = [[1, 2], [np.nan, 3], [7, 6]]
+
+# strategy='mean', 'median', 'most_frequent', or 'constant'
+imputer = SimpleImputer(strategy='mean')
+X_filled = imputer.fit_transform(X)
+
+```
+
+## 2. Encoding Categorical Data
+
+Computers understand numbers, not words. If you have a column for "City" (New York, Paris, Tokyo), you must encode it.
+
+### A. One-Hot Encoding (Nominal)
+
+Creates a new binary column for each category. Best for data without a natural order.
+
+```python
+from sklearn.preprocessing import OneHotEncoder
+
+encoder = OneHotEncoder(sparse_output=False)
+cities = [['New York'], ['Paris'], ['Tokyo']]
+encoded_cities = encoder.fit_transform(cities)
+
+```
+
+### B. Ordinal Encoding (Ranked)
+
+Converts categories into integers (). Use this when the order matters (e.g., Small, Medium, Large).
+
+## 3. Feature Scaling
+
+As discussed in our [Data Engineering module](/tutorial/machine-learning/data-engineering-basics/data-cleaning-and-preprocessing/feature-scaling), scaling ensures that features with large ranges (like Salary) don't overpower features with small ranges (like Age).
+
+### Standardization (`StandardScaler`)
+
+Rescales data to have a mean of  and a standard deviation of .
+
+$$
+z = \frac{x - \mu}{\sigma}
+$$
+
+### Normalization (`MinMaxScaler`)
+
+Rescales data to a fixed range, usually .
+
+```python
+from sklearn.preprocessing import StandardScaler
+
+scaler = StandardScaler()
+X_scaled = scaler.fit_transform(X_filled)
+
+```
+
+## 4. The `fit` vs `transform` Rule
+
+One of the most important concepts in Scikit-Learn is the distinction between these two methods:
+
+* **`.fit()`**: The transformer calculates the parameters (e.g., the mean and standard deviation of your data). **Only do this on Training data.**
+* **`.transform()`**: The transformer applies those calculated parameters to the data.
+* **`.fit_transform()`**: Does both in one step.
+
+```mermaid
+graph TD
+    Train[Training Data] --> Fit[Fit: Learn Mean/Std]
+    Fit --> TransTrain[Transform Training Data]
+    Fit --> TransTest[Transform Test Data]
+
+    style Fit fill:#f3e5f5,stroke:#7b1fa2,color:#333
+
+```
+
+:::warning
+Never `fit` on your Test data. This leads to **Data Leakage**, where your model "cheats" by seeing the distribution of the test set during training.
+:::
+
+## 5. ColumnTransformer: Selective Processing
+
+In real datasets, you have a mix of types: some columns need scaling, others need encoding, and some need nothing. `ColumnTransformer` allows you to apply different prep steps to different columns simultaneously.
+
+```python
+from sklearn.compose import ColumnTransformer
+
+preprocessor = ColumnTransformer(
+    transformers=[
+        ('num', StandardScaler(), ['age', 'income']),
+        ('cat', OneHotEncoder(), ['city', 'gender'])
+    ])
+
+# X_processed = preprocessor.fit_transform(df)
+
+```
+
+---
+
+## References for More Details
+
+* **[Scikit-Learn Preprocessing Guide](https://scikit-learn.org/stable/modules/preprocessing.html):** Discovering advanced transformers like `PowerTransformer` or `PolynomialFeatures`.
+* **[Imputing Missing Values](https://scikit-learn.org/stable/modules/impute.html):** Learning about `IterativeImputer` (MICE) and `KNNImputer`.
+
+---
+
+**Manual data preparation can get messy and hard to replicate. To solve this, Scikit-Learn uses a powerful tool to chain all these steps together into a single object.**
@@ -0,0 +1,117 @@
+---
+title: Hyperparameter Tuning
+sidebar_label: Hyperparameter Tuning
+description: "Optimizing model performance using GridSearchCV, RandomizedSearchCV, and Halving techniques."
+tags: [scikit-learn, hyperparameter-tuning, grid-search, optimization, model-selection]
+---
+
+In Machine Learning, there is a crucial difference between **Parameters** and **Hyperparameters**:
+
+* **Parameters:** Learned by the model during training (e.g., weights in a regression or coefficients in a neural network).
+* **Hyperparameters:** Set by the engineer *before* training starts (e.g., the depth of a Decision Tree or the number of neighbors in KNN).
+
+**Hyperparameter Tuning** is the automated search for the best combination of these settings to minimize error.
+
+## 1. Why Tune Hyperparameters?
+
+Most algorithms come with default settings that work reasonably well, but they are rarely optimal for your specific data. Proper tuning can often bridge the gap between a mediocre model and a state-of-the-art one.
+
+## 2. GridSearchCV: The Exhaustive Search
+
+`GridSearchCV` takes a predefined list of values for each hyperparameter and tries **every possible combination**. 
+
+* **Pros:** Guaranteed to find the best combination within the provided grid.
+* **Cons:** Computationally expensive. If you have 5 parameters with 5 values each, you must train the model $5^5 = 3,125$ times.
+
+```python
+from sklearn.model_selection import GridSearchCV
+from sklearn.ensemble import RandomForestClassifier
+
+param_grid = {
+    'n_estimators': [50, 100, 200],
+    'max_depth': [None, 10, 20],
+    'min_samples_split': [2, 5]
+}
+
+grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=5)
+grid_search.fit(X_train, y_train)
+
+print(f"Best Parameters: {grid_search.best_params_}")
+
+```
+
+## 3. RandomizedSearchCV: The Efficient Alternative
+
+Instead of trying every combination, `RandomizedSearchCV` picks a fixed number of random combinations from a distribution.
+
+* **Pros:** Much faster than GridSearch. It often finds a result almost as good as GridSearch in a fraction of the time.
+* **Cons:** Not guaranteed to find the absolute best "peak" in the parameter space.
+
+```python
+from sklearn.model_selection import RandomizedSearchCV
+from scipy.stats import randint
+
+param_dist = {
+    'n_estimators': randint(50, 500),
+    'max_depth': [None, 10, 20, 30, 40, 50],
+}
+
+random_search = RandomizedSearchCV(RandomForestClassifier(), param_dist, n_iter=20, cv=5)
+random_search.fit(X_train, y_train)
+
+```
+
+## 4. Advanced: Successive Halving
+
+For massive datasets, even Random Search is slow. Scikit-Learn offers **HalvingGridSearch**. It trains all combinations on a small amount of data, throws away the bottom 50%, and keeps "promising" candidates for the next round with more data.
+
+```mermaid
+graph TD
+    S1[Round 1: 100 candidates, 10% data] --> S2[Round 2: 50 candidates, 20% data]
+    S2 --> S3[Round 3: 25 candidates, 40% data]
+    S3 --> S4[Final Round: Best candidates, 100% data]
+
+    style S1 fill:#fff3e0,stroke:#ef6c00,color:#333
+    style S4 fill:#e8f5e9,stroke:#2e7d32,color:#333
+
+```
+
+## 5. Avoiding the Validation Trap
+
+If you tune your hyperparameters using the **Test Set**, you are "leaking" information. The model will look great on that test set, but fail on new data.
+
+**The Solution:** Use **Nested Cross-Validation** or ensure that your `GridSearchCV` only uses the **Training Set** (it will internally split the training data into smaller validation folds).
+
+```mermaid
+graph LR
+    FullData[Full Dataset] --> Split{Initial Split}
+    Split --> Train[Training Set]
+    Split --> Test[Hold-out Test Set]
+
+    subgraph Optimization [GridSearch with Internal CV]
+    Train --> CV1[Fold 1]
+    Train --> CV2[Fold 2]
+    Train --> CV3[Fold 3]
+    end
+
+    Optimization --> BestModel[Best Hyperparameters]
+    BestModel --> FinalEval[Final Evaluation on Test Set]
+
+```
+
+## 6. Tuning Strategy Summary
+
+| Method | Best for... | Resource Usage |
+| --- | --- | --- |
+| **Manual Tuning** | Initial exploration / small models | Low |
+| **GridSearch** | Small number of parameters | High |
+| **RandomSearch** | Many parameters / large search space | Moderate |
+| **Halving Search** | Large datasets / expensive training | Low-Moderate |
+
+## References for More Details
+
+* **[Sklearn Tuning Guide](https://scikit-learn.org/stable/modules/grid_search.html):** Deep dive into `HalvingGridSearchCV` and custom scoring.
+
+---
+
+**Now that your model is fully optimized and tuned, it's time to evaluate its performance using metrics that go beyond simple "Accuracy."**