Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
120 changes: 120 additions & 0 deletions docs/machine-learning/data-engineering-basics/data-formats/json.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
---
title: "JSON: The Semi-Structured Standard"
sidebar_label: JSON
description: "Mastering JSON for Machine Learning: handling nested data, converting dictionaries, and efficient parsing for NLP pipelines."
tags: [data-engineering, json, api, semi-structured-data, python, nlp]
---

**JSON (JavaScript Object Notation)** is a lightweight, text-based format for storing and transporting data. While CSVs are perfect for simple tables, JSON excels at representing **hierarchical** or **nested** data—where one observation might contain lists or other sub-observations.

## 1. JSON Syntax vs. Python Dictionaries

JSON structure is almost identical to a Python dictionary. It uses key-value pairs and supports several data types:

* **Objects:** Enclosed in `{}` (Maps to Python `dict`).
* **Arrays:** Enclosed in `[]` (Maps to Python `list`).
* **Values:** Strings, Numbers, Booleans (`true`/`false`), and `null`.

```json
{
"user_id": 101,
"metadata": {
"login_count": 5,
"tags": ["premium", "active"]
},
"is_active": true
}

```

## 2. Why JSON is Critical for ML

### A. Natural Language Processing (NLP)

Text data often comes with complex metadata (author, timestamp, geolocation, and nested entity tags). JSON allows all this info to stay bundled with the raw text.

### B. Configuration Files

Most ML frameworks use JSON (or its cousin, YAML) to store **Hyperparameters**.

```json
{
"model": "ResNet-50",
"learning_rate": 0.001,
"optimizer": "Adam"
}

```

### C. API Responses

As discussed in the [APIs section](/tutorial/machine-learning/data-engineering-basics/data-collection/apis), almost every web service returns data in JSON format.

## 3. The "Flattening" Problem

Machine Learning models (like Linear Regression or XGBoost) require **flat** 2D arrays (Rows and Columns). They cannot "see" inside a nested JSON object. Data engineers must **Flatten** or **Normalize** the data.

```mermaid
graph LR
Nested[Nested JSON] --> Normalize["pd.json_normalize()"]
Normalize --> Flat[Flat DataFrame]
style Normalize fill:#f3e5f5,stroke:#7b1fa2,color:#333

```

**Example in Python:**

```python
import pandas as pd
import json

raw_json = [
{"name": "Alice", "info": {"age": 25, "city": "NY"}},
{"name": "Bob", "info": {"age": 30, "city": "SF"}}
]

# Flattens 'info' into 'info.age' and 'info.city' columns
df = pd.json_normalize(raw_json)

```

## 4. Performance Trade-offs

| Feature | JSON | CSV | Parquet |
| --- | --- | --- | --- |
| **Flexibility** | **Very High** (Schema-less) | Low (Fixed Columns) | Medium (Evolving Schema) |
| **Parsing Speed** | Slow (Heavy string parsing) | Medium | **Very Fast** |
| **File Size** | Large (Repeated Keys) | Medium | Small (Binary) |

:::note
In a JSON file, the key (e.g., `"user_id"`) is repeated for every single record, which wastes a lot of disk space compared to CSV.
:::

## 5. JSONL: The Big Data Variant

Standard JSON files require you to load the entire file into memory to parse it. For datasets with millions of records, we use **JSONL (JSON Lines)**.

* Each line in the file is a separate, valid JSON object.
* **Benefit:** You can stream the file line-by-line without crashing your RAM.

```text
{"id": 1, "text": "Hello world"}
{"id": 2, "text": "Machine Learning is fun"}

```

## 6. Best Practices for ML Engineers

1. **Validation:** Use JSON Schema to ensure the data you're ingesting hasn't changed structure.
2. **Encoding:** Always use `UTF-8` to avoid character corruption in text data.
3. **Compression:** Since JSON is text-heavy, always use `.gz` or `.zip` when storing raw JSON files to save up to 90% space.

## References for More Details

* **[Python `json` Module](https://docs.python.org/3/library/json.html):** Learning `json.loads()` and `json.dumps()`.

* **[Pandas `json_normalize` Guide](https://pandas.pydata.org/docs/reference/api/pandas.json_normalize.html):** Mastering complex flattening of API data.

---

JSON is the king of flexibility, but for "Big Data" production environments where speed and storage are everything, we move to binary formats.
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
---
title: "Parquet: The Big Data Gold Standard"
sidebar_label: Parquet
description: "Understanding Columnar storage, compression benefits, and why Parquet is the preferred format for high-performance ML pipelines."
tags: [data-engineering, parquet, big-data, columnar-storage, performance, cloud-storage]
---

**Apache Parquet** is an open-source, column-oriented data file format designed for efficient data storage and retrieval. Unlike CSV or JSON, which store data row-by-row, Parquet organizes data by **columns**. This single architectural shift makes it the industry standard for modern data lakes and ML feature stores.

## 1. Row-based vs. Columnar Storage

To understand Parquet, you must understand the difference in how data is laid out on your hard drive.

* **Row-based (CSV/SQL):** Stores all data for "User 1," then all data for "User 2."
* **Columnar (Parquet):** Stores all "User IDs" together, then all "Ages" together, then all "Incomes" together.


```mermaid
graph LR
subgraph Row_Storage [Row-Based: CSV]
R1[Row 1: ID, Age, Income]
R2[Row 2: ID, Age, Income]
end

subgraph Col_Storage [Column-Based: Parquet]
C1[IDs: 1, 2, 3...]
C2[Ages: 25, 30, 35...]
C3[Incomes: 50k, 60k...]
end

```

## 2. Why Parquet is Superior for ML

### A. Column Projection (Selective Reading)

In ML, you might have a dataset with 500 columns, but your specific model only needs 5 features.

* **CSV:** You must read the entire file into memory to get those 5 columns.
* **Parquet:** The system "jumps" directly to the 5 columns you need and skips the other 495. This reduces I/O by over 90%.

### B. Drastic Compression

Because Parquet stores similar data types together, it can use highly efficient compression algorithms (like Snappy or Gzip).

* **Example:** In an "Age" column, numbers are similar. Parquet can store "30, 30, 30, 31" as "3x30, 1x31" (**Run-Length Encoding**).

### C. Schema Preservation

Parquet is a binary format that stores **metadata**. It "knows" that a column is a 64-bit float or a Timestamp. You never have to worry about a "Date" column being accidentally read as a string.

## 3. Parquet vs. CSV: The Benchmarks

| Feature | CSV | Parquet |
| --- | --- | --- |
| **Storage Size** | 1.0x (Large) | **~0.2x (Small)** |
| **Query Speed** | Slow | **Very Fast** |
| **Cost (Cloud)** | Expensive (S3 scans more data) | **Cheap** (S3 scans less data) |
| **ML Readiness** | Requires manual type casting | **Plug-and-play** |

## 4. Using Parquet in Python

Pandas and PyArrow make it easy to switch from CSV to Parquet.

```python
import pandas as pd

# Saving a dataframe to Parquet
# Requires 'pyarrow' or 'fastparquet' installed
df.to_parquet('large_dataset.parquet', compression='snappy')

# Reading only specific columns (The magic of Parquet!)
df_subset = pd.read_parquet('large_dataset.parquet', columns=['feature_1', 'target'])

```

## 5. When to use Parquet

1. **Production Pipelines:** Always use Parquet for data passed between different stages of a pipeline.
2. **Large Datasets:** If your data is MB, the speed gains become obvious.
3. **Cloud Storage:** If storing data in AWS S3 or Google Cloud Storage, Parquet will save you significant money on data egress/scan costs.

## References for More Details

* **[Apache Parquet Official Documentation](https://parquet.apache.org/):** Deep diving into the binary file structure.

* **[Databricks - Why Parquet?](https://www.databricks.com/glossary/what-is-parquet)** Understanding Parquet's role in the "Lakehouse" architecture.

---

Parquet is the king of analytical data storage. However, some streaming applications require a format that is optimized for high-speed row writes rather than column reads.
106 changes: 106 additions & 0 deletions docs/machine-learning/data-engineering-basics/data-formats/xml.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
---
title: "XML: Extensible Markup Language"
sidebar_label: XML
description: "Handling hierarchical data in XML: parsing techniques, its role in Computer Vision annotations, and converting XML to ML-ready formats."
tags: [data-engineering, xml, data-formats, computer-vision, pascal-voc, web-services]
---

**XML** is a markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. While JSON has largely replaced XML for web APIs, XML remains a cornerstone in industrial systems and **Object Detection** datasets.

## 1. Anatomy of an XML Document

XML uses a tree-like structure consisting of **tags**, **attributes**, and **content**.

```xml
<annotation>
<filename>image_01.jpg</filename>
<size>
<width>640</width>
<height>480</height>
</size>
<object>
<name>cat</name>
<bndbox>
<xmin>100</xmin>
<ymin>120</ymin>
<xmax>250</xmax>
<ymax>300</ymax>
</bndbox>
</object>
</annotation>

```

## 2. XML in Machine Learning: Use Cases

### A. Computer Vision (Pascal VOC)

One of the most famous datasets in ML history, **Pascal VOC**, uses XML files to store the coordinates of bounding boxes for image classification and detection.

### B. Enterprise Data Integration

Many older banking, insurance, and manufacturing systems exchange data exclusively via XML over SOAP (Simple Object Access Protocol).

### C. Configuration & Metadata

XML is often used to store metadata for scientific datasets where complex, nested relationships must be strictly defined by a **Schema (XSD)**.

## 3. Parsing XML in Python

Because XML is a tree, we don't read it like a flat file. We "traverse" the tree using libraries like `ElementTree` or `lxml`.

```python
import xml.etree.ElementTree as ET

tree = ET.parse('annotation.xml')
root = tree.getroot()

# Accessing specific data
filename = root.find('filename').text
for obj in root.findall('object'):
name = obj.find('name').text
print(f"Detected object: {name}")

```

## 4. XML vs. JSON

| Feature | XML | JSON |
| --- | --- | --- |
| **Metadata** | Supports Attributes + Elements | Only Key-Value pairs |
| **Strictness** | High (Requires XSD validation) | Low (Flexible) |
| **Size** | Verbose (Closing tags increase size) | Compact |
| **Readability** | High (Document-centric) | High (Data-centric) |

## 5. The Challenge: Deep Nesting

Just like [JSON](/tutorial/machine-learning/data-engineering-basics/data-formats/json), XML is hierarchical. To use it in a standard ML model (like a Random Forest), you must **Flatten** the tree into a table.

```mermaid
graph TD
XML[XML Root] --> Branch1[Branch: Metadata]
XML --> Branch2[Branch: Observations]
Branch2 --> Leaf[Leaf: Data Point]
Leaf --> Flatten[Flattening Logic]
Flatten --> CSV[2D Feature Matrix]

style XML fill:#f3e5f5,stroke:#7b1fa2,color:#333
style CSV fill:#e1f5fe,stroke:#01579b,color:#333

```

## 6. Best Practices

1. **Use `lxml` for Speed:** The built-in `ElementTree` is fine for small files, but `lxml` is significantly faster for processing large datasets.
2. **Beware of "XML Bombs":** Malicious XML files can use entity expansion to crash your parser (DoS attack). Use **defusedxml** if you are parsing untrusted data from the web.
3. **Schema Validation:** Always validate your XML against an `.xsd` file if available to ensure your ML pipeline doesn't break due to a missing tag.


## References for More Details

* **[Python ElementTree Documentation](https://docs.python.org/3/library/xml.etree.elementtree.html):** Learning the standard library approach.
* **[Pascal VOC Dataset Format](https://www.google.com/search?q=http://host.robots.ox.ac.uk/pascal/VOC/):** Seeing how XML is used in real-world ML projects.

---

XML completes our look at "Text-Based" formats. While these are great for humans to read, they are slow for machines to process. Next, we look at the high-speed binary formats used in Big Data.