codeharborhub · ajay-dhangar · Dec 24, 2025 · Dec 24, 2025
@@ -0,0 +1,120 @@
+---
+title: "JSON: The Semi-Structured Standard"
+sidebar_label: JSON
+description: "Mastering JSON for Machine Learning: handling nested data, converting dictionaries, and efficient parsing for NLP pipelines."
+tags: [data-engineering, json, api, semi-structured-data, python, nlp]
+---
+
+**JSON (JavaScript Object Notation)** is a lightweight, text-based format for storing and transporting data. While CSVs are perfect for simple tables, JSON excels at representing **hierarchical** or **nested** data—where one observation might contain lists or other sub-observations.
+
+## 1. JSON Syntax vs. Python Dictionaries
+
+JSON structure is almost identical to a Python dictionary. It uses key-value pairs and supports several data types:
+
+* **Objects:** Enclosed in `{}` (Maps to Python `dict`).
+* **Arrays:** Enclosed in `[]` (Maps to Python `list`).
+* **Values:** Strings, Numbers, Booleans (`true`/`false`), and `null`.
+
+```json
+{
+  "user_id": 101,
+  "metadata": {
+    "login_count": 5,
+    "tags": ["premium", "active"]
+  },
+  "is_active": true
+}
+
+```
+
+## 2. Why JSON is Critical for ML
+
+### A. Natural Language Processing (NLP)
+
+Text data often comes with complex metadata (author, timestamp, geolocation, and nested entity tags). JSON allows all this info to stay bundled with the raw text.
+
+### B. Configuration Files
+
+Most ML frameworks use JSON (or its cousin, YAML) to store **Hyperparameters**.
+
+```json
+{
+  "model": "ResNet-50",
+  "learning_rate": 0.001,
+  "optimizer": "Adam"
+}
+
+```
+
+### C. API Responses
+
+As discussed in the [APIs section](/tutorial/machine-learning/data-engineering-basics/data-collection/apis), almost every web service returns data in JSON format.
+
+## 3. The "Flattening" Problem
+
+Machine Learning models (like Linear Regression or XGBoost) require **flat** 2D arrays (Rows and Columns). They cannot "see" inside a nested JSON object. Data engineers must **Flatten** or **Normalize** the data.
+
+```mermaid
+graph LR
+    Nested[Nested JSON] --> Normalize["pd.json_normalize()"]
+    Normalize --> Flat[Flat DataFrame]
+    style Normalize fill:#f3e5f5,stroke:#7b1fa2,color:#333
+
+```
+
+**Example in Python:**
+
+```python
+import pandas as pd
+import json
+
+raw_json = [
+    {"name": "Alice", "info": {"age": 25, "city": "NY"}},
+    {"name": "Bob", "info": {"age": 30, "city": "SF"}}
+]
+
+# Flattens 'info' into 'info.age' and 'info.city' columns
+df = pd.json_normalize(raw_json)
+
+```
+
+## 4. Performance Trade-offs
+
+| Feature | JSON | CSV | Parquet |
+| --- | --- | --- | --- |
+| **Flexibility** | **Very High** (Schema-less) | Low (Fixed Columns) | Medium (Evolving Schema) |
+| **Parsing Speed** | Slow (Heavy string parsing) | Medium | **Very Fast** |
+| **File Size** | Large (Repeated Keys) | Medium | Small (Binary) |
+
+:::note
+In a JSON file, the key (e.g., `"user_id"`) is repeated for every single record, which wastes a lot of disk space compared to CSV.
+:::
+
+## 5. JSONL: The Big Data Variant
+
+Standard JSON files require you to load the entire file into memory to parse it. For datasets with millions of records, we use **JSONL (JSON Lines)**.
+
+* Each line in the file is a separate, valid JSON object.
+* **Benefit:** You can stream the file line-by-line without crashing your RAM.
+
+```text
+{"id": 1, "text": "Hello world"}
+{"id": 2, "text": "Machine Learning is fun"}
+
+```
+
+## 6. Best Practices for ML Engineers
+
+1. **Validation:** Use JSON Schema to ensure the data you're ingesting hasn't changed structure.
+2. **Encoding:** Always use `UTF-8` to avoid character corruption in text data.
+3. **Compression:** Since JSON is text-heavy, always use `.gz` or `.zip` when storing raw JSON files to save up to 90% space.
+
+## References for More Details
+
+* **[Python `json` Module](https://docs.python.org/3/library/json.html):** Learning `json.loads()` and `json.dumps()`.
+
+* **[Pandas `json_normalize` Guide](https://pandas.pydata.org/docs/reference/api/pandas.json_normalize.html):** Mastering complex flattening of API data.
+
+---
+
+JSON is the king of flexibility, but for "Big Data" production environments where speed and storage are everything, we move to binary formats.
@@ -0,0 +1,91 @@
+---
+title: "Parquet: The Big Data Gold Standard"
+sidebar_label: Parquet
+description: "Understanding Columnar storage, compression benefits, and why Parquet is the preferred format for high-performance ML pipelines."
+tags: [data-engineering, parquet, big-data, columnar-storage, performance, cloud-storage]
+---
+
+**Apache Parquet** is an open-source, column-oriented data file format designed for efficient data storage and retrieval. Unlike CSV or JSON, which store data row-by-row, Parquet organizes data by **columns**. This single architectural shift makes it the industry standard for modern data lakes and ML feature stores.
+
+## 1. Row-based vs. Columnar Storage
+
+To understand Parquet, you must understand the difference in how data is laid out on your hard drive.
+
+* **Row-based (CSV/SQL):** Stores all data for "User 1," then all data for "User 2."
+* **Columnar (Parquet):** Stores all "User IDs" together, then all "Ages" together, then all "Incomes" together.
+
+
+```mermaid
+graph LR
+    subgraph Row_Storage [Row-Based: CSV]
+    R1[Row 1: ID, Age, Income]
+    R2[Row 2: ID, Age, Income]
+    end
+
+    subgraph Col_Storage [Column-Based: Parquet]
+    C1[IDs: 1, 2, 3...]
+    C2[Ages: 25, 30, 35...]
+    C3[Incomes: 50k, 60k...]
+    end
+
+```
+
+## 2. Why Parquet is Superior for ML
+
+### A. Column Projection (Selective Reading)
+
+In ML, you might have a dataset with 500 columns, but your specific model only needs 5 features.
+
+* **CSV:** You must read the entire file into memory to get those 5 columns.
+* **Parquet:** The system "jumps" directly to the 5 columns you need and skips the other 495. This reduces I/O by over 90%.
+
+### B. Drastic Compression
+
+Because Parquet stores similar data types together, it can use highly efficient compression algorithms (like Snappy or Gzip).
+
+* **Example:** In an "Age" column, numbers are similar. Parquet can store "30, 30, 30, 31" as "3x30, 1x31" (**Run-Length Encoding**).
+
+### C. Schema Preservation
+
+Parquet is a binary format that stores **metadata**. It "knows" that a column is a 64-bit float or a Timestamp. You never have to worry about a "Date" column being accidentally read as a string.
+
+## 3. Parquet vs. CSV: The Benchmarks
+
+| Feature | CSV | Parquet |
+| --- | --- | --- |
+| **Storage Size** | 1.0x (Large) | **~0.2x (Small)** |
+| **Query Speed** | Slow | **Very Fast** |
+| **Cost (Cloud)** | Expensive (S3 scans more data) | **Cheap** (S3 scans less data) |
+| **ML Readiness** | Requires manual type casting | **Plug-and-play** |
+
+## 4. Using Parquet in Python
+
+Pandas and PyArrow make it easy to switch from CSV to Parquet.
+
+```python
+import pandas as pd
+
+# Saving a dataframe to Parquet
+# Requires 'pyarrow' or 'fastparquet' installed
+df.to_parquet('large_dataset.parquet', compression='snappy')
+
+# Reading only specific columns (The magic of Parquet!)
+df_subset = pd.read_parquet('large_dataset.parquet', columns=['feature_1', 'target'])
+
+```
+
+## 5. When to use Parquet
+
+1. **Production Pipelines:** Always use Parquet for data passed between different stages of a pipeline.
+2. **Large Datasets:** If your data is MB, the speed gains become obvious.
+3. **Cloud Storage:** If storing data in AWS S3 or Google Cloud Storage, Parquet will save you significant money on data egress/scan costs.
+
+## References for More Details
+
+* **[Apache Parquet Official Documentation](https://parquet.apache.org/):** Deep diving into the binary file structure.
+
+* **[Databricks - Why Parquet?](https://www.databricks.com/glossary/what-is-parquet)** Understanding Parquet's role in the "Lakehouse" architecture.
+
+---
+
+Parquet is the king of analytical data storage. However, some streaming applications require a format that is optimized for high-speed row writes rather than column reads.
@@ -0,0 +1,106 @@
+---
+title: "XML: Extensible Markup Language"
+sidebar_label: XML
+description: "Handling hierarchical data in XML: parsing techniques, its role in Computer Vision annotations, and converting XML to ML-ready formats."
+tags: [data-engineering, xml, data-formats, computer-vision, pascal-voc, web-services]
+---
+
+**XML** is a markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. While JSON has largely replaced XML for web APIs, XML remains a cornerstone in industrial systems and **Object Detection** datasets.
+
+## 1. Anatomy of an XML Document
+
+XML uses a tree-like structure consisting of **tags**, **attributes**, and **content**.
+
+```xml
+<annotation>
+    <filename>image_01.jpg</filename>
+    <size>
+        <width>640</width>
+        <height>480</height>
+    </size>
+    <object>
+        <name>cat</name>
+        <bndbox>
+            <xmin>100</xmin>
+            <ymin>120</ymin>
+            <xmax>250</xmax>
+            <ymax>300</ymax>
+        </bndbox>
+    </object>
+</annotation>
+
+```
+
+## 2. XML in Machine Learning: Use Cases
+
+### A. Computer Vision (Pascal VOC)
+
+One of the most famous datasets in ML history, **Pascal VOC**, uses XML files to store the coordinates of bounding boxes for image classification and detection.
+
+### B. Enterprise Data Integration
+
+Many older banking, insurance, and manufacturing systems exchange data exclusively via XML over SOAP (Simple Object Access Protocol).
+
+### C. Configuration & Metadata
+
+XML is often used to store metadata for scientific datasets where complex, nested relationships must be strictly defined by a **Schema (XSD)**.
+
+## 3. Parsing XML in Python
+
+Because XML is a tree, we don't read it like a flat file. We "traverse" the tree using libraries like `ElementTree` or `lxml`.
+
+```python
+import xml.etree.ElementTree as ET
+
+tree = ET.parse('annotation.xml')
+root = tree.getroot()
+
+# Accessing specific data
+filename = root.find('filename').text
+for obj in root.findall('object'):
+    name = obj.find('name').text
+    print(f"Detected object: {name}")
+
+```
+
+## 4. XML vs. JSON
+
+| Feature | XML | JSON |
+| --- | --- | --- |
+| **Metadata** | Supports Attributes + Elements | Only Key-Value pairs |
+| **Strictness** | High (Requires XSD validation) | Low (Flexible) |
+| **Size** | Verbose (Closing tags increase size) | Compact |
+| **Readability** | High (Document-centric) | High (Data-centric) |
+
+## 5. The Challenge: Deep Nesting
+
+Just like [JSON](/tutorial/machine-learning/data-engineering-basics/data-formats/json), XML is hierarchical. To use it in a standard ML model (like a Random Forest), you must **Flatten** the tree into a table.
+
+```mermaid
+graph TD
+    XML[XML Root] --> Branch1[Branch: Metadata]
+    XML --> Branch2[Branch: Observations]
+    Branch2 --> Leaf[Leaf: Data Point]
+    Leaf --> Flatten[Flattening Logic]
+    Flatten --> CSV[2D Feature Matrix]
+
+    style XML fill:#f3e5f5,stroke:#7b1fa2,color:#333
+    style CSV fill:#e1f5fe,stroke:#01579b,color:#333
+
+```
+
+## 6. Best Practices
+
+1. **Use `lxml` for Speed:** The built-in `ElementTree` is fine for small files, but `lxml` is significantly faster for processing large datasets.
+2. **Beware of "XML Bombs":** Malicious XML files can use entity expansion to crash your parser (DoS attack). Use **defusedxml** if you are parsing untrusted data from the web.
+3. **Schema Validation:** Always validate your XML against an `.xsd` file if available to ensure your ML pipeline doesn't break due to a missing tag.
+
+
+## References for More Details
+
+* **[Python ElementTree Documentation](https://docs.python.org/3/library/xml.etree.elementtree.html):** Learning the standard library approach.
+* **[Pascal VOC Dataset Format](https://www.google.com/search?q=http://host.robots.ox.ac.uk/pascal/VOC/):** Seeing how XML is used in real-world ML projects.
+
+---
+
+XML completes our look at "Text-Based" formats. While these are great for humans to read, they are slow for machines to process. Next, we look at the high-speed binary formats used in Big Data.