Skip to content

sunilgentyala/zero-trust-data-pipeline-framework

 
 

Repository files navigation

Zero-Trust Data Pipeline Verification Framework

CI Python License

Reference implementation of the framework described in:

Mudusu, S. K., & Gentyala, S. (2026). Zero-Trust Data Pipelines for AI Systems: A Framework for Secure, Verifiable, and Auditable Data Engineering. Journal of Recent Trends in Computer Science and Engineering, 14(2), 10–25. https://jrtcse.com/index.php/home/article/view/JRTCSE.2026.14.2.2/JRTCSE.2026.14.2.2


What this implements

The paper proposes a zero-trust approach to data pipelines feeding AI systems — where no data source is implicitly trusted, every record must pass verifiable quality gates, and all pipeline actions are logged for audit. This repository translates those concepts into working Python code.

Concretely:

  • Secure ingestion — checksum every input file before parsing; reject unknown extensions and oversized files
  • Data validation — detect nulls, duplicates, missing required fields, and invalid date formats
  • Policy enforcement — evaluate declarative YAML rules (required columns, PII detection, null limits, file type allowlists)
  • Lineage tracking — record source, timestamp, and transformation steps in SQLite
  • Audit logging — append-only event log for every pipeline action, exportable to JSONL
  • Trust scoring — aggregate all stage results into a 0–100 AI-readiness score with letter grade

Repository structure

zero-trust-data-pipeline-framework/
├── src/ztdp/
│   ├── ingestion.py        # File loading, checksum, format guard
│   ├── validation.py       # Null counts, duplicates, field checks
│   ├── policy_engine.py    # YAML policy loader and rule evaluator
│   ├── lineage.py          # SQLite-backed lineage tracker
│   ├── audit.py            # Append-only audit logger
│   ├── trust_score.py      # 0–100 weighted trust score
│   ├── config.py           # Configuration dataclasses
│   └── exceptions.py       # Typed exceptions per stage
├── examples/
│   ├── sample_pipeline.py  # End-to-end demonstration
│   ├── sample_input.csv    # 15-row healthcare sample dataset
│   └── policies.yaml       # Example policy definition
├── tests/                  # Pytest test suite (46 tests)
├── docs/                   # Architecture, mapping, audit model, test results
├── .github/workflows/ci.yml
├── Dockerfile
└── pyproject.toml

Installation

git clone https://github.com/reachsunilmudusu-rgb/zero-trust-data-pipeline-framework.git
cd zero-trust-data-pipeline-framework

python -m venv .venv
source .venv/bin/activate      # Windows: .venv\Scripts\activate

pip install -e ".[dev]"

Quick start

from ztdp.ingestion import ingest_file
from ztdp.validation import validate
from ztdp.policy_engine import load_policies, enforce
from ztdp.lineage import LineageTracker
from ztdp.audit import AuditLogger
from ztdp.trust_score import calculate_trust_score

ingestion   = ingest_file("examples/sample_input.csv")
validation  = validate(ingestion.records, required_fields=["patient_id", "admission_date"])
policy      = load_policies("examples/policies.yaml")
pol_result  = enforce(ingestion.records, ingestion, validation, policy)

tracker = LineageTracker()
tracker.record(ingestion.dataset_id, ingestion.source_path, ingestion.ingested_at,
               transformation_steps=["validate", "policy_check"])

logger = AuditLogger()
logger.log("pipeline", "ingest", ingestion.dataset_id, "success", f"{ingestion.row_count} rows loaded")

trust = calculate_trust_score(ingestion, validation, pol_result,
                              lineage_recorded=True, audit_recorded=True)
print(trust.summary)

Run the full end-to-end pipeline:

python examples/sample_pipeline.py

Expected output:

============================================================
Zero-Trust Data Pipeline — Verification Run
============================================================

[1] Ingesting file ...
    dataset_id : <uuid>
    rows       : 15
    checksum   : <sha256>...

[2] Validating data ...
    valid      : True
    null %     : 0.0%
    duplicates : 0

[3] Enforcing policies ...
    [PASS] required_columns: All required columns present
    [PASS] pii_columns: PII columns detected — flag for downstream masking: ['patient_id', 'age']
    [PASS] max_null_percentage: Null percentage 0.0% within limit of 10.0%
    [PASS] allowed_file_types: File type 'csv' is allowed
    [PASS] checksum_required: Checksum present
    [PASS] max_duplicate_percentage: Duplicate rate 0.0% within limit of 5.0%

[4] Recording lineage ...
    lineage_id : <uuid>

[5] Calculating trust score ...
    Trust score 100/100 (grade A) — checksum 20/20, validation 25/25, policy 25/25, lineage 20/20, audit 10/10

============================================================
Verification Summary
============================================================
  Dataset ID     : <uuid>
  Source         : sample_input.csv
  Rows ingested  : 15
  Validation     : PASSED
  Policy         : PASSED
  Trust Score    : 100/100 (Grade A)
  Audit events   : 4
============================================================

Running tests

pytest -q

46 tests covering all modules with positive and negative cases. See docs/test_results.md for full expected output.


Docker

docker build -t ztdp .
docker run --rm ztdp

Verification checklist

  • File accepted by ingestion (no IngestionError)
  • Checksum generated and non-empty
  • ValidationResult.is_valid == True
  • Null percentage within policy limit
  • No unexpected duplicate rows
  • PolicyResult.passed == True
  • All required columns present
  • PII columns flagged for masking
  • LineageRecord written to store
  • Audit log has ≥ 1 event for the dataset
  • TrustScoreResult.score >= 75

Documentation

Document Description
Architecture Module layout and data flow
Framework Mapping Paper concept → implementation module
Verification Process What constitutes a passing pipeline run
Audit Model Event schema and query examples
Test Results Expected pytest output and pipeline run

Citation

Mudusu, S. K., & Gentyala, S. (2026). Zero-Trust Data Pipelines for AI Systems:
A Framework for Secure, Verifiable, and Auditable Data Engineering.
Journal of Recent Trends in Computer Science and Engineering, 14(2), 10–25.

About

Reference implementation of Zero-Trust Data Pipelines for AI Systems — checksum validation, YAML policy enforcement, trust scoring, lineage tracking, and append-only audit logs. Companion code for Mudusu & Gentyala (2026), DOI: 10.70589/JRTCSE.2026.14.2.2.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages

  • Python 99.2%
  • Dockerfile 0.8%