Reference implementation of the framework described in:
Mudusu, S. K., & Gentyala, S. (2026). Zero-Trust Data Pipelines for AI Systems: A Framework for Secure, Verifiable, and Auditable Data Engineering. Journal of Recent Trends in Computer Science and Engineering, 14(2), 10–25. https://jrtcse.com/index.php/home/article/view/JRTCSE.2026.14.2.2/JRTCSE.2026.14.2.2
The paper proposes a zero-trust approach to data pipelines feeding AI systems — where no data source is implicitly trusted, every record must pass verifiable quality gates, and all pipeline actions are logged for audit. This repository translates those concepts into working Python code.
Concretely:
- Secure ingestion — checksum every input file before parsing; reject unknown extensions and oversized files
- Data validation — detect nulls, duplicates, missing required fields, and invalid date formats
- Policy enforcement — evaluate declarative YAML rules (required columns, PII detection, null limits, file type allowlists)
- Lineage tracking — record source, timestamp, and transformation steps in SQLite
- Audit logging — append-only event log for every pipeline action, exportable to JSONL
- Trust scoring — aggregate all stage results into a 0–100 AI-readiness score with letter grade
zero-trust-data-pipeline-framework/
├── src/ztdp/
│ ├── ingestion.py # File loading, checksum, format guard
│ ├── validation.py # Null counts, duplicates, field checks
│ ├── policy_engine.py # YAML policy loader and rule evaluator
│ ├── lineage.py # SQLite-backed lineage tracker
│ ├── audit.py # Append-only audit logger
│ ├── trust_score.py # 0–100 weighted trust score
│ ├── config.py # Configuration dataclasses
│ └── exceptions.py # Typed exceptions per stage
├── examples/
│ ├── sample_pipeline.py # End-to-end demonstration
│ ├── sample_input.csv # 15-row healthcare sample dataset
│ └── policies.yaml # Example policy definition
├── tests/ # Pytest test suite (46 tests)
├── docs/ # Architecture, mapping, audit model, test results
├── .github/workflows/ci.yml
├── Dockerfile
└── pyproject.toml
git clone https://github.com/reachsunilmudusu-rgb/zero-trust-data-pipeline-framework.git
cd zero-trust-data-pipeline-framework
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -e ".[dev]"from ztdp.ingestion import ingest_file
from ztdp.validation import validate
from ztdp.policy_engine import load_policies, enforce
from ztdp.lineage import LineageTracker
from ztdp.audit import AuditLogger
from ztdp.trust_score import calculate_trust_score
ingestion = ingest_file("examples/sample_input.csv")
validation = validate(ingestion.records, required_fields=["patient_id", "admission_date"])
policy = load_policies("examples/policies.yaml")
pol_result = enforce(ingestion.records, ingestion, validation, policy)
tracker = LineageTracker()
tracker.record(ingestion.dataset_id, ingestion.source_path, ingestion.ingested_at,
transformation_steps=["validate", "policy_check"])
logger = AuditLogger()
logger.log("pipeline", "ingest", ingestion.dataset_id, "success", f"{ingestion.row_count} rows loaded")
trust = calculate_trust_score(ingestion, validation, pol_result,
lineage_recorded=True, audit_recorded=True)
print(trust.summary)Run the full end-to-end pipeline:
python examples/sample_pipeline.pyExpected output:
============================================================
Zero-Trust Data Pipeline — Verification Run
============================================================
[1] Ingesting file ...
dataset_id : <uuid>
rows : 15
checksum : <sha256>...
[2] Validating data ...
valid : True
null % : 0.0%
duplicates : 0
[3] Enforcing policies ...
[PASS] required_columns: All required columns present
[PASS] pii_columns: PII columns detected — flag for downstream masking: ['patient_id', 'age']
[PASS] max_null_percentage: Null percentage 0.0% within limit of 10.0%
[PASS] allowed_file_types: File type 'csv' is allowed
[PASS] checksum_required: Checksum present
[PASS] max_duplicate_percentage: Duplicate rate 0.0% within limit of 5.0%
[4] Recording lineage ...
lineage_id : <uuid>
[5] Calculating trust score ...
Trust score 100/100 (grade A) — checksum 20/20, validation 25/25, policy 25/25, lineage 20/20, audit 10/10
============================================================
Verification Summary
============================================================
Dataset ID : <uuid>
Source : sample_input.csv
Rows ingested : 15
Validation : PASSED
Policy : PASSED
Trust Score : 100/100 (Grade A)
Audit events : 4
============================================================
pytest -q46 tests covering all modules with positive and negative cases. See docs/test_results.md for full expected output.
docker build -t ztdp .
docker run --rm ztdp- File accepted by ingestion (no
IngestionError) - Checksum generated and non-empty
-
ValidationResult.is_valid == True - Null percentage within policy limit
- No unexpected duplicate rows
-
PolicyResult.passed == True - All required columns present
- PII columns flagged for masking
-
LineageRecordwritten to store - Audit log has ≥ 1 event for the dataset
-
TrustScoreResult.score >= 75
| Document | Description |
|---|---|
| Architecture | Module layout and data flow |
| Framework Mapping | Paper concept → implementation module |
| Verification Process | What constitutes a passing pipeline run |
| Audit Model | Event schema and query examples |
| Test Results | Expected pytest output and pipeline run |
Mudusu, S. K., & Gentyala, S. (2026). Zero-Trust Data Pipelines for AI Systems:
A Framework for Secure, Verifiable, and Auditable Data Engineering.
Journal of Recent Trends in Computer Science and Engineering, 14(2), 10–25.