Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
55 changes: 55 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,61 @@ Changelog for dcm-anon. Format follows [Keep a Changelog](https://keepachangelog

---

## [0.7.0] - 2026-06-07

Minor release. Closes the largest functional gap for the declared audience
(longitudinal multi-hospital cohorts) and hardens two safety-critical paths that
a previous keyword-only heuristic could slip through. Guiding principle is
unchanged: every easy-to-verify claim must be exactly true, because it is what
earns trust in the hard-to-verify ones.

### Added
- **Deterministic per-patient date shifting (`--date-shift`).** PS3.15 Retain
Longitudinal Temporal Information — Modified Dates Option (CID 7050 code
113107). Every DA/DT value (recursing into sequences) is moved by a single
per-patient offset derived by HMAC-SHA256 from `--salt`, so a patient's
studies stay exactly the same number of days apart while the absolute calendar
position is hidden — the property a longitudinal cohort needs and that date
*removal* destroyed. Requires `--salt`; window is `--date-shift-max-days`
(default ±365). Provenance code 113107 is written so the choice is auditable.
Explicitly **not** HIPAA Safe Harbor (which mandates date removal); for use
under a GDPR/pseudonymous regime only. New module `dcm_anon/dateshift.py`;
public API `shift_dates`.
- The independent verifier learns about retained dates (`dates_retained=`): when
`--date-shift` ran it no longer flags the intentionally-shifted dates, while
still catching every non-date identifier — independence over names / IDs /
institutions is unchanged.
- `tests/test_dateshift.py` and new safety-gate / verifier cases.

### Changed
- **Recognizable-face gate is now genuinely fail-closed.** It previously fired
only when an *English* keyword (HEAD/BRAIN/…) matched the description, so a
cranial CT/MR with a blank, coded, or non-English description passed — a false
negative in a safety gate. It now fires on any face-capable modality
(CT/MR/PET/NM) **unless** there is positive evidence of a non-cranial body
part (an accent-normalised, multilingual EN/ES/FR/DE/IT/PT match). Ambiguity
resolves to risk. Clear it as before with `--accept-face-risk`.
- **Independent verifier accept-set tightened.** Free-text words the tool never
emits (`ANONYMOUS`, `REMOVED`) were dropped from the "clean value" set: they
could only ever mask a real residual that happened to read that way. The set
is now restricted to the exact placeholders `dcm_anon` writes plus
structurally-empty values, so the scan errs toward flagging, not toward a
false green.

### Fixed
- **README test count is now exact and CI-enforced.** The stale "197 tests"
claim is replaced with the real test-function count, and
`tests/test_version_coherence.py` fails the build if the README number ever
drifts from the suite again. The honesty pitch starts with the easy number.
- README opening now states plainly what the tool does and does **not** buy
(it does the technical de-identification and the auditable evidence of it; it
does not establish your Art. 9(2) lawful basis — no tool can), removing the
apparent contradiction between the CNIL framing and the disclaimer.
- Landing page no longer quotes firm prices for a managed tier that is not yet
purchasable; pricing will be published when the tier and its DPA ship.

---

## [0.6.1] - 2026-06-01

Patch release. Compliance-citation correctness, independent-verifier test
Expand Down
7 changes: 2 additions & 5 deletions CITATION.cff
Original file line number Diff line number Diff line change
Expand Up @@ -10,15 +10,12 @@ authors:
repository-code: "https://github.com/Ces107/dcm-anon"
url: "https://github.com/Ces107/dcm-anon"
license: MIT
version: "0.4.0"
date-released: "2026-05-19"
version: "0.7.0"
date-released: "2026-06-07"
identifiers:
- type: doi
value: "10.5281/zenodo.20267651"
description: "Concept DOI (always resolves to latest version)"
- type: doi
value: "10.5281/zenodo.20282264"
description: "Version-specific DOI for v0.4.0"
keywords:
- DICOM
- medical-imaging
Expand Down
43 changes: 34 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,9 @@ Deny-by-default DICOM de-identifier (PS3.15 Basic Profile + Structured Report sc
[![Python](https://img.shields.io/badge/python-3.10%20%7C%203.11%20%7C%203.12-blue)](pyproject.toml)
[![Manifest format](https://img.shields.io/badge/manifest_format-v1.2-blueviolet)](dcm_anon/manifest.py)

`dcm-anon` is a **deny-by-default** DICOM de-identifier: it removes every private/vendor tag and blanks any unrecognised person-name by default (not just a fixed allowlist), scrubs the Structured Report content tree, writes the PS3.15 provenance attributes the standard mandates, and **fails closed** — refusing to certify clean the channels a header tool cannot clear (burned-in pixels, recognisable faces in head CT/MR, encapsulated PDF/CDA). Every run emits a verbatim-cited, hash-chained compliance manifest verifiable by an independent post-run scan. It is built around the gap CNIL fined 800,000 EUR for in 2024 — a false anonymisation claim — so it never claims more than it did.
`dcm-anon` is a **deny-by-default** DICOM de-identifier: it removes every private/vendor tag and blanks any unrecognised person-name by default (not just a fixed allowlist), scrubs the Structured Report content tree, writes the PS3.15 provenance attributes the standard mandates, and **fails closed** — refusing to certify clean the channels a header tool cannot clear (burned-in pixels, recognisable faces in head CT/MR, encapsulated PDF/CDA). Every run emits a verbatim-cited, hash-chained compliance manifest verifiable by an independent post-run scan.

**What this does and does not buy you.** The CNIL/Cegedim €800K decision had two halves: a *technical* de-identification step (which was fine) and an *upstream legal* gap — no Art. 9(2) lawful basis, with the pseudonymisation treated as if it removed that obligation. `dcm-anon` is an engineering tool: it does the technical step correctly and produces the **auditable, machine-verifiable evidence** that it did — which action ran on which tag, under which clause, with an independent residual scan and a tamper-evident hash chain. It does **not** establish your Art. 9(2) lawful basis or write your DPIA; that is the controller's job and no tool can do it for you. What it does is make that division of responsibility explicit in the output (a `pseudonymous`-not-`anonymous` label plus an Art. 9 disclosure) so the legal gap is surfaced, not papered over. If you are looking for software that makes a lawful basis unnecessary, no such software exists — and a tool that claimed to be it would be making exactly the false-assurance error that drew the fine.

> **Self-hostable companion.** [**dcm-anon-vault**](https://github.com/Ces107/dcm-anon-vault) wraps the same engine as a multi-user REST API with persisted SHA-256 audit retention and deterministic UID re-mapping for longitudinal cohort linkage. It ships with a Dockerfile, `docker-compose.yml` and `fly.toml`, so you can deploy it yourself on Fly.io or your own VPS today (see [`docs/deploy.md`](https://github.com/Ces107/dcm-anon-vault/blob/main/docs/deploy.md)). A managed hosted tier with a DPA on file for EU hospital procurement is in preparation — register for [early access](https://ces107.github.io/dcm-anon/#early-access).

Expand Down Expand Up @@ -99,6 +101,10 @@ dcm-anon /data/study_0001 /data/anon/study_0001
# Deterministic UIDs — same salt + same source = same output every run
dcm-anon /data/study_0001 /data/anon/study_0001 --salt cohort-A-2024

# Longitudinal cohort: shift dates per-patient instead of removing them
# (preserves inter-study intervals; requires --salt; GDPR/pseudonymous, NOT Safe Harbor)
dcm-anon /data/study_0001 /data/anon/study_0001 --salt cohort-A-2024 --date-shift

# Preview without writing files (audit log still emitted)
dcm-anon /data/study_0001 /data/anon/study_0001 --dry-run

Expand Down Expand Up @@ -334,7 +340,14 @@ exit code 3) rather than pretending:
- **Recognisable faces are not removed.** A head CT/MR/PET reconstructs an
identifiable face (Schwarz, NEJM 2019); metadata de-identification is
insufficient. dcm-anon flags and fails closed — deface externally
(pydeface / mri_reface), then pass `--accept-face-risk`.
(pydeface / mri_reface), then pass `--accept-face-risk`. The gate is
genuinely fail-closed: on a face-capable modality (CT/MR/PET/NM) it fires
unless there is positive evidence of a **non-cranial** body part (an
accent-normalised, multilingual EN/ES/FR/DE/IT/PT match). A head scan with a
blank, coded, or non-English description is treated as a face, not waved
through — the false negative the previous keyword-only rule allowed. The
conservative cost (a bodyless abdominal CT also fails closed) is cleared with
`--accept-face-risk`.
- **Encapsulated PDF/CDA interiors are not cleaned.** The opaque byte stream is
not inspected; the object is quarantined unless `--allow-encapsulated`.
- **SR free-text names are best-effort.** `--scrub-sr` redacts regex/blacklist
Expand All @@ -344,8 +357,17 @@ exit code 3) rather than pretending:
reversible by the salt-holder (a secret key, HMAC-keyed); without a salt,
distinct patients collapse to one pseudonym (cohort separation lost). See
`SECURITY.md`.
- **Dates are removed, not shifted.** A consistent per-patient date-shift option
(PS3.15 Retain Longitudinal Modified Dates) is not yet implemented.
- **Dates: removed by default, or shifted with `--date-shift`.** By default
every date is removed. For longitudinal cohorts, `--date-shift` (PS3.15 Retain
Longitudinal Modified Dates, CID 7050 code 113107) instead moves every date by
a single deterministic per-patient offset derived from `--salt`, so a
patient's studies stay the same number of days apart while their absolute
calendar position is hidden. It **requires `--salt`** (the offset must be
reproducible across files and runs) and is **not HIPAA Safe Harbor**, which
mandates date removal — use it only under a GDPR/pseudonymous regime. The
output's independent verifier is told dates were intentionally retained, so it
no longer flags them while still catching every non-date identifier; the
provenance code 113107 is written so the choice is auditable.
- **Tag table is pinned to a DICOM edition.** Tags introduced in a later edition
are caught only by the deny-by-default identifier-VR sweep, not by name.
- **No DICOMDIR update, no DICOM network (C-STORE/Q-R), no GUI.** Regenerate the
Expand All @@ -359,13 +381,16 @@ exit code 3) rather than pretending:
pytest tests/ -v --cov=dcm_anon --cov-report=term-missing
```

197 tests, coverage ≥80% gated in CI. Suite covers: deny-by-default private +
217 test functions (run `pytest tests/ -q`; the exact count is enforced against
this README by `tests/test_version_coherence.py`, so the number cannot silently
drift). Coverage ≥80% gated in CI. Suite covers: deny-by-default private +
unknown-PN removal (adversarial leak fixtures), file-meta AE scrub, multi-valued
UID remap, PS3.15 provenance attributes, fail-closed safety gates (burned-in /
face / encapsulated / SR), SR content-tree scrubbing, false-green verification
paths (vacuous pass, dependency failure, dry-run), HMAC UID + per-patient
pseudonyms, manifest SHA-chain integrity and tamper detection, and the public
completeness proof.
multilingual fail-closed face / encapsulated / SR), deterministic per-patient
date-shift (interval preservation), SR content-tree scrubbing, false-green
verification paths (vacuous pass, dependency failure, dry-run), HMAC UID +
per-patient pseudonyms, manifest SHA-chain integrity and tamper detection, and
the public completeness proof.

## Completeness proof (run it yourself)

Expand Down
2 changes: 2 additions & 0 deletions dcm_anon/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
render_markdown_report,
)
from dcm_anon.cli import main
from dcm_anon.dateshift import shift_dates
from dcm_anon.manifest import (
ComplianceManifest,
build_manifest,
Expand Down Expand Up @@ -57,5 +58,6 @@
"render_markdown",
"render_markdown_report",
"scan_outputs",
"shift_dates",
"verify_manifest",
]
2 changes: 1 addition & 1 deletion dcm_anon/_version.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,4 +7,4 @@
installs and silently mislabelled which code produced a compliance manifest.
"""

__version__ = "0.6.1"
__version__ = "0.7.0"
26 changes: 26 additions & 0 deletions dcm_anon/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,18 @@ def build_arg_parser(version: str) -> argparse.ArgumentParser:
help="Retain ALL private (odd-group) tags. NOT recommended: "
"PS3.15 mandates their removal and vendors store PHI there. "
"Default removes every private element.")
# Longitudinal date handling. Default removes dates; --date-shift retains the
# time axis by moving every date a deterministic per-patient offset instead.
parser.add_argument("--date-shift", action="store_true",
help="Shift every date by a deterministic per-patient offset "
"instead of removing it (PS3.15 Retain Modified Dates, "
"113107). Preserves inter-study intervals for longitudinal "
"cohorts. REQUIRES --salt. NOT HIPAA Safe Harbor (which "
"mandates date removal) — use only under a GDPR/pseudonymous "
"regime.")
parser.add_argument("--date-shift-max-days", type=int, default=365, metavar="N",
help="Half-width of the per-patient date-shift window in days "
"(offset drawn from [-N, +N]; default 365).")
# Fail-closed waivers — the tool refuses to certify clean when it detects a
# channel it cannot clear. Each flag is an explicit, logged acknowledgement.
parser.add_argument("--allow-burned-in", action="store_true",
Expand Down Expand Up @@ -247,6 +259,17 @@ def _run_verify_mode(args: argparse.Namespace) -> int:
def _validate_anonymize_args(args: argparse.Namespace) -> str | None:
if args.src is None or args.dst is None:
return "src and dst are required unless --verify-manifest is used"
if args.date_shift and not args.salt:
return ("--date-shift requires --salt: the per-patient date offset must be "
"deterministic so a patient's studies stay aligned across runs")
if args.date_shift and args.date_shift_max_days <= 0:
return "--date-shift-max-days must be a positive integer"
if args.date_shift and args.manifest_mode == "hipaa":
return ("--date-shift retains (shifted) dates, which HIPAA Safe Harbor "
"forbids (§164.514(b)(2)(i)(C) requires removing dates more "
"specific than the year). Emitting a Safe-Harbor manifest over "
"retained dates would be a false compliance claim. Use "
"--manifest-mode gdpr, or drop --date-shift for HIPAA output.")
return None


Expand Down Expand Up @@ -274,6 +297,8 @@ def _run_anonymize_mode(args: argparse.Namespace) -> int:
allow_sr=args.allow_sr,
scrub_sr=args.scrub_sr,
sr_profile=args.sr_profile,
date_shift=args.date_shift,
date_shift_max_days=args.date_shift_max_days,
progress_cb=_build_progress_cb(total, quiet=args.quiet),
)

Expand Down Expand Up @@ -302,6 +327,7 @@ def _run_anonymize_mode(args: argparse.Namespace) -> int:
sample_size=args.verify_output_sample,
pixel_ocr=args.verify_output_pixel_ocr,
strict_ocr=not args.no_strict_ocr,
dates_retained=args.date_shift,
)
manifest_obj = build_manifest(
summary,
Expand Down
107 changes: 107 additions & 0 deletions dcm_anon/dateshift.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
"""Consistent per-patient date shifting (PS3.15 Retain Longitudinal Temporal
Information — Modified Dates Option, CID 7050 code 113107).

The default profile *removes* dates, which destroys the time axis a longitudinal
or multi-visit cohort is built on. This module instead **shifts** every DA/DT
value by a single, deterministic, per-patient offset (derived by
:meth:`UIDMapper.date_offset_days` from the salt), so that:

- intervals between a patient's studies are preserved exactly (visit 2 minus
visit 1 is unchanged), and
- the absolute calendar position is moved by an amount no one can recover
without the salt.

This is a GDPR-oriented pseudonymisation control. It is deliberately NOT applied
under the default (date-removing) profile, and it is incompatible with HIPAA
Safe Harbor, which mandates removal of dates more specific than the year
(§164.514(b)(2)(i)(C)) — the CLI/manifest say so rather than over-claiming.

The shift operates by VR (DA, DT), recursing into sequences, so it reaches
date elements the flat tag table does not enumerate (deny-by-default for dates,
the same posture the rest of the pipeline takes).
"""
from __future__ import annotations

from datetime import date, timedelta

from pydicom.dataset import Dataset

# VRs that carry a calendar date we can shift. TM (time-of-day) is NOT shifted:
# a wall-clock time is not on the longitudinal axis and shifting it buys nothing.
_DATE_VR = "DA"
_DATETIME_VR = "DT"


def _shift_da(value: str, offset_days: int) -> str | None:
"""Shift a DICOM DA value (``YYYYMMDD``). Return the shifted string, or
``None`` if the value is empty / not a parseable date (left untouched)."""
text = value.strip()
if len(text) < 8 or not text[:8].isdigit():
return None
try:
base = date(int(text[:4]), int(text[4:6]), int(text[6:8]))
except ValueError:
return None
shifted = base + timedelta(days=offset_days)
return f"{shifted.year:04d}{shifted.month:02d}{shifted.day:02d}"


def _shift_dt(value: str, offset_days: int) -> str | None:
"""Shift the date portion of a DICOM DT value (``YYYYMMDD`` prefix plus an
optional ``HHMMSS.FFFFFF&ZZXX`` tail). The time/timezone tail is preserved
verbatim; only the calendar date moves."""
text = value.strip()
shifted_date = _shift_da(text[:8], offset_days)
if shifted_date is None:
return None
return shifted_date + text[8:]


def _shift_element_value(value: object, vr: str, offset_days: int) -> object | None:
"""Shift a single element value (handling multi-valued DA/DT). Return the new
value, or ``None`` if nothing parseable was found (caller leaves it as-is)."""
shifter = _shift_da if vr == _DATE_VR else _shift_dt

if isinstance(value, (list, tuple)):
out: list[str] = []
changed = False
for member in value:
new = shifter(str(member), offset_days)
if new is None:
out.append(str(member))
else:
out.append(new)
changed = True
return out if changed else None

return shifter(str(value), offset_days)


def shift_dates(ds: Dataset, offset_days: int) -> list[str]:
"""Shift every DA/DT element in *ds* (recursing into sequences) by
*offset_days*. In-place. Returns audit entries in the
``"GGGG,EEEE:SHIFT(VR)"`` form used by ``pipeline._format_touch``.

A zero offset still rewrites the values to themselves and is logged, so the
audit shows the option ran (a per-patient offset of 0 is a legitimate draw).
"""
touched: list[str] = []
for elem in ds:
if elem.VR == "SQ" and elem.value:
for item in elem.value:
if isinstance(item, Dataset):
touched.extend(shift_dates(item, offset_days))
continue
if elem.VR not in (_DATE_VR, _DATETIME_VR):
continue
if elem.value in (None, ""):
continue
new_value = _shift_element_value(elem.value, elem.VR, offset_days)
if new_value is None:
continue
elem.value = new_value
touched.append(f"{elem.tag.group:04X},{elem.tag.element:04X}:SHIFT({elem.VR})")
return touched


__all__ = ["shift_dates"]
Loading
Loading