diff --git a/CHANGELOG.md b/CHANGELOG.md index d2c17f7..fb40819 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -4,6 +4,61 @@ Changelog for dcm-anon. Format follows [Keep a Changelog](https://keepachangelog --- +## [0.7.0] - 2026-06-07 + +Minor release. Closes the largest functional gap for the declared audience +(longitudinal multi-hospital cohorts) and hardens two safety-critical paths that +a previous keyword-only heuristic could slip through. Guiding principle is +unchanged: every easy-to-verify claim must be exactly true, because it is what +earns trust in the hard-to-verify ones. + +### Added +- **Deterministic per-patient date shifting (`--date-shift`).** PS3.15 Retain + Longitudinal Temporal Information — Modified Dates Option (CID 7050 code + 113107). Every DA/DT value (recursing into sequences) is moved by a single + per-patient offset derived by HMAC-SHA256 from `--salt`, so a patient's + studies stay exactly the same number of days apart while the absolute calendar + position is hidden — the property a longitudinal cohort needs and that date + *removal* destroyed. Requires `--salt`; window is `--date-shift-max-days` + (default ±365). Provenance code 113107 is written so the choice is auditable. + Explicitly **not** HIPAA Safe Harbor (which mandates date removal); for use + under a GDPR/pseudonymous regime only. New module `dcm_anon/dateshift.py`; + public API `shift_dates`. +- The independent verifier learns about retained dates (`dates_retained=`): when + `--date-shift` ran it no longer flags the intentionally-shifted dates, while + still catching every non-date identifier — independence over names / IDs / + institutions is unchanged. +- `tests/test_dateshift.py` and new safety-gate / verifier cases. + +### Changed +- **Recognizable-face gate is now genuinely fail-closed.** It previously fired + only when an *English* keyword (HEAD/BRAIN/…) matched the description, so a + cranial CT/MR with a blank, coded, or non-English description passed — a false + negative in a safety gate. It now fires on any face-capable modality + (CT/MR/PET/NM) **unless** there is positive evidence of a non-cranial body + part (an accent-normalised, multilingual EN/ES/FR/DE/IT/PT match). Ambiguity + resolves to risk. Clear it as before with `--accept-face-risk`. +- **Independent verifier accept-set tightened.** Free-text words the tool never + emits (`ANONYMOUS`, `REMOVED`) were dropped from the "clean value" set: they + could only ever mask a real residual that happened to read that way. The set + is now restricted to the exact placeholders `dcm_anon` writes plus + structurally-empty values, so the scan errs toward flagging, not toward a + false green. + +### Fixed +- **README test count is now exact and CI-enforced.** The stale "197 tests" + claim is replaced with the real test-function count, and + `tests/test_version_coherence.py` fails the build if the README number ever + drifts from the suite again. The honesty pitch starts with the easy number. +- README opening now states plainly what the tool does and does **not** buy + (it does the technical de-identification and the auditable evidence of it; it + does not establish your Art. 9(2) lawful basis — no tool can), removing the + apparent contradiction between the CNIL framing and the disclaimer. +- Landing page no longer quotes firm prices for a managed tier that is not yet + purchasable; pricing will be published when the tier and its DPA ship. + +--- + ## [0.6.1] - 2026-06-01 Patch release. Compliance-citation correctness, independent-verifier test diff --git a/CITATION.cff b/CITATION.cff index 154140a..76ce674 100644 --- a/CITATION.cff +++ b/CITATION.cff @@ -10,15 +10,12 @@ authors: repository-code: "https://github.com/Ces107/dcm-anon" url: "https://github.com/Ces107/dcm-anon" license: MIT -version: "0.4.0" -date-released: "2026-05-19" +version: "0.7.0" +date-released: "2026-06-07" identifiers: - type: doi value: "10.5281/zenodo.20267651" description: "Concept DOI (always resolves to latest version)" - - type: doi - value: "10.5281/zenodo.20282264" - description: "Version-specific DOI for v0.4.0" keywords: - DICOM - medical-imaging diff --git a/README.md b/README.md index 5570b91..13d333f 100644 --- a/README.md +++ b/README.md @@ -9,7 +9,9 @@ Deny-by-default DICOM de-identifier (PS3.15 Basic Profile + Structured Report sc [![Python](https://img.shields.io/badge/python-3.10%20%7C%203.11%20%7C%203.12-blue)](pyproject.toml) [![Manifest format](https://img.shields.io/badge/manifest_format-v1.2-blueviolet)](dcm_anon/manifest.py) -`dcm-anon` is a **deny-by-default** DICOM de-identifier: it removes every private/vendor tag and blanks any unrecognised person-name by default (not just a fixed allowlist), scrubs the Structured Report content tree, writes the PS3.15 provenance attributes the standard mandates, and **fails closed** — refusing to certify clean the channels a header tool cannot clear (burned-in pixels, recognisable faces in head CT/MR, encapsulated PDF/CDA). Every run emits a verbatim-cited, hash-chained compliance manifest verifiable by an independent post-run scan. It is built around the gap CNIL fined 800,000 EUR for in 2024 — a false anonymisation claim — so it never claims more than it did. +`dcm-anon` is a **deny-by-default** DICOM de-identifier: it removes every private/vendor tag and blanks any unrecognised person-name by default (not just a fixed allowlist), scrubs the Structured Report content tree, writes the PS3.15 provenance attributes the standard mandates, and **fails closed** — refusing to certify clean the channels a header tool cannot clear (burned-in pixels, recognisable faces in head CT/MR, encapsulated PDF/CDA). Every run emits a verbatim-cited, hash-chained compliance manifest verifiable by an independent post-run scan. + +**What this does and does not buy you.** The CNIL/Cegedim €800K decision had two halves: a *technical* de-identification step (which was fine) and an *upstream legal* gap — no Art. 9(2) lawful basis, with the pseudonymisation treated as if it removed that obligation. `dcm-anon` is an engineering tool: it does the technical step correctly and produces the **auditable, machine-verifiable evidence** that it did — which action ran on which tag, under which clause, with an independent residual scan and a tamper-evident hash chain. It does **not** establish your Art. 9(2) lawful basis or write your DPIA; that is the controller's job and no tool can do it for you. What it does is make that division of responsibility explicit in the output (a `pseudonymous`-not-`anonymous` label plus an Art. 9 disclosure) so the legal gap is surfaced, not papered over. If you are looking for software that makes a lawful basis unnecessary, no such software exists — and a tool that claimed to be it would be making exactly the false-assurance error that drew the fine. > **Self-hostable companion.** [**dcm-anon-vault**](https://github.com/Ces107/dcm-anon-vault) wraps the same engine as a multi-user REST API with persisted SHA-256 audit retention and deterministic UID re-mapping for longitudinal cohort linkage. It ships with a Dockerfile, `docker-compose.yml` and `fly.toml`, so you can deploy it yourself on Fly.io or your own VPS today (see [`docs/deploy.md`](https://github.com/Ces107/dcm-anon-vault/blob/main/docs/deploy.md)). A managed hosted tier with a DPA on file for EU hospital procurement is in preparation — register for [early access](https://ces107.github.io/dcm-anon/#early-access). @@ -99,6 +101,10 @@ dcm-anon /data/study_0001 /data/anon/study_0001 # Deterministic UIDs — same salt + same source = same output every run dcm-anon /data/study_0001 /data/anon/study_0001 --salt cohort-A-2024 +# Longitudinal cohort: shift dates per-patient instead of removing them +# (preserves inter-study intervals; requires --salt; GDPR/pseudonymous, NOT Safe Harbor) +dcm-anon /data/study_0001 /data/anon/study_0001 --salt cohort-A-2024 --date-shift + # Preview without writing files (audit log still emitted) dcm-anon /data/study_0001 /data/anon/study_0001 --dry-run @@ -334,7 +340,14 @@ exit code 3) rather than pretending: - **Recognisable faces are not removed.** A head CT/MR/PET reconstructs an identifiable face (Schwarz, NEJM 2019); metadata de-identification is insufficient. dcm-anon flags and fails closed — deface externally - (pydeface / mri_reface), then pass `--accept-face-risk`. + (pydeface / mri_reface), then pass `--accept-face-risk`. The gate is + genuinely fail-closed: on a face-capable modality (CT/MR/PET/NM) it fires + unless there is positive evidence of a **non-cranial** body part (an + accent-normalised, multilingual EN/ES/FR/DE/IT/PT match). A head scan with a + blank, coded, or non-English description is treated as a face, not waved + through — the false negative the previous keyword-only rule allowed. The + conservative cost (a bodyless abdominal CT also fails closed) is cleared with + `--accept-face-risk`. - **Encapsulated PDF/CDA interiors are not cleaned.** The opaque byte stream is not inspected; the object is quarantined unless `--allow-encapsulated`. - **SR free-text names are best-effort.** `--scrub-sr` redacts regex/blacklist @@ -344,8 +357,17 @@ exit code 3) rather than pretending: reversible by the salt-holder (a secret key, HMAC-keyed); without a salt, distinct patients collapse to one pseudonym (cohort separation lost). See `SECURITY.md`. -- **Dates are removed, not shifted.** A consistent per-patient date-shift option - (PS3.15 Retain Longitudinal Modified Dates) is not yet implemented. +- **Dates: removed by default, or shifted with `--date-shift`.** By default + every date is removed. For longitudinal cohorts, `--date-shift` (PS3.15 Retain + Longitudinal Modified Dates, CID 7050 code 113107) instead moves every date by + a single deterministic per-patient offset derived from `--salt`, so a + patient's studies stay the same number of days apart while their absolute + calendar position is hidden. It **requires `--salt`** (the offset must be + reproducible across files and runs) and is **not HIPAA Safe Harbor**, which + mandates date removal — use it only under a GDPR/pseudonymous regime. The + output's independent verifier is told dates were intentionally retained, so it + no longer flags them while still catching every non-date identifier; the + provenance code 113107 is written so the choice is auditable. - **Tag table is pinned to a DICOM edition.** Tags introduced in a later edition are caught only by the deny-by-default identifier-VR sweep, not by name. - **No DICOMDIR update, no DICOM network (C-STORE/Q-R), no GUI.** Regenerate the @@ -359,13 +381,16 @@ exit code 3) rather than pretending: pytest tests/ -v --cov=dcm_anon --cov-report=term-missing ``` -197 tests, coverage ≥80% gated in CI. Suite covers: deny-by-default private + +217 test functions (run `pytest tests/ -q`; the exact count is enforced against +this README by `tests/test_version_coherence.py`, so the number cannot silently +drift). Coverage ≥80% gated in CI. Suite covers: deny-by-default private + unknown-PN removal (adversarial leak fixtures), file-meta AE scrub, multi-valued UID remap, PS3.15 provenance attributes, fail-closed safety gates (burned-in / -face / encapsulated / SR), SR content-tree scrubbing, false-green verification -paths (vacuous pass, dependency failure, dry-run), HMAC UID + per-patient -pseudonyms, manifest SHA-chain integrity and tamper detection, and the public -completeness proof. +multilingual fail-closed face / encapsulated / SR), deterministic per-patient +date-shift (interval preservation), SR content-tree scrubbing, false-green +verification paths (vacuous pass, dependency failure, dry-run), HMAC UID + +per-patient pseudonyms, manifest SHA-chain integrity and tamper detection, and +the public completeness proof. ## Completeness proof (run it yourself) diff --git a/dcm_anon/__init__.py b/dcm_anon/__init__.py index a2fab83..3c7f635 100644 --- a/dcm_anon/__init__.py +++ b/dcm_anon/__init__.py @@ -13,6 +13,7 @@ render_markdown_report, ) from dcm_anon.cli import main +from dcm_anon.dateshift import shift_dates from dcm_anon.manifest import ( ComplianceManifest, build_manifest, @@ -57,5 +58,6 @@ "render_markdown", "render_markdown_report", "scan_outputs", + "shift_dates", "verify_manifest", ] diff --git a/dcm_anon/_version.py b/dcm_anon/_version.py index 0b88d28..0c305dc 100644 --- a/dcm_anon/_version.py +++ b/dcm_anon/_version.py @@ -7,4 +7,4 @@ installs and silently mislabelled which code produced a compliance manifest. """ -__version__ = "0.6.1" +__version__ = "0.7.0" diff --git a/dcm_anon/cli.py b/dcm_anon/cli.py index cb7c9ea..50582b3 100644 --- a/dcm_anon/cli.py +++ b/dcm_anon/cli.py @@ -70,6 +70,18 @@ def build_arg_parser(version: str) -> argparse.ArgumentParser: help="Retain ALL private (odd-group) tags. NOT recommended: " "PS3.15 mandates their removal and vendors store PHI there. " "Default removes every private element.") + # Longitudinal date handling. Default removes dates; --date-shift retains the + # time axis by moving every date a deterministic per-patient offset instead. + parser.add_argument("--date-shift", action="store_true", + help="Shift every date by a deterministic per-patient offset " + "instead of removing it (PS3.15 Retain Modified Dates, " + "113107). Preserves inter-study intervals for longitudinal " + "cohorts. REQUIRES --salt. NOT HIPAA Safe Harbor (which " + "mandates date removal) — use only under a GDPR/pseudonymous " + "regime.") + parser.add_argument("--date-shift-max-days", type=int, default=365, metavar="N", + help="Half-width of the per-patient date-shift window in days " + "(offset drawn from [-N, +N]; default 365).") # Fail-closed waivers — the tool refuses to certify clean when it detects a # channel it cannot clear. Each flag is an explicit, logged acknowledgement. parser.add_argument("--allow-burned-in", action="store_true", @@ -247,6 +259,17 @@ def _run_verify_mode(args: argparse.Namespace) -> int: def _validate_anonymize_args(args: argparse.Namespace) -> str | None: if args.src is None or args.dst is None: return "src and dst are required unless --verify-manifest is used" + if args.date_shift and not args.salt: + return ("--date-shift requires --salt: the per-patient date offset must be " + "deterministic so a patient's studies stay aligned across runs") + if args.date_shift and args.date_shift_max_days <= 0: + return "--date-shift-max-days must be a positive integer" + if args.date_shift and args.manifest_mode == "hipaa": + return ("--date-shift retains (shifted) dates, which HIPAA Safe Harbor " + "forbids (§164.514(b)(2)(i)(C) requires removing dates more " + "specific than the year). Emitting a Safe-Harbor manifest over " + "retained dates would be a false compliance claim. Use " + "--manifest-mode gdpr, or drop --date-shift for HIPAA output.") return None @@ -274,6 +297,8 @@ def _run_anonymize_mode(args: argparse.Namespace) -> int: allow_sr=args.allow_sr, scrub_sr=args.scrub_sr, sr_profile=args.sr_profile, + date_shift=args.date_shift, + date_shift_max_days=args.date_shift_max_days, progress_cb=_build_progress_cb(total, quiet=args.quiet), ) @@ -302,6 +327,7 @@ def _run_anonymize_mode(args: argparse.Namespace) -> int: sample_size=args.verify_output_sample, pixel_ocr=args.verify_output_pixel_ocr, strict_ocr=not args.no_strict_ocr, + dates_retained=args.date_shift, ) manifest_obj = build_manifest( summary, diff --git a/dcm_anon/dateshift.py b/dcm_anon/dateshift.py new file mode 100644 index 0000000..eb3013e --- /dev/null +++ b/dcm_anon/dateshift.py @@ -0,0 +1,107 @@ +"""Consistent per-patient date shifting (PS3.15 Retain Longitudinal Temporal +Information — Modified Dates Option, CID 7050 code 113107). + +The default profile *removes* dates, which destroys the time axis a longitudinal +or multi-visit cohort is built on. This module instead **shifts** every DA/DT +value by a single, deterministic, per-patient offset (derived by +:meth:`UIDMapper.date_offset_days` from the salt), so that: + +- intervals between a patient's studies are preserved exactly (visit 2 minus + visit 1 is unchanged), and +- the absolute calendar position is moved by an amount no one can recover + without the salt. + +This is a GDPR-oriented pseudonymisation control. It is deliberately NOT applied +under the default (date-removing) profile, and it is incompatible with HIPAA +Safe Harbor, which mandates removal of dates more specific than the year +(§164.514(b)(2)(i)(C)) — the CLI/manifest say so rather than over-claiming. + +The shift operates by VR (DA, DT), recursing into sequences, so it reaches +date elements the flat tag table does not enumerate (deny-by-default for dates, +the same posture the rest of the pipeline takes). +""" +from __future__ import annotations + +from datetime import date, timedelta + +from pydicom.dataset import Dataset + +# VRs that carry a calendar date we can shift. TM (time-of-day) is NOT shifted: +# a wall-clock time is not on the longitudinal axis and shifting it buys nothing. +_DATE_VR = "DA" +_DATETIME_VR = "DT" + + +def _shift_da(value: str, offset_days: int) -> str | None: + """Shift a DICOM DA value (``YYYYMMDD``). Return the shifted string, or + ``None`` if the value is empty / not a parseable date (left untouched).""" + text = value.strip() + if len(text) < 8 or not text[:8].isdigit(): + return None + try: + base = date(int(text[:4]), int(text[4:6]), int(text[6:8])) + except ValueError: + return None + shifted = base + timedelta(days=offset_days) + return f"{shifted.year:04d}{shifted.month:02d}{shifted.day:02d}" + + +def _shift_dt(value: str, offset_days: int) -> str | None: + """Shift the date portion of a DICOM DT value (``YYYYMMDD`` prefix plus an + optional ``HHMMSS.FFFFFF&ZZXX`` tail). The time/timezone tail is preserved + verbatim; only the calendar date moves.""" + text = value.strip() + shifted_date = _shift_da(text[:8], offset_days) + if shifted_date is None: + return None + return shifted_date + text[8:] + + +def _shift_element_value(value: object, vr: str, offset_days: int) -> object | None: + """Shift a single element value (handling multi-valued DA/DT). Return the new + value, or ``None`` if nothing parseable was found (caller leaves it as-is).""" + shifter = _shift_da if vr == _DATE_VR else _shift_dt + + if isinstance(value, (list, tuple)): + out: list[str] = [] + changed = False + for member in value: + new = shifter(str(member), offset_days) + if new is None: + out.append(str(member)) + else: + out.append(new) + changed = True + return out if changed else None + + return shifter(str(value), offset_days) + + +def shift_dates(ds: Dataset, offset_days: int) -> list[str]: + """Shift every DA/DT element in *ds* (recursing into sequences) by + *offset_days*. In-place. Returns audit entries in the + ``"GGGG,EEEE:SHIFT(VR)"`` form used by ``pipeline._format_touch``. + + A zero offset still rewrites the values to themselves and is logged, so the + audit shows the option ran (a per-patient offset of 0 is a legitimate draw). + """ + touched: list[str] = [] + for elem in ds: + if elem.VR == "SQ" and elem.value: + for item in elem.value: + if isinstance(item, Dataset): + touched.extend(shift_dates(item, offset_days)) + continue + if elem.VR not in (_DATE_VR, _DATETIME_VR): + continue + if elem.value in (None, ""): + continue + new_value = _shift_element_value(elem.value, elem.VR, offset_days) + if new_value is None: + continue + elem.value = new_value + touched.append(f"{elem.tag.group:04X},{elem.tag.element:04X}:SHIFT({elem.VR})") + return touched + + +__all__ = ["shift_dates"] diff --git a/dcm_anon/pipeline.py b/dcm_anon/pipeline.py index d858695..3742227 100644 --- a/dcm_anon/pipeline.py +++ b/dcm_anon/pipeline.py @@ -42,6 +42,8 @@ class AnonymizationConfig: allow_sr: bool = False scrub_sr: bool = False sr_profile: str = "default" + date_shift: bool = False + date_shift_max_days: int = 365 progress_cb: ProgressCallback | None = None registry: ActionRegistry = field(default_factory=lambda: DEFAULT_REGISTRY) @@ -73,11 +75,18 @@ def _apply_point_actions( keep: TagSet, mapper: UIDMapper, registry: ActionRegistry, + *, + preserve_dates: bool = False, ) -> list[str]: touched: list[str] = [] for tag, action in PHI_TAGS.items(): if tag in keep or tag not in ds: continue + # With date-shift active, DA/DT elements are retained here and shifted + # by a per-patient offset in a later pass (PS3.15 Retain Modified Dates), + # so the longitudinal time axis survives instead of being blanked. + if preserve_dates and ds[tag].VR in ("DA", "DT"): + continue registry[action](ds, tag, mapper) touched.append(_format_touch(tag, action.value)) return touched @@ -146,6 +155,7 @@ def _recurse_into_sequences( registry: ActionRegistry, *, keep_private: bool = False, + preserve_dates: bool = False, ) -> list[str]: touched: list[str] = [] for elem in ds: @@ -153,7 +163,10 @@ def _recurse_into_sequences( continue for item in elem.value: touched.extend( - _scrub_dataset(item, mapper, keep, registry, keep_private=keep_private) + _scrub_dataset( + item, mapper, keep, registry, + keep_private=keep_private, preserve_dates=preserve_dates, + ) ) return touched @@ -165,17 +178,22 @@ def _scrub_dataset( registry: ActionRegistry = DEFAULT_REGISTRY, *, keep_private: bool = False, + preserve_dates: bool = False, ) -> list[str]: """Apply known PHI_TAGS actions, sweep private + unknown person-names, and recurse into nested Sequence items. Deny-by-default: anything private or any - PN VR is removed/blanked unless explicitly kept.""" - touched = _apply_point_actions(ds, keep_tags, mapper, registry) + PN VR is removed/blanked unless explicitly kept. With ``preserve_dates`` the + DA/DT elements are left in place for the later per-patient date-shift pass.""" + touched = _apply_point_actions(ds, keep_tags, mapper, registry, preserve_dates=preserve_dates) touched.extend(_scrub_curve_overlay_ranges(ds, keep_tags)) if not keep_private: touched.extend(_strip_private_elements(ds, keep_tags)) touched.extend(_scrub_unknown_person_names(ds, keep_tags)) touched.extend( - _recurse_into_sequences(ds, keep_tags, mapper, registry, keep_private=keep_private) + _recurse_into_sequences( + ds, keep_tags, mapper, registry, + keep_private=keep_private, preserve_dates=preserve_dates, + ) ) return touched @@ -244,11 +262,21 @@ def anonymize_file( allow_sr: bool = False, scrub_sr: bool = False, sr_profile: str = "default", + date_shift: bool = False, + date_shift_max_days: int = 365, ) -> AuditRecord: """Anonymize a single DICOM file and return its audit record.""" keep = keep_tags if keep_tags is not None else frozenset() + if date_shift and not mapper.salt: + raise ValueError( + "date_shift requires a salt: the per-patient offset must be " + "deterministic so a patient's studies stay aligned across files/runs." + ) ds = dcmread(src) original_sop = str(ds.SOPInstanceUID) if hasattr(ds, "SOPInstanceUID") else None + # Capture the patient key BEFORE scrubbing rewrites PatientID/PatientName, + # so the date offset is keyed to the real patient (deterministic per person). + patient_key = str(getattr(ds, "PatientID", "") or getattr(ds, "PatientName", "") or "") # Fail-closed risk detection on the ORIGINAL object (before scrubbing removes # the modality/body-part signals). These are channels a header de-identifier @@ -264,10 +292,20 @@ def anonymize_file( allow_sr=allow_sr, ) - touched = _scrub_dataset(ds, mapper, keep, registry, keep_private=keep_private) + touched = _scrub_dataset( + ds, mapper, keep, registry, keep_private=keep_private, preserve_dates=date_shift, + ) _maintain_file_meta_consistency(ds, original_sop, mapper) touched.extend(_scrub_file_meta(ds)) + # Per-patient date shift (PS3.15 Retain Modified Dates). Runs AFTER the scrub + # left the DA/DT elements in place (preserve_dates) and moves every calendar + # date by one deterministic per-patient offset, preserving inter-study gaps. + if date_shift: + from dcm_anon.dateshift import shift_dates + offset = mapper.date_offset_days(patient_key, date_shift_max_days) + touched.extend(shift_dates(ds, offset)) + # SR content-tree pass (free-text / person-names / dates / UIDREFs inside # ContentSequence) on the SAME dataset + SAME UID map. Opt-in via --scrub-sr; # its actions use a distinct vocabulary so they ride a separate audit field, @@ -283,9 +321,19 @@ def anonymize_file( # action log; they are verifiable on the output object and in the manifest. # Claim Clean Structured Content (113104) ONLY when the SR pass actually ran. from dcm_anon._version import __version__ as _tool_version - from dcm_anon.provenance import CLEAN_STRUCTURED_CONTENT, write_deid_provenance - sr_codes = [CLEAN_STRUCTURED_CONTENT] if sr_touched else None - write_deid_provenance(ds, _tool_version, keep_private=keep_private, extra_codes=sr_codes) + from dcm_anon.provenance import ( + CLEAN_STRUCTURED_CONTENT, + RETAIN_MODIFIED_DATES, + write_deid_provenance, + ) + extra_codes = [] + if sr_touched: + extra_codes.append(CLEAN_STRUCTURED_CONTENT) + if date_shift: + extra_codes.append(RETAIN_MODIFIED_DATES) + write_deid_provenance( + ds, _tool_version, keep_private=keep_private, extra_codes=extra_codes or None, + ) output_path: str | None if dry_run: @@ -329,6 +377,8 @@ def anonymize_path( allow_sr: bool = False, scrub_sr: bool = False, sr_profile: str = "default", + date_shift: bool = False, + date_shift_max_days: int = 365, progress_cb: ProgressCallback | None = None, config: AnonymizationConfig | None = None, ) -> AuditSummary: @@ -349,6 +399,8 @@ def anonymize_path( allow_sr=allow_sr, scrub_sr=scrub_sr, sr_profile=sr_profile, + date_shift=date_shift, + date_shift_max_days=date_shift_max_days, progress_cb=progress_cb, ) @@ -373,6 +425,8 @@ def anonymize_path( allow_sr=cfg.allow_sr, scrub_sr=cfg.scrub_sr, sr_profile=cfg.sr_profile, + date_shift=cfg.date_shift, + date_shift_max_days=cfg.date_shift_max_days, )) except Exception as exc: err = ProcessingError( diff --git a/dcm_anon/safety.py b/dcm_anon/safety.py index 235fff2..f4895bb 100644 --- a/dcm_anon/safety.py +++ b/dcm_anon/safety.py @@ -19,6 +19,7 @@ """ from __future__ import annotations +import unicodedata from typing import Final from pydicom.dataset import Dataset @@ -46,11 +47,54 @@ ) # Modalities that produce volumetric data from which a face can be reconstructed. _FACE_MODALITIES: Final[frozenset[str]] = frozenset({"CT", "MR", "PT", "NM"}) + +# Cranio-facial terms across the languages a European cohort actually labels with +# (EN/ES/FR/DE/IT/PT), accent-stripped so "CRÁNEO"/"CRÂNE"/"SCHÄDEL" all match. +# Substring matching, so stems ("CRANI", "CEREBR") cover their inflections. _FACE_KEYWORDS: Final[tuple[str, ...]] = ( - "HEAD", "BRAIN", "SKULL", "FACE", "SINUS", "ORBIT", "TMJ", "IAC", "CRANI", "NEURO", + # English + "HEAD", "BRAIN", "SKULL", "CRANI", "FACE", "FACIAL", "SINUS", "ORBIT", "TMJ", + "IAC", "NEURO", "CEREBR", "PITUITARY", "SELLA", "MAXILL", "MANDIBL", + "NASOPHARYN", "PAROTID", "TEMPORAL BONE", "PETROUS", "ENCEPHAL", + # Spanish + "CABEZA", "CRANE", "CRANEO", "CEREBRO", "ENCEFAL", "CARA", "SENO", "ORBITA", + "MAXILAR", "MANDIBUL", "HIPOFIS", "FACIAL", + # French + "TETE", "CERVEAU", "VISAGE", "ORBITE", "MAXILLAIRE", "FACE", + # German + "KOPF", "SCHADEL", "GEHIRN", "HIRN", "GESICHT", "NEBENHOHLE", "KIEFER", + # Italian + "TESTA", "CRANIO", "CERVELLO", "VISO", "SENI PARANASALI", + # Portuguese + "CABECA", "ROSTO", "ENCEFALO", +) + +# Body regions that DEMONSTRABLY do not reconstruct a face. Their presence is the +# positive evidence that lets a face-capable modality clear the gate. Anything +# not on this list (empty, coded, or an unrecognised description) does NOT clear +# it — the gate fails closed, the way a head scan with a blank description must. +# Neck/cervical are deliberately absent: they routinely include the lower face. +_NON_FACE_BODYPARTS: Final[tuple[str, ...]] = ( + "CHEST", "THORAX", "LUNG", "PULMON", "POUMON", "TORAX", + "ABDOMEN", "ABDOMINAL", "PELVIS", "PELVIC", "PELVI", + "HEART", "CARDIAC", "CARDIO", "CORONARY", "AORTA", "CORAZON", "COEUR", + "LIVER", "HIGADO", "FOIE", "LEBER", "KIDNEY", "RENAL", "RINON", "REIN", + "PROSTATE", "BLADDER", "VEJIGA", "BREAST", "MAMA", "MAMMO", "SEIN", + "COLON", "BOWEL", "RECTUM", "SPLEEN", "PANCREAS", + "KNEE", "RODILLA", "GENOU", "KNIE", "ANKLE", "TOBILLO", "FOOT", "PIE", "PIED", + "HAND", "MANO", "MAIN", "WRIST", "MUNECA", "ELBOW", "CODO", "SHOULDER", + "HOMBRO", "EPAULE", "HIP", "CADERA", "HANCHE", "FEMUR", "TIBIA", "FIBULA", + "HUMERUS", "FOREARM", "EXTREMITY", "EXTREMIT", "LEG", "PIERNA", "ARM", "BRAZO", + "LSPINE", "TSPINE", "LUMBAR", "THORACIC SPINE", "DORSAL", "SACRUM", "COCCYX", ) +def _normalize(text: str) -> str: + """Upper-case and strip diacritics so accented labels match ASCII keywords.""" + decomposed = unicodedata.normalize("NFKD", text) + return "".join(c for c in decomposed if not unicodedata.combining(c)).upper() + + def _attr(ds: Dataset, name: str) -> str: return str(getattr(ds, name, "") or "") @@ -68,13 +112,28 @@ def has_burned_in_risk(ds: Dataset) -> bool: def has_face_risk(ds: Dataset) -> bool: + """Fail-closed recognizable-face gate for volumetric modalities. + + A header de-identifier cannot remove the face a head CT/MR/PET reconstructs, + so the dangerous error here is the FALSE NEGATIVE: passing a cranial study + whose description is blank, coded, or in a language the keyword list missed. + The old "modality + English keyword" rule did exactly that. This version + fails closed instead: on a face-capable modality the gate fires UNLESS there + is positive evidence (an accent-normalised, multilingual body-part match) + that the study is of a non-cranial region. Ambiguity resolves to risk. + """ if _attr(ds, "Modality").upper() not in _FACE_MODALITIES: return False - haystack = " ".join( + haystack = _normalize(" ".join( _attr(ds, a) for a in ("BodyPartExamined", "ProtocolName", "StudyDescription", "SeriesDescription") - ).upper() - return any(keyword in haystack for keyword in _FACE_KEYWORDS) + )) + if any(keyword in haystack for keyword in _FACE_KEYWORDS): + return True + if any(region in haystack for region in _NON_FACE_BODYPARTS): + return False + # Face-capable modality with no positive non-cranial evidence: fail closed. + return True def has_encapsulated_document(ds: Dataset) -> bool: diff --git a/dcm_anon/uid_mapper.py b/dcm_anon/uid_mapper.py index 69cda81..82bc853 100644 --- a/dcm_anon/uid_mapper.py +++ b/dcm_anon/uid_mapper.py @@ -49,6 +49,23 @@ def pseudonym(self, original: str) -> str | None: return None return self._hmac_hex(original)[:_PSEUDONYM_HEX_LEN].upper() + def date_offset_days(self, patient_key: str, max_days: int) -> int: + """Deterministic per-patient date shift in ``[-max_days, +max_days]``. + + Derived from HMAC-SHA256(key=salt, msg=patient_key) so the SAME patient + gets the SAME offset across every file and every run (the property a + longitudinal cohort needs), while remaining unrecoverable without the + salt. Requires a salt — a non-deterministic shift would desynchronise a + patient's studies and is therefore rejected by the caller. + """ + if not self.salt: + raise ValueError("date_offset_days requires a salt (deterministic shift)") + if max_days <= 0: + raise ValueError(f"date-shift window must be positive (got {max_days})") + span = 2 * max_days + 1 + digest_int = int(self._hmac_hex(patient_key)[:_HEX_BYTES_FOR_UID], 16) + return (digest_int % span) - max_days + def size(self) -> int: return len(self._mapping) diff --git a/dcm_anon/verify_output.py b/dcm_anon/verify_output.py index 1432e12..f3be5e9 100644 --- a/dcm_anon/verify_output.py +++ b/dcm_anon/verify_output.py @@ -65,12 +65,15 @@ ) -# Acceptable cleaned values — emitted by dcm-anon for Z (zero/blank) and D -# (dummy placeholder) actions. Anything else is treated as a residual. +# Acceptable cleaned values — restricted to the EXACT placeholders dcm-anon +# emits for its Z (zero/blank) and D actions (see phi_table.PLACEHOLDERS) plus +# structurally-empty values. Free-text words like "ANONYMOUS" / "REMOVED" were +# removed deliberately: the tool never writes them, so accepting them could only +# mask a real residual value that happens to read that way. A tighter accept-set +# means the independent scan errs toward flagging, not toward a false green. _CLEAN_PLACEHOLDERS: Final = frozenset( { - "", " ", "ANON", "ANONYMOUS", "0", "0000", "00000000", "19000101", - "000000.000000", "REMOVED", + "", " ", "ANON", "0", "0000", "00000000", "19000101", "000000.000000", } ) @@ -222,6 +225,8 @@ def _value_is_clean(raw_value: object, label: str) -> tuple[bool, str]: def _scan_metadata( ds: Dataset, file_path: Path, + *, + dates_retained: bool = False, ) -> list[Residual]: """Scan a dataset against the independent PHI tag list. @@ -229,9 +234,16 @@ def _scan_metadata( sequences (e.g. ``RequestAttributesSequence``, ``OriginalAttributesSequence``) is detected. Without this, a manifest could report ``passed=True`` while PHI persists in nested datasets. + + When ``dates_retained`` is set, date-category tags ("(C)") are intentionally + present (shifted by the Retain-Modified-Dates option) and are NOT counted as + residuals; every non-date identifier is still checked, so independence over + names / IDs / institutions is unaffected. """ findings: list[Residual] = [] for group, element, label, hipaa in INDEPENDENT_PHI_TAGS: + if dates_retained and hipaa.startswith("(C)"): + continue tag = (group, element) if tag not in ds: continue @@ -252,7 +264,7 @@ def _scan_metadata( continue for item in elem.value: if isinstance(item, Dataset): - findings.extend(_scan_metadata(item, file_path)) + findings.extend(_scan_metadata(item, file_path, dates_retained=dates_retained)) return findings @@ -331,6 +343,7 @@ def scan_outputs( sample_size: int | None = None, pixel_ocr: bool = False, strict_ocr: bool = True, + dates_retained: bool = False, ) -> VerificationResult: """Independently scan a directory of anonymized DICOMs for PHI residuals. @@ -385,7 +398,7 @@ def scan_outputs( except Exception: continue files_scanned += 1 - residuals.extend(_scan_metadata(ds, path)) + residuals.extend(_scan_metadata(ds, path, dates_retained=dates_retained)) if pixel_ocr and pixel_available: try: residuals.extend(_scan_pixels(ds, path, pytesseract_mod)) diff --git a/docs/index.html b/docs/index.html index e9aa862..6387db5 100644 --- a/docs/index.html +++ b/docs/index.html @@ -38,7 +38,7 @@
-v0.6.1 · MIT · open source · DOI 10.5281/zenodo.20267651 +v0.7.0 · MIT · open source · DOI 10.5281/zenodo.20267651

DICOM anonymization with an audit trail your DPO can verify.

@@ -53,8 +53,8 @@

DICOM anonymization with an audit trail your DPO can verify.

Compliance manifest

Every PS3.15 action (X / Z / U / D) that runs on your study is mapped to the literal text of the regulation that authorizes it — GDPR Art. 4(5), HIPAA Safe Harbor §164.514(b)(2), EU AI Act Art. 10. Re-verified against EUR-Lex / eCFR / gdpr-info.eu on 2026-05-13. SHA-256 chain over the audit log + manifest so an auditor can verify integrity from the JSON alone.

-

v0.6.1

-

Package restructure, AI-slop cleanup, PyPI rename to dcm-anon. No behavioural change to anonymisation. Changelog.

+

v0.7.0

+

Deterministic per-patient date shifting for longitudinal cohorts (--date-shift, PS3.15 Retain Modified Dates), a genuinely fail-closed multilingual face gate, and a tightened independent verifier. Changelog.

v0.3.5 highlights