diff --git a/CHANGELOG.md b/CHANGELOG.md index d2c17f7..fb40819 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -4,6 +4,61 @@ Changelog for dcm-anon. Format follows [Keep a Changelog](https://keepachangelog --- +## [0.7.0] - 2026-06-07 + +Minor release. Closes the largest functional gap for the declared audience +(longitudinal multi-hospital cohorts) and hardens two safety-critical paths that +a previous keyword-only heuristic could slip through. Guiding principle is +unchanged: every easy-to-verify claim must be exactly true, because it is what +earns trust in the hard-to-verify ones. + +### Added +- **Deterministic per-patient date shifting (`--date-shift`).** PS3.15 Retain + Longitudinal Temporal Information — Modified Dates Option (CID 7050 code + 113107). Every DA/DT value (recursing into sequences) is moved by a single + per-patient offset derived by HMAC-SHA256 from `--salt`, so a patient's + studies stay exactly the same number of days apart while the absolute calendar + position is hidden — the property a longitudinal cohort needs and that date + *removal* destroyed. Requires `--salt`; window is `--date-shift-max-days` + (default ±365). Provenance code 113107 is written so the choice is auditable. + Explicitly **not** HIPAA Safe Harbor (which mandates date removal); for use + under a GDPR/pseudonymous regime only. New module `dcm_anon/dateshift.py`; + public API `shift_dates`. +- The independent verifier learns about retained dates (`dates_retained=`): when + `--date-shift` ran it no longer flags the intentionally-shifted dates, while + still catching every non-date identifier — independence over names / IDs / + institutions is unchanged. +- `tests/test_dateshift.py` and new safety-gate / verifier cases. + +### Changed +- **Recognizable-face gate is now genuinely fail-closed.** It previously fired + only when an *English* keyword (HEAD/BRAIN/…) matched the description, so a + cranial CT/MR with a blank, coded, or non-English description passed — a false + negative in a safety gate. It now fires on any face-capable modality + (CT/MR/PET/NM) **unless** there is positive evidence of a non-cranial body + part (an accent-normalised, multilingual EN/ES/FR/DE/IT/PT match). Ambiguity + resolves to risk. Clear it as before with `--accept-face-risk`. +- **Independent verifier accept-set tightened.** Free-text words the tool never + emits (`ANONYMOUS`, `REMOVED`) were dropped from the "clean value" set: they + could only ever mask a real residual that happened to read that way. The set + is now restricted to the exact placeholders `dcm_anon` writes plus + structurally-empty values, so the scan errs toward flagging, not toward a + false green. + +### Fixed +- **README test count is now exact and CI-enforced.** The stale "197 tests" + claim is replaced with the real test-function count, and + `tests/test_version_coherence.py` fails the build if the README number ever + drifts from the suite again. The honesty pitch starts with the easy number. +- README opening now states plainly what the tool does and does **not** buy + (it does the technical de-identification and the auditable evidence of it; it + does not establish your Art. 9(2) lawful basis — no tool can), removing the + apparent contradiction between the CNIL framing and the disclaimer. +- Landing page no longer quotes firm prices for a managed tier that is not yet + purchasable; pricing will be published when the tier and its DPA ship. + +--- + ## [0.6.1] - 2026-06-01 Patch release. Compliance-citation correctness, independent-verifier test diff --git a/CITATION.cff b/CITATION.cff index 154140a..76ce674 100644 --- a/CITATION.cff +++ b/CITATION.cff @@ -10,15 +10,12 @@ authors: repository-code: "https://github.com/Ces107/dcm-anon" url: "https://github.com/Ces107/dcm-anon" license: MIT -version: "0.4.0" -date-released: "2026-05-19" +version: "0.7.0" +date-released: "2026-06-07" identifiers: - type: doi value: "10.5281/zenodo.20267651" description: "Concept DOI (always resolves to latest version)" - - type: doi - value: "10.5281/zenodo.20282264" - description: "Version-specific DOI for v0.4.0" keywords: - DICOM - medical-imaging diff --git a/README.md b/README.md index 5570b91..13d333f 100644 --- a/README.md +++ b/README.md @@ -9,7 +9,9 @@ Deny-by-default DICOM de-identifier (PS3.15 Basic Profile + Structured Report sc [](pyproject.toml) [](dcm_anon/manifest.py) -`dcm-anon` is a **deny-by-default** DICOM de-identifier: it removes every private/vendor tag and blanks any unrecognised person-name by default (not just a fixed allowlist), scrubs the Structured Report content tree, writes the PS3.15 provenance attributes the standard mandates, and **fails closed** — refusing to certify clean the channels a header tool cannot clear (burned-in pixels, recognisable faces in head CT/MR, encapsulated PDF/CDA). Every run emits a verbatim-cited, hash-chained compliance manifest verifiable by an independent post-run scan. It is built around the gap CNIL fined 800,000 EUR for in 2024 — a false anonymisation claim — so it never claims more than it did. +`dcm-anon` is a **deny-by-default** DICOM de-identifier: it removes every private/vendor tag and blanks any unrecognised person-name by default (not just a fixed allowlist), scrubs the Structured Report content tree, writes the PS3.15 provenance attributes the standard mandates, and **fails closed** — refusing to certify clean the channels a header tool cannot clear (burned-in pixels, recognisable faces in head CT/MR, encapsulated PDF/CDA). Every run emits a verbatim-cited, hash-chained compliance manifest verifiable by an independent post-run scan. + +**What this does and does not buy you.** The CNIL/Cegedim €800K decision had two halves: a *technical* de-identification step (which was fine) and an *upstream legal* gap — no Art. 9(2) lawful basis, with the pseudonymisation treated as if it removed that obligation. `dcm-anon` is an engineering tool: it does the technical step correctly and produces the **auditable, machine-verifiable evidence** that it did — which action ran on which tag, under which clause, with an independent residual scan and a tamper-evident hash chain. It does **not** establish your Art. 9(2) lawful basis or write your DPIA; that is the controller's job and no tool can do it for you. What it does is make that division of responsibility explicit in the output (a `pseudonymous`-not-`anonymous` label plus an Art. 9 disclosure) so the legal gap is surfaced, not papered over. If you are looking for software that makes a lawful basis unnecessary, no such software exists — and a tool that claimed to be it would be making exactly the false-assurance error that drew the fine. > **Self-hostable companion.** [**dcm-anon-vault**](https://github.com/Ces107/dcm-anon-vault) wraps the same engine as a multi-user REST API with persisted SHA-256 audit retention and deterministic UID re-mapping for longitudinal cohort linkage. It ships with a Dockerfile, `docker-compose.yml` and `fly.toml`, so you can deploy it yourself on Fly.io or your own VPS today (see [`docs/deploy.md`](https://github.com/Ces107/dcm-anon-vault/blob/main/docs/deploy.md)). A managed hosted tier with a DPA on file for EU hospital procurement is in preparation — register for [early access](https://ces107.github.io/dcm-anon/#early-access). @@ -99,6 +101,10 @@ dcm-anon /data/study_0001 /data/anon/study_0001 # Deterministic UIDs — same salt + same source = same output every run dcm-anon /data/study_0001 /data/anon/study_0001 --salt cohort-A-2024 +# Longitudinal cohort: shift dates per-patient instead of removing them +# (preserves inter-study intervals; requires --salt; GDPR/pseudonymous, NOT Safe Harbor) +dcm-anon /data/study_0001 /data/anon/study_0001 --salt cohort-A-2024 --date-shift + # Preview without writing files (audit log still emitted) dcm-anon /data/study_0001 /data/anon/study_0001 --dry-run @@ -334,7 +340,14 @@ exit code 3) rather than pretending: - **Recognisable faces are not removed.** A head CT/MR/PET reconstructs an identifiable face (Schwarz, NEJM 2019); metadata de-identification is insufficient. dcm-anon flags and fails closed — deface externally - (pydeface / mri_reface), then pass `--accept-face-risk`. + (pydeface / mri_reface), then pass `--accept-face-risk`. The gate is + genuinely fail-closed: on a face-capable modality (CT/MR/PET/NM) it fires + unless there is positive evidence of a **non-cranial** body part (an + accent-normalised, multilingual EN/ES/FR/DE/IT/PT match). A head scan with a + blank, coded, or non-English description is treated as a face, not waved + through — the false negative the previous keyword-only rule allowed. The + conservative cost (a bodyless abdominal CT also fails closed) is cleared with + `--accept-face-risk`. - **Encapsulated PDF/CDA interiors are not cleaned.** The opaque byte stream is not inspected; the object is quarantined unless `--allow-encapsulated`. - **SR free-text names are best-effort.** `--scrub-sr` redacts regex/blacklist @@ -344,8 +357,17 @@ exit code 3) rather than pretending: reversible by the salt-holder (a secret key, HMAC-keyed); without a salt, distinct patients collapse to one pseudonym (cohort separation lost). See `SECURITY.md`. -- **Dates are removed, not shifted.** A consistent per-patient date-shift option - (PS3.15 Retain Longitudinal Modified Dates) is not yet implemented. +- **Dates: removed by default, or shifted with `--date-shift`.** By default + every date is removed. For longitudinal cohorts, `--date-shift` (PS3.15 Retain + Longitudinal Modified Dates, CID 7050 code 113107) instead moves every date by + a single deterministic per-patient offset derived from `--salt`, so a + patient's studies stay the same number of days apart while their absolute + calendar position is hidden. It **requires `--salt`** (the offset must be + reproducible across files and runs) and is **not HIPAA Safe Harbor**, which + mandates date removal — use it only under a GDPR/pseudonymous regime. The + output's independent verifier is told dates were intentionally retained, so it + no longer flags them while still catching every non-date identifier; the + provenance code 113107 is written so the choice is auditable. - **Tag table is pinned to a DICOM edition.** Tags introduced in a later edition are caught only by the deny-by-default identifier-VR sweep, not by name. - **No DICOMDIR update, no DICOM network (C-STORE/Q-R), no GUI.** Regenerate the @@ -359,13 +381,16 @@ exit code 3) rather than pretending: pytest tests/ -v --cov=dcm_anon --cov-report=term-missing ``` -197 tests, coverage ≥80% gated in CI. Suite covers: deny-by-default private + +217 test functions (run `pytest tests/ -q`; the exact count is enforced against +this README by `tests/test_version_coherence.py`, so the number cannot silently +drift). Coverage ≥80% gated in CI. Suite covers: deny-by-default private + unknown-PN removal (adversarial leak fixtures), file-meta AE scrub, multi-valued UID remap, PS3.15 provenance attributes, fail-closed safety gates (burned-in / -face / encapsulated / SR), SR content-tree scrubbing, false-green verification -paths (vacuous pass, dependency failure, dry-run), HMAC UID + per-patient -pseudonyms, manifest SHA-chain integrity and tamper detection, and the public -completeness proof. +multilingual fail-closed face / encapsulated / SR), deterministic per-patient +date-shift (interval preservation), SR content-tree scrubbing, false-green +verification paths (vacuous pass, dependency failure, dry-run), HMAC UID + +per-patient pseudonyms, manifest SHA-chain integrity and tamper detection, and +the public completeness proof. ## Completeness proof (run it yourself) diff --git a/dcm_anon/__init__.py b/dcm_anon/__init__.py index a2fab83..3c7f635 100644 --- a/dcm_anon/__init__.py +++ b/dcm_anon/__init__.py @@ -13,6 +13,7 @@ render_markdown_report, ) from dcm_anon.cli import main +from dcm_anon.dateshift import shift_dates from dcm_anon.manifest import ( ComplianceManifest, build_manifest, @@ -57,5 +58,6 @@ "render_markdown", "render_markdown_report", "scan_outputs", + "shift_dates", "verify_manifest", ] diff --git a/dcm_anon/_version.py b/dcm_anon/_version.py index 0b88d28..0c305dc 100644 --- a/dcm_anon/_version.py +++ b/dcm_anon/_version.py @@ -7,4 +7,4 @@ installs and silently mislabelled which code produced a compliance manifest. """ -__version__ = "0.6.1" +__version__ = "0.7.0" diff --git a/dcm_anon/cli.py b/dcm_anon/cli.py index cb7c9ea..50582b3 100644 --- a/dcm_anon/cli.py +++ b/dcm_anon/cli.py @@ -70,6 +70,18 @@ def build_arg_parser(version: str) -> argparse.ArgumentParser: help="Retain ALL private (odd-group) tags. NOT recommended: " "PS3.15 mandates their removal and vendors store PHI there. " "Default removes every private element.") + # Longitudinal date handling. Default removes dates; --date-shift retains the + # time axis by moving every date a deterministic per-patient offset instead. + parser.add_argument("--date-shift", action="store_true", + help="Shift every date by a deterministic per-patient offset " + "instead of removing it (PS3.15 Retain Modified Dates, " + "113107). Preserves inter-study intervals for longitudinal " + "cohorts. REQUIRES --salt. NOT HIPAA Safe Harbor (which " + "mandates date removal) — use only under a GDPR/pseudonymous " + "regime.") + parser.add_argument("--date-shift-max-days", type=int, default=365, metavar="N", + help="Half-width of the per-patient date-shift window in days " + "(offset drawn from [-N, +N]; default 365).") # Fail-closed waivers — the tool refuses to certify clean when it detects a # channel it cannot clear. Each flag is an explicit, logged acknowledgement. parser.add_argument("--allow-burned-in", action="store_true", @@ -247,6 +259,17 @@ def _run_verify_mode(args: argparse.Namespace) -> int: def _validate_anonymize_args(args: argparse.Namespace) -> str | None: if args.src is None or args.dst is None: return "src and dst are required unless --verify-manifest is used" + if args.date_shift and not args.salt: + return ("--date-shift requires --salt: the per-patient date offset must be " + "deterministic so a patient's studies stay aligned across runs") + if args.date_shift and args.date_shift_max_days <= 0: + return "--date-shift-max-days must be a positive integer" + if args.date_shift and args.manifest_mode == "hipaa": + return ("--date-shift retains (shifted) dates, which HIPAA Safe Harbor " + "forbids (§164.514(b)(2)(i)(C) requires removing dates more " + "specific than the year). Emitting a Safe-Harbor manifest over " + "retained dates would be a false compliance claim. Use " + "--manifest-mode gdpr, or drop --date-shift for HIPAA output.") return None @@ -274,6 +297,8 @@ def _run_anonymize_mode(args: argparse.Namespace) -> int: allow_sr=args.allow_sr, scrub_sr=args.scrub_sr, sr_profile=args.sr_profile, + date_shift=args.date_shift, + date_shift_max_days=args.date_shift_max_days, progress_cb=_build_progress_cb(total, quiet=args.quiet), ) @@ -302,6 +327,7 @@ def _run_anonymize_mode(args: argparse.Namespace) -> int: sample_size=args.verify_output_sample, pixel_ocr=args.verify_output_pixel_ocr, strict_ocr=not args.no_strict_ocr, + dates_retained=args.date_shift, ) manifest_obj = build_manifest( summary, diff --git a/dcm_anon/dateshift.py b/dcm_anon/dateshift.py new file mode 100644 index 0000000..eb3013e --- /dev/null +++ b/dcm_anon/dateshift.py @@ -0,0 +1,107 @@ +"""Consistent per-patient date shifting (PS3.15 Retain Longitudinal Temporal +Information — Modified Dates Option, CID 7050 code 113107). + +The default profile *removes* dates, which destroys the time axis a longitudinal +or multi-visit cohort is built on. This module instead **shifts** every DA/DT +value by a single, deterministic, per-patient offset (derived by +:meth:`UIDMapper.date_offset_days` from the salt), so that: + +- intervals between a patient's studies are preserved exactly (visit 2 minus + visit 1 is unchanged), and +- the absolute calendar position is moved by an amount no one can recover + without the salt. + +This is a GDPR-oriented pseudonymisation control. It is deliberately NOT applied +under the default (date-removing) profile, and it is incompatible with HIPAA +Safe Harbor, which mandates removal of dates more specific than the year +(§164.514(b)(2)(i)(C)) — the CLI/manifest say so rather than over-claiming. + +The shift operates by VR (DA, DT), recursing into sequences, so it reaches +date elements the flat tag table does not enumerate (deny-by-default for dates, +the same posture the rest of the pipeline takes). +""" +from __future__ import annotations + +from datetime import date, timedelta + +from pydicom.dataset import Dataset + +# VRs that carry a calendar date we can shift. TM (time-of-day) is NOT shifted: +# a wall-clock time is not on the longitudinal axis and shifting it buys nothing. +_DATE_VR = "DA" +_DATETIME_VR = "DT" + + +def _shift_da(value: str, offset_days: int) -> str | None: + """Shift a DICOM DA value (``YYYYMMDD``). Return the shifted string, or + ``None`` if the value is empty / not a parseable date (left untouched).""" + text = value.strip() + if len(text) < 8 or not text[:8].isdigit(): + return None + try: + base = date(int(text[:4]), int(text[4:6]), int(text[6:8])) + except ValueError: + return None + shifted = base + timedelta(days=offset_days) + return f"{shifted.year:04d}{shifted.month:02d}{shifted.day:02d}" + + +def _shift_dt(value: str, offset_days: int) -> str | None: + """Shift the date portion of a DICOM DT value (``YYYYMMDD`` prefix plus an + optional ``HHMMSS.FFFFFF&ZZXX`` tail). The time/timezone tail is preserved + verbatim; only the calendar date moves.""" + text = value.strip() + shifted_date = _shift_da(text[:8], offset_days) + if shifted_date is None: + return None + return shifted_date + text[8:] + + +def _shift_element_value(value: object, vr: str, offset_days: int) -> object | None: + """Shift a single element value (handling multi-valued DA/DT). Return the new + value, or ``None`` if nothing parseable was found (caller leaves it as-is).""" + shifter = _shift_da if vr == _DATE_VR else _shift_dt + + if isinstance(value, (list, tuple)): + out: list[str] = [] + changed = False + for member in value: + new = shifter(str(member), offset_days) + if new is None: + out.append(str(member)) + else: + out.append(new) + changed = True + return out if changed else None + + return shifter(str(value), offset_days) + + +def shift_dates(ds: Dataset, offset_days: int) -> list[str]: + """Shift every DA/DT element in *ds* (recursing into sequences) by + *offset_days*. In-place. Returns audit entries in the + ``"GGGG,EEEE:SHIFT(VR)"`` form used by ``pipeline._format_touch``. + + A zero offset still rewrites the values to themselves and is logged, so the + audit shows the option ran (a per-patient offset of 0 is a legitimate draw). + """ + touched: list[str] = [] + for elem in ds: + if elem.VR == "SQ" and elem.value: + for item in elem.value: + if isinstance(item, Dataset): + touched.extend(shift_dates(item, offset_days)) + continue + if elem.VR not in (_DATE_VR, _DATETIME_VR): + continue + if elem.value in (None, ""): + continue + new_value = _shift_element_value(elem.value, elem.VR, offset_days) + if new_value is None: + continue + elem.value = new_value + touched.append(f"{elem.tag.group:04X},{elem.tag.element:04X}:SHIFT({elem.VR})") + return touched + + +__all__ = ["shift_dates"] diff --git a/dcm_anon/pipeline.py b/dcm_anon/pipeline.py index d858695..3742227 100644 --- a/dcm_anon/pipeline.py +++ b/dcm_anon/pipeline.py @@ -42,6 +42,8 @@ class AnonymizationConfig: allow_sr: bool = False scrub_sr: bool = False sr_profile: str = "default" + date_shift: bool = False + date_shift_max_days: int = 365 progress_cb: ProgressCallback | None = None registry: ActionRegistry = field(default_factory=lambda: DEFAULT_REGISTRY) @@ -73,11 +75,18 @@ def _apply_point_actions( keep: TagSet, mapper: UIDMapper, registry: ActionRegistry, + *, + preserve_dates: bool = False, ) -> list[str]: touched: list[str] = [] for tag, action in PHI_TAGS.items(): if tag in keep or tag not in ds: continue + # With date-shift active, DA/DT elements are retained here and shifted + # by a per-patient offset in a later pass (PS3.15 Retain Modified Dates), + # so the longitudinal time axis survives instead of being blanked. + if preserve_dates and ds[tag].VR in ("DA", "DT"): + continue registry[action](ds, tag, mapper) touched.append(_format_touch(tag, action.value)) return touched @@ -146,6 +155,7 @@ def _recurse_into_sequences( registry: ActionRegistry, *, keep_private: bool = False, + preserve_dates: bool = False, ) -> list[str]: touched: list[str] = [] for elem in ds: @@ -153,7 +163,10 @@ def _recurse_into_sequences( continue for item in elem.value: touched.extend( - _scrub_dataset(item, mapper, keep, registry, keep_private=keep_private) + _scrub_dataset( + item, mapper, keep, registry, + keep_private=keep_private, preserve_dates=preserve_dates, + ) ) return touched @@ -165,17 +178,22 @@ def _scrub_dataset( registry: ActionRegistry = DEFAULT_REGISTRY, *, keep_private: bool = False, + preserve_dates: bool = False, ) -> list[str]: """Apply known PHI_TAGS actions, sweep private + unknown person-names, and recurse into nested Sequence items. Deny-by-default: anything private or any - PN VR is removed/blanked unless explicitly kept.""" - touched = _apply_point_actions(ds, keep_tags, mapper, registry) + PN VR is removed/blanked unless explicitly kept. With ``preserve_dates`` the + DA/DT elements are left in place for the later per-patient date-shift pass.""" + touched = _apply_point_actions(ds, keep_tags, mapper, registry, preserve_dates=preserve_dates) touched.extend(_scrub_curve_overlay_ranges(ds, keep_tags)) if not keep_private: touched.extend(_strip_private_elements(ds, keep_tags)) touched.extend(_scrub_unknown_person_names(ds, keep_tags)) touched.extend( - _recurse_into_sequences(ds, keep_tags, mapper, registry, keep_private=keep_private) + _recurse_into_sequences( + ds, keep_tags, mapper, registry, + keep_private=keep_private, preserve_dates=preserve_dates, + ) ) return touched @@ -244,11 +262,21 @@ def anonymize_file( allow_sr: bool = False, scrub_sr: bool = False, sr_profile: str = "default", + date_shift: bool = False, + date_shift_max_days: int = 365, ) -> AuditRecord: """Anonymize a single DICOM file and return its audit record.""" keep = keep_tags if keep_tags is not None else frozenset() + if date_shift and not mapper.salt: + raise ValueError( + "date_shift requires a salt: the per-patient offset must be " + "deterministic so a patient's studies stay aligned across files/runs." + ) ds = dcmread(src) original_sop = str(ds.SOPInstanceUID) if hasattr(ds, "SOPInstanceUID") else None + # Capture the patient key BEFORE scrubbing rewrites PatientID/PatientName, + # so the date offset is keyed to the real patient (deterministic per person). + patient_key = str(getattr(ds, "PatientID", "") or getattr(ds, "PatientName", "") or "") # Fail-closed risk detection on the ORIGINAL object (before scrubbing removes # the modality/body-part signals). These are channels a header de-identifier @@ -264,10 +292,20 @@ def anonymize_file( allow_sr=allow_sr, ) - touched = _scrub_dataset(ds, mapper, keep, registry, keep_private=keep_private) + touched = _scrub_dataset( + ds, mapper, keep, registry, keep_private=keep_private, preserve_dates=date_shift, + ) _maintain_file_meta_consistency(ds, original_sop, mapper) touched.extend(_scrub_file_meta(ds)) + # Per-patient date shift (PS3.15 Retain Modified Dates). Runs AFTER the scrub + # left the DA/DT elements in place (preserve_dates) and moves every calendar + # date by one deterministic per-patient offset, preserving inter-study gaps. + if date_shift: + from dcm_anon.dateshift import shift_dates + offset = mapper.date_offset_days(patient_key, date_shift_max_days) + touched.extend(shift_dates(ds, offset)) + # SR content-tree pass (free-text / person-names / dates / UIDREFs inside # ContentSequence) on the SAME dataset + SAME UID map. Opt-in via --scrub-sr; # its actions use a distinct vocabulary so they ride a separate audit field, @@ -283,9 +321,19 @@ def anonymize_file( # action log; they are verifiable on the output object and in the manifest. # Claim Clean Structured Content (113104) ONLY when the SR pass actually ran. from dcm_anon._version import __version__ as _tool_version - from dcm_anon.provenance import CLEAN_STRUCTURED_CONTENT, write_deid_provenance - sr_codes = [CLEAN_STRUCTURED_CONTENT] if sr_touched else None - write_deid_provenance(ds, _tool_version, keep_private=keep_private, extra_codes=sr_codes) + from dcm_anon.provenance import ( + CLEAN_STRUCTURED_CONTENT, + RETAIN_MODIFIED_DATES, + write_deid_provenance, + ) + extra_codes = [] + if sr_touched: + extra_codes.append(CLEAN_STRUCTURED_CONTENT) + if date_shift: + extra_codes.append(RETAIN_MODIFIED_DATES) + write_deid_provenance( + ds, _tool_version, keep_private=keep_private, extra_codes=extra_codes or None, + ) output_path: str | None if dry_run: @@ -329,6 +377,8 @@ def anonymize_path( allow_sr: bool = False, scrub_sr: bool = False, sr_profile: str = "default", + date_shift: bool = False, + date_shift_max_days: int = 365, progress_cb: ProgressCallback | None = None, config: AnonymizationConfig | None = None, ) -> AuditSummary: @@ -349,6 +399,8 @@ def anonymize_path( allow_sr=allow_sr, scrub_sr=scrub_sr, sr_profile=sr_profile, + date_shift=date_shift, + date_shift_max_days=date_shift_max_days, progress_cb=progress_cb, ) @@ -373,6 +425,8 @@ def anonymize_path( allow_sr=cfg.allow_sr, scrub_sr=cfg.scrub_sr, sr_profile=cfg.sr_profile, + date_shift=cfg.date_shift, + date_shift_max_days=cfg.date_shift_max_days, )) except Exception as exc: err = ProcessingError( diff --git a/dcm_anon/safety.py b/dcm_anon/safety.py index 235fff2..f4895bb 100644 --- a/dcm_anon/safety.py +++ b/dcm_anon/safety.py @@ -19,6 +19,7 @@ """ from __future__ import annotations +import unicodedata from typing import Final from pydicom.dataset import Dataset @@ -46,11 +47,54 @@ ) # Modalities that produce volumetric data from which a face can be reconstructed. _FACE_MODALITIES: Final[frozenset[str]] = frozenset({"CT", "MR", "PT", "NM"}) + +# Cranio-facial terms across the languages a European cohort actually labels with +# (EN/ES/FR/DE/IT/PT), accent-stripped so "CRÁNEO"/"CRÂNE"/"SCHÄDEL" all match. +# Substring matching, so stems ("CRANI", "CEREBR") cover their inflections. _FACE_KEYWORDS: Final[tuple[str, ...]] = ( - "HEAD", "BRAIN", "SKULL", "FACE", "SINUS", "ORBIT", "TMJ", "IAC", "CRANI", "NEURO", + # English + "HEAD", "BRAIN", "SKULL", "CRANI", "FACE", "FACIAL", "SINUS", "ORBIT", "TMJ", + "IAC", "NEURO", "CEREBR", "PITUITARY", "SELLA", "MAXILL", "MANDIBL", + "NASOPHARYN", "PAROTID", "TEMPORAL BONE", "PETROUS", "ENCEPHAL", + # Spanish + "CABEZA", "CRANE", "CRANEO", "CEREBRO", "ENCEFAL", "CARA", "SENO", "ORBITA", + "MAXILAR", "MANDIBUL", "HIPOFIS", "FACIAL", + # French + "TETE", "CERVEAU", "VISAGE", "ORBITE", "MAXILLAIRE", "FACE", + # German + "KOPF", "SCHADEL", "GEHIRN", "HIRN", "GESICHT", "NEBENHOHLE", "KIEFER", + # Italian + "TESTA", "CRANIO", "CERVELLO", "VISO", "SENI PARANASALI", + # Portuguese + "CABECA", "ROSTO", "ENCEFALO", +) + +# Body regions that DEMONSTRABLY do not reconstruct a face. Their presence is the +# positive evidence that lets a face-capable modality clear the gate. Anything +# not on this list (empty, coded, or an unrecognised description) does NOT clear +# it — the gate fails closed, the way a head scan with a blank description must. +# Neck/cervical are deliberately absent: they routinely include the lower face. +_NON_FACE_BODYPARTS: Final[tuple[str, ...]] = ( + "CHEST", "THORAX", "LUNG", "PULMON", "POUMON", "TORAX", + "ABDOMEN", "ABDOMINAL", "PELVIS", "PELVIC", "PELVI", + "HEART", "CARDIAC", "CARDIO", "CORONARY", "AORTA", "CORAZON", "COEUR", + "LIVER", "HIGADO", "FOIE", "LEBER", "KIDNEY", "RENAL", "RINON", "REIN", + "PROSTATE", "BLADDER", "VEJIGA", "BREAST", "MAMA", "MAMMO", "SEIN", + "COLON", "BOWEL", "RECTUM", "SPLEEN", "PANCREAS", + "KNEE", "RODILLA", "GENOU", "KNIE", "ANKLE", "TOBILLO", "FOOT", "PIE", "PIED", + "HAND", "MANO", "MAIN", "WRIST", "MUNECA", "ELBOW", "CODO", "SHOULDER", + "HOMBRO", "EPAULE", "HIP", "CADERA", "HANCHE", "FEMUR", "TIBIA", "FIBULA", + "HUMERUS", "FOREARM", "EXTREMITY", "EXTREMIT", "LEG", "PIERNA", "ARM", "BRAZO", + "LSPINE", "TSPINE", "LUMBAR", "THORACIC SPINE", "DORSAL", "SACRUM", "COCCYX", ) +def _normalize(text: str) -> str: + """Upper-case and strip diacritics so accented labels match ASCII keywords.""" + decomposed = unicodedata.normalize("NFKD", text) + return "".join(c for c in decomposed if not unicodedata.combining(c)).upper() + + def _attr(ds: Dataset, name: str) -> str: return str(getattr(ds, name, "") or "") @@ -68,13 +112,28 @@ def has_burned_in_risk(ds: Dataset) -> bool: def has_face_risk(ds: Dataset) -> bool: + """Fail-closed recognizable-face gate for volumetric modalities. + + A header de-identifier cannot remove the face a head CT/MR/PET reconstructs, + so the dangerous error here is the FALSE NEGATIVE: passing a cranial study + whose description is blank, coded, or in a language the keyword list missed. + The old "modality + English keyword" rule did exactly that. This version + fails closed instead: on a face-capable modality the gate fires UNLESS there + is positive evidence (an accent-normalised, multilingual body-part match) + that the study is of a non-cranial region. Ambiguity resolves to risk. + """ if _attr(ds, "Modality").upper() not in _FACE_MODALITIES: return False - haystack = " ".join( + haystack = _normalize(" ".join( _attr(ds, a) for a in ("BodyPartExamined", "ProtocolName", "StudyDescription", "SeriesDescription") - ).upper() - return any(keyword in haystack for keyword in _FACE_KEYWORDS) + )) + if any(keyword in haystack for keyword in _FACE_KEYWORDS): + return True + if any(region in haystack for region in _NON_FACE_BODYPARTS): + return False + # Face-capable modality with no positive non-cranial evidence: fail closed. + return True def has_encapsulated_document(ds: Dataset) -> bool: diff --git a/dcm_anon/uid_mapper.py b/dcm_anon/uid_mapper.py index 69cda81..82bc853 100644 --- a/dcm_anon/uid_mapper.py +++ b/dcm_anon/uid_mapper.py @@ -49,6 +49,23 @@ def pseudonym(self, original: str) -> str | None: return None return self._hmac_hex(original)[:_PSEUDONYM_HEX_LEN].upper() + def date_offset_days(self, patient_key: str, max_days: int) -> int: + """Deterministic per-patient date shift in ``[-max_days, +max_days]``. + + Derived from HMAC-SHA256(key=salt, msg=patient_key) so the SAME patient + gets the SAME offset across every file and every run (the property a + longitudinal cohort needs), while remaining unrecoverable without the + salt. Requires a salt — a non-deterministic shift would desynchronise a + patient's studies and is therefore rejected by the caller. + """ + if not self.salt: + raise ValueError("date_offset_days requires a salt (deterministic shift)") + if max_days <= 0: + raise ValueError(f"date-shift window must be positive (got {max_days})") + span = 2 * max_days + 1 + digest_int = int(self._hmac_hex(patient_key)[:_HEX_BYTES_FOR_UID], 16) + return (digest_int % span) - max_days + def size(self) -> int: return len(self._mapping) diff --git a/dcm_anon/verify_output.py b/dcm_anon/verify_output.py index 1432e12..f3be5e9 100644 --- a/dcm_anon/verify_output.py +++ b/dcm_anon/verify_output.py @@ -65,12 +65,15 @@ ) -# Acceptable cleaned values — emitted by dcm-anon for Z (zero/blank) and D -# (dummy placeholder) actions. Anything else is treated as a residual. +# Acceptable cleaned values — restricted to the EXACT placeholders dcm-anon +# emits for its Z (zero/blank) and D actions (see phi_table.PLACEHOLDERS) plus +# structurally-empty values. Free-text words like "ANONYMOUS" / "REMOVED" were +# removed deliberately: the tool never writes them, so accepting them could only +# mask a real residual value that happens to read that way. A tighter accept-set +# means the independent scan errs toward flagging, not toward a false green. _CLEAN_PLACEHOLDERS: Final = frozenset( { - "", " ", "ANON", "ANONYMOUS", "0", "0000", "00000000", "19000101", - "000000.000000", "REMOVED", + "", " ", "ANON", "0", "0000", "00000000", "19000101", "000000.000000", } ) @@ -222,6 +225,8 @@ def _value_is_clean(raw_value: object, label: str) -> tuple[bool, str]: def _scan_metadata( ds: Dataset, file_path: Path, + *, + dates_retained: bool = False, ) -> list[Residual]: """Scan a dataset against the independent PHI tag list. @@ -229,9 +234,16 @@ def _scan_metadata( sequences (e.g. ``RequestAttributesSequence``, ``OriginalAttributesSequence``) is detected. Without this, a manifest could report ``passed=True`` while PHI persists in nested datasets. + + When ``dates_retained`` is set, date-category tags ("(C)") are intentionally + present (shifted by the Retain-Modified-Dates option) and are NOT counted as + residuals; every non-date identifier is still checked, so independence over + names / IDs / institutions is unaffected. """ findings: list[Residual] = [] for group, element, label, hipaa in INDEPENDENT_PHI_TAGS: + if dates_retained and hipaa.startswith("(C)"): + continue tag = (group, element) if tag not in ds: continue @@ -252,7 +264,7 @@ def _scan_metadata( continue for item in elem.value: if isinstance(item, Dataset): - findings.extend(_scan_metadata(item, file_path)) + findings.extend(_scan_metadata(item, file_path, dates_retained=dates_retained)) return findings @@ -331,6 +343,7 @@ def scan_outputs( sample_size: int | None = None, pixel_ocr: bool = False, strict_ocr: bool = True, + dates_retained: bool = False, ) -> VerificationResult: """Independently scan a directory of anonymized DICOMs for PHI residuals. @@ -385,7 +398,7 @@ def scan_outputs( except Exception: continue files_scanned += 1 - residuals.extend(_scan_metadata(ds, path)) + residuals.extend(_scan_metadata(ds, path, dates_retained=dates_retained)) if pixel_ocr and pixel_available: try: residuals.extend(_scan_pixels(ds, path, pytesseract_mod)) diff --git a/docs/index.html b/docs/index.html index e9aa862..6387db5 100644 --- a/docs/index.html +++ b/docs/index.html @@ -38,7 +38,7 @@
Every PS3.15 action (X / Z / U / D) that runs on your study is mapped to the literal text of the regulation that authorizes it — GDPR Art. 4(5), HIPAA Safe Harbor §164.514(b)(2), EU AI Act Art. 10. Re-verified against EUR-Lex / eCFR / gdpr-info.eu on 2026-05-13. SHA-256 chain over the audit log + manifest so an auditor can verify integrity from the JSON alone.
-Package restructure, AI-slop cleanup, PyPI rename to dcm-anon. No behavioural change to anonymisation. Changelog.
Deterministic per-patient date shifting for longitudinal cohorts (--date-shift, PS3.15 Retain Modified Dates), a genuinely fail-closed multilingual face gate, and a tightened independent verifier. Changelog.
The OSS CLI is and will stay free. A hosted service (dcm-anon-vault) is in preparation for teams that want hosted deployment, multi-user API keys, DICOM SR content scanning, and retained SHA-256 audit logs. Tiers: Free 50 files/mo, Pro €99/mo, Enterprise from €1,200/mo (pricing.md). No card required at this stage.
+The OSS CLI is and will stay free, and it is the complete, working product today — install it and use it. A hosted service (dcm-anon-vault, self-hostable now via Docker/Fly.io) is in preparation as a managed tier for teams that want hosted deployment, multi-user API keys, DICOM SR content scanning, and retained SHA-256 audit logs. It is not purchasable yet and there is nothing to pay for — paid pricing will be published only when the managed tier and its DPA actually ship. This page is for gauging interest and shaping that roadmap, not for taking your money.
Drop me a line with a one-paragraph context: what you're trying to anonymize, what regulatory regime you're under, and what the gap is today. I read every one and reply within a week.
diff --git a/tests/test_dateshift.py b/tests/test_dateshift.py new file mode 100644 index 0000000..4ab5e47 --- /dev/null +++ b/tests/test_dateshift.py @@ -0,0 +1,149 @@ +"""Deterministic per-patient date shifting (PS3.15 Retain Modified Dates). + +The headline guarantee for longitudinal cohorts: a patient's studies are all +moved by the SAME offset, so inter-study intervals survive while the absolute +calendar position is hidden. These tests pin that property plus the fail-safes +(salt required, provenance stamped, verifier not tripped by retained dates). +""" +from __future__ import annotations + +from datetime import date +from pathlib import Path + +import pytest +from pydicom import dcmread + +from dcm_anon import UIDMapper, anonymize_file, main +from dcm_anon.dateshift import shift_dates +from dcm_anon.pipeline import AnonymizationConfig, anonymize_path +from tests.conftest import _make_synthetic_dcm + +_SALT = "cohort-A-secret" + + +def _variant(path: Path, **attrs: object) -> Path: + _make_synthetic_dcm(path) + ds = dcmread(path) + for key, value in attrs.items(): + setattr(ds, key, value) + ds.save_as(path, enforce_file_format=True) + return path + + +def _anon(src: Path, dst: Path, **kwargs: object) -> Path: + anonymize_file(src, dst, UIDMapper(salt=_SALT), date_shift=True, **kwargs) # type: ignore[arg-type] + return dst + + +class TestDeterministicOffset: + def test_offset_is_stable_per_patient(self) -> None: + m = UIDMapper(salt=_SALT) + assert m.date_offset_days("MRN-1", 365) == m.date_offset_days("MRN-1", 365) + + def test_offset_within_window(self) -> None: + m = UIDMapper(salt=_SALT) + for key in ("MRN-1", "MRN-2", "MRN-3", "abc", "999"): + assert -30 <= m.date_offset_days(key, 30) <= 30 + + def test_distinct_patients_generally_differ(self) -> None: + m = UIDMapper(salt=_SALT) + offsets = {m.date_offset_days(f"MRN-{i}", 365) for i in range(20)} + assert len(offsets) > 1 + + def test_offset_requires_salt(self) -> None: + with pytest.raises(ValueError): + UIDMapper().date_offset_days("MRN-1", 365) + + +class TestShiftedDatesInOutput: + def test_dates_shifted_not_removed(self, tmp_path: Path) -> None: + src = _variant(tmp_path / "in.dcm", PatientID="MRN-1", StudyDate="20231105") + out = _anon(src, tmp_path / "out.dcm") + ds = dcmread(out) + assert "StudyDate" in ds and ds.StudyDate # present, not blanked + m = UIDMapper(salt=_SALT) + offset = m.date_offset_days("MRN-1", 365) + expected = date(2023, 11, 5).toordinal() + offset + assert date(*_ymd(str(ds.StudyDate))).toordinal() == expected + + def test_interval_between_studies_preserved(self, tmp_path: Path) -> None: + a = _variant(tmp_path / "a.dcm", PatientID="MRN-7", StudyDate="20230101") + b = _variant(tmp_path / "b.dcm", PatientID="MRN-7", StudyDate="20230131") # +30d + oa = dcmread(_anon(a, tmp_path / "oa.dcm")) + ob = dcmread(_anon(b, tmp_path / "ob.dcm")) + delta = date(*_ymd(str(ob.StudyDate))).toordinal() - date(*_ymd(str(oa.StudyDate))).toordinal() + assert delta == 30 + + def test_datetime_time_tail_preserved(self, tmp_path: Path) -> None: + src = _variant(tmp_path / "in.dcm", PatientID="MRN-1", + AcquisitionDateTime="20231105131415.000000") + ds = dcmread(_anon(src, tmp_path / "out.dcm")) + assert str(ds.AcquisitionDateTime)[8:].startswith("131415") + + def test_provenance_code_113107_written(self, tmp_path: Path) -> None: + src = _variant(tmp_path / "in.dcm", PatientID="MRN-1") + ds = dcmread(_anon(src, tmp_path / "out.dcm")) + codes = {item.CodeValue for item in ds.DeidentificationMethodCodeSequence} + assert "113107" in codes + + +class TestFailSafes: + def test_anonymize_file_requires_salt(self, tmp_path: Path) -> None: + src = _variant(tmp_path / "in.dcm") + with pytest.raises(ValueError): + anonymize_file(src, tmp_path / "out.dcm", UIDMapper(), date_shift=True) + + def test_cli_date_shift_without_salt_is_usage_error(self, tmp_path: Path) -> None: + src = _variant(tmp_path / "in.dcm") + rc = main([str(src), str(tmp_path / "out"), "--date-shift", "--quiet"]) + assert rc == 2 + + def test_cli_date_shift_with_salt_succeeds(self, tmp_path: Path) -> None: + src = _variant(tmp_path / "in.dcm", StudyDescription="Chest CT") + rc = main([str(src), str(tmp_path / "out"), + "--date-shift", "--salt", _SALT, "--quiet"]) + assert rc == 0 + + def test_cli_date_shift_with_hipaa_manifest_is_rejected(self, tmp_path: Path) -> None: + # Retaining dates under a Safe-Harbor manifest would be a false claim. + src = _variant(tmp_path / "in.dcm", StudyDescription="Chest CT") + rc = main([str(src), str(tmp_path / "out"), "--date-shift", "--salt", _SALT, + "--manifest-mode", "hipaa", "--quiet"]) + assert rc == 2 + + def test_cli_date_shift_with_gdpr_manifest_is_allowed(self, tmp_path: Path) -> None: + src = _variant(tmp_path / "in.dcm", StudyDescription="Chest CT") + rc = main([str(src), str(tmp_path / "out"), "--date-shift", "--salt", _SALT, + "--manifest-mode", "gdpr", "--verify-output", "--quiet"]) + assert rc == 0 + + +class TestPipelineInteg: + def test_path_level_shift_consistent_across_directory(self, tmp_path: Path) -> None: + src = tmp_path / "in" + src.mkdir() + _variant(src / "v1.dcm", PatientID="MRN-9", StudyDate="20200101") + _variant(src / "v2.dcm", PatientID="MRN-9", StudyDate="20200301") + cfg = AnonymizationConfig(salt=_SALT, date_shift=True) + anonymize_path(src, tmp_path / "out", config=cfg) + d1 = dcmread(tmp_path / "out" / "v1.dcm") + d2 = dcmread(tmp_path / "out" / "v2.dcm") + # 2020 is a leap year: Jan 1 -> Mar 1 is 60 days. + delta = date(*_ymd(str(d2.StudyDate))).toordinal() - date(*_ymd(str(d1.StudyDate))).toordinal() + assert delta == 60 + + def test_nested_sequence_dates_are_shifted(self, tmp_path: Path) -> None: + offset = 10 + from pydicom.dataset import Dataset + from pydicom.sequence import Sequence + ds = Dataset() + item = Dataset() + item.StudyDate = "20230101" + ds.RequestAttributesSequence = Sequence([item]) + touched = shift_dates(ds, offset) + assert str(ds.RequestAttributesSequence[0].StudyDate) == "20230111" + assert any("SHIFT(DA)" in t for t in touched) + + +def _ymd(da: str) -> tuple[int, int, int]: + return int(da[:4]), int(da[4:6]), int(da[6:8]) diff --git a/tests/test_safety_gates.py b/tests/test_safety_gates.py index d2e462b..89c3ed0 100644 --- a/tests/test_safety_gates.py +++ b/tests/test_safety_gates.py @@ -44,6 +44,30 @@ def test_head_mr_triggers_face_risk(self, tmp_path: Path) -> None: src = _variant(tmp_path, Modality="MR", StudyDescription="Brain MRI without contrast") assert RISK_FACE in _risks(src) + def test_face_modality_with_blank_description_fails_closed(self, tmp_path: Path) -> None: + # The false-negative the old keyword gate let through: a head MR whose + # description is empty. With no positive non-cranial evidence the gate + # must fail closed. + src = _variant(tmp_path, Modality="MR", StudyDescription="", + SeriesDescription="", ProtocolName="", BodyPartExamined="") + assert RISK_FACE in _risks(src) + + def test_face_modality_with_coded_protocol_fails_closed(self, tmp_path: Path) -> None: + src = _variant(tmp_path, Modality="CT", StudyDescription="", + SeriesDescription="", ProtocolName="PROT-4471", BodyPartExamined="") + assert RISK_FACE in _risks(src) + + def test_spanish_cranial_description_triggers_face_risk(self, tmp_path: Path) -> None: + # Accent-stripped multilingual match: "Cráneo" / "Cerebro" must fire. + src = _variant(tmp_path, Modality="MR", StudyDescription="", + SeriesDescription="RM de cráneo y cerebro", BodyPartExamined="") + assert RISK_FACE in _risks(src) + + def test_non_cranial_bodypart_clears_face_gate(self, tmp_path: Path) -> None: + src = _variant(tmp_path, Modality="CT", StudyDescription="", + SeriesDescription="", ProtocolName="", BodyPartExamined="ABDOMEN") + assert RISK_FACE not in _risks(src) + def test_encapsulated_pdf_triggers_risk(self, tmp_path: Path) -> None: src = _variant(tmp_path, SOPClassUID=_ENCAPSULATED_PDF_SOP) assert RISK_ENCAPSULATED in _risks(src) diff --git a/tests/test_verify_output.py b/tests/test_verify_output.py index 8c26409..d426812 100644 --- a/tests/test_verify_output.py +++ b/tests/test_verify_output.py @@ -82,6 +82,32 @@ def _write_clean_with_dirty_sequence(path: Path) -> Path: return path +def _write_shifted_dates_with_dirty_name(path: Path) -> Path: + """A retained-dates output: real (shifted) StudyDate present, plus a PHI name + that MUST still be caught — date retention must not blind the scan to names.""" + ds = _base_dataset(path) + ds.StudyDate = "20240312" # intentionally retained (shifted) date + ds.PatientName = "DOE^JOHN" # genuine residual PHI + ds.save_as(path, enforce_file_format=True) + return path + + +def test_retained_dates_not_flagged_but_names_still_are(tmp_path: Path) -> None: + _write_shifted_dates_with_dirty_name(tmp_path / "a.dcm") + res = scan_outputs(tmp_path, dates_retained=True) + cats = {r.hipaa_category for r in res.residuals} + # Shifted dates are intentional under Retain-Modified-Dates: not residuals. + assert not any(c.startswith("(C)") for c in cats) + # The name is real PHI and independence over names is preserved. + assert any(c.startswith("(A)") for c in cats) + + +def test_dates_flagged_by_default(tmp_path: Path) -> None: + _write_shifted_dates_with_dirty_name(tmp_path / "a.dcm") + res = scan_outputs(tmp_path) # dates_retained defaults to False + assert any(r.hipaa_category.startswith("(C)") for r in res.residuals) + + class _FakeTesseract: """Stand-in for pytesseract: a usable version probe + scripted OCR text.""" @@ -141,13 +167,24 @@ def test_value_excerpt_is_truncated(tmp_path: Path) -> None: # --------------------------------------------------------------------------- # # Cleanliness helpers # --------------------------------------------------------------------------- # -@pytest.mark.parametrize("placeholder", ["", " ", "ANON", "anonymous", "19000101", "REMOVED"]) +@pytest.mark.parametrize("placeholder", ["", " ", "ANON", "anon", "19000101", "0"]) def test_placeholders_are_clean(placeholder: str) -> None: clean, excerpt = vo._value_is_clean(placeholder, "PatientName") assert clean assert excerpt == "" +@pytest.mark.parametrize("residual", ["ANONYMOUS", "REMOVED", "Smith^John"]) +def test_freetext_words_are_flagged_not_silently_cleaned(residual: str) -> None: + """The accept-set is restricted to the exact placeholders the tool emits. + Words it never writes (ANONYMOUS / REMOVED) must be FLAGGED, not waved + through by a loose string match — that is the false-green the independent + scan exists to prevent.""" + clean, excerpt = vo._value_is_clean(residual, "PatientName") + assert not clean + assert excerpt == residual + + def test_none_value_is_clean() -> None: assert vo._value_is_clean(None, "PatientName") == (True, "") diff --git a/tests/test_version_coherence.py b/tests/test_version_coherence.py index eec6c97..d16e7b0 100644 --- a/tests/test_version_coherence.py +++ b/tests/test_version_coherence.py @@ -7,6 +7,7 @@ """ from __future__ import annotations +import ast import re from pathlib import Path @@ -15,6 +16,36 @@ ROOT = Path(__file__).resolve().parent.parent +def _count_test_functions() -> int: + """Count every ``def test_*`` (module-level or method) across tests/. + + Static AST count — no pytest run — so it is the same number a reviewer gets + from ``grep -rc 'def test_'`` and cannot be gamed by import side effects. + """ + total = 0 + for path in sorted((ROOT / "tests").glob("test_*.py")): + tree = ast.parse(path.read_text(encoding="utf-8")) + for node in ast.walk(tree): + if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef)) and node.name.startswith("test_"): + total += 1 + return total + + +def test_readme_test_count_matches_suite() -> None: + """The README's advertised test-function count must equal the real count. + + The original '197 tests' claim drifted from the actual suite; this guard + makes the easy-to-verify number provably honest, since the credibility of + every hard-to-verify claim rests on the easy ones being right. + """ + readme = (ROOT / "README.md").read_text(encoding="utf-8") + match = re.search(r"(\d+) test functions", readme) + assert match is not None, "README must state 'N test functions'" + claimed = int(match.group(1)) + actual = _count_test_functions() + assert claimed == actual, f"README claims {claimed} test functions, suite has {actual}" + + def test_changelog_latest_versioned_heading_matches() -> None: text = (ROOT / "CHANGELOG.md").read_text(encoding="utf-8") match = re.search(r"^## \[(\d+\.\d+\.\d+)\]", text, re.MULTILINE) @@ -24,6 +55,17 @@ def test_changelog_latest_versioned_heading_matches() -> None: ) +def test_citation_cff_version_matches() -> None: + """CITATION.cff is what a citing paper copies; a stale version there is the + exact provenance drift this tool sells against. Pin it to _version.""" + text = (ROOT / "CITATION.cff").read_text(encoding="utf-8") + match = re.search(r'^version:\s*"?(\d+\.\d+\.\d+)"?', text, re.MULTILINE) + assert match is not None, "no 'version:' field in CITATION.cff" + assert match.group(1) == __version__, ( + f"CITATION.cff version {match.group(1)!r} != _version {__version__!r}" + ) + + def test_landing_page_shows_current_version() -> None: html = (ROOT / "docs" / "index.html").read_text(encoding="utf-8") assert f"v{__version__}" in html, (