DetectionForge

Detection-as-code pipeline with measured precision/recall against real OTRF attack captures.

by Aadarsh Kadam · github.com/aadarshkadam067

What this is

DetectionForge is a detection-as-code pipeline that treats SIEM rules like software: version-controlled, automatically tested against real OTRF attack captures, and auto-converted to three SIEM backends. Every rule ships with measured precision and recall against a 1,648-event corpus (1,164 process-creation events + 484 registry events) drawn from OTRF Security-Datasets ZIPs — no synthetic events, no hand-crafted fixtures. The pipeline ships 20 rules covering 21 ATT&CK techniques across 8 tactics, mean precision 0.997, with 60/60 multi-SIEM conversion success across Splunk SPL, Elastic EQL, and Microsoft Sentinel KQL. (The 21st technique is T1027 Obfuscated Files — T1059.001's PowerShell -EncodedCommand rule legitimately covers both T1059.001 and T1027.)

Each rule's precision figure is classified as earned (14 rules — the benign corpus contains same-shape events the rule correctly excludes) or structural-absence (6 rules — precision = 1.000 because the corpus contains no events the rule could match; shipped under the T1218 LOLBIN-cluster rationale). The classification is in dist/data/dashboard_meta.json per rule. Three more techniques are documented as Phase 5 deferrals rather than shipped with structurally guaranteed measurements (see Phase 5 backlog below). The same standard applies everywhere: a 1.000 number is never shown without its provenance.

Dashboard

The dashboard presents the detection corpus, the precision measurements, and the documented gaps as a single static site. Served via any HTTP server over the dist/ directory — no build step, no backend, four JSON contract files read at runtime.

Overview — headline metrics. The earned-vs-structural-absence split renders at the same prominence as the mean-precision figure; the honesty meter in the sidebar surfaces the 14/6 split, conversion tally, and FP count.

Rules — sortable, filterable, expandable table of all 20 rules. T1059.001's 0.933 precision and single false positive surface inline (highlighted), not rounded away. Filter by classification or logsource; expand any row for the Sigma source and the three converted backend queries.

Coverage — ATT&CK tactic-column grid. 26 direct technique cells and 16 parent-rollup cells (dashed). Every non-rollup cell carries its classification chip; T1059.001 and T1027 render with an FP background because the underlying rule has one.

Trends — single-snapshot state. first_snapshot_date equals built_date, so each chart renders the current value as a horizontal level with a single marker. Successive forge build runs append points and the sparklines widen automatically.

Gaps — three deferred techniques with class (Class 1 / Class 3), prerequisite type, and reason from dashboard_meta.json. The structural-absence inventory continues below this fold. Named, not hidden.

Verify in 5 minutes

git clone https://github.com/aadarshkadam067/DetectionForge.git
cd DetectionForge
python3 -m venv .venv && source .venv/bin/activate
pip install -e '.[dev,convert]'
python scripts/update_attack_data.py   # generates forge/data/attack_techniques.json
forge run                               # lint → test → convert → score → build

Or run the full pipeline in Docker:

docker compose -f docker/docker-compose.yml run --rm forge

Expected forge test summary:

20 rules — mean precision 0.997  mean recall 1.000  mean F1 0.998

Expected forge convert summary:

60/60 conversions succeeded — success rate 100.0% (threshold 95%)

ATT&CK coverage layer is written to reports/layer.json and is loadable directly into the public Navigator at https://mitre-attack.github.io/attack-navigator/ via Open Existing Layer → Upload from Local.

View the dashboard locally

The static measurement dashboard at dist/ consumes the JSON contract files in dist/data/. forge run must complete first — the JSON files are not committed (they're build output), so a fresh clone serves an empty dashboard until the pipeline has populated dist/data/. Then:

forge run                              # populates dist/data/*.json
cd dist && python3 -m http.server 8000 # any static server works
# open http://127.0.0.1:8000/

If you open the dashboard before running the pipeline, it renders an explicit error card pointing at this same command — by design, so the failure mode is legible instead of a silent empty UI.

The dashboard uses in-browser Babel transformation, which produces one console warning at page load. This is intentional — it preserves the no-build-step deployment property documented in PDR §9: a production-deployable static asset with zero toolchain dependency. The dashboard ships as plain HTML/CSS/JSX read directly from disk.

Architecture

┌──────────────────┐   ┌──────────────────┐   ┌──────────────────────────┐
│  rules/*.yml     │   │  data/attack/    │   │  data/benign/...         │
│  (Sigma rules)   │   │  (TP fixtures)   │   │  data/registry_baseline  │
└────────┬─────────┘   └────────┬─────────┘   └────────────┬─────────────┘
         │                      │                          │
         └──────────┬───────────┴────────────┬─────────────┘
                    ▼                        ▼
            ┌───────────────┐        ┌───────────────┐
            │  forge lint   │        │  forge test   │
            │  (Stage 1)    │        │  (Stage 2)    │
            └───────┬───────┘        └───────┬───────┘
                    │                        │
                    ▼                        ▼
            ┌───────────────┐        ┌───────────────┐
            │ forge convert │        │ forge score   │
            │  (Stage 3)    │        │  (Stage 4)    │
            └───────┬───────┘        └───────┬───────┘
                    │                        │
                    ▼                        ▼
        ┌─────────────────────────┐  ┌────────────────────────┐
        │ reports/converted/<r>/  │  │ reports/results.json   │
        │   splunk.spl            │  │ reports/layer.json     │
        │   elastic.eql           │  │ reports/conv_matrix... │
        │   sentinel.kql          │  │                        │
        └────────────┬────────────┘  └────────────┬───────────┘
                     │                            │
                     └─────────────┬──────────────┘
                                   ▼
                          ┌────────────────┐
                          │  forge build   │
                          │  (Stage 5)     │
                          └────────┬───────┘
                                   ▼
                          ┌────────────────────────────────┐
                          │  dist/data/                    │
                          │    dashboard_meta.json         │
                          │    results.json                │
                          │    conversion_matrix.json      │
                          │    layer.json                  │
                          └────────────────────────────────┘

Stage	Command	Inputs	Outputs
1. Lint	`forge lint`	`rules/*/.yml`	pass/fail with errors
2. Test	`forge test`	rules + fixtures + baselines	`reports/results.json`
3. Convert	`forge convert`	rules	`reports/converted/<slug>/{splunk.spl, elastic.eql, sentinel.kql}` + `reports/conversion_matrix.json`
4. Score	`forge score`	`reports/results.json`	`reports/layer.json` (ATT&CK Navigator v4.5)
5. Build	`forge build`	all reports + rule YAMLs + `_meta.yml`	`dist/data/{dashboard_meta, results, conversion_matrix, layer}.json`

Note on Stage 5. forge build produces JSON data artifacts only — no HTML. The dashboard at dist/ is committed source (HTML/CSS/JSX, no build step) that consumes those JSON files at runtime. See PDR §9 and journey Entry 4.9 for the decision record and the rebuild rationale.

Measurements

Precision / Recall Table (all 20 rules)

The Class column distinguishes earned precision (the corpus contained same-shape events the rule correctly excluded — discrimination is real) from struct-abs (the corpus contained no events the rule could match — precision = 1.000 reflects absence, not discrimination; shipped under the T1218 LOLBIN-cluster rationale where applicable). See journey Entry 4.7 for the per-rule classification audit.

Technique	Tactic	Logsource	P	R	F1	TP	FP	Class
T1003.001	Cred Access	process_creation	1.000	1.000	1.000	2	0	earned
T1003.002	Cred Access	process_creation	1.000	1.000	1.000	1	0	struct-abs
T1021.006	Lateral Movement	process_creation	1.000	1.000	1.000	3	0	earned
T1033	Discovery	process_creation	1.000	1.000	1.000	2	0	earned
T1053.005	Persistence	process_creation	1.000	1.000	1.000	2	0	earned
T1055.001	Defense Evasion	process_creation	1.000	1.000	1.000	1	0	earned
T1059.001	Execution	process_creation	0.933	1.000	0.965	14	1	earned
T1059.005	Execution	process_creation	1.000	1.000	1.000	1	0	struct-abs
T1087.001	Discovery	process_creation	1.000	1.000	1.000	8	0	earned
T1105	C2	process_creation	1.000	1.000	1.000	1	0	earned
T1136.001	Persistence	process_creation	1.000	1.000	1.000	4	0	earned
T1218.001	Defense Evasion	process_creation	1.000	1.000	1.000	1	0	struct-abs
T1218.004	Defense Evasion	process_creation	1.000	1.000	1.000	1	0	struct-abs
T1218.005	Defense Evasion	process_creation	1.000	1.000	1.000	2	0	struct-abs
T1218.010	Defense Evasion	process_creation	1.000	1.000	1.000	1	0	earned
T1218.013	Defense Evasion	process_creation	1.000	1.000	1.000	1	0	struct-abs
T1220	Defense Evasion	process_creation	1.000	1.000	1.000	1	0	earned
T1547.001	Persistence	registry_event	1.000	1.000	1.000	3	0	earned
T1548.002	Priv Esc	process_creation	1.000	1.000	1.000	1	0	earned
T1562.004	Defense Evasion	process_creation	1.000	1.000	1.000	4	0	earned

Aggregate: 20 rules · mean precision 0.997 · mean recall 1.000 · mean F1 0.998 · 14 earned / 6 structural-absence.

Corpus: 1,648 events across two logsources —

1,164-event process-creation baseline (Sysmon EID 1) drawn from 121 OTRF atomic + compound Windows ZIPs, deduplicated on (Image, CommandLine, ParentImage, Computer) and filtered via the structural attack-event exclusion (ADR-005).
484-event registry baseline (Sysmon EID 12/13) drawn from 4 OTRF source ZIPs (empire_wmi_local_event_subscriptions_elevated_user, empire_schtasks_creation_execution_elevated_user, covenant_dcom_iertutil_dll_hijack, empire_dcom_shellwindows_stager), with explicit exclusion of \Microsoft\Windows\CurrentVersion\Run\ and \Microsoft\Windows\CurrentVersion\RunOnce\ paths (the labeling axiom — exclude exactly what the rule detects). Built in Phase 4 Day 3 via the ADR-006 per-logsource baseline routing.

48 structural-filter signatures total (46 process_creation + 3 registry_event, with overlap accounted for). All 20 rules clear their expected.precision threshold; the lone non-1.000 figure (T1059.001 at 0.933) is the documented cross-technique FP referenced in T1087.001 — Before/After below applied symmetrically. Details in journey Entries 4.3 / 4.6 / 4.7.

Multi-SIEM Conversion Matrix

All 20 rules × 3 backends = 60/60 conversions succeeded (100%). Backends: pySigma-backend-splunk 2.1.0, pySigma-backend-elasticsearch 2.0.2, pySigma-backend-kusto 1.0.1. Two logsource categories convert cleanly:

Logsource	Rules	Conversions
`process_creation`	19	57/57
`registry_event`	1 (T1547.001)	3/3
Total	20	60/60

Per-rule per-backend status and the actual emitted query strings are in reports/conversion_matrix.json and reports/converted/<rule>/{splunk.spl, elastic.eql, sentinel.kql}. Trailing-backslash semantics on the registry logsource verified correct across all three backends — \CurrentVersion\Run\ does not match \RunTime or \RunOnce on the \Run\ condition. See docs/measurements/phase4-day3-registry-harness.md for the verification.

T1087.001 — Before/After Defect Fix (methodology evidence)

The harness caught a rule defect during corpus expansion: T1087.001's CommandLine|contains: user condition fired on net user /add commands from T1136.001 (Create Account), producing two cross-technique false positives. The defect was logged at detection time (Phase 3 Day 2), deliberately retained in the measurement, and fixed in a later commit — providing a reproducible before/after delta.

State	Precision	Recall	F1	FPs	Defect
Before fix (Day 2)	0.800	1.000	0.889	2	`user` keyword matched `net user /add`
After fix (Day 4)	1.000	1.000	1.000	0	`filter_account_creation: CommandLine\|contains: ' /add '`

The fix is a 4-line YAML change; the harness measures the improvement. This is the iterative-improvement loop the project was built to demonstrate. Full chain of evidence in docs/measurements/phase3-day4-t1087-fix.md.

Methodology

Corpus design

All positive fixtures (TP events) come from real OTRF Security-Datasets captures, never from synthetic or hand-authored events. Dataset selection is logged with an accept/reject decision and reason in data/captures/_decisions.md — the evidentiary record for "how did you choose your test data?"

The 1,164-event process-creation baseline was assembled from 121 OTRF atomic + compound Windows ZIPs (Phase 4 Day 1 corpus build), deduplicated on (Image, CommandLine, ParentImage, Computer) using scripts/rebuild_baseline.py. A structural attack-event filter (ADR-005) then removed events whose hash(Image, CommandLine, ParentImage) matched any positive fixture. As new attack fixtures land, the filter is re-run; the baseline shrinks idempotently as cross-dataset attack-shape leaks are caught.

The 484-event registry baseline (Sysmon EID 12/13) was built in Phase 4 Day 3 by scripts/extract_registry_baseline.py from four OTRF source ZIPs. The extraction excludes \Microsoft\Windows\CurrentVersion\Run\ and \Microsoft\Windows\CurrentVersion\RunOnce\ paths up front (the labeling axiom — the rule's detection target is exactly what is excluded; nothing more, nothing less). The generalized structural filter then re-applies post-fixture using (TargetObject, Details, Image) as the identity signature. See ADR-006 for the per-logsource routing decision and docs/measurements/phase4-day3-registry-harness.md for the build record.

Precision is only meaningful for rules whose baseline contains events of the same logsource the rule selects on. The harness enforces this via ADR-006: forge/__init__.py defines BASELINE_MAP (a category → file registry) and REGISTERED_LOGSOURCE_CATEGORIES (its keyset). At test time, rules with no explicit fixtures.negative are auto-routed by their logsource.category. At lint time, a rule with no negative fixture and no registered category is rejected before it can be measured (see tests/test_lint.py for the three guard cases). Multi-logsource expansion is now "add a baseline file + add a dispatch-table entry" — no other changes.

Note on legacy figures. The pre-Phase-4 README reported a 1,481-event baseline. That figure was a raw concatenation count that included 419 cross-dataset duplicate events; the correct unique-event count for the pre-Phase-4 corpus was approximately 1,051. The Phase 4 rebuild corrects this and adds the registry baseline. See Entry 4.3 in the project journey for the accounting.

Rule format

Rules are authored in Sigma format with a forge: extension block for fixture paths, expected metrics, and multi-SIEM flags. All rules are Pydantic v2 validated on lint.

# Example — rules/windows/credential_access/T1003_001_lsass_dump_comsvcs.yml
detection:
  selection:
    EventID: 1
    Image|endswith: '\rundll32.exe'
    CommandLine|contains|all:
      - 'comsvcs'
      - 'MiniDump'
  condition: selection

forge:
  fixtures:
    positive: data/attack/T1003.001/events.json
    negative: data/benign/workstation_baseline.json
  expected:
    precision: 0.95
    recall: 1.00
  multi_siem: true

The test harness (forge/test_harness.py) evaluates rules against fixture events using a hand-rolled field-matcher supporting four modifiers: |endswith, |contains, |contains|all, |startswith. pySigma is used only for forge convert (query-string generation), not for evaluation — see ADR-002 for the design decision and spike results.

Multi-SIEM conversion

forge convert compiles each Sigma rule to query strings for all three backends via pySigma. The Sigma modifier set used across all current rules maps cleanly to each target language:

Sigma modifier	SPL	EQL	KQL
`\|endswith`	`field="*value"`	`field:"*value"`	`field endswith "value"`
`\|contains`	`field="value"`	`field:"value"`	`field contains "value"`
`\|contains\|all`	Repeated field clauses (AND)	`(f:"a" and f:"b")`	`(f contains "a" and f contains "b")`
`\|startswith`	`field="value*"`	`field:"value*"`	`field startswith "value"`
`not filter`	`NOT (field IN (...))`	`not (field like~ (...))`	`not((field endswith "..."))`

No translation losses detected. All 51 conversions succeeded. Example converted queries for T1003.001 are in docs/examples/T1003_001_converted_queries.md.

Known limitations & Phase 5 backlog

Documented deferrals (three techniques, three prerequisite types)

Three techniques are documented as deferrals rather than shipped with structurally guaranteed precision figures. Each has a named prerequisite for unblocking.

Technique	Class	Prerequisite type	Reason
T1037.001 Logon Script (UserInitMprLogonScript)	Class 3	Corpus only	484-event registry baseline contains 0 events touching `\Environment\` paths — precision would be structurally guaranteed. Unblocking is corpus expansion, not harness work.
T1546.003 WMI Event Subscription	Class 1	Harness + corpus	Empire's WMI persistence goes through the WMI namespace API (Sysmon EID 19/20/21), not registry writes. Unblocking requires a new `wmi_event` baseline and dispatch entry.
T1110.003 Password Spraying	Class 1	Harness + corpus + aggregation	Rule selects on EID 4625 (logon failure), no baseline in current harness. Unblocking requires a logon-event baseline, dispatch entry, and aggregation logic to count failures per source within a window.

The three classes of deferral (Class 1 logsource-mismatch, Class 3 corpus-realism) are defined in journey Entry 4.5. T1037.001 is the most readily unblockable — the harness already handles registry_event. Phase 5 Day 1 sequencing: tackle T1037.001 as a corpus expansion, then take on the harness-side work for the other two.

Principle: a rule's precision number is only meaningful when the benign baseline contains events of the same logsource the rule selects on, in sufficient population to plausibly false-positive. Shipping a rule whose precision would be structurally guaranteed contradicts the standard applied to every other rule in the project.

Single-event fixture caveats

Eight rules have a positive fixture of three events or fewer. Recall is measured as binary against those events — it is 1.000 in every case but does not mean all real-world variants are covered. See per-rule _meta.yml in data/attack/<technique>/ for the OTRF source dataset and the variant captured.

Dashboard / presentation layer

forge build (Stage 5) writes the four JSON contract files to dist/data/. A static measurement dashboard at dist/ (React via CDN, no build step) consumes those files at runtime and presents five views: Overview, Rules, Coverage, Trends, Gaps. The earned-vs-structural-absence classification ships next to every precision number — the project's honesty contract, surfaced in code. See View the dashboard locally above for the serve order.

ATT&CK coverage also ships via reports/layer.json, which loads directly into the official MITRE Navigator at https://mitre-attack.github.io/attack-navigator/ (Open Existing Layer → Upload from Local) for anyone who prefers the canonical viewer to the project dashboard.

The original UI deferral and the rebuild rationale are recorded in PDR §9 and journey Entry 4.9.

Acknowledgments

OTRF Security-Datasets — all positive fixture events and the benign baseline corpus come from OTRF's open-source attack capture library. Dataset selection, event counts, and accept/reject decisions are logged in data/captures/_decisions.md.

Open Threat Research Foundation (OTRF). Security-Datasets. https://securitydatasets.com. MIT License.

MITRE ATT&CK — technique and tactic mappings throughout this project follow the MITRE ATT&CK framework.

MITRE Corporation. ATT&CK®. https://attack.mitre.org. ATT&CK® content is licensed under CC BY 4.0.

SigmaHQ — rule format and pySigma conversion backends.

SigmaHQ. Sigma — Generic Signature Format for SIEM Systems. https://sigmahq.io. Apache License 2.0.

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 85 Commits
.devcontainer		.devcontainer
.github		.github
data		data
dist		dist
docker		docker
docs		docs
forge		forge
reports		reports
rules		rules
scripts		scripts
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
PDR.md		PDR.md
PRD.md		PRD.md
README.md		README.md
ROADMAP.md		ROADMAP.md
forge.toml		forge.toml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DetectionForge

What this is

Dashboard

Verify in 5 minutes

View the dashboard locally

Architecture

Measurements

Precision / Recall Table (all 20 rules)

Multi-SIEM Conversion Matrix

T1087.001 — Before/After Defect Fix (methodology evidence)

Methodology

Corpus design

Rule format

Multi-SIEM conversion

Known limitations & Phase 5 backlog

Documented deferrals (three techniques, three prerequisite types)

Single-event fixture caveats

Dashboard / presentation layer

Acknowledgments

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DetectionForge

What this is

Dashboard

Verify in 5 minutes

View the dashboard locally

Architecture

Measurements

Precision / Recall Table (all 20 rules)

Multi-SIEM Conversion Matrix

T1087.001 — Before/After Defect Fix (methodology evidence)

Methodology

Corpus design

Rule format

Multi-SIEM conversion

Known limitations & Phase 5 backlog

Documented deferrals (three techniques, three prerequisite types)

Single-event fixture caveats

Dashboard / presentation layer

Acknowledgments

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages