Detection-as-code pipeline with measured precision/recall against real OTRF attack captures.
by Aadarsh Kadam ยท github.com/aadarshkadam067
DetectionForge is a detection-as-code pipeline that treats SIEM rules like software: version-controlled, automatically tested against real OTRF attack captures, and auto-converted to three SIEM backends. Every rule ships with measured precision and recall against a 1,648-event corpus (1,164 process-creation events + 484 registry events) drawn from OTRF Security-Datasets ZIPs โ no synthetic events, no hand-crafted fixtures. The pipeline ships 20 rules covering 21 ATT&CK techniques across 8 tactics, mean precision 0.997, with 60/60 multi-SIEM conversion success across Splunk SPL, Elastic EQL, and Microsoft Sentinel KQL. (The 21st technique is T1027 Obfuscated Files โ T1059.001's PowerShell -EncodedCommand rule legitimately covers both T1059.001 and T1027.)
Each rule's precision figure is classified as earned (14 rules โ the benign corpus contains same-shape events the rule correctly excludes) or structural-absence (6 rules โ precision = 1.000 because the corpus contains no events the rule could match; shipped under the T1218 LOLBIN-cluster rationale). The classification is in dist/data/dashboard_meta.json per rule. Three more techniques are documented as Phase 5 deferrals rather than shipped with structurally guaranteed measurements (see Phase 5 backlog below). The same standard applies everywhere: a 1.000 number is never shown without its provenance.
The dashboard presents the detection corpus, the precision measurements, and the documented gaps as a single static site. Served via any HTTP server over the dist/ directory โ no build step, no backend, four JSON contract files read at runtime.
Overview โ headline metrics. The earned-vs-structural-absence split renders at the same prominence as the mean-precision figure; the honesty meter in the sidebar surfaces the 14/6 split, conversion tally, and FP count.
Rules โ sortable, filterable, expandable table of all 20 rules. T1059.001's 0.933 precision and single false positive surface inline (highlighted), not rounded away. Filter by classification or logsource; expand any row for the Sigma source and the three converted backend queries.
Coverage โ ATT&CK tactic-column grid. 26 direct technique cells and 16 parent-rollup cells (dashed). Every non-rollup cell carries its classification chip; T1059.001 and T1027 render with an FP background because the underlying rule has one.
Trends โ single-snapshot state. first_snapshot_date equals built_date, so each chart renders the current value as a horizontal level with a single marker. Successive forge build runs append points and the sparklines widen automatically.
Gaps โ three deferred techniques with class (Class 1 / Class 3), prerequisite type, and reason from dashboard_meta.json. The structural-absence inventory continues below this fold. Named, not hidden.
git clone https://github.com/aadarshkadam067/DetectionForge.git
cd DetectionForge
python3 -m venv .venv && source .venv/bin/activate
pip install -e '.[dev,convert]'
python scripts/update_attack_data.py # generates forge/data/attack_techniques.json
forge run # lint โ test โ convert โ score โ buildOr run the full pipeline in Docker:
docker compose -f docker/docker-compose.yml run --rm forgeExpected forge test summary:
20 rules โ mean precision 0.997 mean recall 1.000 mean F1 0.998
Expected forge convert summary:
60/60 conversions succeeded โ success rate 100.0% (threshold 95%)
ATT&CK coverage layer is written to reports/layer.json and is loadable directly into the public Navigator at https://mitre-attack.github.io/attack-navigator/ via Open Existing Layer โ Upload from Local.
The static measurement dashboard at dist/ consumes the JSON contract files
in dist/data/. forge run must complete first โ the JSON files are
not committed (they're build output), so a fresh clone serves an empty
dashboard until the pipeline has populated dist/data/. Then:
forge run # populates dist/data/*.json
cd dist && python3 -m http.server 8000 # any static server works
# open http://127.0.0.1:8000/If you open the dashboard before running the pipeline, it renders an explicit error card pointing at this same command โ by design, so the failure mode is legible instead of a silent empty UI.
The dashboard uses in-browser Babel transformation, which produces one console warning at page load. This is intentional โ it preserves the no-build-step deployment property documented in PDR ยง9: a production-deployable static asset with zero toolchain dependency. The dashboard ships as plain HTML/CSS/JSX read directly from disk.
โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ rules/*.yml โ โ data/attack/ โ โ data/benign/... โ
โ (Sigma rules) โ โ (TP fixtures) โ โ data/registry_baseline โ
โโโโโโโโโโฌโโโโโโโโโโ โโโโโโโโโโฌโโโโโโโโโโ โโโโโโโโโโโโโโฌโโโโโโโโโโโโโโ
โ โ โ
โโโโโโโโโโโโฌโโโโโโโโโโโโดโโโโโโโโโโโโโฌโโโโโโโโโโโโโโ
โผ โผ
โโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโ
โ forge lint โ โ forge test โ
โ (Stage 1) โ โ (Stage 2) โ
โโโโโโโโโฌโโโโโโโโ โโโโโโโโโฌโโโโโโโโ
โ โ
โผ โผ
โโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโ
โ forge convert โ โ forge score โ
โ (Stage 3) โ โ (Stage 4) โ
โโโโโโโโโฌโโโโโโโโ โโโโโโโโโฌโโโโโโโโ
โ โ
โผ โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโ
โ reports/converted/<r>/ โ โ reports/results.json โ
โ splunk.spl โ โ reports/layer.json โ
โ elastic.eql โ โ reports/conv_matrix... โ
โ sentinel.kql โ โ โ
โโโโโโโโโโโโโโฌโโโโโโโโโโโโโ โโโโโโโโโโโโโโฌโโโโโโโโโโโโ
โ โ
โโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโ
โผ
โโโโโโโโโโโโโโโโโโ
โ forge build โ
โ (Stage 5) โ
โโโโโโโโโโฌโโโโโโโโ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ dist/data/ โ
โ dashboard_meta.json โ
โ results.json โ
โ conversion_matrix.json โ
โ layer.json โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
| Stage | Command | Inputs | Outputs |
|---|---|---|---|
| 1. Lint | forge lint |
rules/**/*.yml |
pass/fail with errors |
| 2. Test | forge test |
rules + fixtures + baselines | reports/results.json |
| 3. Convert | forge convert |
rules | reports/converted/<slug>/{splunk.spl, elastic.eql, sentinel.kql} + reports/conversion_matrix.json |
| 4. Score | forge score |
reports/results.json |
reports/layer.json (ATT&CK Navigator v4.5) |
| 5. Build | forge build |
all reports + rule YAMLs + _meta.yml |
dist/data/{dashboard_meta, results, conversion_matrix, layer}.json |
Note on Stage 5. forge build produces JSON data artifacts only โ no HTML. The dashboard at dist/ is committed source (HTML/CSS/JSX, no build step) that consumes those JSON files at runtime. See PDR ยง9 and journey Entry 4.9 for the decision record and the rebuild rationale.
The Class column distinguishes earned precision (the corpus contained same-shape events the rule correctly excluded โ discrimination is real) from struct-abs (the corpus contained no events the rule could match โ precision = 1.000 reflects absence, not discrimination; shipped under the T1218 LOLBIN-cluster rationale where applicable). See journey Entry 4.7 for the per-rule classification audit.
| Technique | Tactic | Logsource | P | R | F1 | TP | FP | Class |
|---|---|---|---|---|---|---|---|---|
| T1003.001 | Cred Access | process_creation | 1.000 | 1.000 | 1.000 | 2 | 0 | earned |
| T1003.002 | Cred Access | process_creation | 1.000 | 1.000 | 1.000 | 1 | 0 | struct-abs |
| T1021.006 | Lateral Movement | process_creation | 1.000 | 1.000 | 1.000 | 3 | 0 | earned |
| T1033 | Discovery | process_creation | 1.000 | 1.000 | 1.000 | 2 | 0 | earned |
| T1053.005 | Persistence | process_creation | 1.000 | 1.000 | 1.000 | 2 | 0 | earned |
| T1055.001 | Defense Evasion | process_creation | 1.000 | 1.000 | 1.000 | 1 | 0 | earned |
| T1059.001 | Execution | process_creation | 0.933 | 1.000 | 0.965 | 14 | 1 | earned |
| T1059.005 | Execution | process_creation | 1.000 | 1.000 | 1.000 | 1 | 0 | struct-abs |
| T1087.001 | Discovery | process_creation | 1.000 | 1.000 | 1.000 | 8 | 0 | earned |
| T1105 | C2 | process_creation | 1.000 | 1.000 | 1.000 | 1 | 0 | earned |
| T1136.001 | Persistence | process_creation | 1.000 | 1.000 | 1.000 | 4 | 0 | earned |
| T1218.001 | Defense Evasion | process_creation | 1.000 | 1.000 | 1.000 | 1 | 0 | struct-abs |
| T1218.004 | Defense Evasion | process_creation | 1.000 | 1.000 | 1.000 | 1 | 0 | struct-abs |
| T1218.005 | Defense Evasion | process_creation | 1.000 | 1.000 | 1.000 | 2 | 0 | struct-abs |
| T1218.010 | Defense Evasion | process_creation | 1.000 | 1.000 | 1.000 | 1 | 0 | earned |
| T1218.013 | Defense Evasion | process_creation | 1.000 | 1.000 | 1.000 | 1 | 0 | struct-abs |
| T1220 | Defense Evasion | process_creation | 1.000 | 1.000 | 1.000 | 1 | 0 | earned |
| T1547.001 | Persistence | registry_event | 1.000 | 1.000 | 1.000 | 3 | 0 | earned |
| T1548.002 | Priv Esc | process_creation | 1.000 | 1.000 | 1.000 | 1 | 0 | earned |
| T1562.004 | Defense Evasion | process_creation | 1.000 | 1.000 | 1.000 | 4 | 0 | earned |
Aggregate: 20 rules ยท mean precision 0.997 ยท mean recall 1.000 ยท mean F1 0.998 ยท 14 earned / 6 structural-absence.
Corpus: 1,648 events across two logsources โ
- 1,164-event process-creation baseline (Sysmon EID 1) drawn from 121 OTRF
atomic + compound Windows ZIPs, deduplicated on
(Image, CommandLine, ParentImage, Computer)and filtered via the structural attack-event exclusion (ADR-005). - 484-event registry baseline (Sysmon EID 12/13) drawn from 4 OTRF source
ZIPs (
empire_wmi_local_event_subscriptions_elevated_user,empire_schtasks_creation_execution_elevated_user,covenant_dcom_iertutil_dll_hijack,empire_dcom_shellwindows_stager), with explicit exclusion of\Microsoft\Windows\CurrentVersion\Run\and\Microsoft\Windows\CurrentVersion\RunOnce\paths (the labeling axiom โ exclude exactly what the rule detects). Built in Phase 4 Day 3 via the ADR-006 per-logsource baseline routing.
48 structural-filter signatures total (46 process_creation + 3 registry_event,
with overlap accounted for). All 20 rules clear their expected.precision
threshold; the lone non-1.000 figure (T1059.001 at 0.933) is the
documented cross-technique FP referenced in T1087.001 โ Before/After
below applied symmetrically. Details in journey
Entries 4.3 / 4.6 / 4.7.
All 20 rules ร 3 backends = 60/60 conversions succeeded (100%). Backends: pySigma-backend-splunk 2.1.0, pySigma-backend-elasticsearch 2.0.2, pySigma-backend-kusto 1.0.1. Two logsource categories convert cleanly:
| Logsource | Rules | Conversions |
|---|---|---|
process_creation |
19 | 57/57 |
registry_event |
1 (T1547.001) | 3/3 |
| Total | 20 | 60/60 |
Per-rule per-backend status and the actual emitted query strings are in
reports/conversion_matrix.json and reports/converted/<rule>/{splunk.spl, elastic.eql, sentinel.kql}.
Trailing-backslash semantics on the registry logsource verified correct across
all three backends โ \CurrentVersion\Run\ does not match \RunTime or \RunOnce
on the \Run\ condition. See docs/measurements/phase4-day3-registry-harness.md
for the verification.
The harness caught a rule defect during corpus expansion: T1087.001's CommandLine|contains: user condition fired on net user /add commands from T1136.001 (Create Account), producing two cross-technique false positives. The defect was logged at detection time (Phase 3 Day 2), deliberately retained in the measurement, and fixed in a later commit โ providing a reproducible before/after delta.
| State | Precision | Recall | F1 | FPs | Defect |
|---|---|---|---|---|---|
| Before fix (Day 2) | 0.800 | 1.000 | 0.889 | 2 | user keyword matched net user /add |
| After fix (Day 4) | 1.000 | 1.000 | 1.000 | 0 | filter_account_creation: CommandLine|contains: ' /add ' |
The fix is a 4-line YAML change; the harness measures the improvement. This is the iterative-improvement loop the project was built to demonstrate. Full chain of evidence in docs/measurements/phase3-day4-t1087-fix.md.
All positive fixtures (TP events) come from real OTRF Security-Datasets captures, never from synthetic or hand-authored events. Dataset selection is logged with an accept/reject decision and reason in data/captures/_decisions.md โ the evidentiary record for "how did you choose your test data?"
The 1,164-event process-creation baseline was assembled from 121 OTRF
atomic + compound Windows ZIPs (Phase 4 Day 1 corpus build), deduplicated on
(Image, CommandLine, ParentImage, Computer) using
scripts/rebuild_baseline.py. A structural attack-event filter
(ADR-005) then
removed events whose hash(Image, CommandLine, ParentImage) matched any
positive fixture. As new attack fixtures land, the filter is re-run; the
baseline shrinks idempotently as cross-dataset attack-shape leaks are caught.
The 484-event registry baseline (Sysmon EID 12/13) was built in Phase 4
Day 3 by scripts/extract_registry_baseline.py from four OTRF source ZIPs.
The extraction excludes \Microsoft\Windows\CurrentVersion\Run\ and
\Microsoft\Windows\CurrentVersion\RunOnce\ paths up front (the labeling
axiom โ the rule's detection target is exactly what is excluded; nothing more,
nothing less). The generalized structural filter then re-applies post-fixture
using (TargetObject, Details, Image) as the identity signature.
See ADR-006 for the
per-logsource routing decision and docs/measurements/phase4-day3-registry-harness.md for the build record.
Precision is only meaningful for rules whose baseline contains events of the
same logsource the rule selects on. The harness enforces this via ADR-006:
forge/__init__.py defines BASELINE_MAP (a category โ file registry) and
REGISTERED_LOGSOURCE_CATEGORIES (its keyset). At test time, rules with no
explicit fixtures.negative are auto-routed by their logsource.category.
At lint time, a rule with no negative fixture and no registered category is
rejected before it can be measured (see tests/test_lint.py for the three
guard cases). Multi-logsource expansion is now "add a baseline file + add a
dispatch-table entry" โ no other changes.
Note on legacy figures. The pre-Phase-4 README reported a 1,481-event baseline. That figure was a raw concatenation count that included 419 cross-dataset duplicate events; the correct unique-event count for the pre-Phase-4 corpus was approximately 1,051. The Phase 4 rebuild corrects this and adds the registry baseline. See Entry 4.3 in the project journey for the accounting.
Rules are authored in Sigma format with a forge: extension block for fixture paths, expected metrics, and multi-SIEM flags. All rules are Pydantic v2 validated on lint.
# Example โ rules/windows/credential_access/T1003_001_lsass_dump_comsvcs.yml
detection:
selection:
EventID: 1
Image|endswith: '\rundll32.exe'
CommandLine|contains|all:
- 'comsvcs'
- 'MiniDump'
condition: selection
forge:
fixtures:
positive: data/attack/T1003.001/events.json
negative: data/benign/workstation_baseline.json
expected:
precision: 0.95
recall: 1.00
multi_siem: trueThe test harness (forge/test_harness.py) evaluates rules against fixture events using a hand-rolled field-matcher supporting four modifiers: |endswith, |contains, |contains|all, |startswith. pySigma is used only for forge convert (query-string generation), not for evaluation โ see ADR-002 for the design decision and spike results.
forge convert compiles each Sigma rule to query strings for all three backends via pySigma. The Sigma modifier set used across all current rules maps cleanly to each target language:
| Sigma modifier | SPL | EQL | KQL |
|---|---|---|---|
|endswith |
field="*value" |
field:"*value" |
field endswith "value" |
|contains |
field="*value*" |
field:"*value*" |
field contains "value" |
|contains|all |
Repeated field clauses (AND) | (f:"*a*" and f:"*b*") |
(f contains "a" and f contains "b") |
|startswith |
field="value*" |
field:"value*" |
field startswith "value" |
not filter |
NOT (field IN (...)) |
not (field like~ (...)) |
not((field endswith "...")) |
No translation losses detected. All 51 conversions succeeded. Example converted queries for T1003.001 are in docs/examples/T1003_001_converted_queries.md.
Three techniques are documented as deferrals rather than shipped with structurally guaranteed precision figures. Each has a named prerequisite for unblocking.
| Technique | Class | Prerequisite type | Reason |
|---|---|---|---|
| T1037.001 Logon Script (UserInitMprLogonScript) | Class 3 | Corpus only | 484-event registry baseline contains 0 events touching \Environment\ paths โ precision would be structurally guaranteed. Unblocking is corpus expansion, not harness work. |
| T1546.003 WMI Event Subscription | Class 1 | Harness + corpus | Empire's WMI persistence goes through the WMI namespace API (Sysmon EID 19/20/21), not registry writes. Unblocking requires a new wmi_event baseline and dispatch entry. |
| T1110.003 Password Spraying | Class 1 | Harness + corpus + aggregation | Rule selects on EID 4625 (logon failure), no baseline in current harness. Unblocking requires a logon-event baseline, dispatch entry, and aggregation logic to count failures per source within a window. |
The three classes of deferral (Class 1 logsource-mismatch, Class 3
corpus-realism) are defined in journey Entry 4.5.
T1037.001 is the most readily unblockable โ the harness already handles
registry_event. Phase 5 Day 1 sequencing: tackle T1037.001 as a corpus
expansion, then take on the harness-side work for the other two.
Principle: a rule's precision number is only meaningful when the benign baseline contains events of the same logsource the rule selects on, in sufficient population to plausibly false-positive. Shipping a rule whose precision would be structurally guaranteed contradicts the standard applied to every other rule in the project.
Eight rules have a positive fixture of three events or fewer. Recall is
measured as binary against those events โ it is 1.000 in every case but does
not mean all real-world variants are covered. See per-rule _meta.yml in
data/attack/<technique>/ for the OTRF source dataset and the variant
captured.
forge build (Stage 5) writes the four JSON contract files to dist/data/.
A static measurement dashboard at dist/ (React via CDN, no build step)
consumes those files at runtime and presents five views: Overview, Rules,
Coverage, Trends, Gaps. The earned-vs-structural-absence classification ships
next to every precision number โ the project's honesty contract, surfaced
in code. See View the dashboard locally above for the serve order.
ATT&CK coverage also ships via reports/layer.json, which loads directly
into the official MITRE Navigator at
https://mitre-attack.github.io/attack-navigator/
(Open Existing Layer โ Upload from Local) for anyone who prefers the
canonical viewer to the project dashboard.
The original UI deferral and the rebuild rationale are recorded in PDR ยง9 and journey Entry 4.9.
OTRF Security-Datasets โ all positive fixture events and the benign baseline corpus come from OTRF's open-source attack capture library. Dataset selection, event counts, and accept/reject decisions are logged in data/captures/_decisions.md.
Open Threat Research Foundation (OTRF). Security-Datasets. https://securitydatasets.com. MIT License.
MITRE ATT&CK โ technique and tactic mappings throughout this project follow the MITRE ATT&CK framework.
MITRE Corporation. ATT&CKยฎ. https://attack.mitre.org. ATT&CKยฎ content is licensed under CC BY 4.0.
SigmaHQ โ rule format and pySigma conversion backends.
SigmaHQ. Sigma โ Generic Signature Format for SIEM Systems. https://sigmahq.io. Apache License 2.0.
MIT โ see LICENSE.
Copyright (c) 2026 Aadarsh Kadam