Skip to content

feat(crashtracker): Collect and use panic callstack in crash report#1918

Draft
gleocadie wants to merge 2 commits intomainfrom
gleocadie/use-panic-callstack-instead
Draft

feat(crashtracker): Collect and use panic callstack in crash report#1918
gleocadie wants to merge 2 commits intomainfrom
gleocadie/use-panic-callstack-instead

Conversation

@gleocadie
Copy link
Copy Markdown
Contributor

@gleocadie gleocadie commented Apr 24, 2026

What does this PR do?

Captures the callstack directly in the panic hook (using the backtrace crate) and stores it alongside the panic message. When the signal handler fires after a panic, the pre-captured callstack is used instead of unwinding from the signal handler's ucontext. A new CrashKindData::Panic variant distinguishes panic-originated crashes from signal-based crashes and unhandled exceptions, and the error kind is set to ErrorKind::Panic accordingly.

Key changes:

  • Added PANIC_CALLSTACK atomic global to store the callstack captured in the panic hook
  • Introduced capture_panic_callstack() which uses backtrace::trace_unsynchronized to walk the stack and collect ip, sp, symbol address, and module base address per frame
  • Refactored CrashKindData to add a Panic variant carrying both the stacktrace and message, and moved message_ptr into the per-variant data where it belongs
  • Simplified Collector::spawn signature by passing CrashKindData directly instead of individual raw pointers
  • The signal handler now checks if a panic message was stored: if so, it builds a CrashKindData::Panic; otherwise it falls back to CrashKindData::UnixSignal

Motivation

Previously, for panics compiled with panic = "abort", the crash report callstack was captured from the signal handler after SIGABRT was raised. This had several problems:

  1. Polluted stack: the abort machinery frames (__GI_raise, rust_panic_abort, etc.) appear on top of the actual panic site
  2. Unreliable on unwind mode: if panic = "unwind" is used and the panic crosses an FFI boundary, the stack may be partially or fully destroyed by the time the signal handler fires
  3. Broken on Alpine/musl: musl's signal trampoline lacks DWARF unwind info (missing FDE), making signal-handler-based unwinding fragile. While the existing libunwind-seeded-from-ucontext approach mitigates this for signal crashes, capturing in the panic hook avoids the problem entirely since there is no signal trampoline to cross
  4. Signal handler may never fire: in unwind mode, if a catch_unwind boundary exists or the thread simply terminates, no signal is raised and the callstack is lost entirely

Capturing in the panic hook gives us the cleanest possible callstack: the stack is fully intact, we're in normal Rust context (safe to allocate, use libunwind, etc.), and it works identically in both unwind and abort modes.

Additional Notes

  • The backtrace crate is added as a new dependency. It uses trace_unsynchronized to avoid locking overhead since the panic hook is single-threaded per panicking thread.
  • The message_ptr field was moved from being a separate parameter in emit_crashreport into the CrashKindData variants (Panic and UnhandledException), since UnixSignal crashes don't carry a panic message. This makes the data flow more explicit.
  • Symbol resolution is not performed in the panic hook — only raw addresses (ip, sp, symbol_address, module_base_address) are captured. The receiver's blazesym pass handles symbolication as before.

How to test the change?

  • Run the existing crash tracker binary tests: cargo test -p bin_tests -- crashtracker
  • Verify that the "panic" crash type test now asserts error.kind == "Panic"
  • Unit tests for PANIC_CALLSTACK atomic storage/retrieval/replacement are included
  • Test on Alpine (musl) to confirm that panic callstacks are captured correctly without depending on signal-handler unwinding

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 24, 2026

📚 Documentation Check Results

⚠️ 1055 documentation warning(s) found

📦 libdd-crashtracker - 1055 warning(s)


Updated: 2026-04-24 09:53:07 UTC | Commit: e9de96e | missing-docs job results

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 24, 2026

🔒 Cargo Deny Results

⚠️ 4 issue(s) found, showing only errors (advisories, bans, sources)

📦 libdd-crashtracker - 4 error(s)

Show output
error[unsound]: Rand is unsound with a custom logger using `rand::rng()`
    ┌─ /home/runner/work/libdatadog/libdatadog/Cargo.lock:148:1
    │
148 │ rand 0.8.5 registry+https://github.com/rust-lang/crates.io-index
    │ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ unsound advisory detected
    │
    ├ ID: RUSTSEC-2026-0097
    ├ Advisory: https://rustsec.org/advisories/RUSTSEC-2026-0097
    ├ It has been reported (by @lopopolo) that the `rand` library is [unsound](https://rust-lang.github.io/unsafe-code-guidelines/glossary.html#soundness-of-code--of-a-library) (i.e. that safe code using the public API can cause Undefined Behaviour) when all the following conditions are met:
      
      - The `log` and `thread_rng` features are enabled
      - A [custom logger](https://docs.rs/log/latest/log/#implementing-a-logger) is defined
      - The custom logger accesses `rand::rng()` (previously `rand::thread_rng()`) and calls any `TryRng` (previously `RngCore`) methods on `ThreadRng`
      - The `ThreadRng` (attempts to) reseed while called from the custom logger (this happens every 64 kB of generated data)
      - Trace-level logging is enabled or warn-level logging is enabled and the random source (the `getrandom` crate) is unable to provide a new seed
      
      `TryRng` (previously `RngCore`) methods for `ThreadRng` use `unsafe` code to cast `*mut BlockRng<ReseedingCore>` to `&mut BlockRng<ReseedingCore>`. When all the above conditions are met this results in an aliased mutable reference, violating the Stacked Borrows rules. Miri is able to detect this violation in sample code. Since construction of [aliased mutable references is Undefined Behaviour](https://doc.rust-lang.org/stable/nomicon/references.html), the behaviour of optimized builds is hard to predict.
    ├ Announcement: https://github.com/rust-random/rand/pull/1763
    ├ Solution: Upgrade to >=0.10.1 OR <0.10.0, >=0.9.3 OR <0.9.0, >=0.8.6 (try `cargo update -p rand`)
    ├ rand v0.8.5
      ├── libdd-common v3.0.2
      │   ├── (build) libdd-crashtracker v1.0.0
      │   ├── libdd-shared-runtime v0.1.0
      │   │   └── libdd-telemetry v4.0.0
      │   │       └── libdd-crashtracker v1.0.0 (*)
      │   └── libdd-telemetry v4.0.0 (*)
      └── libdd-crashtracker v1.0.0 (*)

error[vulnerability]: Name constraints for URI names were incorrectly accepted
    ┌─ /home/runner/work/libdatadog/libdatadog/Cargo.lock:162:1
    │
162 │ rustls-webpki 0.103.10 registry+https://github.com/rust-lang/crates.io-index
    │ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ security vulnerability detected
    │
    ├ ID: RUSTSEC-2026-0098
    ├ Advisory: https://rustsec.org/advisories/RUSTSEC-2026-0098
    ├ Name constraints for URI names were ignored and therefore accepted.
      
      Note this library does not provide an API for asserting URI names, and URI name constraints are otherwise not implemented.  URI name constraints are now rejected unconditionally.
      
      Since name constraints are restrictions on otherwise properly-issued certificates, this bug is reachable only after signature verification and requires misissuance to exploit.
      
      This vulnerability is identified as [GHSA-965h-392x-2mh5](https://github.com/rustls/webpki/security/advisories/GHSA-965h-392x-2mh5). Thank you to @1seal for the report.
    ├ Solution: Upgrade to >=0.103.12, <0.104.0-alpha.1 OR >=0.104.0-alpha.6 (try `cargo update -p rustls-webpki`)
    ├ rustls-webpki v0.103.10
      └── rustls v0.23.37
          ├── hyper-rustls v0.27.7
          │   └── libdd-common v3.0.2
          │       ├── (build) libdd-crashtracker v1.0.0
          │       ├── libdd-shared-runtime v0.1.0
          │       │   └── libdd-telemetry v4.0.0
          │       │       └── libdd-crashtracker v1.0.0 (*)
          │       └── libdd-telemetry v4.0.0 (*)
          ├── libdd-common v3.0.2 (*)
          └── tokio-rustls v0.26.0
              ├── hyper-rustls v0.27.7 (*)
              └── libdd-common v3.0.2 (*)

error[vulnerability]: Name constraints were accepted for certificates asserting a wildcard name
    ┌─ /home/runner/work/libdatadog/libdatadog/Cargo.lock:162:1
    │
162 │ rustls-webpki 0.103.10 registry+https://github.com/rust-lang/crates.io-index
    │ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ security vulnerability detected
    │
    ├ ID: RUSTSEC-2026-0099
    ├ Advisory: https://rustsec.org/advisories/RUSTSEC-2026-0099
    ├ Permitted subtree name constraints for DNS names were accepted for certificates asserting a wildcard name.
      
      This was incorrect because, given a name constraint of `accept.example.com`, `*.example.com` could feasibly allow a name of `reject.example.com` which is outside the constraint.
      This is very similar to [CVE-2025-61727](https://go.dev/issue/76442).
      
      Since name constraints are restrictions on otherwise properly-issued certificates, this bug is reachable only after signature verification and requires misissuance to exploit.
      
      This vulnerability is identified as [GHSA-xgp8-3hg3-c2mh](https://github.com/rustls/webpki/security/advisories/GHSA-xgp8-3hg3-c2mh). Thank you to @1seal for the report.
    ├ Solution: Upgrade to >=0.103.12, <0.104.0-alpha.1 OR >=0.104.0-alpha.6 (try `cargo update -p rustls-webpki`)
    ├ rustls-webpki v0.103.10
      └── rustls v0.23.37
          ├── hyper-rustls v0.27.7
          │   └── libdd-common v3.0.2
          │       ├── (build) libdd-crashtracker v1.0.0
          │       ├── libdd-shared-runtime v0.1.0
          │       │   └── libdd-telemetry v4.0.0
          │       │       └── libdd-crashtracker v1.0.0 (*)
          │       └── libdd-telemetry v4.0.0 (*)
          ├── libdd-common v3.0.2 (*)
          └── tokio-rustls v0.26.0
              ├── hyper-rustls v0.27.7 (*)
              └── libdd-common v3.0.2 (*)

error[vulnerability]: Reachable panic in certificate revocation list parsing
    ┌─ /home/runner/work/libdatadog/libdatadog/Cargo.lock:162:1
    │
162 │ rustls-webpki 0.103.10 registry+https://github.com/rust-lang/crates.io-index
    │ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ security vulnerability detected
    │
    ├ ID: RUSTSEC-2026-0104
    ├ Advisory: https://rustsec.org/advisories/RUSTSEC-2026-0104
    ├ A panic was reachable when parsing certificate revocation lists via [`BorrowedCertRevocationList::from_der`]
      or [`OwnedCertRevocationList::from_der`].  This was the result of mishandling a syntactically valid empty
      `BIT STRING` appearing in the `onlySomeReasons` element of a `IssuingDistributionPoint` CRL extension.
      
      This panic is reachable prior to a CRL's signature being verified.
      
      Applications that do not use CRLs are not affected.
      
      Thank you to @tynus3 for the report.
    ├ Solution: Upgrade to >=0.103.13, <0.104.0-alpha.1 OR >=0.104.0-alpha.7 (try `cargo update -p rustls-webpki`)
    ├ rustls-webpki v0.103.10
      └── rustls v0.23.37
          ├── hyper-rustls v0.27.7
          │   └── libdd-common v3.0.2
          │       ├── (build) libdd-crashtracker v1.0.0
          │       ├── libdd-shared-runtime v0.1.0
          │       │   └── libdd-telemetry v4.0.0
          │       │       └── libdd-crashtracker v1.0.0 (*)
          │       └── libdd-telemetry v4.0.0 (*)
          ├── libdd-common v3.0.2 (*)
          └── tokio-rustls v0.26.0
              ├── hyper-rustls v0.27.7 (*)
              └── libdd-common v3.0.2 (*)

advisories FAILED, bans ok, sources ok

Updated: 2026-04-24 09:54:29 UTC | Commit: e9de96e | dependency-check job results

@github-actions
Copy link
Copy Markdown
Contributor

Clippy Allow Annotation Report

Comparing clippy allow annotations between branches:

  • Base Branch: origin/main
  • PR Branch: origin/gleocadie/use-panic-callstack-instead

Summary by Rule

Rule Base Branch PR Branch Change

Annotation Counts by File

File Base Branch PR Branch Change

Annotation Stats by Crate

Crate Base Branch PR Branch Change
clippy-annotation-reporter 5 5 No change (0%)
datadog-ffe-ffi 1 1 No change (0%)
datadog-ipc 21 21 No change (0%)
datadog-live-debugger 6 6 No change (0%)
datadog-live-debugger-ffi 10 10 No change (0%)
datadog-profiling-replayer 4 4 No change (0%)
datadog-remote-config 3 3 No change (0%)
datadog-sidecar 56 56 No change (0%)
libdd-common 10 10 No change (0%)
libdd-common-ffi 12 12 No change (0%)
libdd-data-pipeline 5 5 No change (0%)
libdd-ddsketch 2 2 No change (0%)
libdd-dogstatsd-client 1 1 No change (0%)
libdd-profiling 13 13 No change (0%)
libdd-telemetry 19 19 No change (0%)
libdd-tinybytes 4 4 No change (0%)
libdd-trace-normalization 2 2 No change (0%)
libdd-trace-obfuscation 8 8 No change (0%)
libdd-trace-stats 1 1 No change (0%)
libdd-trace-utils 15 15 No change (0%)
Total 198 198 No change (0%)

About This Report

This report tracks Clippy allow annotations for specific rules, showing how they've changed in this PR. Decreasing the number of these annotations generally improves code quality.

@datadog-prod-us1-4
Copy link
Copy Markdown

datadog-prod-us1-4 Bot commented Apr 24, 2026

Tests

🎉 All green!

❄️ No new flaky tests detected
🧪 All tests passed

🎯 Code Coverage (details)
Patch Coverage: 44.14%
Overall Coverage: 71.75% (+0.00%)

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: d73cb3a | Docs | Datadog PR Page | Give us feedback!

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 24, 2026

Codecov Report

❌ Patch coverage is 44.14414% with 62 lines in your changes missing coverage. Please review.
✅ Project coverage is 71.75%. Comparing base (cf8c1cf) to head (d73cb3a).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1918   +/-   ##
=======================================
  Coverage   71.75%   71.75%           
=======================================
  Files         434      434           
  Lines       69954    70026   +72     
=======================================
+ Hits        50192    50244   +52     
- Misses      19762    19782   +20     
Components Coverage Δ
libdd-crashtracker 66.15% <44.14%> (+0.19%) ⬆️
libdd-crashtracker-ffi 35.36% <ø> (+1.27%) ⬆️
libdd-alloc 98.77% <ø> (ø)
libdd-data-pipeline 85.65% <ø> (ø)
libdd-data-pipeline-ffi 70.70% <ø> (ø)
libdd-common 79.41% <ø> (ø)
libdd-common-ffi 73.87% <ø> (ø)
libdd-telemetry 68.06% <ø> (ø)
libdd-telemetry-ffi 19.37% <ø> (ø)
libdd-dogstatsd-client 82.64% <ø> (ø)
datadog-ipc 76.16% <ø> (-0.15%) ⬇️
libdd-profiling 81.61% <ø> (ø)
libdd-profiling-ffi 64.36% <ø> (ø)
datadog-sidecar 29.16% <ø> (+0.04%) ⬆️
datdog-sidecar-ffi 7.42% <ø> (+0.22%) ⬆️
spawn-worker 54.69% <ø> (ø)
libdd-tinybytes 93.16% <ø> (ø)
libdd-trace-normalization 81.71% <ø> (ø)
libdd-trace-obfuscation 87.26% <ø> (ø)
libdd-trace-protobuf 68.25% <ø> (ø)
libdd-trace-utils 89.27% <ø> (ø)
datadog-tracer-flare 86.88% <ø> (ø)
libdd-log 74.69% <ø> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@dd-octo-sts
Copy link
Copy Markdown
Contributor

dd-octo-sts Bot commented Apr 24, 2026

Artifact Size Benchmark Report

aarch64-alpine-linux-musl
Artifact Baseline Commit Change
/aarch64-alpine-linux-musl/lib/libdatadog_profiling.so 7.63 MB 7.63 MB 0% (0 B) 👌
/aarch64-alpine-linux-musl/lib/libdatadog_profiling.a 83.31 MB 83.33 MB +.02% (+17.75 KB) 🔍
aarch64-unknown-linux-gnu
Artifact Baseline Commit Change
/aarch64-unknown-linux-gnu/lib/libdatadog_profiling.so 10.10 MB 10.17 MB +.63% (+65.35 KB) 🔍
/aarch64-unknown-linux-gnu/lib/libdatadog_profiling.a 99.42 MB 99.44 MB +.02% (+25.99 KB) 🔍
libdatadog-x64-windows
Artifact Baseline Commit Change
/libdatadog-x64-windows/debug/dynamic/datadog_profiling_ffi.dll 25.19 MB 25.19 MB --.01% (-4.00 KB) 💪
/libdatadog-x64-windows/debug/dynamic/datadog_profiling_ffi.lib 79.90 KB 79.90 KB 0% (0 B) 👌
/libdatadog-x64-windows/debug/dynamic/datadog_profiling_ffi.pdb 184.45 MB 184.45 MB 0% (0 B) 👌
/libdatadog-x64-windows/debug/static/datadog_profiling_ffi.lib 918.63 MB 918.65 MB +0% (+22.09 KB) 👌
/libdatadog-x64-windows/release/dynamic/datadog_profiling_ffi.dll 7.89 MB 7.89 MB 0% (0 B) 👌
/libdatadog-x64-windows/release/dynamic/datadog_profiling_ffi.lib 79.90 KB 79.90 KB 0% (0 B) 👌
/libdatadog-x64-windows/release/dynamic/datadog_profiling_ffi.pdb 23.67 MB 23.67 MB 0% (0 B) 👌
/libdatadog-x64-windows/release/static/datadog_profiling_ffi.lib 46.19 MB 46.19 MB -0% (-104 B) 👌
libdatadog-x86-windows
Artifact Baseline Commit Change
/libdatadog-x86-windows/debug/dynamic/datadog_profiling_ffi.dll 21.66 MB 21.66 MB 0% (0 B) 👌
/libdatadog-x86-windows/debug/dynamic/datadog_profiling_ffi.lib 81.14 KB 81.14 KB 0% (0 B) 👌
/libdatadog-x86-windows/debug/dynamic/datadog_profiling_ffi.pdb 188.55 MB 188.56 MB +0% (+8.00 KB) 👌
/libdatadog-x86-windows/debug/static/datadog_profiling_ffi.lib 903.86 MB 903.89 MB +0% (+21.95 KB) 👌
/libdatadog-x86-windows/release/dynamic/datadog_profiling_ffi.dll 6.13 MB 6.13 MB --.01% (-1.00 KB) 💪
/libdatadog-x86-windows/release/dynamic/datadog_profiling_ffi.lib 81.14 KB 81.14 KB 0% (0 B) 👌
/libdatadog-x86-windows/release/dynamic/datadog_profiling_ffi.pdb 25.36 MB 25.36 MB 0% (0 B) 👌
/libdatadog-x86-windows/release/static/datadog_profiling_ffi.lib 43.68 MB 43.68 MB +0% (+48 B) 👌
x86_64-alpine-linux-musl
Artifact Baseline Commit Change
/x86_64-alpine-linux-musl/lib/libdatadog_profiling.a 74.28 MB 74.30 MB +.02% (+19.24 KB) 🔍
/x86_64-alpine-linux-musl/lib/libdatadog_profiling.so 8.55 MB 8.55 MB 0% (0 B) 👌
x86_64-unknown-linux-gnu
Artifact Baseline Commit Change
/x86_64-unknown-linux-gnu/lib/libdatadog_profiling.a 91.78 MB 91.80 MB +.01% (+16.52 KB) 🔍
/x86_64-unknown-linux-gnu/lib/libdatadog_profiling.so 10.20 MB 10.20 MB +.04% (+4.96 KB) 🔍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants