Skip to content

Add linux_psi integration#3020

Closed
victoroseghale wants to merge 39 commits into
DataDog:masterfrom
victoroseghale:master
Closed

Add linux_psi integration#3020
victoroseghale wants to merge 39 commits into
DataDog:masterfrom
victoroseghale:master

Conversation

@victoroseghale
Copy link
Copy Markdown

@victoroseghale victoroseghale commented May 29, 2026

What does this PR do?

Adds a new integration linux_psi to monitor Linux kernel PSI (Pressure Stall Information).

Motivation

PSI (kernel 4.20+) is the canonical signal for "how much time is this system stalling on contention" - far more useful than CPU% for diagnosing "everything is slow" incidents. The existing linux_proc_extras integration does not cover PSI. No other integration in core or extras does either.

Per-cgroup PSI extends the same signal to the noisy-neighbor question: which workload is generating pressure on this host? This is the answer container-heavy fleets need but cannot get from host-wide metrics alone.

Review checklist

  • PR has a meaningful title or PR has the no-changelog label attached
  • Feature or bugfix has tests
  • Git history is clean
  • If PR impacts documentation, docs team has been notified or an issue has been opened on the documentation repo
  • If this PR includes a log pipeline, please add a description describing the remappers and processors.

Additional Information

Ships at v1.1.0 with:

  • Host-wide PSI from /proc/pressure/{cpu,memory,io} - 24 metrics covering the some/full x avg10/avg60/avg300/total matrix per resource. Honors the Agent's procfs_path config for containerized agents.
  • Opt-in per-cgroup PSI from /sys/fs/cgroup/.../{cpu,memory,io}.pressure - 24 additional metrics under the system.pressure.cgroup.* namespace, tagged with cgroup_path and cgroup_root so users can drill into k8s pods or systemd services. Requires cgroup v2.
  • Kernel version surfaced via set_metadata for fleet-wide audits ("which hosts can emit cpu.full?").
  • 8 instance config options: tags, service, min_collection_interval, resources (subset of cpu/memory/io), cgroup_roots, cgroupfs_path, cgroup_max_depth, cgroup_max_count (cardinality caps).
  • 2 service checks: linux_psi.can_read (host) and linux_psi.cgroup.can_read (cgroup, only when cgroup_roots is configured).
  • 5 recommended monitors: high cpu_some, severe cpu_full, severe memory_full (paging-grade), high io_some, severe io_full.
  • 30-widget operator-question-organized dashboard with 7 collapsible groups (status, live trends, fleet view, per-host investigation, per-cgroup view, capacity & trends, reference).
  • 6 dashboard screenshots in tile.media[] captured from a real multi-host deployment showing the stress-workload phases clearly visible across each panel.

victoralfred and others added 30 commits May 28, 2026 14:37
Adds the empty Python package skeleton for the new linux_psi check:
pyproject.toml, hatch.toml, version file at 1.0.0, and the
datadog_checks/linux_psi namespace stub.

No check logic yet - that lands in the next commit.
Reads /proc/pressure/{cpu,memory,io} (Linux kernel 4.20+) and emits up to
24 metrics per host - some/full x avg10/avg60/avg300 (gauges) and total
(monotonic_count) for each resource.

Handles three real-world conditions gracefully:
  - PSI not enabled (kernel < 4.20 or no psi=1 boot param) -> WARNING
    service check, no metrics
  - pre-5.13 kernels lack the 'full' line for cpu -> those 4 metrics
    simply do not emit
  - permission denied -> CRITICAL service check with the offending path

Honors the Agent's procfs_path config so container deployments that
mount the host /proc at /host/proc work without code changes.
Seven unit tests exercising every code path against fixture /proc/pressure
files: happy path, pre-5.13 kernel (no cpu.full line), missing pressure
directory (kernel < 4.20), one resource file missing, permission denied,
malformed input lines, and the monotonic_count typing of the total field.

One integration test reads the real /proc/pressure/* on the host. Skipped
on non-Linux and on Linux kernels without PSI enabled.

Six fixture files cover the cases above including a deliberately malformed
io fixture to verify parser resilience.
Tile manifest with Linux-only classifier and AI/ML / OS & System
categories. Twenty-four metric rows in metadata.csv. One service check
(linux_psi.can_read) with OK / WARNING / CRITICAL states documented.

Overview dashboard has six widgets - per-resource pressure timeseries
(cpu/memory/io) plus a top-list of hosts ranked by memory and io
pressure. Three recommended monitors:

  - cpu_pressure_some_high     - 5m avg300 > 50%, warning at 30%
  - memory_pressure_full_critical - 1m avg10 > 10%, pages on critical
  - io_pressure_some_high      - 5m avg300 > 70%, warning at 40%
User-facing README with Overview, Setup (host and containerized),
Configuration, Data Collected, Troubleshooting, and Support sections.
Documents the kernel requirement (4.20+), the psi=1 boot parameter,
and the procfs_path config for containerized agents.

Initial 1.0.0 changelog entry summarizing what shipped.
Updates outside the integration directory required for the test
runner and review routing to find linux_psi:

  - .codecov.yml: new Linux_PSI project entry and linux_psi flag
  - .github/CODEOWNERS: route /linux_psi/ to @voseghale plus
    @DataDog/ecosystems-review
  - .github/workflows/test-all.yml: include the linux_psi check in
    the test-all job

These are the entries ddev validate ci --sync generated automatically.
Add four new unit tests covering branches that the initial suite did not
exercise:

  - test_procfs_path_override: confirms the check honors the Agent's
    procfs_path config for containerized deployments
  - test_os_error_is_soft_failed: a generic OSError (EIO mid-read) on one
    file is logged but the other resources still emit and the service
    check stays OK
  - test_all_files_missing_yields_warning: empty /proc/pressure directory
    yields WARNING, not OK
  - test_multi_file_permission_denied_is_critical: when some files succeed
    and others permission-deny, the service check is CRITICAL and the
    message identifies the offending path

Extend the malformed io fixture to include a field with no equals sign
and an unknown field name so the existing test_malformed_line_is_ignored
exercises those branches.

Coverage on check.py: 84% -> 97%. The single remaining miss is a
defensive return that is unreachable because an earlier blank-line check
short-circuits the only path that could lead to it.
Soften the memory_full monitor so it does not page on transient pressure
spikes that are normal during page-cache reclaim:

  - threshold raised from 10% to 20%
  - window broadened from avg10/last_1m to avg60/last_5m
  - renotify cadence dropped from every 5m to every 30m

Add two missing 'full' (all-tasks-stalled) monitors. The previous set
only covered 'some' for cpu and io but 'full' is the more severe state
and deserves its own alert with appropriate threshold:

  - cpu_pressure_full_critical: avg60/5m > 10% (kernel 5.13+ only)
  - io_pressure_full_critical:  avg60/5m > 15%

Strip the placeholder @ALL and @PagerDuty handles from the templates -
notification routing belongs to the install-time configuration, not the
shipped template. Add an explicit comment in each monitor's message
explaining this.

Total recommended monitors: 5 (was 3).
Add a markdown note widget at the top of the dashboard explaining what
'some' vs 'full' mean, what the avg10/60/300 windows represent, and the
kernel version requirements. Users opening the dashboard for the first
time now understand the metrics without needing to click out to docs.

Also wire the existing host template variable through to all six
timeseries / toplist queries (was a hardcoded '*', now correctly filters
by the selected host).
Add an explicit Compatibility section listing the kernel version
requirements (4.20+ for the core integration, 5.13+ for the cpu.full
metrics), the minimum Agent version (7.53), and the psi=1 boot
parameter requirement.

In the Support section, state the license (BSD-3-Clause matching the
parent repo) and the contribution flow (issue first for non-trivial
changes) so contributors and users know the maintenance model up
front.
Add two new troubleshooting entries covering issues users hit most:

  - 'yaml: cannot unmarshal !!map into string' when tags is written
    as 'tags: - env: prod' (a list of maps) instead of 'tags:
    - env:prod' (a list of strings). Shows the wrong vs correct shape
    and the Python one-liner for linting conf.yaml standalone before
    restarting the agent.
  - 'check loads but no metrics in Datadog' explaining how to use
    'datadog-agent check linux_psi' to confirm the integration is
    emitting locally and distinguish a Datadog-account/network issue
    from an integration issue.

Both write-ups come from real first-install experience. The tag-shape
mistake in particular is the classic Datadog config bug every team
hits exactly once; documenting it here saves the next user a
restart-debug cycle.
Adds an optional 'resources' list to the instance config. Users can
restrict collection to a subset of cpu, memory, io to handle hosts
where one resource's pressure is permanently elevated for legitimate
reasons (e.g., a database host with sustained high I/O) or where a
specific /proc/pressure file is masked by container security policy.

Unknown resource names raise ConfigurationError at check startup with
the offending values surfaced in the error message rather than
degrading silently.

Why this is better than the existing metric_patterns workaround:
metric_patterns suppresses the *metrics* but the check still opens
and reads the file. That means the linux_psi.can_read service check
still degrades if a 'noisy' resource has permission issues. The
resources option skips the file entirely so the service check stays
clean.

Three new unit tests:

  - test_resources_config_filters_collection
  - test_resources_config_rejects_unknown (ConfigurationError path)
  - test_resources_config_preserves_order_and_dedups

Coverage now 98% on check.py (was 97%). Total tests: 15.
Read /proc/sys/kernel/osrelease, parse the major.minor.patch prefix
with a defensive regex, and submit it via set_metadata with the
semver scheme. The kernel version then appears in the tile's
Integration metadata column, making fleet-wide audits possible:

  - 'how many hosts are on a kernel that lacks cpu.full PSI (< 5.13)?'
  - 'are all my hosts on a kernel that supports the psi=1 boot param
    (4.20+)?'

Refactor the procfs path resolution: _set_paths now stores
self._proc_root, which both pressure_dir and the kernel-version
metadata read use. This means the existing procfs_path config (for
containerized agents that mount /proc at /host/proc) automatically
covers the kernel version read as well.

Six new test cases:

  - test_kernel_version_metadata_parses: parametrized over four
    distro-specific osrelease strings (Ubuntu, plain, sticky-patch
    with +, Debian)
  - test_kernel_version_metadata_missing_file_is_silent: graceful
    no-op when osrelease cannot be opened
  - test_kernel_version_metadata_garbled_is_silent: graceful no-op
    on unparseable osrelease content

Total tests: 21 (was 15).
Three vertical bars in rising height (green / amber / red) representing
the three PSI resources CPU / memory / I/O at increasing pressure
levels. Communicates the integration's purpose at a glance in the tile
catalog and stays recognizable at small sizes.

No gradients, no text, no external font dependencies. Works on both
light and dark tile backgrounds. ~25 lines of SVG total.
The v1 dashboard is the floor - every metric has a place, every
resource is represented. The dashboard that actually answers
operational questions is one design pass beyond that, and writing the
design down now means whoever picks this up later (including the
original author six months from now) does not have to re-derive it.

DASHBOARD_DESIGN.md covers:

  - Design philosophy: organize panels by the question an operator
    wants answered, not by the metric being shown
  - Six question-driven panels for v1.1 (stacked three-resource view,
    severe-contention indicator, pressure-budget heatmap, pressure vs
    utilization quadrant, week-over-week delta with deploy overlay,
    pressure SLO board)
  - The composite pressure score (computed at query time, no extra
    storage cost, single highest-density visualization)
  - Candidates I considered and ruled out, with the reasoning
  - Recommended shipping order so each panel goes out independently
  - Validation checklist before merging any panel into the main JSON
  - Links to the relevant Datadog function and widget docs

This is the kind of doc that turns 'whatever the original author
intended is lost' into 'here is exactly why we shipped what we
shipped and what would come next.'
Add opt-in collection of per-cgroup pressure stall data from
/sys/fs/cgroup/.../<resource>.pressure. Disabled by default; enable
by setting the cgroup_roots config option to a list of slices to
walk (e.g., system.slice, kubepods.slice).

Per-cgroup metrics live in their own namespace
'system.pressure.cgroup.<resource>.<kind>.<key>' so they do not
co-mingle with host-level 'system.pressure.<resource>.<kind>.<key>'
in aggregate queries (avg:metric{*} would otherwise sum unrelated
values). Each per-cgroup metric carries cgroup_path and cgroup_root
tags so users can drill in by namespace/service/pod, and so the
Agent tagger can enrich them downstream with k8s metadata when
present.

Cardinality is bounded by three caps:

  - cgroup_max_depth (default 2) limits subdirectory recursion
  - cgroup_max_count (default 200) limits total cgroups per check run
  - cgroup_roots is opt-in so users only walk what they want

cgroup PSI requires cgroup v2 (the unified hierarchy). The check
detects v1 hosts by the absence of cgroup.controllers at the
cgroupfs root and emits a WARNING service check
(linux_psi.cgroup.can_read) with a clear message instead of
silently failing.

Refactor _emit_line and _read_one to take namespace + tags
parameters so they can serve both the host path (unchanged metric
names) and the new cgroup path (system.pressure.cgroup.* with
cgroup_path tags). Existing 21 tests pass unchanged.
…ice check

Adds metadata.csv rows for the 24 system.pressure.cgroup.* metrics
emitted by the new cgroup PSI collection path. Each row notes that
the metric is per-cgroup and lists cgroup_path as a sample tag so
the Datadog Metrics Summary UI shows it in the tag dimension column.

Adds the linux_psi.cgroup.can_read service check entry with both
states documented (OK when cgroup v2 hierarchy is readable, WARNING
when the host is on cgroup v1 or the cgroupfs path is missing).

Total metrics: 48 (was 24). Total service checks: 2 (was 1).
Spec adds four instance options for the cgroup feature:
  - cgroup_roots: opt-in list of slices to walk (default: empty,
    feature disabled). When set, must be a YAML list.
  - cgroupfs_path: cgroup filesystem root (default /sys/fs/cgroup)
  - cgroup_max_depth: subdirectory recursion limit (default 2)
  - cgroup_max_count: cardinality cap per check run (default 200)

Regenerated config_models and conf.yaml.example from the updated
spec. Documented each option with a paragraph explaining when to
tune it.

Five new test cases:

  - test_cgroup_disabled_by_default: no cgroup walking when
    cgroup_roots is empty, no cgroup service check emitted
  - test_cgroup_collects_metrics: full walk of a fake systemd-style
    slice (system.slice + 2 services), verifies the per-cgroup
    metrics are emitted with the correct cgroup_path and cgroup_root
    tags
  - test_cgroup_v1_host_warns: a cgroupfs path that lacks the
    cgroup.controllers marker (i.e., cgroup v1) yields a clear
    WARNING service check, host-level metrics keep flowing
  - test_cgroup_max_count_caps_cardinality: 5 cgroups under a
    max_count of 2 emits at most 2 cgroup paths worth of metrics
  - test_cgroup_roots_rejects_non_list: a scalar value for
    cgroup_roots raises ConfigurationError at check startup

Total tests: 26 (was 21).
README updates:
  - Configuration table gains four new options (cgroup_roots,
    cgroupfs_path, cgroup_max_depth, cgroup_max_count) with the
    when-to-tune-them rationale documented inline
  - Data Collected section explains the new system.pressure.cgroup.*
    namespace and the cgroup_path / cgroup_root tags
  - Service Checks table lists linux_psi.cgroup.can_read
  - Compatibility matrix gains the cgroup v2 (kernel 5.2+) row
  - Two new troubleshooting entries for the cgroup-v2-not-detected
    case and the cgroup_max_count warning

DASHBOARD_DESIGN updated to note that per-cgroup PSI is now available
(it was previously listed as deferred). A v1.2 dashboard pass should
add the per-cgroup breakdowns, with the canonical 'which workload is
stressing this host' panel (top-list of cgroups by cgroup.cpu.some.
avg300) called out as the highest-value addition.

Version bump 1.0.0 -> 1.1.0 per SemVer (new opt-in feature,
backwards-compatible). CHANGELOG entry documents the added and
changed bits.
A misconfigured conf.yaml could previously make the check walk and
emit metrics for directories outside the cgroupfs root. Three attack
vectors are now closed at the config and walk layers:

  - cgroup_roots entries containing parent-directory references (..)
    are rejected at check startup with a ConfigurationError
  - cgroup_roots entries that are absolute paths are rejected at
    check startup
  - cgroup_roots entries that resolve outside cgroupfs_path via a
    symlink (e.g., a 'sneaky.slice' symlink pointing at /etc) are
    skipped at walk time with a warning log; no metrics emit from
    the escape target

Eight new tests cover the rejection and skip paths, including a
symlink-escape test that confirmed the vulnerability before the fix
landed.

The _is_within_cgroupfs helper uses os.path.realpath on both the
candidate root and the configured cgroupfs_path then verifies the
candidate is at or below the base. This catches any future symlink
games on intermediate path components, not just the final segment.

Total tests: 34 (was 26).
Previously, when the cgroup_max_count cap was reached while walking
the first root in cgroup_roots, the inner for-loop broke but the
outer for-loop continued to the next root. That root would immediately
re-check the cap, break again, and log the cap-hit warning a second
time. With N roots configured the warning fired N times.

Fix: hoist the cap-hit state into a flag the outer loop checks. One
warning per check run regardless of how many roots are configured;
also avoids the wasted os.path.isdir / _is_within_cgroupfs work on
the unreached roots.

Test: test_cgroup_max_count_breaks_across_multiple_roots sets up two
roots with multiple cgroups each, max_count=1, and asserts the
warning is logged exactly once (the test failed with count 2 before
the fix).
Datadog truncates tag values longer than 200 chars at the backend
silently. For deeply-nested k8s pod cgroup paths (kubepods.slice/
kubepods-burstable.slice/kubepods-burstable-pod<uuid>.slice/cri-
containerd-<container-id>.scope) the path can easily exceed 200 chars
and the resulting tag becomes either truncated arbitrarily or rejected.

Truncate in-band with a visible '...truncated' sentinel so:
  - the truncation is reproducible and auditable in logs / dashboards
  - the sentinel makes it obvious which tags were affected so the
    operator can address it (e.g., increase cgroup_max_depth scope
    or normalize the cgroup naming upstream)
  - we never emit a tag the backend would silently mangle

Test: builds a 200-character single-segment cgroup name (under the
255-char filename limit on most filesystems but enough to push the
full cgroup_path tag over 200 chars). Asserts the emitted tag is
<=200 chars and ends with the sentinel.
Four targeted tests for previously-uncovered defensive paths:

  - test_cgroupfs_path_missing_entirely_warns: cgroupfs_path set to a
    nonexistent directory yields a WARNING service check (distinct
    from the existing 'exists but not v2' case)
  - test_cgroup_root_in_config_does_not_exist_on_disk: configuring
    kubepods.slice on a host without it logs at debug and continues
    cleanly; service check stays OK because the rest of the feature
    is working
  - test_walker_skips_scandir_permission_error: when os.scandir
    raises PermissionError on a subdirectory, the walker skips it
    and continues emitting for sibling cgroups
  - test_emit_cgroup_handles_pressure_file_oserror: an EIO read on
    one cgroup's pressure file is logged and skipped; other cgroups
    and resources still emit

The remaining 7% of uncovered lines in check.py are:
  - the OSError handler for os.path.realpath in _is_within_cgroupfs
    (requires deeper filesystem mocking)
  - the real_path == real_base edge case (root cgroup itself)
  - default-parameter paths in _emit_line that are only reachable
    via private API calls the test suite has no reason to exercise

These are all defensive code paths with no realistic trigger; the
93% measured coverage represents effectively-complete coverage of
the code that will actually run in production.

Total tests: 40 (was 36).
Replaces the 7-widget 'metric inventory' v1 dashboard with 30 widgets
organized into 7 collapsible groups, each scoped to a different
operator question. The dashboard now reads top-to-bottom from urgency
to investigation:

  1. About header - one paragraph orientation + legend
  2. Status (pink) - 4 KPI tiles with red/yellow/green conditional
     formatting answering 'is anything on fire?' in under 5 seconds:
     worst memory.full, worst cpu.some, worst io.some, composite score
  3. Live trends (blue) - stacked 3-resource chart answering 'where
     is the wedge?' plus per-resource panels with warning/critical
     y-markers and a composite-by-host trendline
  4. Fleet view (purple) - top-N hosts for CPU/memory/IO + cumulative
     stall (last 24h) leaderboard + memory pressure distribution
     histogram. The 'who's worst?' answer.
  5. Per-host investigation (green) - some-vs-full overlay per
     resource + stall-rate-per-second derived from total counter;
     uses the host template variable
  6. Per-cgroup view (yellow) - top cgroups by CPU/memory + per-
     cgroup timeseries; populated only when cgroup_roots is configured
     (clear note widget says so). The noisy-neighbor view.
  7. Capacity & trends (orange) - this-week-vs-last-week timeseries
     + pressure-budget leaderboard. The quarterly-review view.
  8. Reference (gray) - all-windows per-resource (avg10/60/300) for
     custom-query authoring. Not for at-a-glance use.

Five template variables (host, env, service, cgroup_path, cgroup_root)
so the same dashboard serves fleet-wide and per-host views without
duplication.

DASHBOARD_DESIGN.md updated to reflect the v1.1 work has shipped and
to lay out the v1.2 roadmap (pressure-vs-utilization scatter,
anomaly bands, SLO widget, deploy-event overlay, forecast).

The visual hierarchy is deliberate: on-call sees groups 1+2 without
scrolling and gets their answer; capacity planner expands group 7;
performance engineer expands groups 5+6 during regression
investigation. Nobody scrolls past noise they don't need.
Two real bugs surfaced when the dashboard was imported and tested
against live data:

  - Cumulative stall toplist used 'sum:' across hosts of an as_rate()
    counter, which double-counts when there are multiple host
    series. Switched to 'avg:' (correct per-host rate) and dropped
    the microsecond unit format which was misleading on a derived
    per-second rate.

  - Distribution widget rendered 'No data' because it lacked the
    required 'response_format: scalar' and explicit aggregator on
    the query. Datadog requires both for distribution widgets even
    though most other widget types are forgiving about defaults.

Caught during the import-into-account validation step that the
DASHBOARD_DESIGN.md doc calls out as mandatory before merging any
dashboard JSON change. JSON parses; all 40 unit tests still pass.
Six PNGs captured from a live Datadog account after running the
20-minute stress workload across two real hosts (hpaserver and
voseghale-HP). Each image corresponds to one of the dashboard's
seven groups and is captioned with the operator question it answers:

  - dashboard_overview.png - hero shot with the four KPI tiles
    showing red/yellow/green conditional formatting in action
  - dashboard_fleet_view.png - multi-host top-lists plus the
    cumulative stall-time leaderboard (proves it works across
    a real fleet, not just one host)
  - dashboard_per_host_investigation.png - per-host pressure
    spikes from the stress run clearly visible as plateaus on
    each chart, with the derived stall-rate-per-second view
  - dashboard_per_cgroup.png - top systemd-service cgroups by
    CPU and memory pressure with real cgroup_path tag values
    (proves the cgroup feature works end-to-end)
  - dashboard_capacity_trends.png - this-week-vs-last-week
    timeseries with the pressure budget leaderboard
  - dashboard_reference.png - all-three-windows per resource
    plus a Datadog auto-detected anomaly marker

Registered in manifest.json tile.media[] so the Datadog tile catalog
listing carousels through them when a user views the integration.
@victoroseghale victoroseghale requested review from a team as code owners May 29, 2026 00:02
@datadog-official
Copy link
Copy Markdown
Contributor

datadog-official Bot commented May 29, 2026

Pipelines

Fix all issues with BitsAI

⚠️ Warnings

🚦 3 Pipeline jobs failed

PR | test / test (linux, ubuntu-22.04, linux_psi, LinuxPSI (py3.13), py3.13) / LinuxPSI (py3.13)-py3.13   View in Datadog   GitHub Actions

🔧 Fix in code (Fix with Cursor). AttributeError: 'LinuxPSICheck' object has no attribute '_resolve_resources' in datadog_checks/linux_psi/check.py:65

Validate repository | run / Validate   View in Datadog   GitHub Actions

🛟 This job is unlikely to succeed on retry. Please review your pipeline configuration. CI configuration is not in sync; integration 'linux_psi' should be 'linuxpsi'.

PR | test / check   View in Datadog   GitHub Actions

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: c1127c2 | Docs | Datadog PR Page | Give us feedback!

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new linux_psi integration to collect Linux PSI (Pressure Stall Information) from /proc/pressure/* (host-level) and optionally per-cgroup PSI from cgroup v2 (*.pressure), and ships the accompanying assets (metadata, config spec, dashboard, monitor templates), tests, and CI/codecov wiring.

Changes:

  • Added LinuxPSICheck implementation to emit PSI gauges (avg10/60/300) and monotonic counts (total), plus kernel-version metadata and opt-in cgroup v2 walking.
  • Added unit + integration tests and fixture PSI files to validate parsing, error handling, config behavior, and cgroup enumeration/cardinality caps.
  • Added integration packaging + assets (README, config spec, service checks, dashboard, monitors), and registered the integration in CI, CODEOWNERS, and Codecov.

Reviewed changes

Copilot reviewed 38 out of 46 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
linux_psi/datadog_checks/linux_psi/check.py Implements PSI collection (host + optional per-cgroup), parsing, service checks, kernel metadata, and cgroup traversal/cardinality logic.
linux_psi/datadog_checks/linux_psi/init.py Exposes LinuxPSICheck and version.
linux_psi/datadog_checks/linux_psi/about.py Defines integration version.
linux_psi/datadog_checks/linux_psi/data/conf.yaml.example Documents instance configuration options (resources + cgroup options).
linux_psi/assets/configuration/spec.yaml Declarative configuration spec used to generate config models and docs.
linux_psi/datadog_checks/linux_psi/config_models/init.py Generated config model mixin wiring.
linux_psi/datadog_checks/linux_psi/config_models/instance.py Generated Pydantic instance config model.
linux_psi/datadog_checks/linux_psi/config_models/shared.py Generated Pydantic shared config model.
linux_psi/datadog_checks/linux_psi/config_models/defaults.py Generated defaults for instance options.
linux_psi/datadog_checks/linux_psi/config_models/validators.py Placeholder for custom validators/transformers.
linux_psi/tests/test_unit.py Unit tests using fixture directories to validate parsing, error handling, and cgroup behavior without requiring Linux.
linux_psi/tests/test_integration.py Integration test that reads real /proc/pressure/* when available on Linux.
linux_psi/tests/conftest.py Test fixtures for shared instance config and dd_environment.
linux_psi/tests/common.py Test helper utilities for fixture paths/reads.
linux_psi/tests/init.py Marks test package.
linux_psi/tests/fixtures/pressure_cpu_normal PSI fixture content for CPU (some/full).
linux_psi/tests/fixtures/pressure_cpu_no_full PSI fixture content for CPU on kernels lacking full.
linux_psi/tests/fixtures/pressure_memory_normal PSI fixture content for memory (normal).
linux_psi/tests/fixtures/pressure_memory_stressed PSI fixture content for memory (stressed).
linux_psi/tests/fixtures/pressure_io_normal PSI fixture content for IO (normal).
linux_psi/tests/fixtures/pressure_io_malformed PSI fixture content to validate malformed parsing behavior.
linux_psi/README.md User-facing integration documentation (setup, config, troubleshooting, compatibility).
linux_psi/metadata.csv Metric metadata for host-level and cgroup-level PSI metrics.
linux_psi/manifest.json Integration manifest including tile metadata and assets registration.
linux_psi/assets/service_checks.json Service-check metadata for linux_psi.can_read and linux_psi.cgroup.can_read.
linux_psi/assets/dashboards/linux_psi_overview.json Shipped dashboard for PSI operational views and drilldowns.
linux_psi/assets/monitors/cpu_pressure_some_high.json Recommended monitor template for high CPU some pressure.
linux_psi/assets/monitors/cpu_pressure_full_critical.json Recommended monitor template for severe CPU full pressure.
linux_psi/assets/monitors/memory_pressure_full_critical.json Recommended monitor template for severe memory full pressure.
linux_psi/assets/monitors/io_pressure_some_high.json Recommended monitor template for high I/O some pressure.
linux_psi/assets/monitors/io_pressure_full_critical.json Recommended monitor template for severe I/O full pressure.
linux_psi/assets/logo.svg Integration logo asset.
linux_psi/DASHBOARD_DESIGN.md Design rationale and maintenance notes for the shipped dashboard JSON.
linux_psi/CHANGELOG.md Integration changelog entries for releases/features.
linux_psi/pyproject.toml Packaging metadata and dependency declarations.
linux_psi/hatch.toml Hatch environment configuration for the integration.
.github/workflows/test-all.yml Adds CI job to run linux_psi tests on Linux.
.github/CODEOWNERS Adds CODEOWNERS entry for the new integration directory.
.codecov.yml Adds Codecov flag + project status entries for linux_psi.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread linux_psi/datadog_checks/linux_psi/check.py Outdated
Comment thread linux_psi/datadog_checks/linux_psi/check.py
Comment thread linux_psi/datadog_checks/linux_psi/check.py
Comment thread linux_psi/datadog_checks/linux_psi/check.py
Comment thread linux_psi/datadog_checks/linux_psi/check.py
Comment thread linux_psi/tests/conftest.py
Comment thread linux_psi/assets/dashboards/linux_psi_overview.json
Comment thread linux_psi/assets/dashboards/linux_psi_overview.json
victoroseghale and others added 2 commits May 29, 2026 04:32
Fix - Validate the type up-front and raise a clear ConfigurationError when it's not a list/tuple of strings.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants