Add linux_psi integration by victoroseghale · Pull Request #3020 · DataDog/integrations-extras

victoroseghale · 2026-05-29T00:02:42Z

What does this PR do?

Adds a new integration linux_psi to monitor Linux kernel PSI (Pressure Stall Information).

Motivation

PSI (kernel 4.20+) is the canonical signal for "how much time is this system stalling on contention" - far more useful than CPU% for diagnosing "everything is slow" incidents. The existing linux_proc_extras integration does not cover PSI. No other integration in core or extras does either.

Per-cgroup PSI extends the same signal to the noisy-neighbor question: which workload is generating pressure on this host? This is the answer container-heavy fleets need but cannot get from host-wide metrics alone.

Review checklist

PR has a meaningful title or PR has the no-changelog label attached
Feature or bugfix has tests
Git history is clean
If PR impacts documentation, docs team has been notified or an issue has been opened on the documentation repo
If this PR includes a log pipeline, please add a description describing the remappers and processors.

Additional Information

Ships at v1.1.0 with:

Host-wide PSI from /proc/pressure/{cpu,memory,io} - 24 metrics covering the some/full x avg10/avg60/avg300/total matrix per resource. Honors the Agent's procfs_path config for containerized agents.
Opt-in per-cgroup PSI from /sys/fs/cgroup/.../{cpu,memory,io}.pressure - 24 additional metrics under the system.pressure.cgroup.* namespace, tagged with cgroup_path and cgroup_root so users can drill into k8s pods or systemd services. Requires cgroup v2.
Kernel version surfaced via set_metadata for fleet-wide audits ("which hosts can emit cpu.full?").
8 instance config options: tags, service, min_collection_interval, resources (subset of cpu/memory/io), cgroup_roots, cgroupfs_path, cgroup_max_depth, cgroup_max_count (cardinality caps).
2 service checks: linux_psi.can_read (host) and linux_psi.cgroup.can_read (cgroup, only when cgroup_roots is configured).
5 recommended monitors: high cpu_some, severe cpu_full, severe memory_full (paging-grade), high io_some, severe io_full.
30-widget operator-question-organized dashboard with 7 collapsible groups (status, live trends, fleet view, per-host investigation, per-cgroup view, capacity & trends, reference).
6 dashboard screenshots in tile.media[] captured from a real multi-host deployment showing the stress-workload phases clearly visible across each panel.

Adds the empty Python package skeleton for the new linux_psi check: pyproject.toml, hatch.toml, version file at 1.0.0, and the datadog_checks/linux_psi namespace stub. No check logic yet - that lands in the next commit.

Reads /proc/pressure/{cpu,memory,io} (Linux kernel 4.20+) and emits up to 24 metrics per host - some/full x avg10/avg60/avg300 (gauges) and total (monotonic_count) for each resource. Handles three real-world conditions gracefully: - PSI not enabled (kernel < 4.20 or no psi=1 boot param) -> WARNING service check, no metrics - pre-5.13 kernels lack the 'full' line for cpu -> those 4 metrics simply do not emit - permission denied -> CRITICAL service check with the offending path Honors the Agent's procfs_path config so container deployments that mount the host /proc at /host/proc work without code changes.

Seven unit tests exercising every code path against fixture /proc/pressure files: happy path, pre-5.13 kernel (no cpu.full line), missing pressure directory (kernel < 4.20), one resource file missing, permission denied, malformed input lines, and the monotonic_count typing of the total field. One integration test reads the real /proc/pressure/* on the host. Skipped on non-Linux and on Linux kernels without PSI enabled. Six fixture files cover the cases above including a deliberately malformed io fixture to verify parser resilience.

Tile manifest with Linux-only classifier and AI/ML / OS & System categories. Twenty-four metric rows in metadata.csv. One service check (linux_psi.can_read) with OK / WARNING / CRITICAL states documented. Overview dashboard has six widgets - per-resource pressure timeseries (cpu/memory/io) plus a top-list of hosts ranked by memory and io pressure. Three recommended monitors: - cpu_pressure_some_high - 5m avg300 > 50%, warning at 30% - memory_pressure_full_critical - 1m avg10 > 10%, pages on critical - io_pressure_some_high - 5m avg300 > 70%, warning at 40%

User-facing README with Overview, Setup (host and containerized), Configuration, Data Collected, Troubleshooting, and Support sections. Documents the kernel requirement (4.20+), the psi=1 boot parameter, and the procfs_path config for containerized agents. Initial 1.0.0 changelog entry summarizing what shipped.

Updates outside the integration directory required for the test runner and review routing to find linux_psi: - .codecov.yml: new Linux_PSI project entry and linux_psi flag - .github/CODEOWNERS: route /linux_psi/ to @voseghale plus @DataDog/ecosystems-review - .github/workflows/test-all.yml: include the linux_psi check in the test-all job These are the entries ddev validate ci --sync generated automatically.

Add four new unit tests covering branches that the initial suite did not exercise: - test_procfs_path_override: confirms the check honors the Agent's procfs_path config for containerized deployments - test_os_error_is_soft_failed: a generic OSError (EIO mid-read) on one file is logged but the other resources still emit and the service check stays OK - test_all_files_missing_yields_warning: empty /proc/pressure directory yields WARNING, not OK - test_multi_file_permission_denied_is_critical: when some files succeed and others permission-deny, the service check is CRITICAL and the message identifies the offending path Extend the malformed io fixture to include a field with no equals sign and an unknown field name so the existing test_malformed_line_is_ignored exercises those branches. Coverage on check.py: 84% -> 97%. The single remaining miss is a defensive return that is unreachable because an earlier blank-line check short-circuits the only path that could lead to it.

Soften the memory_full monitor so it does not page on transient pressure spikes that are normal during page-cache reclaim: - threshold raised from 10% to 20% - window broadened from avg10/last_1m to avg60/last_5m - renotify cadence dropped from every 5m to every 30m Add two missing 'full' (all-tasks-stalled) monitors. The previous set only covered 'some' for cpu and io but 'full' is the more severe state and deserves its own alert with appropriate threshold: - cpu_pressure_full_critical: avg60/5m > 10% (kernel 5.13+ only) - io_pressure_full_critical: avg60/5m > 15% Strip the placeholder @ALL and @PagerDuty handles from the templates - notification routing belongs to the install-time configuration, not the shipped template. Add an explicit comment in each monitor's message explaining this. Total recommended monitors: 5 (was 3).

Add a markdown note widget at the top of the dashboard explaining what 'some' vs 'full' mean, what the avg10/60/300 windows represent, and the kernel version requirements. Users opening the dashboard for the first time now understand the metrics without needing to click out to docs. Also wire the existing host template variable through to all six timeseries / toplist queries (was a hardcoded '*', now correctly filters by the selected host).

Add an explicit Compatibility section listing the kernel version requirements (4.20+ for the core integration, 5.13+ for the cpu.full metrics), the minimum Agent version (7.53), and the psi=1 boot parameter requirement. In the Support section, state the license (BSD-3-Clause matching the parent repo) and the contribution flow (issue first for non-trivial changes) so contributors and users know the maintenance model up front.

Add two new troubleshooting entries covering issues users hit most: - 'yaml: cannot unmarshal !!map into string' when tags is written as 'tags: - env: prod' (a list of maps) instead of 'tags: - env:prod' (a list of strings). Shows the wrong vs correct shape and the Python one-liner for linting conf.yaml standalone before restarting the agent. - 'check loads but no metrics in Datadog' explaining how to use 'datadog-agent check linux_psi' to confirm the integration is emitting locally and distinguish a Datadog-account/network issue from an integration issue. Both write-ups come from real first-install experience. The tag-shape mistake in particular is the classic Datadog config bug every team hits exactly once; documenting it here saves the next user a restart-debug cycle.

Adds an optional 'resources' list to the instance config. Users can restrict collection to a subset of cpu, memory, io to handle hosts where one resource's pressure is permanently elevated for legitimate reasons (e.g., a database host with sustained high I/O) or where a specific /proc/pressure file is masked by container security policy. Unknown resource names raise ConfigurationError at check startup with the offending values surfaced in the error message rather than degrading silently. Why this is better than the existing metric_patterns workaround: metric_patterns suppresses the *metrics* but the check still opens and reads the file. That means the linux_psi.can_read service check still degrades if a 'noisy' resource has permission issues. The resources option skips the file entirely so the service check stays clean. Three new unit tests: - test_resources_config_filters_collection - test_resources_config_rejects_unknown (ConfigurationError path) - test_resources_config_preserves_order_and_dedups Coverage now 98% on check.py (was 97%). Total tests: 15.

Read /proc/sys/kernel/osrelease, parse the major.minor.patch prefix with a defensive regex, and submit it via set_metadata with the semver scheme. The kernel version then appears in the tile's Integration metadata column, making fleet-wide audits possible: - 'how many hosts are on a kernel that lacks cpu.full PSI (< 5.13)?' - 'are all my hosts on a kernel that supports the psi=1 boot param (4.20+)?' Refactor the procfs path resolution: _set_paths now stores self._proc_root, which both pressure_dir and the kernel-version metadata read use. This means the existing procfs_path config (for containerized agents that mount /proc at /host/proc) automatically covers the kernel version read as well. Six new test cases: - test_kernel_version_metadata_parses: parametrized over four distro-specific osrelease strings (Ubuntu, plain, sticky-patch with +, Debian) - test_kernel_version_metadata_missing_file_is_silent: graceful no-op when osrelease cannot be opened - test_kernel_version_metadata_garbled_is_silent: graceful no-op on unparseable osrelease content Total tests: 21 (was 15).

Three vertical bars in rising height (green / amber / red) representing the three PSI resources CPU / memory / I/O at increasing pressure levels. Communicates the integration's purpose at a glance in the tile catalog and stays recognizable at small sizes. No gradients, no text, no external font dependencies. Works on both light and dark tile backgrounds. ~25 lines of SVG total.

The v1 dashboard is the floor - every metric has a place, every resource is represented. The dashboard that actually answers operational questions is one design pass beyond that, and writing the design down now means whoever picks this up later (including the original author six months from now) does not have to re-derive it. DASHBOARD_DESIGN.md covers: - Design philosophy: organize panels by the question an operator wants answered, not by the metric being shown - Six question-driven panels for v1.1 (stacked three-resource view, severe-contention indicator, pressure-budget heatmap, pressure vs utilization quadrant, week-over-week delta with deploy overlay, pressure SLO board) - The composite pressure score (computed at query time, no extra storage cost, single highest-density visualization) - Candidates I considered and ruled out, with the reasoning - Recommended shipping order so each panel goes out independently - Validation checklist before merging any panel into the main JSON - Links to the relevant Datadog function and widget docs This is the kind of doc that turns 'whatever the original author intended is lost' into 'here is exactly why we shipped what we shipped and what would come next.'

Add opt-in collection of per-cgroup pressure stall data from /sys/fs/cgroup/.../<resource>.pressure. Disabled by default; enable by setting the cgroup_roots config option to a list of slices to walk (e.g., system.slice, kubepods.slice). Per-cgroup metrics live in their own namespace 'system.pressure.cgroup.<resource>.<kind>.<key>' so they do not co-mingle with host-level 'system.pressure.<resource>.<kind>.<key>' in aggregate queries (avg:metric{*} would otherwise sum unrelated values). Each per-cgroup metric carries cgroup_path and cgroup_root tags so users can drill in by namespace/service/pod, and so the Agent tagger can enrich them downstream with k8s metadata when present. Cardinality is bounded by three caps: - cgroup_max_depth (default 2) limits subdirectory recursion - cgroup_max_count (default 200) limits total cgroups per check run - cgroup_roots is opt-in so users only walk what they want cgroup PSI requires cgroup v2 (the unified hierarchy). The check detects v1 hosts by the absence of cgroup.controllers at the cgroupfs root and emits a WARNING service check (linux_psi.cgroup.can_read) with a clear message instead of silently failing. Refactor _emit_line and _read_one to take namespace + tags parameters so they can serve both the host path (unchanged metric names) and the new cgroup path (system.pressure.cgroup.* with cgroup_path tags). Existing 21 tests pass unchanged.

…ice check Adds metadata.csv rows for the 24 system.pressure.cgroup.* metrics emitted by the new cgroup PSI collection path. Each row notes that the metric is per-cgroup and lists cgroup_path as a sample tag so the Datadog Metrics Summary UI shows it in the tag dimension column. Adds the linux_psi.cgroup.can_read service check entry with both states documented (OK when cgroup v2 hierarchy is readable, WARNING when the host is on cgroup v1 or the cgroupfs path is missing). Total metrics: 48 (was 24). Total service checks: 2 (was 1).

Spec adds four instance options for the cgroup feature: - cgroup_roots: opt-in list of slices to walk (default: empty, feature disabled). When set, must be a YAML list. - cgroupfs_path: cgroup filesystem root (default /sys/fs/cgroup) - cgroup_max_depth: subdirectory recursion limit (default 2) - cgroup_max_count: cardinality cap per check run (default 200) Regenerated config_models and conf.yaml.example from the updated spec. Documented each option with a paragraph explaining when to tune it. Five new test cases: - test_cgroup_disabled_by_default: no cgroup walking when cgroup_roots is empty, no cgroup service check emitted - test_cgroup_collects_metrics: full walk of a fake systemd-style slice (system.slice + 2 services), verifies the per-cgroup metrics are emitted with the correct cgroup_path and cgroup_root tags - test_cgroup_v1_host_warns: a cgroupfs path that lacks the cgroup.controllers marker (i.e., cgroup v1) yields a clear WARNING service check, host-level metrics keep flowing - test_cgroup_max_count_caps_cardinality: 5 cgroups under a max_count of 2 emits at most 2 cgroup paths worth of metrics - test_cgroup_roots_rejects_non_list: a scalar value for cgroup_roots raises ConfigurationError at check startup Total tests: 26 (was 21).

README updates: - Configuration table gains four new options (cgroup_roots, cgroupfs_path, cgroup_max_depth, cgroup_max_count) with the when-to-tune-them rationale documented inline - Data Collected section explains the new system.pressure.cgroup.* namespace and the cgroup_path / cgroup_root tags - Service Checks table lists linux_psi.cgroup.can_read - Compatibility matrix gains the cgroup v2 (kernel 5.2+) row - Two new troubleshooting entries for the cgroup-v2-not-detected case and the cgroup_max_count warning DASHBOARD_DESIGN updated to note that per-cgroup PSI is now available (it was previously listed as deferred). A v1.2 dashboard pass should add the per-cgroup breakdowns, with the canonical 'which workload is stressing this host' panel (top-list of cgroups by cgroup.cpu.some. avg300) called out as the highest-value addition. Version bump 1.0.0 -> 1.1.0 per SemVer (new opt-in feature, backwards-compatible). CHANGELOG entry documents the added and changed bits.

A misconfigured conf.yaml could previously make the check walk and emit metrics for directories outside the cgroupfs root. Three attack vectors are now closed at the config and walk layers: - cgroup_roots entries containing parent-directory references (..) are rejected at check startup with a ConfigurationError - cgroup_roots entries that are absolute paths are rejected at check startup - cgroup_roots entries that resolve outside cgroupfs_path via a symlink (e.g., a 'sneaky.slice' symlink pointing at /etc) are skipped at walk time with a warning log; no metrics emit from the escape target Eight new tests cover the rejection and skip paths, including a symlink-escape test that confirmed the vulnerability before the fix landed. The _is_within_cgroupfs helper uses os.path.realpath on both the candidate root and the configured cgroupfs_path then verifies the candidate is at or below the base. This catches any future symlink games on intermediate path components, not just the final segment. Total tests: 34 (was 26).

Previously, when the cgroup_max_count cap was reached while walking the first root in cgroup_roots, the inner for-loop broke but the outer for-loop continued to the next root. That root would immediately re-check the cap, break again, and log the cap-hit warning a second time. With N roots configured the warning fired N times. Fix: hoist the cap-hit state into a flag the outer loop checks. One warning per check run regardless of how many roots are configured; also avoids the wasted os.path.isdir / _is_within_cgroupfs work on the unreached roots. Test: test_cgroup_max_count_breaks_across_multiple_roots sets up two roots with multiple cgroups each, max_count=1, and asserts the warning is logged exactly once (the test failed with count 2 before the fix).

Datadog truncates tag values longer than 200 chars at the backend silently. For deeply-nested k8s pod cgroup paths (kubepods.slice/ kubepods-burstable.slice/kubepods-burstable-pod<uuid>.slice/cri- containerd-<container-id>.scope) the path can easily exceed 200 chars and the resulting tag becomes either truncated arbitrarily or rejected. Truncate in-band with a visible '...truncated' sentinel so: - the truncation is reproducible and auditable in logs / dashboards - the sentinel makes it obvious which tags were affected so the operator can address it (e.g., increase cgroup_max_depth scope or normalize the cgroup naming upstream) - we never emit a tag the backend would silently mangle Test: builds a 200-character single-segment cgroup name (under the 255-char filename limit on most filesystems but enough to push the full cgroup_path tag over 200 chars). Asserts the emitted tag is <=200 chars and ends with the sentinel.

Four targeted tests for previously-uncovered defensive paths: - test_cgroupfs_path_missing_entirely_warns: cgroupfs_path set to a nonexistent directory yields a WARNING service check (distinct from the existing 'exists but not v2' case) - test_cgroup_root_in_config_does_not_exist_on_disk: configuring kubepods.slice on a host without it logs at debug and continues cleanly; service check stays OK because the rest of the feature is working - test_walker_skips_scandir_permission_error: when os.scandir raises PermissionError on a subdirectory, the walker skips it and continues emitting for sibling cgroups - test_emit_cgroup_handles_pressure_file_oserror: an EIO read on one cgroup's pressure file is logged and skipped; other cgroups and resources still emit The remaining 7% of uncovered lines in check.py are: - the OSError handler for os.path.realpath in _is_within_cgroupfs (requires deeper filesystem mocking) - the real_path == real_base edge case (root cgroup itself) - default-parameter paths in _emit_line that are only reachable via private API calls the test suite has no reason to exercise These are all defensive code paths with no realistic trigger; the 93% measured coverage represents effectively-complete coverage of the code that will actually run in production. Total tests: 40 (was 36).

Replaces the 7-widget 'metric inventory' v1 dashboard with 30 widgets organized into 7 collapsible groups, each scoped to a different operator question. The dashboard now reads top-to-bottom from urgency to investigation: 1. About header - one paragraph orientation + legend 2. Status (pink) - 4 KPI tiles with red/yellow/green conditional formatting answering 'is anything on fire?' in under 5 seconds: worst memory.full, worst cpu.some, worst io.some, composite score 3. Live trends (blue) - stacked 3-resource chart answering 'where is the wedge?' plus per-resource panels with warning/critical y-markers and a composite-by-host trendline 4. Fleet view (purple) - top-N hosts for CPU/memory/IO + cumulative stall (last 24h) leaderboard + memory pressure distribution histogram. The 'who's worst?' answer. 5. Per-host investigation (green) - some-vs-full overlay per resource + stall-rate-per-second derived from total counter; uses the host template variable 6. Per-cgroup view (yellow) - top cgroups by CPU/memory + per- cgroup timeseries; populated only when cgroup_roots is configured (clear note widget says so). The noisy-neighbor view. 7. Capacity & trends (orange) - this-week-vs-last-week timeseries + pressure-budget leaderboard. The quarterly-review view. 8. Reference (gray) - all-windows per-resource (avg10/60/300) for custom-query authoring. Not for at-a-glance use. Five template variables (host, env, service, cgroup_path, cgroup_root) so the same dashboard serves fleet-wide and per-host views without duplication. DASHBOARD_DESIGN.md updated to reflect the v1.1 work has shipped and to lay out the v1.2 roadmap (pressure-vs-utilization scatter, anomaly bands, SLO widget, deploy-event overlay, forecast). The visual hierarchy is deliberate: on-call sees groups 1+2 without scrolling and gets their answer; capacity planner expands group 7; performance engineer expands groups 5+6 during regression investigation. Nobody scrolls past noise they don't need.

Two real bugs surfaced when the dashboard was imported and tested against live data: - Cumulative stall toplist used 'sum:' across hosts of an as_rate() counter, which double-counts when there are multiple host series. Switched to 'avg:' (correct per-host rate) and dropped the microsecond unit format which was misleading on a derived per-second rate. - Distribution widget rendered 'No data' because it lacked the required 'response_format: scalar' and explicit aggregator on the query. Datadog requires both for distribution widgets even though most other widget types are forgiving about defaults. Caught during the import-into-account validation step that the DASHBOARD_DESIGN.md doc calls out as mandatory before merging any dashboard JSON change. JSON parses; all 40 unit tests still pass.

Six PNGs captured from a live Datadog account after running the 20-minute stress workload across two real hosts (hpaserver and voseghale-HP). Each image corresponds to one of the dashboard's seven groups and is captioned with the operator question it answers: - dashboard_overview.png - hero shot with the four KPI tiles showing red/yellow/green conditional formatting in action - dashboard_fleet_view.png - multi-host top-lists plus the cumulative stall-time leaderboard (proves it works across a real fleet, not just one host) - dashboard_per_host_investigation.png - per-host pressure spikes from the stress run clearly visible as plateaus on each chart, with the derived stall-rate-per-second view - dashboard_per_cgroup.png - top systemd-service cgroups by CPU and memory pressure with real cgroup_path tag values (proves the cgroup feature works end-to-end) - dashboard_capacity_trends.png - this-week-vs-last-week timeseries with the pressure budget leaderboard - dashboard_reference.png - all-three-windows per resource plus a Datadog auto-detected anomaly marker Registered in manifest.json tile.media[] so the Datadog tile catalog listing carousels through them when a user views the integration.

datadog-official · 2026-05-29T00:06:24Z

✨ Fix all issues with BitsAI

⚠️ Warnings

🚦 3 Pipeline jobs failed

PR | test / test (linux, ubuntu-22.04, linux_psi, LinuxPSI (py3.13), py3.13) / LinuxPSI (py3.13)-py3.13

🔧 Fix in code (Fix with Cursor).
AttributeError: 'LinuxPSICheck' object has no attribute '_resolve_resources' in datadog_checks/linux_psi/check.py:65

Validate repository | run / Validate

🛟 This job is unlikely to succeed on retry. Please review your pipeline configuration.
CI configuration is not in sync; integration 'linux_psi' should be 'linuxpsi'.

PR | test / check

Useful? React with 👍 / 👎

_{This comment will be updated automatically if new data arrives.

🔗 Commit SHA: c1127c2 | Docs | Datadog PR Page | Give us feedback!}

….system.pressure

…names test variable.

…. its safe here because test mutates instance in place-the tests that customize it used dict-spread {**instance, ...} which produces a fresh dict

… and author details

Copilot

Pull request overview

This PR introduces a new linux_psi integration to collect Linux PSI (Pressure Stall Information) from /proc/pressure/* (host-level) and optionally per-cgroup PSI from cgroup v2 (*.pressure), and ships the accompanying assets (metadata, config spec, dashboard, monitor templates), tests, and CI/codecov wiring.

Changes:

Added LinuxPSICheck implementation to emit PSI gauges (avg10/60/300) and monotonic counts (total), plus kernel-version metadata and opt-in cgroup v2 walking.
Added unit + integration tests and fixture PSI files to validate parsing, error handling, config behavior, and cgroup enumeration/cardinality caps.
Added integration packaging + assets (README, config spec, service checks, dashboard, monitors), and registered the integration in CI, CODEOWNERS, and Codecov.

Reviewed changes

Copilot reviewed 38 out of 46 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
linux_psi/datadog_checks/linux_psi/check.py	Implements PSI collection (host + optional per-cgroup), parsing, service checks, kernel metadata, and cgroup traversal/cardinality logic.
linux_psi/datadog_checks/linux_psi/init.py	Exposes `LinuxPSICheck` and version.
linux_psi/datadog_checks/linux_psi/about.py	Defines integration version.
linux_psi/datadog_checks/linux_psi/data/conf.yaml.example	Documents instance configuration options (resources + cgroup options).
linux_psi/assets/configuration/spec.yaml	Declarative configuration spec used to generate config models and docs.
linux_psi/datadog_checks/linux_psi/config_models/init.py	Generated config model mixin wiring.
linux_psi/datadog_checks/linux_psi/config_models/instance.py	Generated Pydantic instance config model.
linux_psi/datadog_checks/linux_psi/config_models/shared.py	Generated Pydantic shared config model.
linux_psi/datadog_checks/linux_psi/config_models/defaults.py	Generated defaults for instance options.
linux_psi/datadog_checks/linux_psi/config_models/validators.py	Placeholder for custom validators/transformers.
linux_psi/tests/test_unit.py	Unit tests using fixture directories to validate parsing, error handling, and cgroup behavior without requiring Linux.
linux_psi/tests/test_integration.py	Integration test that reads real `/proc/pressure/*` when available on Linux.
linux_psi/tests/conftest.py	Test fixtures for shared instance config and dd_environment.
linux_psi/tests/common.py	Test helper utilities for fixture paths/reads.
linux_psi/tests/init.py	Marks test package.
linux_psi/tests/fixtures/pressure_cpu_normal	PSI fixture content for CPU (some/full).
linux_psi/tests/fixtures/pressure_cpu_no_full	PSI fixture content for CPU on kernels lacking `full`.
linux_psi/tests/fixtures/pressure_memory_normal	PSI fixture content for memory (normal).
linux_psi/tests/fixtures/pressure_memory_stressed	PSI fixture content for memory (stressed).
linux_psi/tests/fixtures/pressure_io_normal	PSI fixture content for IO (normal).
linux_psi/tests/fixtures/pressure_io_malformed	PSI fixture content to validate malformed parsing behavior.
linux_psi/README.md	User-facing integration documentation (setup, config, troubleshooting, compatibility).
linux_psi/metadata.csv	Metric metadata for host-level and cgroup-level PSI metrics.
linux_psi/manifest.json	Integration manifest including tile metadata and assets registration.
linux_psi/assets/service_checks.json	Service-check metadata for `linux_psi.can_read` and `linux_psi.cgroup.can_read`.
linux_psi/assets/dashboards/linux_psi_overview.json	Shipped dashboard for PSI operational views and drilldowns.
linux_psi/assets/monitors/cpu_pressure_some_high.json	Recommended monitor template for high CPU `some` pressure.
linux_psi/assets/monitors/cpu_pressure_full_critical.json	Recommended monitor template for severe CPU `full` pressure.
linux_psi/assets/monitors/memory_pressure_full_critical.json	Recommended monitor template for severe memory `full` pressure.
linux_psi/assets/monitors/io_pressure_some_high.json	Recommended monitor template for high I/O `some` pressure.
linux_psi/assets/monitors/io_pressure_full_critical.json	Recommended monitor template for severe I/O `full` pressure.
linux_psi/assets/logo.svg	Integration logo asset.
linux_psi/DASHBOARD_DESIGN.md	Design rationale and maintenance notes for the shipped dashboard JSON.
linux_psi/CHANGELOG.md	Integration changelog entries for releases/features.
linux_psi/pyproject.toml	Packaging metadata and dependency declarations.
linux_psi/hatch.toml	Hatch environment configuration for the integration.
.github/workflows/test-all.yml	Adds CI job to run linux_psi tests on Linux.
.github/CODEOWNERS	Adds CODEOWNERS entry for the new integration directory.
.codecov.yml	Adds Codecov flag + project status entries for linux_psi.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Fix - Validate the type up-front and raise a clear ConfigurationError when it's not a list/tuple of strings. Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

victoralfred and others added 30 commits May 28, 2026 14:37

linux_psi: scaffold integration package

893a67c

Adds the empty Python package skeleton for the new linux_psi check: pyproject.toml, hatch.toml, version file at 1.0.0, and the datadog_checks/linux_psi namespace stub. No check logic yet - that lands in the next commit.

Merge branch 'DataDog:master' into master

464a79b

Fix email and owner details

93a8d1e

Merge branch 'DataDog:master' into master

c1dca30

Fix metric naming namespace

1ed444c

victoroseghale requested review from a team as code owners May 29, 2026 00:02

victoroseghale added 7 commits May 29, 2026 02:16

Fix - change the HOST_NAMESPACE namespace from system.pressure to psi…

ee86cc7

….system.pressure

Fix broken run job 26610089690 originating from unused host level sc_…

ad7fa17

…names test variable.

Fix pytest ScopeMisMatch by lifting instance fixture to session scope…

7ac5ca6

…. its safe here because test mutates instance in place-the tests that customize it used dict-spread {**instance, ...} which produces a fresh dict

Fix code owner validation failure

b66a1b6

Fix app_id and owner fields in manifest.json; update source_type_name…

96cc82a

… and author details

Fix broken manifest json format

5f09dbf

Fix manifest app_id

0874249

victoroseghale requested a review from Copilot May 29, 2026 02:25

Copilot started reviewing on behalf of victoroseghale May 29, 2026 02:25 View session

Copilot AI reviewed May 29, 2026

View reviewed changes

victoroseghale and others added 2 commits May 29, 2026 04:32

Potential fix for pull request finding

a5fbe17

Fix - Validate the type up-front and raise a clear ConfigurationError when it's not a list/tuple of strings. Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Fix format error

c1127c2

victoroseghale requested a review from Copilot May 29, 2026 02:37

Copilot started reviewing on behalf of victoroseghale May 29, 2026 02:37 View session

victoroseghale closed this May 29, 2026

Copilot AI reviewed May 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add linux_psi integration#3020

Add linux_psi integration#3020
victoroseghale wants to merge 39 commits into
DataDog:masterfrom
victoroseghale:master

victoroseghale commented May 29, 2026 •

edited

Loading

Uh oh!

datadog-official Bot commented May 29, 2026 •

edited by datadog-datadog-prod-us1-2 Bot

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

victoroseghale commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Motivation

Review checklist

Additional Information

Uh oh!

datadog-official Bot commented May 29, 2026 • edited by datadog-datadog-prod-us1-2 Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ Warnings

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

victoroseghale commented May 29, 2026 •

edited

Loading

datadog-official Bot commented May 29, 2026 •

edited by datadog-datadog-prod-us1-2 Bot

Loading