Add linux_psi integration#3020
Closed
victoroseghale wants to merge 39 commits into
Closed
Conversation
Adds the empty Python package skeleton for the new linux_psi check: pyproject.toml, hatch.toml, version file at 1.0.0, and the datadog_checks/linux_psi namespace stub. No check logic yet - that lands in the next commit.
Reads /proc/pressure/{cpu,memory,io} (Linux kernel 4.20+) and emits up to
24 metrics per host - some/full x avg10/avg60/avg300 (gauges) and total
(monotonic_count) for each resource.
Handles three real-world conditions gracefully:
- PSI not enabled (kernel < 4.20 or no psi=1 boot param) -> WARNING
service check, no metrics
- pre-5.13 kernels lack the 'full' line for cpu -> those 4 metrics
simply do not emit
- permission denied -> CRITICAL service check with the offending path
Honors the Agent's procfs_path config so container deployments that
mount the host /proc at /host/proc work without code changes.
Seven unit tests exercising every code path against fixture /proc/pressure files: happy path, pre-5.13 kernel (no cpu.full line), missing pressure directory (kernel < 4.20), one resource file missing, permission denied, malformed input lines, and the monotonic_count typing of the total field. One integration test reads the real /proc/pressure/* on the host. Skipped on non-Linux and on Linux kernels without PSI enabled. Six fixture files cover the cases above including a deliberately malformed io fixture to verify parser resilience.
Tile manifest with Linux-only classifier and AI/ML / OS & System categories. Twenty-four metric rows in metadata.csv. One service check (linux_psi.can_read) with OK / WARNING / CRITICAL states documented. Overview dashboard has six widgets - per-resource pressure timeseries (cpu/memory/io) plus a top-list of hosts ranked by memory and io pressure. Three recommended monitors: - cpu_pressure_some_high - 5m avg300 > 50%, warning at 30% - memory_pressure_full_critical - 1m avg10 > 10%, pages on critical - io_pressure_some_high - 5m avg300 > 70%, warning at 40%
User-facing README with Overview, Setup (host and containerized), Configuration, Data Collected, Troubleshooting, and Support sections. Documents the kernel requirement (4.20+), the psi=1 boot parameter, and the procfs_path config for containerized agents. Initial 1.0.0 changelog entry summarizing what shipped.
Updates outside the integration directory required for the test
runner and review routing to find linux_psi:
- .codecov.yml: new Linux_PSI project entry and linux_psi flag
- .github/CODEOWNERS: route /linux_psi/ to @voseghale plus
@DataDog/ecosystems-review
- .github/workflows/test-all.yml: include the linux_psi check in
the test-all job
These are the entries ddev validate ci --sync generated automatically.
Add four new unit tests covering branches that the initial suite did not
exercise:
- test_procfs_path_override: confirms the check honors the Agent's
procfs_path config for containerized deployments
- test_os_error_is_soft_failed: a generic OSError (EIO mid-read) on one
file is logged but the other resources still emit and the service
check stays OK
- test_all_files_missing_yields_warning: empty /proc/pressure directory
yields WARNING, not OK
- test_multi_file_permission_denied_is_critical: when some files succeed
and others permission-deny, the service check is CRITICAL and the
message identifies the offending path
Extend the malformed io fixture to include a field with no equals sign
and an unknown field name so the existing test_malformed_line_is_ignored
exercises those branches.
Coverage on check.py: 84% -> 97%. The single remaining miss is a
defensive return that is unreachable because an earlier blank-line check
short-circuits the only path that could lead to it.
Soften the memory_full monitor so it does not page on transient pressure spikes that are normal during page-cache reclaim: - threshold raised from 10% to 20% - window broadened from avg10/last_1m to avg60/last_5m - renotify cadence dropped from every 5m to every 30m Add two missing 'full' (all-tasks-stalled) monitors. The previous set only covered 'some' for cpu and io but 'full' is the more severe state and deserves its own alert with appropriate threshold: - cpu_pressure_full_critical: avg60/5m > 10% (kernel 5.13+ only) - io_pressure_full_critical: avg60/5m > 15% Strip the placeholder @ALL and @PagerDuty handles from the templates - notification routing belongs to the install-time configuration, not the shipped template. Add an explicit comment in each monitor's message explaining this. Total recommended monitors: 5 (was 3).
Add a markdown note widget at the top of the dashboard explaining what 'some' vs 'full' mean, what the avg10/60/300 windows represent, and the kernel version requirements. Users opening the dashboard for the first time now understand the metrics without needing to click out to docs. Also wire the existing host template variable through to all six timeseries / toplist queries (was a hardcoded '*', now correctly filters by the selected host).
Add an explicit Compatibility section listing the kernel version requirements (4.20+ for the core integration, 5.13+ for the cpu.full metrics), the minimum Agent version (7.53), and the psi=1 boot parameter requirement. In the Support section, state the license (BSD-3-Clause matching the parent repo) and the contribution flow (issue first for non-trivial changes) so contributors and users know the maintenance model up front.
Add two new troubleshooting entries covering issues users hit most:
- 'yaml: cannot unmarshal !!map into string' when tags is written
as 'tags: - env: prod' (a list of maps) instead of 'tags:
- env:prod' (a list of strings). Shows the wrong vs correct shape
and the Python one-liner for linting conf.yaml standalone before
restarting the agent.
- 'check loads but no metrics in Datadog' explaining how to use
'datadog-agent check linux_psi' to confirm the integration is
emitting locally and distinguish a Datadog-account/network issue
from an integration issue.
Both write-ups come from real first-install experience. The tag-shape
mistake in particular is the classic Datadog config bug every team
hits exactly once; documenting it here saves the next user a
restart-debug cycle.
Adds an optional 'resources' list to the instance config. Users can restrict collection to a subset of cpu, memory, io to handle hosts where one resource's pressure is permanently elevated for legitimate reasons (e.g., a database host with sustained high I/O) or where a specific /proc/pressure file is masked by container security policy. Unknown resource names raise ConfigurationError at check startup with the offending values surfaced in the error message rather than degrading silently. Why this is better than the existing metric_patterns workaround: metric_patterns suppresses the *metrics* but the check still opens and reads the file. That means the linux_psi.can_read service check still degrades if a 'noisy' resource has permission issues. The resources option skips the file entirely so the service check stays clean. Three new unit tests: - test_resources_config_filters_collection - test_resources_config_rejects_unknown (ConfigurationError path) - test_resources_config_preserves_order_and_dedups Coverage now 98% on check.py (was 97%). Total tests: 15.
Read /proc/sys/kernel/osrelease, parse the major.minor.patch prefix
with a defensive regex, and submit it via set_metadata with the
semver scheme. The kernel version then appears in the tile's
Integration metadata column, making fleet-wide audits possible:
- 'how many hosts are on a kernel that lacks cpu.full PSI (< 5.13)?'
- 'are all my hosts on a kernel that supports the psi=1 boot param
(4.20+)?'
Refactor the procfs path resolution: _set_paths now stores
self._proc_root, which both pressure_dir and the kernel-version
metadata read use. This means the existing procfs_path config (for
containerized agents that mount /proc at /host/proc) automatically
covers the kernel version read as well.
Six new test cases:
- test_kernel_version_metadata_parses: parametrized over four
distro-specific osrelease strings (Ubuntu, plain, sticky-patch
with +, Debian)
- test_kernel_version_metadata_missing_file_is_silent: graceful
no-op when osrelease cannot be opened
- test_kernel_version_metadata_garbled_is_silent: graceful no-op
on unparseable osrelease content
Total tests: 21 (was 15).
Three vertical bars in rising height (green / amber / red) representing the three PSI resources CPU / memory / I/O at increasing pressure levels. Communicates the integration's purpose at a glance in the tile catalog and stays recognizable at small sizes. No gradients, no text, no external font dependencies. Works on both light and dark tile backgrounds. ~25 lines of SVG total.
The v1 dashboard is the floor - every metric has a place, every
resource is represented. The dashboard that actually answers
operational questions is one design pass beyond that, and writing the
design down now means whoever picks this up later (including the
original author six months from now) does not have to re-derive it.
DASHBOARD_DESIGN.md covers:
- Design philosophy: organize panels by the question an operator
wants answered, not by the metric being shown
- Six question-driven panels for v1.1 (stacked three-resource view,
severe-contention indicator, pressure-budget heatmap, pressure vs
utilization quadrant, week-over-week delta with deploy overlay,
pressure SLO board)
- The composite pressure score (computed at query time, no extra
storage cost, single highest-density visualization)
- Candidates I considered and ruled out, with the reasoning
- Recommended shipping order so each panel goes out independently
- Validation checklist before merging any panel into the main JSON
- Links to the relevant Datadog function and widget docs
This is the kind of doc that turns 'whatever the original author
intended is lost' into 'here is exactly why we shipped what we
shipped and what would come next.'
Add opt-in collection of per-cgroup pressure stall data from
/sys/fs/cgroup/.../<resource>.pressure. Disabled by default; enable
by setting the cgroup_roots config option to a list of slices to
walk (e.g., system.slice, kubepods.slice).
Per-cgroup metrics live in their own namespace
'system.pressure.cgroup.<resource>.<kind>.<key>' so they do not
co-mingle with host-level 'system.pressure.<resource>.<kind>.<key>'
in aggregate queries (avg:metric{*} would otherwise sum unrelated
values). Each per-cgroup metric carries cgroup_path and cgroup_root
tags so users can drill in by namespace/service/pod, and so the
Agent tagger can enrich them downstream with k8s metadata when
present.
Cardinality is bounded by three caps:
- cgroup_max_depth (default 2) limits subdirectory recursion
- cgroup_max_count (default 200) limits total cgroups per check run
- cgroup_roots is opt-in so users only walk what they want
cgroup PSI requires cgroup v2 (the unified hierarchy). The check
detects v1 hosts by the absence of cgroup.controllers at the
cgroupfs root and emits a WARNING service check
(linux_psi.cgroup.can_read) with a clear message instead of
silently failing.
Refactor _emit_line and _read_one to take namespace + tags
parameters so they can serve both the host path (unchanged metric
names) and the new cgroup path (system.pressure.cgroup.* with
cgroup_path tags). Existing 21 tests pass unchanged.
…ice check Adds metadata.csv rows for the 24 system.pressure.cgroup.* metrics emitted by the new cgroup PSI collection path. Each row notes that the metric is per-cgroup and lists cgroup_path as a sample tag so the Datadog Metrics Summary UI shows it in the tag dimension column. Adds the linux_psi.cgroup.can_read service check entry with both states documented (OK when cgroup v2 hierarchy is readable, WARNING when the host is on cgroup v1 or the cgroupfs path is missing). Total metrics: 48 (was 24). Total service checks: 2 (was 1).
Spec adds four instance options for the cgroup feature:
- cgroup_roots: opt-in list of slices to walk (default: empty,
feature disabled). When set, must be a YAML list.
- cgroupfs_path: cgroup filesystem root (default /sys/fs/cgroup)
- cgroup_max_depth: subdirectory recursion limit (default 2)
- cgroup_max_count: cardinality cap per check run (default 200)
Regenerated config_models and conf.yaml.example from the updated
spec. Documented each option with a paragraph explaining when to
tune it.
Five new test cases:
- test_cgroup_disabled_by_default: no cgroup walking when
cgroup_roots is empty, no cgroup service check emitted
- test_cgroup_collects_metrics: full walk of a fake systemd-style
slice (system.slice + 2 services), verifies the per-cgroup
metrics are emitted with the correct cgroup_path and cgroup_root
tags
- test_cgroup_v1_host_warns: a cgroupfs path that lacks the
cgroup.controllers marker (i.e., cgroup v1) yields a clear
WARNING service check, host-level metrics keep flowing
- test_cgroup_max_count_caps_cardinality: 5 cgroups under a
max_count of 2 emits at most 2 cgroup paths worth of metrics
- test_cgroup_roots_rejects_non_list: a scalar value for
cgroup_roots raises ConfigurationError at check startup
Total tests: 26 (was 21).
README updates:
- Configuration table gains four new options (cgroup_roots,
cgroupfs_path, cgroup_max_depth, cgroup_max_count) with the
when-to-tune-them rationale documented inline
- Data Collected section explains the new system.pressure.cgroup.*
namespace and the cgroup_path / cgroup_root tags
- Service Checks table lists linux_psi.cgroup.can_read
- Compatibility matrix gains the cgroup v2 (kernel 5.2+) row
- Two new troubleshooting entries for the cgroup-v2-not-detected
case and the cgroup_max_count warning
DASHBOARD_DESIGN updated to note that per-cgroup PSI is now available
(it was previously listed as deferred). A v1.2 dashboard pass should
add the per-cgroup breakdowns, with the canonical 'which workload is
stressing this host' panel (top-list of cgroups by cgroup.cpu.some.
avg300) called out as the highest-value addition.
Version bump 1.0.0 -> 1.1.0 per SemVer (new opt-in feature,
backwards-compatible). CHANGELOG entry documents the added and
changed bits.
A misconfigured conf.yaml could previously make the check walk and
emit metrics for directories outside the cgroupfs root. Three attack
vectors are now closed at the config and walk layers:
- cgroup_roots entries containing parent-directory references (..)
are rejected at check startup with a ConfigurationError
- cgroup_roots entries that are absolute paths are rejected at
check startup
- cgroup_roots entries that resolve outside cgroupfs_path via a
symlink (e.g., a 'sneaky.slice' symlink pointing at /etc) are
skipped at walk time with a warning log; no metrics emit from
the escape target
Eight new tests cover the rejection and skip paths, including a
symlink-escape test that confirmed the vulnerability before the fix
landed.
The _is_within_cgroupfs helper uses os.path.realpath on both the
candidate root and the configured cgroupfs_path then verifies the
candidate is at or below the base. This catches any future symlink
games on intermediate path components, not just the final segment.
Total tests: 34 (was 26).
Previously, when the cgroup_max_count cap was reached while walking the first root in cgroup_roots, the inner for-loop broke but the outer for-loop continued to the next root. That root would immediately re-check the cap, break again, and log the cap-hit warning a second time. With N roots configured the warning fired N times. Fix: hoist the cap-hit state into a flag the outer loop checks. One warning per check run regardless of how many roots are configured; also avoids the wasted os.path.isdir / _is_within_cgroupfs work on the unreached roots. Test: test_cgroup_max_count_breaks_across_multiple_roots sets up two roots with multiple cgroups each, max_count=1, and asserts the warning is logged exactly once (the test failed with count 2 before the fix).
Datadog truncates tag values longer than 200 chars at the backend
silently. For deeply-nested k8s pod cgroup paths (kubepods.slice/
kubepods-burstable.slice/kubepods-burstable-pod<uuid>.slice/cri-
containerd-<container-id>.scope) the path can easily exceed 200 chars
and the resulting tag becomes either truncated arbitrarily or rejected.
Truncate in-band with a visible '...truncated' sentinel so:
- the truncation is reproducible and auditable in logs / dashboards
- the sentinel makes it obvious which tags were affected so the
operator can address it (e.g., increase cgroup_max_depth scope
or normalize the cgroup naming upstream)
- we never emit a tag the backend would silently mangle
Test: builds a 200-character single-segment cgroup name (under the
255-char filename limit on most filesystems but enough to push the
full cgroup_path tag over 200 chars). Asserts the emitted tag is
<=200 chars and ends with the sentinel.
Four targeted tests for previously-uncovered defensive paths:
- test_cgroupfs_path_missing_entirely_warns: cgroupfs_path set to a
nonexistent directory yields a WARNING service check (distinct
from the existing 'exists but not v2' case)
- test_cgroup_root_in_config_does_not_exist_on_disk: configuring
kubepods.slice on a host without it logs at debug and continues
cleanly; service check stays OK because the rest of the feature
is working
- test_walker_skips_scandir_permission_error: when os.scandir
raises PermissionError on a subdirectory, the walker skips it
and continues emitting for sibling cgroups
- test_emit_cgroup_handles_pressure_file_oserror: an EIO read on
one cgroup's pressure file is logged and skipped; other cgroups
and resources still emit
The remaining 7% of uncovered lines in check.py are:
- the OSError handler for os.path.realpath in _is_within_cgroupfs
(requires deeper filesystem mocking)
- the real_path == real_base edge case (root cgroup itself)
- default-parameter paths in _emit_line that are only reachable
via private API calls the test suite has no reason to exercise
These are all defensive code paths with no realistic trigger; the
93% measured coverage represents effectively-complete coverage of
the code that will actually run in production.
Total tests: 40 (was 36).
Replaces the 7-widget 'metric inventory' v1 dashboard with 30 widgets
organized into 7 collapsible groups, each scoped to a different
operator question. The dashboard now reads top-to-bottom from urgency
to investigation:
1. About header - one paragraph orientation + legend
2. Status (pink) - 4 KPI tiles with red/yellow/green conditional
formatting answering 'is anything on fire?' in under 5 seconds:
worst memory.full, worst cpu.some, worst io.some, composite score
3. Live trends (blue) - stacked 3-resource chart answering 'where
is the wedge?' plus per-resource panels with warning/critical
y-markers and a composite-by-host trendline
4. Fleet view (purple) - top-N hosts for CPU/memory/IO + cumulative
stall (last 24h) leaderboard + memory pressure distribution
histogram. The 'who's worst?' answer.
5. Per-host investigation (green) - some-vs-full overlay per
resource + stall-rate-per-second derived from total counter;
uses the host template variable
6. Per-cgroup view (yellow) - top cgroups by CPU/memory + per-
cgroup timeseries; populated only when cgroup_roots is configured
(clear note widget says so). The noisy-neighbor view.
7. Capacity & trends (orange) - this-week-vs-last-week timeseries
+ pressure-budget leaderboard. The quarterly-review view.
8. Reference (gray) - all-windows per-resource (avg10/60/300) for
custom-query authoring. Not for at-a-glance use.
Five template variables (host, env, service, cgroup_path, cgroup_root)
so the same dashboard serves fleet-wide and per-host views without
duplication.
DASHBOARD_DESIGN.md updated to reflect the v1.1 work has shipped and
to lay out the v1.2 roadmap (pressure-vs-utilization scatter,
anomaly bands, SLO widget, deploy-event overlay, forecast).
The visual hierarchy is deliberate: on-call sees groups 1+2 without
scrolling and gets their answer; capacity planner expands group 7;
performance engineer expands groups 5+6 during regression
investigation. Nobody scrolls past noise they don't need.
Two real bugs surfaced when the dashboard was imported and tested
against live data:
- Cumulative stall toplist used 'sum:' across hosts of an as_rate()
counter, which double-counts when there are multiple host
series. Switched to 'avg:' (correct per-host rate) and dropped
the microsecond unit format which was misleading on a derived
per-second rate.
- Distribution widget rendered 'No data' because it lacked the
required 'response_format: scalar' and explicit aggregator on
the query. Datadog requires both for distribution widgets even
though most other widget types are forgiving about defaults.
Caught during the import-into-account validation step that the
DASHBOARD_DESIGN.md doc calls out as mandatory before merging any
dashboard JSON change. JSON parses; all 40 unit tests still pass.
Six PNGs captured from a live Datadog account after running the
20-minute stress workload across two real hosts (hpaserver and
voseghale-HP). Each image corresponds to one of the dashboard's
seven groups and is captioned with the operator question it answers:
- dashboard_overview.png - hero shot with the four KPI tiles
showing red/yellow/green conditional formatting in action
- dashboard_fleet_view.png - multi-host top-lists plus the
cumulative stall-time leaderboard (proves it works across
a real fleet, not just one host)
- dashboard_per_host_investigation.png - per-host pressure
spikes from the stress run clearly visible as plateaus on
each chart, with the derived stall-rate-per-second view
- dashboard_per_cgroup.png - top systemd-service cgroups by
CPU and memory pressure with real cgroup_path tag values
(proves the cgroup feature works end-to-end)
- dashboard_capacity_trends.png - this-week-vs-last-week
timeseries with the pressure budget leaderboard
- dashboard_reference.png - all-three-windows per resource
plus a Datadog auto-detected anomaly marker
Registered in manifest.json tile.media[] so the Datadog tile catalog
listing carousels through them when a user views the integration.
Contributor
|
…names test variable.
…. its safe here because test mutates instance in place-the tests that customize it used dict-spread {**instance, ...} which produces a fresh dict
… and author details
There was a problem hiding this comment.
Pull request overview
This PR introduces a new linux_psi integration to collect Linux PSI (Pressure Stall Information) from /proc/pressure/* (host-level) and optionally per-cgroup PSI from cgroup v2 (*.pressure), and ships the accompanying assets (metadata, config spec, dashboard, monitor templates), tests, and CI/codecov wiring.
Changes:
- Added
LinuxPSICheckimplementation to emit PSI gauges (avg10/60/300) and monotonic counts (total), plus kernel-version metadata and opt-in cgroup v2 walking. - Added unit + integration tests and fixture PSI files to validate parsing, error handling, config behavior, and cgroup enumeration/cardinality caps.
- Added integration packaging + assets (README, config spec, service checks, dashboard, monitors), and registered the integration in CI, CODEOWNERS, and Codecov.
Reviewed changes
Copilot reviewed 38 out of 46 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| linux_psi/datadog_checks/linux_psi/check.py | Implements PSI collection (host + optional per-cgroup), parsing, service checks, kernel metadata, and cgroup traversal/cardinality logic. |
| linux_psi/datadog_checks/linux_psi/init.py | Exposes LinuxPSICheck and version. |
| linux_psi/datadog_checks/linux_psi/about.py | Defines integration version. |
| linux_psi/datadog_checks/linux_psi/data/conf.yaml.example | Documents instance configuration options (resources + cgroup options). |
| linux_psi/assets/configuration/spec.yaml | Declarative configuration spec used to generate config models and docs. |
| linux_psi/datadog_checks/linux_psi/config_models/init.py | Generated config model mixin wiring. |
| linux_psi/datadog_checks/linux_psi/config_models/instance.py | Generated Pydantic instance config model. |
| linux_psi/datadog_checks/linux_psi/config_models/shared.py | Generated Pydantic shared config model. |
| linux_psi/datadog_checks/linux_psi/config_models/defaults.py | Generated defaults for instance options. |
| linux_psi/datadog_checks/linux_psi/config_models/validators.py | Placeholder for custom validators/transformers. |
| linux_psi/tests/test_unit.py | Unit tests using fixture directories to validate parsing, error handling, and cgroup behavior without requiring Linux. |
| linux_psi/tests/test_integration.py | Integration test that reads real /proc/pressure/* when available on Linux. |
| linux_psi/tests/conftest.py | Test fixtures for shared instance config and dd_environment. |
| linux_psi/tests/common.py | Test helper utilities for fixture paths/reads. |
| linux_psi/tests/init.py | Marks test package. |
| linux_psi/tests/fixtures/pressure_cpu_normal | PSI fixture content for CPU (some/full). |
| linux_psi/tests/fixtures/pressure_cpu_no_full | PSI fixture content for CPU on kernels lacking full. |
| linux_psi/tests/fixtures/pressure_memory_normal | PSI fixture content for memory (normal). |
| linux_psi/tests/fixtures/pressure_memory_stressed | PSI fixture content for memory (stressed). |
| linux_psi/tests/fixtures/pressure_io_normal | PSI fixture content for IO (normal). |
| linux_psi/tests/fixtures/pressure_io_malformed | PSI fixture content to validate malformed parsing behavior. |
| linux_psi/README.md | User-facing integration documentation (setup, config, troubleshooting, compatibility). |
| linux_psi/metadata.csv | Metric metadata for host-level and cgroup-level PSI metrics. |
| linux_psi/manifest.json | Integration manifest including tile metadata and assets registration. |
| linux_psi/assets/service_checks.json | Service-check metadata for linux_psi.can_read and linux_psi.cgroup.can_read. |
| linux_psi/assets/dashboards/linux_psi_overview.json | Shipped dashboard for PSI operational views and drilldowns. |
| linux_psi/assets/monitors/cpu_pressure_some_high.json | Recommended monitor template for high CPU some pressure. |
| linux_psi/assets/monitors/cpu_pressure_full_critical.json | Recommended monitor template for severe CPU full pressure. |
| linux_psi/assets/monitors/memory_pressure_full_critical.json | Recommended monitor template for severe memory full pressure. |
| linux_psi/assets/monitors/io_pressure_some_high.json | Recommended monitor template for high I/O some pressure. |
| linux_psi/assets/monitors/io_pressure_full_critical.json | Recommended monitor template for severe I/O full pressure. |
| linux_psi/assets/logo.svg | Integration logo asset. |
| linux_psi/DASHBOARD_DESIGN.md | Design rationale and maintenance notes for the shipped dashboard JSON. |
| linux_psi/CHANGELOG.md | Integration changelog entries for releases/features. |
| linux_psi/pyproject.toml | Packaging metadata and dependency declarations. |
| linux_psi/hatch.toml | Hatch environment configuration for the integration. |
| .github/workflows/test-all.yml | Adds CI job to run linux_psi tests on Linux. |
| .github/CODEOWNERS | Adds CODEOWNERS entry for the new integration directory. |
| .codecov.yml | Adds Codecov flag + project status entries for linux_psi. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Fix - Validate the type up-front and raise a clear ConfigurationError when it's not a list/tuple of strings. Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Adds a new integration
linux_psito monitor Linux kernel PSI (Pressure Stall Information).Motivation
PSI (kernel 4.20+) is the canonical signal for "how much time is this system stalling on contention" - far more useful than CPU% for diagnosing "everything is slow" incidents. The existing linux_proc_extras integration does not cover PSI. No other integration in core or extras does either.
Per-cgroup PSI extends the same signal to the noisy-neighbor question: which workload is generating pressure on this host? This is the answer container-heavy fleets need but cannot get from host-wide metrics alone.
Review checklist
no-changeloglabel attachedAdditional Information
Ships at v1.1.0 with: