From e1bb667eab41e4b17bb67a5381cb639d711f358e Mon Sep 17 00:00:00 2001 From: Santiago Mola Date: Wed, 22 Apr 2026 14:56:47 +0200 Subject: [PATCH 1/3] plan --- PLAN.md | 560 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 560 insertions(+) create mode 100644 PLAN.md diff --git a/PLAN.md b/PLAN.md new file mode 100644 index 00000000000..c5ab510fb03 --- /dev/null +++ b/PLAN.md @@ -0,0 +1,560 @@ +# Wait Conditions — Incremental Reintroduction Plan + +## 1. Context + +### What PR #1292 tried to do + +[PR #1292](https://github.com/DataDog/system-tests/pull/1292) ("Add wait +conditions to check when setup is ready"), closed unmerged in June 2024, +replaced the fixed post-setup sleep (`library_interface_timeout`, +`agent_interface_timeout`, `backend_interface_timeout`) with **precondition +checks**: instead of sleeping N seconds hoping the tracer has flushed all +data, the scenario actively polls until it observes that the expected +telemetry has arrived. When the condition is met, it proceeds immediately; +otherwise it falls back to a deadline. + +The PR introduced one monolithic module `utils/wait_conditions.py` with 5 +built-in conditions, plus a global registry for test-defined conditions: + +1. **`_wait_for_test_requests`** (PHP only): wait until every `rid` ever + issued by the weblog has been seen in the tracer. Needed for multi-worker + weblogs where a single watermark request is not enough. +2. **`_wait_for_request`** (watermark): send one final `GET /` and wait until + its trace arrives at the agent proxy. Assumption: all previous traces + arrive before or together with the watermark. Returns the index `n` of + the watermark message, used later as a read-after-write barrier. When a + sampling rate is set, issues a new watermark request each poll cycle + (because any single request may be dropped). +3. **`_wait_for_request_in_agent`**: same idea, one level deeper — wait + until the watermark request is also visible on the agent interface, not + just the tracer's. +4. **`_wait_for_telemetry`**: after the watermark index, wait for 2 more + telemetry heartbeats. Two heartbeats after the watermark means anything + batched into telemetry before the watermark has had time to be flushed. + Special-cased to short-circuit when telemetry isn't emitted at all (Ruby + only sends some messages but not heartbeats). +5. **`_wait_for_remote_config`**: if `proxy_state.mock_remote_config_backend` + is set, wait until `len(MOCKED_RESPONSES) + 2` config requests have been + seen — enough to guarantee every mocked RC response was delivered. +6. **`_wait_for_otel_request`**: OTel-specific watermark based on + `/basic/trace`. +7. **Custom conditions**: tests can call `wait_conditions.add(timeout, fn)` + to register arbitrary boolean callables (used by DSM and profiling). + +It also made collateral changes: + +- Renamed three per-interface timeouts into a single `post_setup_timeout`. +- Added `post_setup=True` to `weblog.request()` to allow a watermark request + outside the setup phase (normally forbidden). +- Added `weblog.get_all_seen_rids()` and `interfaces.library.get_all_rids()`. +- Tightened `DD_REMOTE_CONFIG_POLL_INTERVAL_SECONDS` to 1s for many RC + scenarios so the watermark converges faster. +- Removed per-interface `wait()` methods and `_init_rid_to_library_trace_ids` + on the backend interface. +- Added `cpuset_cpus="0"` when starting containers (unrelated cgroup tweak). +- Enabled DSM/profiling wait conditions in the corresponding tests. + +### Why it didn't merge + +The PR's own description lists open TODOs — "Move most logic to +`wait_conditions.py`", `/make_distant_call` compatibility, documentation of +_why_ the conditions actually work — and there are no review comments, no +CI runs recorded, and the branch was closed a year after the last commit. +Reading the diff with 2.5 years of hindsight, the likely blockers were: + +- **Big-bang semantics change**: 11 files modified, the wait model replaced + globally, behavior redefined for every scenario at once. Hard to bisect + if a single library starts flaking. +- **Global mutable state**: `_WAIT_CONDITIONS` is a module-level list. Any + cross-scenario pollution (e.g. a condition added during one test and not + drained) silently affects the next scenario in the same process. +- **Polling under the GIL**: `while True: list(interfaces.library.get_data())` + materializes the entire `_data_list` on every iteration. Today we already + have an **event-driven** `Interface.wait_for(predicate, timeout)` that + wakes up from `_append_data` — polling would be a regression. +- **Assumptions not yet justified**: "2 heartbeats after watermark is + enough", "sampling rate + new request per iteration is correct", + "agent-side watermark subsumes backend wait". None of these were + documented or tested in isolation. +- **PHP fallback silently broken**: `_wait_for_test_requests` loops with + no `time.sleep`, burning CPU while contending with the ingest path. +- **No wait for E2E backend**: despite the checkbox, the PR actually + *removed* `backend_interface_timeout` without replacing its single known + use case (APM tracing E2E, which waits for backend lookups by trace ID). +- **Changed test contract**: `test_dsm.py` now reads hashes in `setup_*` and + stores them on `self`, then asserts on `self.r.status_code == 200` too. + That is the right direction but it's a separate, large refactor entangled + with the wait refactor. + +### What has changed in the codebase since April 2024 + +The tree has moved on: + +- `utils/_context/_scenarios.py` has been split into a package + (`utils/_context/_scenarios/*.py`). The per-interface timeouts still + exist in `endtoend.py:195-205, 284-327, 466-470` and in every scenario + declaration in `__init__.py` and `debugger.py`. +- `Interface.wait_for(predicate, timeout)` is **already implemented** + (`utils/interfaces/_core.py:199-222`) and used by several tests: + `_remote_config.py`, `tests/ffe/test_dynamic_evaluation.py`, + `tests/debugger/utils.py`, `tests/stats/test_stats.py`, + `tests/test_sampling_rate_capping.py`, + `tests/remote_config/test_remote_configuration.py`, and more. +- The interface base class owns a `threading.Event` and dispatches it on + every `_append_data`, so predicate-based waits wake up as soon as the + matching payload arrives — no polling required. +- Replay mode is well established: `load_data_from_logs()` is the codepath + used for replay, and `post_setup()` branches on `self.replay`. +- `force_interface_timout_to_zero` exists to skip waiting when no test was + collected. +- `_wait_for_app_readiness()` exists as a pre-setup warmup and uses + `interfaces.library.ready.wait(40)` — a different `ready` event we can + reuse. +- `containers.py` has moved, `_fix_host_pwd_in_volumes` is gone from the + path the PR touched, and `cpuset_cpus="0"` is **not something we want** + — it would pin every container to one CPU. + +### Goals + +The underlying idea is still valuable: **replace opaque sleeps with +explicit, observable conditions**. That makes CI faster when everything +works, more deterministic when it doesn't, and gives us a single place to +encode "what does 'setup is finished' really mean for scenario X". + +This plan aims to deliver the same end state as PR #1292, but: + +1. **Strictly additive** at every step — existing `interface.wait(timeout)` + keeps working until the last step removes it. +2. **Event-driven** — reuse the existing `Interface.wait_for` and its + `threading.Event`, do not reintroduce the PR's busy polling. +3. **Scoped per scenario class** — no module-level global registry; the + conditions live on the scenario instance and are drained per run. +4. **One library, one RFC flag at a time** — the risky behavior changes + (telemetry-heartbeat barrier, sampling-mode watermark) land behind a + scenario-level opt-in before becoming the default. +5. **Documented assumptions** — every condition answers the question + "why is this sufficient?" in a comment next to the code. +6. **Leave out the unrelated changes** — no `cpuset_cpus`, no `stop(timeout)` + change, no container log tweaks, no `DD_REMOTE_CONFIG_POLL_INTERVAL` + overrides unless we can show they are actually needed for the wait to + converge. + +### Non-goals + +- Rewriting the DSM or profiling tests. If the existing fixed timeout is + enough for them today, leave them alone; the wait-condition API will be + available if they opt in later. +- Changing the pre-setup warmup / app-readiness path. +- Changing behavior in replay mode. +- Removing `tracer_sampling_rate` or `proxy_state` plumbing. +- Removing `_wait_for_app_readiness`. + +--- + +## 2. Design + +### 2.1 Module layout + +``` +utils/ + wait_conditions/ + __init__.py # public API: WaitConditions, Condition + _watermark.py # one watermark request + per-interface observation + _telemetry.py # "N heartbeats after watermark" barrier + _remote_config.py # RC-count barrier, driven by proxy_state + _otel.py # /basic/trace watermark + README.md # why each condition is sufficient +``` + +A package, not a single module. Each condition is isolated and +independently testable. No module-level mutable state. + +### 2.2 Public API + +```python +# utils/wait_conditions/__init__.py + +@dataclass(frozen=True) +class Condition: + """One thing we are waiting for. + + `predicate` is called with no arguments; it should return True when the + condition is met. Implementations should be cheap and idempotent. + `description` is used in log lines and error messages. + """ + predicate: Callable[[], bool] + description: str + timeout: float # seconds + +class WaitConditions: + """Collects and runs a set of wait conditions for one scenario run. + + Instances are owned by a scenario; nothing is shared across scenarios or + across processes. + """ + + def __init__(self) -> None: ... + + def add(self, condition: Condition) -> None: ... + + def run(self, *, deadline: float) -> list[Condition]: + """Run every registered condition in parallel (thread-per-condition + on top of the existing interface events). Returns the list of + conditions that did not meet before the deadline. + """ +``` + +Key differences from PR #1292's `_WAIT_CONDITIONS` global: + +- **Instance-scoped.** The scenario holds one `WaitConditions` for the + current run. After `post_setup` returns, it is dropped. There is no + risk of leaking a condition from one scenario into the next. +- **Concurrent.** Every condition is waited on in its own thread, bounded + by the scenario-level deadline. Sequential waits made sense in the + original PR because the conditions were ordered (watermark first, then + telemetry_after_watermark). Here we keep ordering where it matters + (telemetry references the watermark index) by composing conditions in + code rather than by relying on a list order. +- **No polling in the hot loop.** Each built-in condition builds on + `Interface.wait_for(predicate, timeout)`, which already wakes up on + `_append_data`. Custom conditions passed as raw callables still poll, + but at a much lower rate and only when the user opts in. + +### 2.3 Scenario integration + +The scenario's `_wait_and_stop_containers` currently does: + +```python +self._wait_interface(interfaces.library, library_interface_timeout) +... +self._wait_interface(interfaces.agent, agent_interface_timeout) +... +self._wait_interface(interfaces.backend, backend_interface_timeout) +``` + +We keep that shape. We add an optional earlier step: + +```python +self._wait_for_setup_conditions(deadline=time.time() + self.post_setup_timeout) +# then the existing per-interface waits, with their timeouts clamped to the +# remaining budget +``` + +The existing interface-level timeouts become a **fallback**, not the +primary mechanism. We do **not** delete them in the first steps. + +### 2.4 Why each condition is sufficient + +This section is the one that was missing from PR #1292. It goes into +`utils/wait_conditions/README.md`. + +- **Watermark trace**: We issue `GET /` after setup requests are done and + wait until the tracer has flushed a trace containing the resulting `rid`. + The trace pipeline is FIFO per worker process (this is a property of the + agent writer, not of the library under test), so any span generated + **before** the watermark request by the same worker has already been + sent to the agent proxy by the time we observe the watermark. This is + why multi-worker weblogs (php-fpm, Apache mod_php, passenger on Ruby if + we ever add it) need the full-rid pre-pass. +- **Watermark in agent**: Observing the watermark on the tracer proxy only + proves the tracer flushed. The agent may still be batching before + sending to the backend proxy. So we wait for the watermark span to + appear on the agent interface too. This subsumes what + `_wait_interface(interfaces.agent, ...)` was buying us in the current + code, for trace data. +- **Telemetry heartbeat barrier**: Telemetry is emitted on a wall-clock + timer, not per-request. So the watermark trace does not tell us + anything about telemetry that was due but not yet flushed. The + "heartbeat" message is the one message that is guaranteed to be + periodic regardless of request activity. Two heartbeats after we + observed the watermark is enough to guarantee that: (a) at least one + full heartbeat period has elapsed since the last request, (b) any + batched telemetry generated by any pre-watermark request has had one + full period to be flushed, plus one more period as safety. One is too + few because heartbeat #1 could have been in-flight when the watermark + was observed. Libraries that never send heartbeats (Ruby) are detected + by the absence of heartbeat or app-started messages and short-circuit + this wait. +- **Remote config count**: The mock RC backend is deterministic — it has + exactly `len(MOCKED_RESPONSES)` payloads to deliver. Once we've seen + `len(MOCKED_RESPONSES) + 2` config requests from the library, the + library has ack'd every payload plus two additional polls confirming + "no new update". This is the classic "two consecutive no-ops prove + we're caught up" pattern. The `+2` is not arbitrary: one poll could + race with the last mocked response being applied; two polls after the + last mocked response are necessary to observe the applied ack. +- **OTel watermark**: Same as trace watermark, but on `/basic/trace` via + the OTel collector pipeline, observed on + `interfaces.open_telemetry`. +- **Backend wait** (APM tracing E2E only): Today this uses + `backend_interface_timeout=5`. For the backend there is no cheap + watermark — the backend does not offer a "last-seen-rid" API we can + observe. The existing fixed sleep is the right shape here; wait + conditions should **not** try to replace it. We keep + `backend_interface_timeout` as is and, when we retire + `library_interface_timeout` and `agent_interface_timeout`, leave + `backend_interface_timeout` untouched. + +### 2.5 What we do not reintroduce from PR #1292 + +- **Global `_WAIT_CONDITIONS` list** — replaced by the per-scenario + `WaitConditions` instance. +- **Busy polling** — replaced by `Interface.wait_for` + one bounded thread + per condition. +- **`cpuset_cpus="0"`** — unrelated, actively harmful on multi-CPU runners. +- **`container.stop(timeout=10)`** — unrelated. +- **Bulk `DD_REMOTE_CONFIG_POLL_INTERVAL_SECONDS=1`** — reintroduced only + on scenarios where step 6 empirically shows the watermark doesn't + converge within the old timeout. +- **`DD_TELEMETRY_METRICS_INTERVAL=2`** — same as above, empirical only. +- **`get_rid_from_request` accepting a raw string** — small footgun, + unrelated to wait logic. +- **Removal of `_init_rid_to_library_trace_ids`** from the backend + interface — we will keep the `wait()` path alive on `_backend.py`. + +--- + +## 3. Plan — one PR per step + +Each step is small enough to review and run CI in isolation. Each step +**leaves the tree green and the behavior unchanged for anything not +explicitly opted in**. + +### Step 0 — Baseline measurements + +Before any change, record current `post_setup` durations per scenario +from a main-branch CI run (dashboard or pytest log scrape). This is the +number we are trying to beat. Commit nothing; file an internal note. + +**Deliverable**: a list of "scenario → median post_setup wait time on +main" numbers we can use as the before/after baseline. + +### Step 1 — Skeleton `wait_conditions` package, no behavior change + +- Create `utils/wait_conditions/__init__.py` with `WaitConditions` and + `Condition` as described in §2.2. +- `WaitConditions.run()` is a no-op in this step (returns `[]`). +- No other file changes, no scenario integration. +- Unit tests in `utils/wait_conditions/test_wait_conditions.py`: + - Adding and iterating conditions. + - `run()` with trivially-true / trivially-false predicates and a 100 ms + deadline. + - Thread safety: `add()` concurrent with `run()`. + +**Acceptance**: `pytest utils/wait_conditions` passes in CI; default +scenario still runs identically. + +### Step 2 — Watermark condition on the tracer interface + +- Add `_watermark.py` with `make_tracer_watermark(*, library_name, + tracer_sampling_rate, weblog, interfaces) -> Condition`. +- Implementation uses `interfaces.library.wait_for(predicate, + timeout)` — event-driven, no polling. +- Issue the watermark request via `weblog.get("/", + post_setup=True)` — so we add `post_setup=True` support to + `_weblog.py` here, exactly the change from PR #1292 (minimal, no other + surface change). +- For sampling scenarios, predicate retries the watermark request every + `_ITER_SLEEP_TIME` (see §2.6 below) by scheduling a helper that calls + `wait_for` in a loop, giving up at the deadline. +- Scenario **does not yet use this**; it only becomes testable via a + new `EndToEndScenario.wait_conditions_enabled = False` flag that is + off everywhere. +- Add `tests/internal/test_wait_conditions_watermark.py` as an integration + test using the lightest existing scenario (`DEFAULT` on one library, + e.g. python) — enable the flag locally in the test only. + +**Acceptance**: with the flag on, setup completes faster than baseline +on python DEFAULT; with the flag off (everywhere else), zero behavior +change. + +### Step 3 — Watermark condition on the agent interface + +- Add `make_agent_watermark(...)` in `_watermark.py`. Predicate reads + from `interfaces.agent.get_spans(request=rid)`. +- Runs after the tracer watermark (composition: the agent predicate only + starts checking once tracer `rid` is known). +- No scenario change yet; same flag as step 2. + +**Acceptance**: agent-side wait converges on python DEFAULT with the +flag on; no change otherwise. + +### Step 4 — Telemetry heartbeat barrier + +- Add `_telemetry.py` with `make_telemetry_barrier(*, interfaces, + watermark_index) -> Condition`. +- "Watermark index" is the `n` returned in §2.4. We expose it from the + tracer-watermark condition via a shared mutable container (`dict` or + small dataclass) that step 2 stored the index into. +- Short-circuit: if no `app-heartbeat` or `app-started` has ever been + seen, `Condition.predicate` returns True immediately. + +**Acceptance**: on python DEFAULT and node DEFAULT, telemetry barrier +is reached; on ruby DEFAULT, the short-circuit fires and setup does not +stall. + +### Step 5 — Opt-in at scenario level on DEFAULT only + +- Add `EndToEndScenario.__init__` parameter `use_wait_conditions: bool = + False`. +- When True, `_wait_and_stop_containers` builds a `WaitConditions` with + the three conditions from steps 2–4, runs them with + `deadline=time.time() + self.library_interface_timeout + self.agent_interface_timeout`, + and **then still calls** the existing `_wait_interface(...)` waits with + a timeout clamped to `max(0, deadline - time.time())`. If the wait + conditions all succeeded, the legacy waits are cheap no-ops. +- Turn the flag on for `scenarios.default` only. +- `backend_interface_timeout` remains untouched, so APM_TRACING_E2E is + not affected. + +**Acceptance**: full default-scenario CI passes across every library and +weblog. Compare `post_setup` times to Step 0 baseline; expected drop on +libraries with short actual flush times (nodejs, go) and roughly equal +on ones that were already tight (java, python-at-25s). + +### Step 6 — Remote-config condition + +- Add `_remote_config.py` with `make_rc_barrier(*, proxy_state, + interfaces) -> Condition | None`. Returns `None` if there is no + `mock_remote_config_backend` — caller filters out `None`. +- Opt in on one RC scenario: + `REMOTE_CONFIG_MOCKED_BACKEND_ASM_FEATURES`. +- If the wait reliably converges, opt in on the other RC scenarios in a + follow-up (one per PR is overkill; they all share the same shape). +- Only now consider whether to override `DD_REMOTE_CONFIG_POLL_INTERVAL_SECONDS`. + Decide based on measured convergence time, not by blanket-copying PR + #1292. + +**Acceptance**: RC scenarios finish setup faster than their current +`library_interface_timeout=100`. + +### Step 7 — OTel watermark + +- Add `_otel.py` with `make_otel_watermark(...)`. +- Opt in on `OpenTelemetryScenario` via `use_wait_conditions=True`. +- `backend_interface_timeout` still applies for the OTel collector → + backend path that OTel scenarios use. + +**Acceptance**: OTel integrations scenario still green; setup faster. + +### Step 8 — Custom condition support, port DSM and profiling + +- Expose `scenario.wait_conditions.add(Condition(...))` to tests via a + new context helper `utils.wait_conditions.add_for_current_scenario(...)` + — the helper retrieves the current scenario (same pattern the existing + `context` uses) rather than relying on a global list. +- Port `tests/integrations/test_dsm.py::DsmHelper.wait_for_hashes` to + register conditions through this helper. + - Keep the hash-on-`self` refactor — it's good on its own merit — but + split it out into its own PR if the diff is noisy. +- Port `tests/test_profiling.py::_common_setup` to register a profiling + condition the same way. + +**Acceptance**: DSM and PROFILING scenarios still green; observable +speedup only on passing runs. + +### Step 9 — Opt in across all EndToEnd scenarios + +- Flip `use_wait_conditions=True` to be the default in + `EndToEndScenario.__init__`. Scenarios that needed the old behavior + (if any surfaced in Step 5) remain pinned to `False` with an inline + `# TODO: opt in` comment and a JIRA link. +- Keep `library_interface_timeout`, `agent_interface_timeout`, and + `backend_interface_timeout` and their role as fallback. + +**Acceptance**: full CI green across all scenarios. Dashboard-level +post-setup time compared against Step 0. + +### Step 10 — Retire the fallbacks (optional, only if Steps 5–9 are stable for at least two weeks) + +- Remove `library_interface_timeout` and `agent_interface_timeout` + parameters from `EndToEndScenario`. They become `post_setup_timeout` + (scenario-level deadline for wait-conditions). +- `backend_interface_timeout` stays — it has no watermark. +- `interfaces.library.wait(timeout)` and `interfaces.agent.wait(timeout)` + are deleted. `interfaces.backend.wait(timeout)` stays. +- Delete the per-library timeout table in `endtoend.py:312-327`. +- Update `docs/edit/flushing.md`, `docs/internals/end-to-end-life-cycle.md`. + +**Acceptance**: diff is purely mechanical; CI green; dashboards unchanged. + +### 2.6 Polling interval + +The PR used `_ITER_SLEEP_TIME = 0.5`. For everything that goes through +`Interface.wait_for`, there is no polling — the event is set by +`_append_data`. The only place a loop is needed is the +sampling-with-retries watermark (step 2). There, `1.0s` matches the +minimum request interval the ingest path tolerates (comment in the +original code). Use `1.0s` and note the justification. + +--- + +## 4. Test strategy + +- **Unit**: `utils/wait_conditions/test_*.py` — pure-Python, no docker, + mock the interfaces. +- **Integration** per step: one minimal scenario, one library, flag + on, behavior verified in isolation. Run in `run-default-scenario` only + at first. +- **Full**: once a step flips a default, the `run-default-scenario` + label is removed on that PR so every scenario runs. +- **Measurement**: grep pytest output for the `[post_setup]` log lines + we emit from `WaitConditions.run()` and compare medians per + (scenario, library, weblog) to Step 0 baseline. + +--- + +## 5. Risks and mitigations + +| Risk | Mitigation | +| --- | --- | +| A library's trace pipeline is **not** FIFO end-to-end (e.g. async buffer with reorder) and watermark arrives before an earlier trace. | Step 5 is gated by CI on every library; any flake during the stabilization window pins that library back to legacy waits via `use_wait_conditions=False` until investigated. | +| Telemetry "2 heartbeats" assumption breaks in a library we forgot. | Short-circuit keyed on heartbeat/app-started presence (same as PR #1292). Add a warning log when the short-circuit fires so we notice silent failures. | +| `post_setup=True` request path diverges from normal `weblog.get` in a subtle way and tests start seeing it as a real request. | Keep the "don't append to `responses`" gate. Add an assertion in the weblog that a `post_setup=True` request is only allowed from `utils/wait_conditions/**`. | +| Thread-per-condition explodes under high scenario count. | Bounded: at most 5 conditions per scenario; threads use `Interface.wait_for`, which itself doesn't spin. Worst case is <10 threads per scenario, short-lived. | +| Pollution across tests within a scenario via the per-scenario `WaitConditions`. | `WaitConditions` instance is created in `post_setup` and dropped after `run()`. Conditions registered by a test apply only to that scenario's post-setup, which is exactly what we want. | +| Someone lands a new scenario without opting in after Step 9. | Step 9 flips the default, so new scenarios get wait conditions automatically. Legacy fallbacks still there as safety net until Step 10. | +| Step 10 removes `library_interface_timeout` and some forgotten external tool parses it. | Step 10 is gated on a two-week green window and a repo-wide grep. | + +--- + +## 6. What PR #1292 got right and we should keep + +- Watermark request idea — good and cheap. +- `post_setup=True` flag on `weblog.request()` — minimal and correct. +- Telemetry barrier as a separate step from trace barrier — correct. +- Short-circuit for libraries that don't emit telemetry — correct. +- `get_all_rids()` / `get_all_seen_rids()` — useful for the PHP case; + scope them to the PHP path only in our version. +- Custom wait conditions as an extension point — correct, but + re-scoped to a per-scenario collection. + +## 7. What PR #1292 got wrong and we should not repeat + +- One big module with global state. +- Busy polling instead of using `Interface.wait_for`. +- Merging unrelated container/cgroup changes. +- Deleting `backend_interface` wait semantics without a replacement. +- Global `DD_REMOTE_CONFIG_POLL_INTERVAL_SECONDS` overrides without + measurement. +- Entangling the DSM test refactor with the wait refactor. +- No documentation of _why_ each condition is sufficient. +- No incremental opt-in flag. + +--- + +## 8. Open questions to resolve before Step 1 + +1. Is `Interface.wait_for` robust enough that we can lean on it as the + primary mechanism? (Look at the existing call sites in + `tests/debugger`, `tests/stats`, `tests/remote_config` to confirm.) +2. Is there a library where `GET /` is not routable (some weblog + variants may not define `/`)? If so, the watermark endpoint needs + to be `/watermark` and every weblog needs a trivial handler. If that + is too invasive, we fall back to issuing a request whose path + already exists in every weblog — candidates: `/healthcheck` if + present, otherwise re-use a known path from the setup phase by + replaying it (harder, don't do this in v1). +3. Do we need a kill-switch env var (`SYSTEM_TESTS_WAIT_CONDITIONS=0`) + to force-disable the new path on a CI run without a code change? + Yes, add it in Step 5. From 837b7265b5d41a15edde50d5056a853a731e8b1f Mon Sep 17 00:00:00 2001 From: Santiago Mola Date: Wed, 22 Apr 2026 16:09:14 +0200 Subject: [PATCH 2/3] plan --- PLAN.md | 91 +++++++++++++++++++++++++++++++-------------------------- 1 file changed, 50 insertions(+), 41 deletions(-) diff --git a/PLAN.md b/PLAN.md index c5ab510fb03..99fac38ecd8 100644 --- a/PLAN.md +++ b/PLAN.md @@ -1,4 +1,4 @@ -# Wait Conditions — Incremental Reintroduction Plan +# Wait Conditions ## 1. Context @@ -155,24 +155,22 @@ This plan aims to deliver the same end state as PR #1292, but: ### 2.1 Module layout +We start with one file: + ``` utils/ - wait_conditions/ - __init__.py # public API: WaitConditions, Condition - _watermark.py # one watermark request + per-interface observation - _telemetry.py # "N heartbeats after watermark" barrier - _remote_config.py # RC-count barrier, driven by proxy_state - _otel.py # /basic/trace watermark - README.md # why each condition is sufficient + wait_conditions.py # public API + built-in conditions ``` -A package, not a single module. Each condition is isolated and -independently testable. No module-level mutable state. +That keeps the first PRs small and easy to review. If the file becomes +too crowded once the built-in conditions land, we can split it later +without changing the rollout strategy. The important constraint is **no +module-level mutable state**. ### 2.2 Public API ```python -# utils/wait_conditions/__init__.py +# utils/wait_conditions.py @dataclass(frozen=True) class Condition: @@ -245,8 +243,9 @@ primary mechanism. We do **not** delete them in the first steps. ### 2.4 Why each condition is sufficient -This section is the one that was missing from PR #1292. It goes into -`utils/wait_conditions/README.md`. +This section is the one that was missing from PR #1292. In our version, +the rationale lives next to the implementation as docstrings and +comments in `utils/wait_conditions.py`, not in a separate README. - **Watermark trace**: We issue `GET /` after setup requests are done and wait until the tracer has flushed a trace containing the resulting `rid`. @@ -320,33 +319,40 @@ Each step is small enough to review and run CI in isolation. Each step **leaves the tree green and the behavior unchanged for anything not explicitly opted in**. -### Step 0 — Baseline measurements +### Step 0 — Map what actually depends on the legacy wait + +Before introducing any new wait-condition code, make one exploratory +change: force `library_interface_timeout=0`, push it to CI, and collect +the failures. -Before any change, record current `post_setup` durations per scenario -from a main-branch CI run (dashboard or pytest log scrape). This is the -number we are trying to beat. Commit nothing; file an internal note. +The goal is not to keep that change. The goal is to learn which +scenarios, libraries, and tests genuinely depend on the current fixed +library wait, so we can prioritize the first built-in conditions around +real failures instead of guessing. -**Deliverable**: a list of "scenario → median post_setup wait time on -main" numbers we can use as the before/after baseline. +**Deliverable**: a short failure map such as "scenario/library → what +arrived too late" (library traces, agent traces, telemetry, remote +config, OTel, or something unexpected). -### Step 1 — Skeleton `wait_conditions` package, no behavior change +### Step 1 — Skeleton `wait_conditions` module, no behavior change -- Create `utils/wait_conditions/__init__.py` with `WaitConditions` and +- Create `utils/wait_conditions.py` with `WaitConditions` and `Condition` as described in §2.2. - `WaitConditions.run()` is a no-op in this step (returns `[]`). - No other file changes, no scenario integration. -- Unit tests in `utils/wait_conditions/test_wait_conditions.py`: +- Unit tests in a small dedicated test file (for example + `tests/internal/test_wait_conditions.py`): - Adding and iterating conditions. - `run()` with trivially-true / trivially-false predicates and a 100 ms deadline. - Thread safety: `add()` concurrent with `run()`. -**Acceptance**: `pytest utils/wait_conditions` passes in CI; default -scenario still runs identically. +**Acceptance**: the dedicated wait-conditions unit tests pass in CI; +default scenario still runs identically. ### Step 2 — Watermark condition on the tracer interface -- Add `_watermark.py` with `make_tracer_watermark(*, library_name, +- Add `make_tracer_watermark(*, library_name, tracer_sampling_rate, weblog, interfaces) -> Condition`. - Implementation uses `interfaces.library.wait_for(predicate, timeout)` — event-driven, no polling. @@ -364,13 +370,13 @@ scenario still runs identically. test using the lightest existing scenario (`DEFAULT` on one library, e.g. python) — enable the flag locally in the test only. -**Acceptance**: with the flag on, setup completes faster than baseline -on python DEFAULT; with the flag off (everywhere else), zero behavior -change. +**Acceptance**: with the flag on, setup converges on python DEFAULT; +with the flag off (everywhere else), zero behavior change. ### Step 3 — Watermark condition on the agent interface -- Add `make_agent_watermark(...)` in `_watermark.py`. Predicate reads +- Add `make_agent_watermark(...)` in `utils/wait_conditions.py`. + Predicate reads from `interfaces.agent.get_spans(request=rid)`. - Runs after the tracer watermark (composition: the agent predicate only starts checking once tracer `rid` is known). @@ -381,7 +387,7 @@ flag on; no change otherwise. ### Step 4 — Telemetry heartbeat barrier -- Add `_telemetry.py` with `make_telemetry_barrier(*, interfaces, +- Add `make_telemetry_barrier(*, interfaces, watermark_index) -> Condition`. - "Watermark index" is the `n` returned in §2.4. We expose it from the tracer-watermark condition via a shared mutable container (`dict` or @@ -408,13 +414,13 @@ stall. not affected. **Acceptance**: full default-scenario CI passes across every library and -weblog. Compare `post_setup` times to Step 0 baseline; expected drop on -libraries with short actual flush times (nodejs, go) and roughly equal -on ones that were already tight (java, python-at-25s). +weblog. Correctness is the gate. If the logs make it easy to compare +post-setup duration before vs after on DEFAULT, note it, but do not make +that a requirement for this step. ### Step 6 — Remote-config condition -- Add `_remote_config.py` with `make_rc_barrier(*, proxy_state, +- Add `make_rc_barrier(*, proxy_state, interfaces) -> Condition | None`. Returns `None` if there is no `mock_remote_config_backend` — caller filters out `None`. - Opt in on one RC scenario: @@ -430,7 +436,7 @@ on ones that were already tight (java, python-at-25s). ### Step 7 — OTel watermark -- Add `_otel.py` with `make_otel_watermark(...)`. +- Add `make_otel_watermark(...)` in `utils/wait_conditions.py`. - Opt in on `OpenTelemetryScenario` via `use_wait_conditions=True`. - `backend_interface_timeout` still applies for the OTel collector → backend path that OTel scenarios use. @@ -463,7 +469,8 @@ speedup only on passing runs. `backend_interface_timeout` and their role as fallback. **Acceptance**: full CI green across all scenarios. Dashboard-level -post-setup time compared against Step 0. +correctness is the main gate; any timing improvement is a bonus, not a +hard requirement. ### Step 10 — Retire the fallbacks (optional, only if Steps 5–9 are stable for at least two weeks) @@ -491,16 +498,18 @@ original code). Use `1.0s` and note the justification. ## 4. Test strategy -- **Unit**: `utils/wait_conditions/test_*.py` — pure-Python, no docker, - mock the interfaces. +- **Unit**: one small pure-Python test file for `utils/wait_conditions.py` + — no docker, mock the interfaces. - **Integration** per step: one minimal scenario, one library, flag on, behavior verified in isolation. Run in `run-default-scenario` only at first. - **Full**: once a step flips a default, the `run-default-scenario` label is removed on that PR so every scenario runs. +- **Exploration first**: Step 0 gives us the initial failure map that + tells us which waits matter most. - **Measurement**: grep pytest output for the `[post_setup]` log lines - we emit from `WaitConditions.run()` and compare medians per - (scenario, library, weblog) to Step 0 baseline. + we emit from `WaitConditions.run()` when useful, but correctness and + failure reduction come before timing analysis in the first rollout. --- @@ -531,7 +540,7 @@ original code). Use `1.0s` and note the justification. ## 7. What PR #1292 got wrong and we should not repeat -- One big module with global state. +- Module-level global state. - Busy polling instead of using `Interface.wait_for`. - Merging unrelated container/cgroup changes. - Deleting `backend_interface` wait semantics without a replacement. From 939d9c674cf287e0f7a56660a2067fab73b6388f Mon Sep 17 00:00:00 2001 From: Santiago Mola Date: Wed, 22 Apr 2026 16:13:57 +0200 Subject: [PATCH 3/3] chore: force legacy library wait to zero --- utils/_context/_scenarios/endtoend.py | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/utils/_context/_scenarios/endtoend.py b/utils/_context/_scenarios/endtoend.py index 3b276ebf344..4c9dc2c5433 100644 --- a/utils/_context/_scenarios/endtoend.py +++ b/utils/_context/_scenarios/endtoend.py @@ -435,9 +435,10 @@ def _wait_and_stop_containers(self, *, force_interface_timout_to_zero: bool): interfaces.open_telemetry.check_deserialization_errors() else: - self._wait_interface( - interfaces.library, 0 if force_interface_timout_to_zero else self.library_interface_timeout - ) + # Step 0 of PLAN.md is intentionally exploratory: force the legacy + # post-setup library sleep to zero so CI shows which + # scenario/library pairs still depend on it. + self._wait_interface(interfaces.library, 0) self.weblog_infra.stop() interfaces.library.check_deserialization_errors()