fix: use median instead of mean for logprob error for stability in nightlies #1722

terrykong · 2026-01-06T08:28:36Z

What does this PR do ?

Also contains a change to just clean up ckpt dirs from the nightlies

Add a one line overview of what this PR aims to accomplish.

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

Summary by CodeRabbit

New Features
- Added median calculation function for metric evaluation, enabling more robust statistical analysis.
- Integrated median into metric check evaluation context for test validation.
Tests
- Comprehensive unit tests for median function covering multiple scenarios and edge cases.
Chores
- Updated metric validation checks across test suites to use median instead of mean.
- Added checkpoint cleanup steps in test workflows.
- Updated project configuration in test suites.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

…ghtlies Signed-off-by: Terry Kong <[email protected]>

coderabbitai · 2026-01-06T08:33:33Z

📝 Walkthrough

Walkthrough

Introduces a median() function to compute median values over specified 1-indexed ranges of dictionary values, exposing it in the metrics evaluation context. Updates 70+ test suite scripts to use median instead of mean for error metric thresholds and adds checkpoint directory cleanup (rm -rf "$CKPT_DIR") after successful metrics validation across all modified test suites.

Changes

Cohort / File(s)	Summary
Core Implementation `tests/check_metrics.py`, `tests/unit/test_check_metrics.py`	Added `median(value, range_start=1, range_end=0)` function to compute median over 1-indexed range with resume offset adjustment. Exposed `median` in `evaluate_check()` local context. Added comprehensive test suite covering odd/even counts, outliers, range boundaries, and integration with `evaluate_check`.
DPO Test Suites `tests/test_suites/llm/dapo-qwen2.5-7b.sh`, `tests/test_suites/llm/dpo-llama3.1-8b-instruct-.sh` (5 files), `tests/test_suites/llm/dpo-mistral-nemo-instruct-.sh`	Replaced `mean(data["train/token_mult_prob_error"])` with `median(...)` in metric checks; added `rm -rf "$CKPT_DIR"` cleanup step post-metrics validation.
GRPO Test Suites (Main) `tests/test_suites/llm/grpo-*.sh` (27 files)	Replaced `mean()` with `median()` for `train/token_mult_prob_error` metric evaluation; consistently added checkpoint cleanup step after successful metrics checks. Some variants also updated wandb project from `nemo-rl-distillation` to `nemo-rl`.
GRPO Test Suites (Disabled) `tests/test_suites/llm/grpo-helpsteer3-llama-3.3-nemotron-super-49b-v1.5-8n8g-fsdp2tp8cp4.sh.disabled`, `tests/test_suites/llm/grpo-math-llama-nemotron-super-49b-v.5-4n8g-fsdp2tp8.sh.disabled`	Replaced `mean()` with `median()` in metric checks; no cleanup steps in disabled test suites.
Distillation Test Suites `tests/test_suites/llm/distillation-qwen3-32b-to-*.sh` (6 files)	Updated wandb project name from `nemo-rl-distillation` to `nemo-rl`; added `rm -rf "$CKPT_DIR"` cleanup after successful metrics checks.
Performance Test Suites `tests/test_suites/llm/performance/dapo-deepseek-v3-64n8g.sh`, `tests/test_suites/llm/performance/grpo-.sh` (25 files), `tests/test_suites/llm/performance/grpo-qwen3-.sh` (8 files)	Replaced `mean(...)` with `median(...)` for `train/token_mult_prob_error` metric checks; added checkpoint cleanup step post-metrics validation across all performance test variants.
SFT Test Suites `tests/test_suites/llm/sft-*.sh` (12 files)	Added `rm -rf "$CKPT_DIR"` cleanup step after successful metrics validation; no metric function changes (SFT tests use different metrics than GRPO/DPO).
VLM Test Suites `tests/test_suites/vlm/vlm_grpo-qwen2.5-vl-3b-instruct-clevr-*.sh` (2 files)	Added checkpoint cleanup step inside finalization block after metrics checks pass.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

NVIDIA-NeMo/RL#1322: Adds performance test scripts that invoke tests/check_metrics.py with metric checks; directly uses the new median functionality introduced in this PR.
NVIDIA-NeMo/RL#1497: Updates performance test configurations and scripts that call tests/check_metrics.py; targets the same metrics-checking pipeline affected by median function addition.

Suggested labels

Run CICD

Suggested reviewers

yfw
yuki-97

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Test Results For Major Changes	⚠️ Warning	PR changes metrics computation across 60+ test scripts from mean to median without including test results or numerical analysis demonstrating no regression in the description.	Update PR description with test results, numerical analysis comparing median vs. mean, stability evidence, validation of thresholds, and confirmation of no regression across modified test suites.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main change: replacing mean with median for logprob error metrics to improve stability in nightly runs.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 17

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

🤖 Fix all issues with AI Agents

In @tests/check_metrics.py:
- Around line 100-130: The median function can call statistics.median on an
empty list (vals) which raises StatisticsError; update the median(value,
range_start=1, range_end=0) function to check if vals is empty before calling
statistics.median and return a sensible default (e.g., None or float('nan')) or
raise a controlled error; locate the vals list and the final return line in the
median function and add the empty-list guard right before the return so the
function handles ranges that yield no values gracefully.

In @tests/test_suites/llm/dapo-qwen2.5-7b.sh:
- Around line 42-43: Validate CKPT_DIR before running the cleanup: ensure the
CKPT_DIR variable is set and non-empty, confirm it points to an existing
directory (e.g., test [ -n "$CKPT_DIR" ] and [ -d "$CKPT_DIR" ]), and guard
against dangerous paths like "/" or "." (reject if "$CKPT_DIR" == "/" or empty)
before executing rm -rf "$CKPT_DIR"; update the block that currently runs rm -rf
"$CKPT_DIR" to perform these checks and only remove when they pass.

In
@tests/test_suites/llm/distillation-qwen3-32b-to-4b-base-2n8g-fsdp2tp2-long.v1.sh:
- Around line 41-43: Before running rm -rf "$CKPT_DIR", validate CKPT_DIR:
ensure the variable is set and non-empty, verify it points to an existing
directory (not "/" or other critical root paths), and optionally confirm it
matches an expected safe prefix (e.g., a known tmp/checkpoints base) or prompt
for confirmation; if any check fails, log an error and abort instead of
performing rm -rf.
- Line 22: The wandb project name was changed from "nemo-rl-distillation" to
"nemo-rl" in the configuration key logger.wandb.project; verify this is
intentional. If it was accidental, revert logger.wandb.project back to
"nemo-rl-distillation"; if intentional, update the PR description and any
downstream references (CI configs, dashboards, experiment tracking docs) to
reflect the new project name so metric tracking and organization are not
disrupted.

In
@tests/test_suites/llm/distillation-qwen3-32b-to-4b-base-2n8g-fsdp2tp2-seqpack.v1.sh:
- Around line 41-43: The cleanup line unconditionally runs rm -rf "$CKPT_DIR"
which is dangerous if CKPT_DIR is empty or unset; add a safety guard before
deletion that verifies CKPT_DIR is set and non-empty (and ideally not "/" or
"/"-like paths) and only then run the removal; reference CKPT_DIR and the rm -rf
call so you replace that direct removal with a conditional check that aborts or
logs an error when CKPT_DIR is missing or unsafe.

In @tests/test_suites/llm/dpo-llama3.1-8b-instruct-4n8g-fsdp2tp2-quick.v2.sh:
- Around line 43-45: Add a safety check before the unconditional rm -rf by
validating the CKPT_DIR variable and its value: ensure CKPT_DIR is set and
non-empty, reject dangerous values such as "/" or "." and verify the path exists
(e.g., [[ -n "$CKPT_DIR" ]] && [[ "$CKPT_DIR" != "/" ]] && [[ "$CKPT_DIR" != "."
]] && [ -e "$CKPT_DIR" ]), then only perform rm -rf "$CKPT_DIR"; if validation
fails, log an error and skip deletion.

In @tests/test_suites/llm/dpo-llama3.1-8b-instruct-4n8g-fsdp2tp4.sh:
- Around line 44-45: The cleanup uses rm -rf "$CKPT_DIR" with no safety checks;
add validation to ensure CKPT_DIR is set, non-empty, points to an existing
directory, and is not a root or obvious dangerous path (e.g., "/" or empty
string) before running rm -rf. Implement checks around the CKPT_DIR variable
(validate non-empty, [ -d "$CKPT_DIR" ], resolve with realpath and ensure it’s
not "/" and optionally ensure it matches an expected prefix or pattern), log or
exit with an error if validation fails, and only perform rm -rf "$CKPT_DIR"
after those guards pass.

In @tests/test_suites/llm/dpo-llama3.1-8b-instruct-4n8g-megatrontp2pp2-quick.sh:
- Around line 43-45: The cleanup line unconditionally runs rm -rf "$CKPT_DIR",
which is dangerous if CKPT_DIR is unset or empty; update the script to validate
CKPT_DIR before deletion by checking it is non-empty, points to an existing
directory, and is not a root or critical path (e.g., not "/" or empty), and only
then perform the removal; adjust the block around the rm invocation that
references $CKPT_DIR to bail out with an error if the checks fail.

In @tests/test_suites/llm/dpo-llama3.1-8b-tulu3-1n8g-fsdp2tp1.sh:
- Around line 42-44: The cleanup line unconditionally runs rm -rf "$CKPT_DIR",
which is unsafe if CKPT_DIR is empty or malicious; update the script to validate
CKPT_DIR before deletion: ensure the variable is set and non-empty, optionally
resolve and verify it is under an expected base directory (e.g., using realpath)
and not "/" or "/"-like paths, and only then perform rm -rf "$CKPT_DIR";
reference the CKPT_DIR variable and the cleanup block in the test_suites/llm
script when adding these guards.

In @tests/test_suites/llm/dpo-llama3.2-1b-instruct-1n8g-fsdp2tp1.v2.sh:
- Around line 41-43: The cleanup line using rm -rf "$CKPT_DIR" is unsafe if
CKPT_DIR is empty or points to root; before removal, verify CKPT_DIR is
non-empty, not equal to "/" (or other critical paths), and refers to an existing
directory, then perform the removal; update the script around the rm -rf
invocation to perform these guards (check CKPT_DIR variable, check it’s not "/"
and that the path exists) and only then run the delete command.

In
@tests/test_suites/llm/dpo-mistral-nemo-instruct-2407-1n8g-fsdp2tp8-actckpt-long.sh:
- Around line 41-43: The cleanup currently runs rm -rf "$CKPT_DIR" without
validating CKPT_DIR; change it to first verify CKPT_DIR is set/non-empty and
points to a safe directory (e.g., check [ -n "$CKPT_DIR" ] and that it is not
"/" and optionally that it exists and is a directory) before performing the
removal; update the block around the rm -rf invocation (the CKPT_DIR variable
usage) to only execute removal when those validations pass and log or fail
otherwise.

In @tests/test_suites/llm/grpo-deepscaler-1.5b-24K.sh:
- Around line 69-70: The cleanup command unconditionally removes "$CKPT_DIR"
even if the second metrics check failed or if MAX_STEPS was not reached; update
the script so cleanup only runs when all relevant checks pass: either move the
conversion, eval, second metrics check and rm -rf "$CKPT_DIR" inside the
existing if [[ $(jq ...) -ge $MAX_STEPS ]] block, or add an explicit guard that
verifies the first metrics check succeeded and the second check returned success
before executing rm -rf "$CKPT_DIR"; use the existing jq MAX_STEPS check and the
variable/results from the second metrics check to gate the CKPT_DIR removal.

In @tests/test_suites/llm/grpo-deepscaler-1.5b-8K.sh:
- Around line 63-64: The cleanup uses rm -rf "$CKPT_DIR" without validating
CKPT_DIR; add a safety check before that destructive command to ensure CKPT_DIR
is set and non-empty and not equal to "/" (and optionally confirm it exists and
is under the expected parent directory) and only then run the rm -rf
"$CKPT_DIR"; locate the rm -rf invocation and wrap it with the validation logic
using the CKPT_DIR variable to prevent accidental deletion.

In @tests/test_suites/llm/grpo-gemma3-1b-it-1n8g-fsdp2tp1.sh:
- Around line 41-42: The cleanup line blindly runs rm -rf "$CKPT_DIR"; add a
safety guard that ensures CKPT_DIR is non-empty and points to an existing
directory (e.g., test that CKPT_DIR is not empty and [ -d "$CKPT_DIR" ]), and
also reject dangerous values like "/" or empty string before calling rm -rf;
update the script around the CKPT_DIR cleanup to perform these checks and only
run rm -rf "$CKPT_DIR" when the conditions pass.

In @tests/test_suites/llm/grpo-gemma3-27b-it-8n8g-fsdp2tp8-actckpt-long.sh:
- Around line 39-41: Add a safety guard before the unconditional rm -rf by
validating CKPT_DIR: ensure the variable is set and non-empty, verify it points
to an existing directory (not "/" or empty string), and optionally check it
matches an expected path pattern or prefix before performing rm -rf "$CKPT_DIR";
update the cleanup block around the rm -rf command (reference CKPT_DIR) to
perform these checks and only delete when all validations pass.

In @tests/test_suites/llm/grpo-gspo-deepscaler-1.5b-8K.sh:
- Around line 40-41: Add a safety validation before the cleanup that ensures the
CKPT_DIR variable is set and non-empty and points to an existing directory
before calling rm -rf; locate the cleanup block where CKPT_DIR is used (the rm
-rf "$CKPT_DIR" line) and add a guard that checks CKPT_DIR is not empty and that
[ -d "$CKPT_DIR" ] (and optionally not equal to "/" or "."), only then perform
the removal, otherwise log/exit to avoid accidental deletion.

In
@tests/test_suites/llm/grpo-helpsteer3-llama-3.3-nemotron-super-49b-v1.5-8n8g-fsdp2tp8cp4.sh.disabled:
- Around line 37-40: Add the missing checkpoint cleanup after the metrics
validation block by removing the checkpoint directory variable (CKPT_DIR) when
the validation passes; locate the section containing the validation expression
'median(data["train/token_mult_prob_error"]) < 1.1' /
"data['train/token_mult_prob_error']['$MAX_STEPS'] < 1.1" and append the same
cleanup command used in the other test suites (rm -rf "$CKPT_DIR") so the script
follows the PR pattern and cleans up checkpoints on success.

In
@tests/test_suites/llm/grpo-llama3.1-8b-instruct-1n8g-megatron-fp8-rollouts.v3.sh:
- Around line 42-43: The cleanup currently runs rm -rf "$CKPT_DIR"
unconditionally which is dangerous if CKPT_DIR is unset/empty or points to a
root path; before removing, validate CKPT_DIR: ensure it is set and non-empty,
ensure it is an existing directory (not a file), guard against dangerous values
like "/" or an unexpected top-level path (optionally check it matches an
expected prefix), and only then run the removal; update the cleanup around the
rm -rf invocation (the CKPT_DIR variable and the rm -rf "$CKPT_DIR" line) to
perform these checks and fail-safe if validation does not pass.

In @tests/test_suites/llm/grpo-llama3.1-8b-instruct-2n8g-megatron-fp8-e2e.sh:
- Around line 40-41: The deletion line using rm -rf "$CKPT_DIR" must be guarded:
ensure the CKPT_DIR variable is non-empty and points to an existing directory
before removing; update the script around that deletion to check [[ -n
"$CKPT_DIR" ]] and [[ -d "$CKPT_DIR" ]], quote the variable when used, and
invoke rm with the -- separator (e.g., rm -rf -- "$CKPT_DIR") only after the
checks pass; otherwise, log or exit with a clear error instead of running rm.

In @tests/test_suites/llm/grpo-llama3.1-8b-instruct-4n8g-fsdp2tp1-long.v3.sh:
- Around line 40-41: Before running the dangerous rm command, verify CKPT_DIR is
non-empty, points to an existing directory and is not root ("/") or "/"-like;
replace the bare rm -rf "$CKPT_DIR" with a guarded sequence such as checking [[
-n "$CKPT_DIR" ]] && [ -d "$CKPT_DIR" ] && [ "$CKPT_DIR" != "/" ] (and
optionally ensure it is not "." or "$PWD") then call rm -rf -- "$CKPT_DIR";
reference the CKPT_DIR variable and the rm -rf invocation when making the
change.

In @tests/test_suites/llm/grpo-llama3.2-1b-instruct-1n8g-fsdp2tp1.v3.sh:
- Around line 41-42: The cleanup step unconditionally runs rm -rf "$CKPT_DIR"
which is dangerous if CKPT_DIR is unset/empty or points to "/" — add a safety
guard before removal: ensure CKPT_DIR is non-empty and not "/" (and ideally not
"."), then invoke rm with a safe form (use -- to avoid treating leading dashes);
reference CKPT_DIR and the rm -rf call and only proceed to delete when those
checks pass.

In @tests/test_suites/llm/grpo-llama3.2-1b-instruct-1n8g-megatron.sh:
- Around line 42-43: Add a safety check before the destructive rm -rf operation:
verify CKPT_DIR is non-empty and not '/' (and optionally not '.'), and only then
run rm -rf -- "$CKPT_DIR"; otherwise print an error and exit non-zero.
Specifically, replace the raw rm -rf "$CKPT_DIR" with a guard like checking [[
-n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]] (and "$CKPT_DIR" != ".") before
executing rm -rf -- "$CKPT_DIR" to prevent accidental deletion when CKPT_DIR is
unset or invalid.

In @tests/test_suites/llm/grpo-math-qwen3-30ba3b-megatron-tp4-32k.sh:
- Around line 36-41: The cleanup command rm -rf "$CKPT_DIR" runs
unconditionally; make it conditional on the metrics-check succeeding by either
adding set -e near the top (after sourcing common.env) so the script will exit
on any failure, or check the exit status of the uv run invocation for
tests/check_metrics.py (e.g., run the command and only execute rm -rf
"$CKPT_DIR" if its exit code is 0) so checkpoints are only removed after a
successful run.

In @tests/test_suites/llm/grpo-moonlight-16ba3b-4n8g-megatron.sh:
- Around line 36-43: The cleanup runs unconditionally and may delete checkpoints
even when tests/check_metrics.py fails; guard the rm -rf "$CKPT_DIR" so it only
runs on success by checking the exit status of the metrics command (or enable
shell error exit with set -e at the top) and only perform the rm when
tests/check_metrics.py returns zero; reference the invocation of
tests/check_metrics.py and the rm -rf "$CKPT_DIR" in your change and ensure
failed runs preserve $CKPT_DIR for debugging.

In @tests/test_suites/llm/grpo-qwen2.5-math-1.5b-instruct-1n8g-fsdp2tp1.v3.sh:
- Around line 41-42: Before running the destructive rm -rf, validate CKPT_DIR:
ensure the variable is set and non-empty, confirm it points to an existing
directory and is not a dangerous path like "/" or empty, and bail out with an
error if the checks fail; then proceed to remove the directory. Specifically
update the rm -rf "$CKPT_DIR" site to perform these checks on CKPT_DIR and only
call rm -rf when all validations pass.

In @tests/test_suites/llm/performance/grpo-deepseek-v3-32n4g.sh:
- Around line 46-47: The script currently removes the checkpoint directory
unconditionally with rm -rf "$CKPT_DIR"; change this so cleanup only runs after
metrics validation succeeds by making the rm conditional on the exit status of
the metrics check (e.g., run check_metrics.py and only execute rm -rf
"$CKPT_DIR" if that command exits zero or chain with &&), or explicitly test the
return code ($?) and skip deletion on failure so artifacts remain for debugging;
reference the check_metrics.py invocation and the CKPT_DIR / rm -rf "$CKPT_DIR"
lines to locate where to apply the change.

In @tests/test_suites/llm/performance/grpo-deepseek-v3-32n8g.sh:
- Around line 45-47: The rm -rf "$CKPT_DIR" is unsafe if CKPT_DIR is empty or
misset; before deletion validate that CKPT_DIR is set and non-empty, that it
points to an existing directory and is not a dangerous path (e.g., "/" or "." or
empty string), and only then perform the removal; update the cleanup block to
perform these checks around the CKPT_DIR variable and skip/abort deletion with
an error if validation fails.

In @tests/test_suites/llm/performance/grpo-deepseek-v3-64n4g-async-1off.sh:
- Around line 42-47: The cleanup currently runs unconditionally after the
metrics check; change it so rm -rf "$CKPT_DIR" only runs when the metrics
validation succeeds by capturing the exit status of the uv run
tests/check_metrics.py invocation (the command starting with "uv run
tests/check_metrics.py $JSON_METRICS ...") and conditionally executing the
removal on a zero exit code; if non-zero, print or log a failure message and
skip deleting "$CKPT_DIR" so artifacts are preserved for debugging.

In @tests/test_suites/llm/performance/grpo-deepseek-v3-64n8g-async-1off.sh:
- Around line 46-47: Add a safety check before running rm -rf "$CKPT_DIR":
verify the CKPT_DIR variable is set and non-empty, ensure it is not "/" or the
root directory, and optionally confirm it matches an expected base path or
pattern (e.g., contains a known workspace prefix) before performing deletion; if
the check fails, log an error and skip removal. Use the CKPT_DIR variable and
the rm -rf invocation locations in the script to implement these guards.

In @tests/test_suites/llm/performance/grpo-deepseek-v3-64n8g-fp8-async-1off.sh:
- Around line 46-47: The cleanup step unconditionally runs rm -rf "$CKPT_DIR"
which can be dangerous if CKPT_DIR is unset or malformed; before removing,
validate that CKPT_DIR is non-empty, not "/", and points to an existing
directory (and optionally matches an expected path prefix), then perform the
removal using a safe rm invocation; update the script around the CKPT_DIR
removal to check these conditions (referencing the CKPT_DIR variable and the rm
-rf invocation) and only call rm when the validations pass.

In @tests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-2n4g.sh:
- Around line 40-41: The script currently runs rm -rf "$CKPT_DIR" which is
dangerous if CKPT_DIR is empty, unset, or points to a critical path; before
deletion, validate CKPT_DIR: ensure the variable is set and non-empty, resolve
it to an absolute path (realpath or equivalent), verify it exists and is not "/"
or the filesystem root, and optionally assert it is under an expected safe base
directory or matches a known prefix; only proceed to run rm -rf "$CKPT_DIR"
after these checks pass (refer to CKPT_DIR and the rm -rf invocation to locate
the deletion line).

In @tests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-2n8g.sh:
- Around line 40-41: Validate the CKPT_DIR variable before running rm -rf:
ensure CKPT_DIR is set and non-empty and not a dangerous path (e.g., "/" or
empty) and optionally verify it points to an expected directory (exists and is a
directory) before executing the cleanup; update the cleanup block around the rm
-rf "$CKPT_DIR" invocation in the script to perform these checks and only remove
the directory when they pass.

In @tests/test_suites/llm/performance/grpo-qwen3-235b-16n4g.sh:
- Around line 41-42: The script currently always removes the checkpoint
directory (CKPT_DIR) after running check_metrics.py, which can delete artifacts
needed for debugging if the metrics check fails; update the flow so the cleanup
only runs when the metrics validation succeeds by making the rm -rf "$CKPT_DIR"
conditional on the exit status of check_metrics.py (for example, run the metrics
command and then only execute the CKPT_DIR removal when it succeeds using an
exit-code check or shell && chaining), ensuring that CKPT_DIR is preserved on
failure for investigation.

In @tests/test_suites/llm/performance/grpo-qwen3-235b-16n8g.sh:
- Around line 41-42: Add a safety guard before the unconditional rm -rf call to
ensure CKPT_DIR is non-empty and points to an existing directory; specifically,
check that the CKPT_DIR variable is set (non-empty) and that the path is a
directory, then perform the removal (use the -- delimiter when invoking rm to
handle paths beginning with a dash) to avoid accidental deletion when CKPT_DIR
is unset or empty.

In @tests/test_suites/llm/performance/grpo-qwen3-235b-32n8g-async-1off.sh:
- Around line 40-42: The cleanup step uses rm -rf on CKPT_DIR without
validation; ensure CKPT_DIR is set, non-empty, not "/" and matches the expected
safe path/prefix before removal, and use the safe rm invocation that protects
against args starting with a dash; update the script around the rm -rf
"$CKPT_DIR" line to perform these checks and only proceed to remove when they
pass, otherwise log an error and skip deletion.

In @tests/test_suites/llm/performance/grpo-qwen3-30ba3b-24n8g-async-8off.sh:
- Around line 40-41: The cleanup currently removes "$CKPT_DIR" unconditionally;
change it to run only when the metrics validation succeeds by chaining or
checking the exit status of the metrics check (e.g., run check_metrics.py and
only execute rm -rf "$CKPT_DIR" if that command exits zero). Locate the
invocation of check_metrics.py and replace the unconditional rm -rf "$CKPT_DIR"
with a conditional/&& chain or an if-check that tests the exit code of
check_metrics.py before deleting the checkpoint directory.

In @tests/test_suites/llm/performance/grpo-qwen3-30ba3b-4n4g.sh:
- Around line 40-41: The cleanup uses an unguarded rm -rf on $CKPT_DIR; add a
safety check before that command to ensure CKPT_DIR is non-empty, not "/", and
points to an existing directory (e.g., test -n "$CKPT_DIR" && [ "$CKPT_DIR" !=
"/" ] && [ -d "$CKPT_DIR" ]), and only then run the removal (use -- with rm).
Update the invocation around the rm -rf "$CKPT_DIR" line to perform these
validations and bail out or log an error if the checks fail.

In @tests/test_suites/llm/performance/grpo-qwen3-30ba3b-4n8g-async-1off.sh:
- Around line 39-41: The cleanup uses rm -rf "$CKPT_DIR" which can delete
unintended targets if CKPT_DIR is empty or unset; before removal, validate that
CKPT_DIR is defined and non-empty and optionally ensure it points to an expected
directory pattern (e.g., contains a specific prefix or is under a known base
path) and only then perform the removal; update the script surrounding the rm
-rf invocation (the CKPT_DIR variable usage) to check for non-empty value and
safe path conditions and abort with a clear error if the check fails.

In @tests/test_suites/llm/performance/grpo-qwen3-30ba3b-4n8g.sh:
- Around line 40-41: The cleanup uses rm -rf "$CKPT_DIR" without safety checks;
ensure CKPT_DIR is non-empty and points to an existing directory before removal
by adding a guard that verifies the variable is set/non-empty and that the path
is a directory (e.g., check both non-empty and -d) and only then call rm -rf on
CKPT_DIR; refer to the CKPT_DIR variable and the rm -rf "$CKPT_DIR" statement
when making the change.

In @tests/test_suites/llm/performance/grpo-qwen3-30ba3b-8n4g-async-1off.sh:
- Around line 40-41: Before running the destructive rm -rf on CKPT_DIR, validate
that the CKPT_DIR variable is set, non-empty, not equal to "/" (root), and
points to an existing directory; only then perform the removal and use the safe
rm invocation (with --) to avoid treating paths beginning with - as options.
Update the cleanup block around rm -rf "$CKPT_DIR" to perform these checks and
skip or log an error if the validation fails.

In @tests/test_suites/llm/performance/grpo-qwen3-32b-4n4g.sh:
- Around line 40-41: The checkpoint cleanup (rm -rf "$CKPT_DIR") runs
unconditionally and can delete artifacts when check_metrics.py fails; change the
flow so the cleanup only runs on successful metrics validation by checking the
exit status of the metrics step (e.g., run or inspect check_metrics.py exit code
or use conditional chaining) and move or guard the rm -rf "$CKPT_DIR" behind
that success check (reference the check_metrics.py invocation and CKPT_DIR
cleanup line).

In @tests/test_suites/llm/performance/grpo-qwen3-32b-8n4g-async-1off.sh:
- Around line 36-41: The script currently always removes the checkpoint
directory (rm -rf "$CKPT_DIR") even if the metrics validation command (uv run
tests/check_metrics.py $JSON_METRICS ...) fails; change the flow so that the
cleanup runs only on successful validation by checking the exit status of the
metrics command (or enable strict error handling like set -e) and only executing
rm -rf "$CKPT_DIR" when tests/check_metrics.py exits successfully, otherwise
preserve the directory and propagate a non-zero exit code for post-mortem
debugging.

In @tests/test_suites/llm/performance/grpo-qwen3-32b-8n8g-async-1off.sh:
- Around line 36-41: The script currently always removes the checkpoint
directory after running the metrics check; change it so checkpoint cleanup runs
only if the metrics validation succeeds by testing the exit status of the check
command (the call to tests/check_metrics.py or the "uv run
tests/check_metrics.py $JSON_METRICS ..." invocation) and only executing the rm
-rf "$CKPT_DIR" when that command exits successfully (e.g., use an if on its
exit code or chain with &&) so failed validations preserve CKPT_DIR for
debugging.

In @tests/test_suites/llm/sft-gpt-oss-20b-1n8g-fsdp8ep8-automodel.sh:
- Around line 42-43: The cleanup step uses rm -rf "$CKPT_DIR" without
validation; add a safety check before that line to ensure CKPT_DIR is set,
non-empty, not "/" or empty string, points to an existing directory, and
(optionally) resides under an expected base/path prefix; only proceed to remove
when these checks pass and otherwise log an error and exit. Reference the
CKPT_DIR variable and the rm -rf line in the script and implement the validation
logic immediately before calling rm to prevent accidental deletion of critical
paths.

In @tests/test_suites/llm/sft-llama3.1-8b-1n8g-fsdp2tp1-dynamicbatch.sh:
- Around line 42-44: Add a safety guard before the destructive rm -rf call:
validate CKPT_DIR is set and non-empty, is not "/" (or other critical paths),
and is a directory (e.g., check [ -n "$CKPT_DIR" ] && [ "$CKPT_DIR" != "/" ] &&
[ -d "$CKPT_DIR" ]) and only then run rm -rf -- "$CKPT_DIR"; this prevents
accidental deletion when CKPT_DIR is empty, unset, or misconfigured and uses --
to handle paths beginning with -.

In @tests/test_suites/llm/sft-llama3.1-8b-1n8g-fsdp2tp2.sh:
- Around line 44-45: The cleanup call rm -rf "$CKPT_DIR" is unsafe as-is; add
validation before deletion: check CKPT_DIR is non-empty, is an existing
directory, and not "/" (and ideally not "$HOME" or root paths), optionally
assert it matches an expected prefix (e.g., /tmp or a project-specific base); if
any check fails, log an error and exit without running rm. Use a safe removal
invocation rm -rf -- "$CKPT_DIR" after validations and include informative
logging (e.g., echo or printf) so failures are visible.

In @tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.sh:
- Around line 38-40: The cleanup call using rm -rf "$CKPT_DIR" is unsafe if
CKPT_DIR is empty or points to a critical path; add a defensive validation
before deletion: verify CKPT_DIR is set and non-empty, ensure it is not "/" (or
other forbidden roots), and optionally confirm it exists and matches the
expected checkpoint directory pattern; if the check fails, log an error and
abort instead of calling rm -rf. Use the CKPT_DIR variable and the cleanup block
around rm -rf "$CKPT_DIR" to implement these checks.

In @tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.sh:
- Around line 38-40: Before running the dangerous removal, ensure CKPT_DIR is
set, non-empty, and not the root directory; replace the bare rm -rf "$CKPT_DIR"
with a guarded check that verifies CKPT_DIR is defined and not "/" (and
optionally matches an expected base path or prefix), log or error out if the
check fails, and only then perform the recursive delete for CKPT_DIR.

In @tests/test_suites/llm/sft-qwen2.5-32b-4n8g-fsdp2tp8sp-actckpt.v3.sh:
- Around line 43-45: Add a guard before the existing rm -rf "$CKPT_DIR" to
ensure CKPT_DIR is set, non-empty, and not a dangerous path (e.g., "/" or "")
before deleting; update the script around the rm invocation (the CKPT_DIR
variable and cleanup block) to first check that CKPT_DIR is defined and not
equal to "/" (and optionally that it exists and is a directory) and only then
run the removal, otherwise log/echo a warning and skip deletion.

In @tests/test_suites/llm/sft-qwen2.5-math7b-2n8g-megatron.sh:
- Around line 44-45: The cleanup line using rm -rf "$CKPT_DIR" is unsafe; add
validation before removal: ensure the CKPT_DIR variable is set and non-empty,
verify it points to an existing directory (and not "/" or other critical paths),
optionally resolve it with realpath to prevent symlink tricks, and only then
perform rm -rf; if validation fails, print an error and exit non‑zero. Reference
the CKPT_DIR variable and the rm -rf command when implementing the checks.

In
@tests/test_suites/vlm/vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n8g-dtensor2tp1.v1.sh:
- Around line 38-40: The cleanup step uses rm -rf "$CKPT_DIR" without
validation; add a safety guard that verifies CKPT_DIR is set and non-empty (and
optionally not "/" or other critical paths) before running the removal. Update
the script around the rm -rf invocation (the CKPT_DIR variable and the cleanup
block) to check e.g. that CKPT_DIR != "" and not equal to "/" (or perform a
pattern check) and only then execute rm -rf "$CKPT_DIR"; if the check fails,
emit a clear error/warning and skip deletion.

🟠 Major comments (34)

tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.sh-38-40 (1)
38-40: Add safety check before rm -rf to prevent unintended deletion.

The rm -rf "$CKPT_DIR" command could be dangerous if $CKPT_DIR is unset, empty, or misconfigured. While the variable is sourced from common.env, defensive programming suggests adding a validation check before deletion.
🔎 Proposed safety check
+    # Clean up checkpoint directory after successful run to save space.
+    if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then
+        rm -rf "$CKPT_DIR"
+    fi
-    # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
tests/test_suites/llm/sft-qwen2.5-32b-4n8g-fsdp2tp8sp-actckpt.v3.sh-43-45 (1)
43-45: Add safety check before removing checkpoint directory.

The rm -rf "$CKPT_DIR" command is potentially dangerous if CKPT_DIR is unset or empty. Consider adding a guard to verify the variable is set and non-empty before deletion.
🔎 Proposed fix with safety check
     # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
+    if [[ -n "$CKPT_DIR" ]]; then
+        rm -rf "$CKPT_DIR"
+    fi
tests/test_suites/llm/dpo-llama3.2-1b-instruct-1n8g-fsdp2tp1.v2.sh-41-43 (1)
41-43: Add safety check before removing checkpoint directory.

The rm -rf "$CKPT_DIR" command is potentially dangerous if CKPT_DIR is unset or empty. Consider adding a guard to verify the variable is set and non-empty before deletion.
🔎 Proposed fix with safety check
     # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
+    if [[ -n "$CKPT_DIR" ]]; then
+        rm -rf "$CKPT_DIR"
+    fi
tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.sh-38-40 (1)
38-40: Add safety check before removing checkpoint directory.

The rm -rf "$CKPT_DIR" command is potentially dangerous if CKPT_DIR is unset or empty. Consider adding a guard to verify the variable is set and non-empty before deletion.
🔎 Proposed fix with safety check
     # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
+    if [[ -n "$CKPT_DIR" ]]; then
+        rm -rf "$CKPT_DIR"
+    fi
tests/test_suites/llm/distillation-qwen3-32b-to-4b-base-2n8g-fsdp2tp2-seqpack.v1.sh-41-43 (1)
41-43: Add safety check before removing checkpoint directory.

The rm -rf "$CKPT_DIR" command is potentially dangerous if CKPT_DIR is unset or empty. Consider adding a guard to verify the variable is set and non-empty before deletion.
🔎 Proposed fix with safety check
     # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
+    if [[ -n "$CKPT_DIR" ]]; then
+        rm -rf "$CKPT_DIR"
+    fi
tests/test_suites/vlm/vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n8g-dtensor2tp1.v1.sh-38-40 (1)
38-40: Add safety check before removing checkpoint directory.

The rm -rf "$CKPT_DIR" command is potentially dangerous if CKPT_DIR is unset or empty. Consider adding a guard to verify the variable is set and non-empty before deletion.
🔎 Proposed fix with safety check
     # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
+    if [[ -n "$CKPT_DIR" ]]; then
+        rm -rf "$CKPT_DIR"
+    fi
tests/test_suites/llm/dapo-qwen2.5-7b.sh-42-43 (1)
42-43: Add validation before removing checkpoint directory.

While the cleanup is appropriately placed (only on success), the rm -rf command should validate that $CKPT_DIR is set and non-empty to prevent accidental deletion if the variable is malformed.
🔎 Proposed safety check
+    # Clean up checkpoint directory after successful run to save space.
+    if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then
+        rm -rf "$CKPT_DIR"
+    fi
-    # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
tests/test_suites/llm/grpo-gspo-deepscaler-1.5b-8K.sh-40-41 (1)
40-41: Add validation before removing checkpoint directory.

While the cleanup is appropriately placed (only on success), the rm -rf command should validate that $CKPT_DIR is set and non-empty to prevent accidental deletion if the variable is malformed.
🔎 Proposed safety check
+    # Clean up checkpoint directory after successful run to save space.
+    if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then
+        rm -rf "$CKPT_DIR"
+    fi
-    # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
tests/test_suites/llm/performance/grpo-deepseek-v3-64n8g-fp8-async-1off.sh-46-47 (1)
46-47: Add validation before removing checkpoint directory.

While the cleanup is appropriately placed (only on success), the rm -rf command should validate that $CKPT_DIR is set and non-empty to prevent accidental deletion if the variable is malformed.
🔎 Proposed safety check
+    # Clean up checkpoint directory after successful run to save space.
+    if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then
+        rm -rf "$CKPT_DIR"
+    fi
-    # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
tests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-2n8g.sh-40-41 (1)
40-41: Add validation before removing checkpoint directory.

While the cleanup is appropriately placed (only on success), the rm -rf command should validate that $CKPT_DIR is set and non-empty to prevent accidental deletion if the variable is malformed.
🔎 Proposed safety check
+    # Clean up checkpoint directory after successful run to save space.
+    if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then
+        rm -rf "$CKPT_DIR"
+    fi
-    # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
tests/test_suites/llm/dpo-mistral-nemo-instruct-2407-1n8g-fsdp2tp8-actckpt-long.sh-41-43 (1)
41-43: Add validation before removing checkpoint directory.

While the cleanup is appropriately placed (only on success), the rm -rf command should validate that $CKPT_DIR is set and non-empty to prevent accidental deletion if the variable is malformed.
🔎 Proposed safety check
+    # Clean up checkpoint directory after successful run to save space.
+    if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then
+        rm -rf "$CKPT_DIR"
+    fi
-    # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
tests/test_suites/llm/performance/grpo-qwen3-30ba3b-24n8g-async-8off.sh-40-41 (1)
40-41: Conditionally execute cleanup only after successful metrics validation.

The checkpoint cleanup runs unconditionally after check_metrics.py. If the metrics check fails, the checkpoint is still deleted, removing debugging artifacts.
🔎 Recommended fix: chain cleanup to check success
-    uv run tests/check_metrics.py $JSON_METRICS \
-        'median(data["train/token_mult_prob_error"]) < 1.1' \
-        'data["train/token_mult_prob_error"]["10"] < 1.1'
-
-    # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
+    if uv run tests/check_metrics.py $JSON_METRICS \
+        'median(data["train/token_mult_prob_error"]) < 1.1' \
+        'data["train/token_mult_prob_error"]["10"] < 1.1'; then
+        # Clean up checkpoint directory after successful run to save space.
+        rm -rf "$CKPT_DIR"
+    fi
tests/test_suites/llm/performance/grpo-deepseek-v3-32n4g.sh-46-47 (1)
46-47: Conditionally execute cleanup only after successful metrics validation.

The checkpoint cleanup runs regardless of whether the metrics check succeeds. If check_metrics.py fails, the checkpoint is still deleted, removing artifacts needed for debugging.
🔎 Recommended fix: chain cleanup to check success
-    uv run tests/check_metrics.py $JSON_METRICS \
-        'median(data["train/token_mult_prob_error"]) < 1.1' \
-        'data["train/token_mult_prob_error"]["10"] < 1.1'
-
-    # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
+    if uv run tests/check_metrics.py $JSON_METRICS \
+        'median(data["train/token_mult_prob_error"]) < 1.1' \
+        'data["train/token_mult_prob_error"]["10"] < 1.1'; then
+        # Clean up checkpoint directory after successful run to save space.
+        rm -rf "$CKPT_DIR"
+    fi
tests/test_suites/llm/performance/grpo-qwen3-235b-16n4g.sh-41-42 (1)
41-42: Conditionally execute cleanup only after successful metrics validation.

The checkpoint cleanup runs unconditionally after check_metrics.py. If the metrics check fails but bash continues execution, the checkpoint is deleted, losing debugging artifacts.
🔎 Recommended fix: chain cleanup to check success
-    uv run tests/check_metrics.py $JSON_METRICS \
-        'median(data["train/token_mult_prob_error"]) < 1.1' \
-        'data["train/token_mult_prob_error"]["10"] < 1.1'
-
-    # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
+    if uv run tests/check_metrics.py $JSON_METRICS \
+        'median(data["train/token_mult_prob_error"]) < 1.1' \
+        'data["train/token_mult_prob_error"]["10"] < 1.1'; then
+        # Clean up checkpoint directory after successful run to save space.
+        rm -rf "$CKPT_DIR"
+    fi
Committable suggestion skipped: line range outside the PR's diff.
tests/test_suites/llm/performance/grpo-qwen3-32b-4n4g.sh-40-41 (1)
40-41: Conditionally execute cleanup only after successful metrics validation.

The checkpoint cleanup currently runs unconditionally after check_metrics.py. If the metrics check fails but the script continues (bash default behavior without set -e), the checkpoint is still deleted, removing valuable debugging artifacts.
🔎 Recommended fix: chain cleanup to check success
-    uv run tests/check_metrics.py $JSON_METRICS \
-        'median(data["train/token_mult_prob_error"]) < 1.1' \
-        'data["train/token_mult_prob_error"]["10"] < 1.1'
-
-    # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
+    if uv run tests/check_metrics.py $JSON_METRICS \
+        'median(data["train/token_mult_prob_error"]) < 1.1' \
+        'data["train/token_mult_prob_error"]["10"] < 1.1'; then
+        # Clean up checkpoint directory after successful run to save space.
+        rm -rf "$CKPT_DIR"
+    fi
tests/test_suites/llm/grpo-deepscaler-1.5b-24K.sh-69-70 (1)
69-70: Conditionally execute cleanup and verify placement.

Two concerns with the cleanup placement:

Unconditional execution: The cleanup runs after the second metrics check (line 66-67) without verifying that check succeeded. If the check fails, the checkpoint is still deleted.

Outside MAX_STEPS conditional: The cleanup at line 69-70 is outside the if [[ $(jq ...) -ge $MAX_STEPS ]] block that guards the first metrics check. If MAX_STEPS is not reached, the first check is skipped, but conversion/eval/second check/cleanup all run. This may be intentional, but seems inconsistent with the pattern in other test suites where cleanup only occurs when MAX_STEPS is reached.
🔎 Recommended fix: make cleanup conditional on all checks passing
 uv run tests/check_metrics.py ${RUN_LOG}-24k-metric.json \
-  'data["score"] >= 0.2396'
-
-# Clean up checkpoint directory after successful run to save space.
-rm -rf "$CKPT_DIR"
+  'data["score"] >= 0.2396' && \
+    rm -rf "$CKPT_DIR"  # Clean up checkpoint directory after successful run to save space.
If the intent is to also guard against running when MAX_STEPS is not reached, consider moving the conversion/eval/cleanup inside the MAX_STEPS conditional or adding an explicit check.
tests/test_suites/llm/grpo-deepscaler-1.5b-8K.sh-63-64 (1)
63-64: Add safety check before rm -rf to prevent accidental deletion.

The rm -rf "$CKPT_DIR" command is dangerous if $CKPT_DIR is unset or empty. Shell best practices require validating the variable before destructive operations.
🔎 Suggested fix with safety check
 
 # Clean up checkpoint directory after successful run to save space.
-rm -rf "$CKPT_DIR"
+if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then
+    rm -rf "$CKPT_DIR"
+fi
 
tests/test_suites/llm/grpo-qwen2.5-math-1.5b-instruct-1n8g-fsdp2tp1.v3.sh-41-42 (1)
41-42: Add safety check before rm -rf to prevent accidental deletion.

The rm -rf "$CKPT_DIR" command is dangerous if $CKPT_DIR is unset or empty. Shell best practices require validating the variable before destructive operations.
🔎 Suggested fix with safety check
 
     # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
+    if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then
+        rm -rf "$CKPT_DIR"
+    fi
 fi
tests/test_suites/llm/performance/grpo-qwen3-30ba3b-4n4g.sh-40-41 (1)
40-41: Add safety check before rm -rf to prevent accidental deletion.

The rm -rf "$CKPT_DIR" command is dangerous if $CKPT_DIR is unset or empty. Shell best practices require validating the variable before destructive operations.
🔎 Suggested fix with safety check
 
     # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
+    if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then
+        rm -rf "$CKPT_DIR"
+    fi
 fi
tests/test_suites/llm/performance/grpo-qwen3-30ba3b-8n4g-async-1off.sh-40-41 (1)
40-41: Add safety check before rm -rf to prevent accidental deletion.

The rm -rf "$CKPT_DIR" command is dangerous if $CKPT_DIR is unset or empty. Shell best practices require validating the variable before destructive operations.
🔎 Suggested fix with safety check
 
     # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
+    if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then
+        rm -rf "$CKPT_DIR"
+    fi
 fi
tests/test_suites/llm/grpo-llama3.2-1b-instruct-1n8g-megatron.sh-42-43 (1)
42-43: Add safety check before rm -rf to prevent accidental deletion.

The rm -rf "$CKPT_DIR" command is dangerous if $CKPT_DIR is unset or empty. Shell best practices require validating the variable before destructive operations.
🔎 Suggested fix with safety check
 
     # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
+    if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then
+        rm -rf "$CKPT_DIR"
+    fi
 fi
tests/test_suites/llm/grpo-gemma3-27b-it-8n8g-fsdp2tp8-actckpt-long.sh-39-41 (1)
39-41: Add safety check before rm -rf to prevent accidental deletion.

The rm -rf "$CKPT_DIR" command can be dangerous if CKPT_DIR is empty, unset, or misconfigured. Add a validation check to ensure the variable is set and non-empty before deletion.
🔎 Recommended safety check
+    # Clean up checkpoint directory after successful run to save space.
+    if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then
+        rm -rf "$CKPT_DIR"
+    fi
-
-    # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
tests/test_suites/llm/performance/grpo-deepseek-v3-32n8g.sh-45-47 (1)
45-47: Add safety check before rm -rf to prevent accidental deletion.

The rm -rf "$CKPT_DIR" command can be dangerous if CKPT_DIR is empty, unset, or misconfigured. Add a validation check to ensure the variable is set and non-empty before deletion.
🔎 Recommended safety check
+    # Clean up checkpoint directory after successful run to save space.
+    if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then
+        rm -rf "$CKPT_DIR"
+    fi
-
-    # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
tests/test_suites/llm/performance/grpo-qwen3-30ba3b-4n8g-async-1off.sh-39-41 (1)
39-41: Add safety check before rm -rf to prevent accidental deletion.

The rm -rf "$CKPT_DIR" command can be dangerous if CKPT_DIR is empty, unset, or misconfigured. Add a validation check to ensure the variable is set and non-empty before deletion.
🔎 Recommended safety check
+    # Clean up checkpoint directory after successful run to save space.
+    if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then
+        rm -rf "$CKPT_DIR"
+    fi
-
-    # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
tests/test_suites/llm/sft-llama3.1-8b-1n8g-fsdp2tp1-dynamicbatch.sh-42-44 (1)
42-44: Add safety check before rm -rf to prevent accidental deletion.

The rm -rf "$CKPT_DIR" command can be dangerous if CKPT_DIR is empty, unset, or misconfigured. Add a validation check to ensure the variable is set and non-empty before deletion.
🔎 Recommended safety check
+    # Clean up checkpoint directory after successful run to save space.
+    if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then
+        rm -rf "$CKPT_DIR"
+    fi
-
-    # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
tests/test_suites/llm/grpo-llama3.2-1b-instruct-1n8g-fsdp2tp1.v3.sh-41-42 (1)
41-42: Add safety check before removing checkpoint directory.

The rm -rf "$CKPT_DIR" command could be dangerous if CKPT_DIR is unset or empty.
🔎 Proposed safety check
     # Clean up checkpoint directory after successful run to save space.
+    if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then
-    rm -rf "$CKPT_DIR"
+        rm -rf "$CKPT_DIR"
+    fi
tests/test_suites/llm/performance/grpo-qwen3-32b-4n8g.sh-40-41 (1)
40-41: Add safety check before removing checkpoint directory.

The rm -rf "$CKPT_DIR" command could be dangerous if CKPT_DIR is unset or empty.
🔎 Proposed safety check
     # Clean up checkpoint directory after successful run to save space.
+    if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then
-    rm -rf "$CKPT_DIR"
+        rm -rf "$CKPT_DIR"
+    fi
tests/test_suites/llm/grpo-llama3.1-8b-instruct-4n8g-fsdp2tp1-long.v3.sh-40-41 (1)
40-41: Add safety check before removing checkpoint directory.

The rm -rf "$CKPT_DIR" command could be dangerous if CKPT_DIR is unset or empty.
🔎 Proposed safety check
     # Clean up checkpoint directory after successful run to save space.
+    if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then
-    rm -rf "$CKPT_DIR"
+        rm -rf "$CKPT_DIR"
+    fi
tests/test_suites/llm/grpo-llama3.1-8b-instruct-1n8g-megatron-fp8-rollouts.v3.sh-42-43 (1)
42-43: Add safety check before removing checkpoint directory.

The rm -rf "$CKPT_DIR" command could be dangerous if CKPT_DIR is unset or empty. Consider adding a validation check to prevent accidental deletion.
🔎 Proposed safety check
     # Clean up checkpoint directory after successful run to save space.
+    if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then
-    rm -rf "$CKPT_DIR"
+        rm -rf "$CKPT_DIR"
+    fi
tests/test_suites/llm/performance/grpo-deepseek-v3-64n8g-async-1off.sh-46-47 (1)
46-47: Add safety check before removing checkpoint directory.

The rm -rf "$CKPT_DIR" command could be dangerous if CKPT_DIR is unset or empty. Consider adding a validation check to prevent accidental deletion.
🔎 Proposed safety check
     # Clean up checkpoint directory after successful run to save space.
+    if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then
-    rm -rf "$CKPT_DIR"
+        rm -rf "$CKPT_DIR"
+    fi
tests/test_suites/llm/grpo-llama3.1-8b-instruct-2n8g-megatron-fp8-e2e.sh-40-41 (1)
40-41: Add safety guard before rm -rf to prevent unintended deletion.

If $CKPT_DIR is unset or empty, rm -rf could behave unexpectedly. Add a guard to verify the variable is non-empty and the directory exists.
🔎 Proposed fix
     # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
+    if [ -n "$CKPT_DIR" ] && [ -d "$CKPT_DIR" ]; then
+        rm -rf "$CKPT_DIR"
+    fi
tests/test_suites/llm/grpo-gemma3-1b-it-1n8g-fsdp2tp1.sh-41-42 (1)
41-42: Add safety guard before rm -rf to prevent unintended deletion.

If $CKPT_DIR is unset or empty, rm -rf could behave unexpectedly. Add a guard to verify the variable is non-empty and the directory exists.
🔎 Proposed fix
     # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
+    if [ -n "$CKPT_DIR" ] && [ -d "$CKPT_DIR" ]; then
+        rm -rf "$CKPT_DIR"
+    fi
tests/test_suites/llm/performance/grpo-qwen3-30ba3b-4n8g.sh-40-41 (1)
40-41: Add safety guard before rm -rf to prevent unintended deletion.

If $CKPT_DIR is unset or empty, rm -rf could behave unexpectedly. Add a guard to verify the variable is non-empty and the directory exists.
🔎 Proposed fix
     # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
+    if [ -n "$CKPT_DIR" ] && [ -d "$CKPT_DIR" ]; then
+        rm -rf "$CKPT_DIR"
+    fi
tests/test_suites/llm/performance/grpo-qwen3-235b-16n8g.sh-41-42 (1)
41-42: Add safety guard before rm -rf to prevent unintended deletion.

If $CKPT_DIR is unset or empty, rm -rf could behave unexpectedly. Add a guard to verify the variable is non-empty and the directory exists.
🔎 Proposed fix
     # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
+    if [ -n "$CKPT_DIR" ] && [ -d "$CKPT_DIR" ]; then
+        rm -rf "$CKPT_DIR"
+    fi
Committable suggestion skipped: line range outside the PR's diff.

🟡 Minor comments (3)

tests/test_suites/llm/distillation-qwen3-32b-to-4b-base-2n8g-fsdp2tp2-long.v1.sh-22-22 (1)

22-22: Verify that the wandb project name change is intentional.

The wandb project name has changed from nemo-rl-distillation to nemo-rl. This change is not mentioned in the PR objectives and could affect metric tracking and organization in Weights & Biases. Please confirm this is an intentional change and not an accidental modification.
tests/test_suites/llm/grpo-helpsteer3-llama-3.3-nemotron-super-49b-v1.5-8n8g-fsdp2tp8cp4.sh.disabled-37-40 (1)
37-40: Missing cleanup step - inconsistent with PR pattern.

Unlike other test suites in this PR that add checkpoint cleanup (rm -rf "$CKPT_DIR") after successful metrics validation, this file is missing that addition. For consistency, consider adding the same cleanup step after line 38, even though the file is currently disabled.
🔎 Suggested addition for consistency
     uv run tests/check_metrics.py $JSON_METRICS \
         'median(data["train/token_mult_prob_error"]) < 1.1' \
         "data['train/token_mult_prob_error']['$MAX_STEPS'] < 1.1"
+
+    # Clean up checkpoint directory after successful run to save space.
+    rm -rf "$CKPT_DIR"
 fi
tests/test_suites/llm/performance/dapo-deepseek-v3-64n8g.sh-47-49 (1)

47-49: Remove unnecessary checkpoint cleanup or configure it to use the actual checkpoint directory.

The rm -rf "$CKPT_DIR" command removes an empty directory that is never populated. The config file specifies checkpoint_dir: results/grpo_dapomath17k_dsv3_megatron (line 30 of dapo-deepseek-v3-64n8g.yaml), but the test script does not pass this to the training entrypoint, so checkpoints are stored in the config path instead. The CKPT_DIR variable ($EXP_DIR/ckpts) is created by common.env but remains unused.

Either remove the cleanup step or configure the training call to use CKPT_DIR by adding a checkpoint override to the command line arguments.

tests/check_metrics.py

tests/test_suites/llm/distillation-qwen3-32b-to-4b-base-2n8g-fsdp2tp2-long.v1.sh

tests/test_suites/llm/dpo-llama3.1-8b-instruct-4n8g-fsdp2tp2-quick.v2.sh

tests/test_suites/llm/dpo-llama3.1-8b-instruct-4n8g-fsdp2tp4.sh

tests/test_suites/llm/dpo-llama3.1-8b-instruct-4n8g-megatrontp2pp2-quick.sh

tests/test_suites/llm/performance/grpo-qwen3-32b-8n4g-async-1off.sh

tests/test_suites/llm/performance/grpo-qwen3-32b-8n8g-async-1off.sh

tests/test_suites/llm/sft-gpt-oss-20b-1n8g-fsdp8ep8-automodel.sh

tests/test_suites/llm/sft-llama3.1-8b-1n8g-fsdp2tp2.sh

tests/test_suites/llm/sft-qwen2.5-math7b-2n8g-megatron.sh

...st_suites/llm/grpo-helpsteer3-llama-3.3-nemotron-super-49b-v1.5-8n8g-fsdp2tp8cp4.sh.disabled

Signed-off-by: Terry Kong <[email protected]>

…ghtlies (#1722) Signed-off-by: Terry Kong <[email protected]> Signed-off-by: NeMo Bot <[email protected]>

fix: use median instead of mean for logprob error for stability in ni…

9ac5b53

…ghtlies Signed-off-by: Terry Kong <[email protected]>

terrykong requested review from yfw and yuki-97 January 6, 2026 08:28

terrykong requested a review from a team as a code owner January 6, 2026 08:28

terrykong added CI:L1 Run doctests, unit tests, and functional tests r0.5.0 labels Jan 6, 2026

terrykong temporarily deployed to nemo-ci January 6, 2026 08:28 — with GitHub Actions Inactive

coderabbitai bot reviewed Jan 6, 2026

View reviewed changes

terrykong mentioned this pull request Jan 6, 2026

Test fixes #1718

Draft

terrykong temporarily deployed to nemo-ci January 6, 2026 09:19 — with GitHub Actions Inactive

yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Jan 6, 2026

yuki-97 temporarily deployed to nemo-ci January 6, 2026 10:54 — with GitHub Actions Inactive

yuki-97 temporarily deployed to nemo-ci January 6, 2026 11:24 — with GitHub Actions Inactive

yuki-97 reviewed Jan 6, 2026

View reviewed changes

...st_suites/llm/grpo-helpsteer3-llama-3.3-nemotron-super-49b-v1.5-8n8g-fsdp2tp8cp4.sh.disabled Show resolved Hide resolved

yuki-97 had a problem deploying to nemo-ci January 6, 2026 17:44 — with GitHub Actions Error

fix

c80ecf9

Signed-off-by: Terry Kong <[email protected]>

terrykong enabled auto-merge (squash) January 6, 2026 20:22

terrykong added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Jan 6, 2026

terrykong temporarily deployed to nemo-ci January 6, 2026 20:22 — with GitHub Actions Inactive

terrykong temporarily deployed to nemo-ci January 6, 2026 20:41 — with GitHub Actions Inactive

yuki-97 approved these changes Jan 7, 2026

View reviewed changes

terrykong temporarily deployed to nemo-ci January 7, 2026 02:01 — with GitHub Actions Inactive

terrykong merged commit 57c834c into main Jan 7, 2026
55 of 58 checks passed

terrykong deleted the test-fix-5 branch January 7, 2026 03:32

chtruong814 pushed a commit that referenced this pull request Jan 7, 2026

fix: use median instead of mean for logprob error for stability in ni…

7c7c9f8

…ghtlies (#1722) Signed-off-by: Terry Kong <[email protected]> Signed-off-by: NeMo Bot <[email protected]>

coderabbitai bot mentioned this pull request Jan 7, 2026

cp: fix: use median instead of mean for logprob error for stability in nightlies (1722) into r0.5.0 #1731

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: use median instead of mean for logprob error for stability in nightlies #1722

fix: use median instead of mean for logprob error for stability in nightlies #1722

Uh oh!

terrykong commented Jan 6, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 6, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix: use median instead of mean for logprob error for stability in nightlies #1722

fix: use median instead of mean for logprob error for stability in nightlies #1722

Uh oh!

Conversation

terrykong commented Jan 6, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

terrykong commented Jan 6, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 6, 2026 •

edited

Loading