Add standalone script to analyze savings and lambda dependency#28
Add standalone script to analyze savings and lambda dependency#28rbx merged 8 commits intoFairRootGroup:masterfrom
Conversation
|
Caution Review failedPull request was closed or merged during review 📝 WalkthroughWalkthroughAdds a new lambda-sweep analysis tool (analyze_lambda_occupancy.py) that runs train.py in evaluation mode across lambda values and aggregates per-episode occupancy/savings metrics. Also extends per-episode core usage instrumentation (baseline.py, environment.py, metrics_tracker.py), enhances train.py evaluation logging with occupancy in cores/nodes, and improves workload log analysis with quantile-based burst suggestions. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Analyzer as analyze_lambda_occupancy.py
participant Trainer as train.py
participant Env as environment.py
participant Baseline as baseline.py
participant Metrics as metrics_tracker.py
User->>Analyzer: start lambda sweep CLI
Analyzer->>Analyzer: build lambda schedule (explicit or adaptive)
loop for each lambda
Analyzer->>Trainer: spawn train.py --eval --lambda=X
Trainer->>Env: run episode loop
Env->>Baseline: baseline_step(...)
Baseline-->>Env: return (cost, baseline_next_empty_slot, next_job_id, baseline_num_used_nodes, baseline_num_used_cores)
Env->>Metrics: append episode_used_cores / baseline variants
Metrics-->>Env: updated per-episode lists
Trainer->>Trainer: emit evaluation stdout with occupancy/savings/wait lines
Trainer-->>Analyzer: stdout captured
Analyzer->>Analyzer: parse episode metrics + wait summary, compute LambdaRunStats
end
Analyzer->>Analyzer: aggregate, fit trends, write CSV/JSON and plot
Analyzer->>User: produce summary files and visualizations
sequenceDiagram
participant Trainer as train.py
participant Env as environment.py
participant Baseline as baseline.py
participant Metrics as metrics_tracker.py
Trainer->>Env: begin episode
Env->>Metrics: reset_episode_metrics()
Metrics-->>Env: initialized episode_used_cores and baseline lists
loop per step
Env->>Baseline: baseline_step(...)
Baseline-->>Env: return baseline_num_used_nodes, baseline_num_used_cores (part of 6-tuple)
Env->>Metrics: append episode_used_cores, episode_baseline_used_nodes, episode_baseline_used_cores
end
Trainer->>Metrics: query episode aggregates
Metrics-->>Trainer: provide per-episode core/node usage
Trainer->>Trainer: compute/log occupancy in cores and nodes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/baseline.py (2)
54-55:⚠️ Potential issue | 🟡 MinorDocstring is outdated.
The docstring states the function returns a 4-tuple, but it now returns 6 values including
num_used_nodesandnum_used_cores.📝 Proposed fix
Returns: - Tuple of (baseline_cost, baseline_cost_off, updated baseline_next_empty_slot, updated next_job_id) + Tuple of (baseline_cost, baseline_cost_off, updated baseline_next_empty_slot, updated next_job_id, num_used_nodes, num_used_cores)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/baseline.py` around lines 54 - 55, The docstring in the baseline calculation function is outdated: it still documents a 4-tuple return but the function now returns six values (baseline_cost, baseline_cost_off, baseline_next_empty_slot, next_job_id, num_used_nodes, num_used_cores). Update the Returns section of the docstring in src/baseline.py where the function (e.g., the baseline calculation function) is defined to list all six returned values in the correct order and briefly describe each (including num_used_nodes and num_used_cores) so the docstring matches the actual return tuple.
33-33:⚠️ Potential issue | 🟡 MinorReturn type annotation is inconsistent with actual return.
The function now returns a 6-tuple
(baseline_cost, baseline_cost_off, baseline_next_empty_slot, next_job_id, num_used_nodes, num_used_cores)but the type hint still showstuple[float, float, int, int].🔧 Proposed fix
-) -> tuple[float, float, int, int]: +) -> tuple[float, float, int, int, int, int]:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/baseline.py` at line 33, The return type annotation is out of sync: the function returns six values (baseline_cost, baseline_cost_off, baseline_next_empty_slot, next_job_id, num_used_nodes, num_used_cores) but the signature currently declares tuple[float, float, int, int]; update the function's return type annotation to reflect six elements (e.g., tuple[float, float, int, int, int, int]) in the function definition in src/baseline.py so callers and type checkers see the correct shape.
🧹 Nitpick comments (7)
analyze_lambda_occupancy.py (3)
374-374: Dynamically attached method relies on implementation detail.
evaluate.cache_keysis attached at line 714, butselect_lambda_schedulecalls it at line 374. This coupling is fragile—if someone callsselect_lambda_schedulewith a different callable, it will fail withAttributeError.Consider either:
- Passing the cache explicitly as a parameter
- Documenting the expected interface in the function signature using a Protocol
♻️ Alternative: pass cache explicitly
def select_lambda_schedule( args: argparse.Namespace, evaluate: Callable[[int], LambdaRunStats], + cache: dict[int, LambdaRunStats], ) -> list[int]:Then replace
evaluate.cache_keys()withcache.keys()throughout the function.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@analyze_lambda_occupancy.py` at line 374, select_lambda_schedule currently depends on a dynamically attached method evaluate.cache_keys which is fragile; change the function to accept the cache explicitly (e.g., add a parameter named cache) and replace all uses of evaluate.cache_keys() with cache.keys(), or alternatively declare an explicit Protocol type describing a keys() method and annotate the cache parameter (or evaluate) in select_lambda_schedule's signature so callers must provide a compatible object; update all call sites to pass the cache object and adjust any imports/typing accordingly.
498-498: Addstrict=Truetozip()to catch length mismatches.If
x,y, andpoint_colorshave different lengths, silent truncation could cause subtle bugs. Addingstrict=Truewill raise an error if the iterables have different lengths.♻️ Proposed fix
- for i, (xi, yi, c) in enumerate(zip(x, y, point_colors)): + for i, (xi, yi, c) in enumerate(zip(x, y, point_colors, strict=True)):🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@analyze_lambda_occupancy.py` at line 498, The loop over x, y, and point_colors silently truncates if lengths differ; update the enumerate(zip(x, y, point_colors)) call in analyze_lambda_occupancy.py (the loop that destructures into xi, yi, c) to use strict=True (i.e., zip(x, y, point_colors, strict=True)) so a ValueError is raised on length mismatch and the issue surfaces immediately.
268-283: Subprocess invocation is safe but consider adding timeout.The command is built from controlled
argparseinputs, so injection risk is low. However,subprocess.runwithout a timeout could hang indefinitely iftrain.pystalls.⏱️ Proposed fix: add timeout
completed = subprocess.run( command, cwd=str(project_root), capture_output=True, text=True, check=False, + timeout=3600, # 1 hour timeout per lambda evaluation )You'll need to handle
subprocess.TimeoutExpiredin the caller.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@analyze_lambda_occupancy.py` around lines 268 - 283, The subprocess.run call that executes `command` (producing `completed`) needs a timeout to avoid hangs: add a sensible timeout argument to the `subprocess.run(...)` invocation and catch `subprocess.TimeoutExpired` around that call; on timeout, include context (the `lambda_value`, the command, and a snippet via `os_tail`) and raise or propagate a `RuntimeError` similar to the existing non-zero exit handling so the caller (which already expects errors) can handle it; keep existing behavior for printing (`args.echo_train_output`) and returncode handling intact.src/metrics_tracker.py (1)
82-88: Inconsistent type annotations.Type annotations were removed from some episode lists (lines 82-88) while other lists in the same method retain them (lines 90-95). This creates inconsistency. Consider either adding type annotations to all new fields or documenting why they were intentionally omitted.
♻️ Proposed fix for consistency
# Time series data for plotting (episode) - self.episode_on_nodes = [] - self.episode_used_nodes = [] - self.episode_used_cores = [] - self.episode_baseline_used_nodes = [] - self.episode_baseline_used_cores = [] - self.episode_job_queue_sizes = [] - self.episode_price_stats = [] + self.episode_on_nodes: list[int] = [] + self.episode_used_nodes: list[int] = [] + self.episode_used_cores: list[int] = [] + self.episode_baseline_used_nodes: list[int] = [] + self.episode_baseline_used_cores: list[int] = [] + self.episode_job_queue_sizes: list[int] = [] + self.episode_price_stats: list[float] = []🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/metrics_tracker.py` around lines 82 - 88, The new episode list attributes (episode_on_nodes, episode_used_nodes, episode_used_cores, episode_baseline_used_nodes, episode_baseline_used_cores, episode_job_queue_sizes, episode_price_stats) lack type annotations and should be annotated consistently with the other episode fields in the class (e.g., the annotated lists declared on lines 90-95); update each of these attributes in the MetricsTracker class (or the enclosing init method) to include the same list type hints used elsewhere (matching types like List[int], List[float], or List[List[int]] as appropriate) so all episode fields use consistent type annotations.train.py (1)
227-229: Consider extracting occupancy calculations for readability.The inline expressions are correct but dense. Extracting them into local variables before the print statement would improve readability.
✨ Optional refactor for readability
+ agent_occ_cores = env.metrics.episode_used_cores[-1] * 100 / (CORES_PER_NODE * MAX_NODES) if env.metrics.episode_used_cores else 0 + baseline_occ_cores = env.metrics.episode_baseline_used_cores[-1] * 100 / (CORES_PER_NODE * MAX_NODES) if env.metrics.episode_baseline_used_cores else 0 + agent_occ_nodes = env.metrics.episode_used_nodes[-1] * 100 / MAX_NODES if env.metrics.episode_used_nodes else 0 + baseline_occ_nodes = env.metrics.episode_baseline_used_nodes[-1] * 100 / MAX_NODES if env.metrics.episode_baseline_used_nodes else 0 print(f" Episode {episode + 1}: " ... f"MaxQueue={env.metrics.max_queue_size_reached}, " - f"Agent Occupancy (Cores)={env.metrics.episode_used_cores[-1]*100/(CORES_PER_NODE*MAX_NODES)if env.metrics.episode_used_cores else 0 :.2f}%, Baseline Occupancy (Cores)={env.metrics.episode_baseline_used_cores[-1]*100/(CORES_PER_NODE*MAX_NODES) if env.metrics.episode_baseline_used_cores else 0 :.2f}% " - f"Agent Occupancy (Nodes)={env.metrics.episode_used_nodes[-1]*100/MAX_NODES if env.metrics.episode_used_nodes else 0 :.2f}%, Baseline Occupancy (Nodes)={env.metrics.episode_baseline_used_nodes[-1]*100/MAX_NODES if env.metrics.episode_baseline_used_nodes else 0 :.2f}% " ) + f"Agent Occupancy (Cores)={agent_occ_cores:.2f}%, Baseline Occupancy (Cores)={baseline_occ_cores:.2f}% " + f"Agent Occupancy (Nodes)={agent_occ_nodes:.2f}%, Baseline Occupancy (Nodes)={baseline_occ_nodes:.2f}%")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@train.py` around lines 227 - 229, Extract the dense inline occupancy calculations into clearly named local variables before the print call: read max_queue = env.metrics.max_queue_size_reached, compute agent_cores_pct and baseline_cores_pct using env.metrics.episode_used_cores and env.metrics.episode_baseline_used_cores with the existing empty-list guards and CORES_PER_NODE and MAX_NODES, and compute agent_nodes_pct and baseline_nodes_pct from env.metrics.episode_used_nodes and env.metrics.episode_baseline_used_nodes with the existing guards and MAX_NODES; then use those variables in the f-string instead of the long inline expressions to improve readability in train.py where the print/log is emitted.data/workload_statistics/analyze_workload_logs.py (2)
591-596: Validate--burst-event-quantileat parse time.Right now invalid values are silently clamped downstream. Rejecting out-of-range input (
<0or>1) in argparse avoids hidden misconfiguration.🔧 Suggested parser hardening
+def parse_quantile(raw: str) -> float: + try: + q = float(raw) + except ValueError as exc: + raise argparse.ArgumentTypeError(f"Invalid quantile '{raw}'") from exc + if not (0.0 <= q <= 1.0): + raise argparse.ArgumentTypeError("Quantile must be in [0,1]") + return q + ... - parser.add_argument( - "--burst-event-quantile", - type=float, + parser.add_argument( + "--burst-event-quantile", + type=parse_quantile, default=0.75, help="Quantile of positive baseline-adjusted hourly counts used to isolate burst hours (default: 0.75).", )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@data/workload_statistics/analyze_workload_logs.py` around lines 591 - 596, The --burst-event-quantile parser entry currently allows out-of-range floats; validate it at parse time by adding an argparse-level check: implement and use a small validator (e.g., validate_quantile) as the type or call a post-parse check for the argument from parser.add_argument("--burst-event-quantile", ...), ensuring values are between 0.0 and 1.0 and raising argparse.ArgumentTypeError (or exiting with parser.error) for invalid input so bad configs are rejected rather than silently clamped downstream.
306-319: Narrow the exception scope in row parsing.Catching
Exceptionhere can hide coding errors and skewskipped_rowssilently. Limit this to expected parse/data errors.🧭 Suggested exception narrowing
- except Exception: + except (KeyError, ValueError, TypeError): skipped += 1 return🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@data/workload_statistics/analyze_workload_logs.py` around lines 306 - 319, The broad except Exception in the consume function should be narrowed to only expected parse/data errors to avoid hiding bugs; replace the generic except with a specific catch for parsing and missing-key issues (e.g., except (KeyError, ValueError, TypeError): skipped += 1; return) while letting other exceptions propagate so real coding errors surface; reference the consume function and the calls/operations parse_submit_hour, parse_duration_hours, float(...) on row[submit_col], row[duration_col], row[nodes_col], row[cores_col], and the cores_is_total/node_count branch when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@analyze_lambda_occupancy.py`:
- Line 717: The list comprehension using an ambiguous loop variable `l`
(stats_ordered = [cache[l] for l in sorted(selected_lambdas)]) should use a
clearer name; change `l` to a descriptive identifier such as `lambda_name` or
`lambda_id` and update the comprehension to stats_ordered = [cache[lambda_name]
for lambda_name in sorted(selected_lambdas)]; also scan the surrounding scope
for any other uses of `l` related to selected_lambdas and rename them
consistently to avoid confusion with the digit `1`.
In `@data/workload_statistics/analyze_workload_logs.py`:
- Around line 683-701: The generated suggested_flags block emits "nan" for unset
integer stats; update the code that constructs the f-string (the block using
fmt_suggested_int and stats[...] keys like suggested_wg_burst_small_jobs_min,
suggested_wg_burst_heavy_jobs_min, etc.) to omit any integer flag whose value is
not finite: check each numeric value with math.isfinite (or numpy.isfinite)
before appending its "--wg-..." fragment and only call fmt_suggested_int when
finite; for probabilities you can keep the current formatting but for all
"*_min", "*_max", "*_nodes*", "*_cores*" fields skip adding the flag when the
stat is non-finite (alternatively substitute a safe integer default), ensuring
the produced suggested_flags string is CLI-valid.
---
Outside diff comments:
In `@src/baseline.py`:
- Around line 54-55: The docstring in the baseline calculation function is
outdated: it still documents a 4-tuple return but the function now returns six
values (baseline_cost, baseline_cost_off, baseline_next_empty_slot, next_job_id,
num_used_nodes, num_used_cores). Update the Returns section of the docstring in
src/baseline.py where the function (e.g., the baseline calculation function) is
defined to list all six returned values in the correct order and briefly
describe each (including num_used_nodes and num_used_cores) so the docstring
matches the actual return tuple.
- Line 33: The return type annotation is out of sync: the function returns six
values (baseline_cost, baseline_cost_off, baseline_next_empty_slot, next_job_id,
num_used_nodes, num_used_cores) but the signature currently declares
tuple[float, float, int, int]; update the function's return type annotation to
reflect six elements (e.g., tuple[float, float, int, int, int, int]) in the
function definition in src/baseline.py so callers and type checkers see the
correct shape.
---
Nitpick comments:
In `@analyze_lambda_occupancy.py`:
- Line 374: select_lambda_schedule currently depends on a dynamically attached
method evaluate.cache_keys which is fragile; change the function to accept the
cache explicitly (e.g., add a parameter named cache) and replace all uses of
evaluate.cache_keys() with cache.keys(), or alternatively declare an explicit
Protocol type describing a keys() method and annotate the cache parameter (or
evaluate) in select_lambda_schedule's signature so callers must provide a
compatible object; update all call sites to pass the cache object and adjust any
imports/typing accordingly.
- Line 498: The loop over x, y, and point_colors silently truncates if lengths
differ; update the enumerate(zip(x, y, point_colors)) call in
analyze_lambda_occupancy.py (the loop that destructures into xi, yi, c) to use
strict=True (i.e., zip(x, y, point_colors, strict=True)) so a ValueError is
raised on length mismatch and the issue surfaces immediately.
- Around line 268-283: The subprocess.run call that executes `command`
(producing `completed`) needs a timeout to avoid hangs: add a sensible timeout
argument to the `subprocess.run(...)` invocation and catch
`subprocess.TimeoutExpired` around that call; on timeout, include context (the
`lambda_value`, the command, and a snippet via `os_tail`) and raise or propagate
a `RuntimeError` similar to the existing non-zero exit handling so the caller
(which already expects errors) can handle it; keep existing behavior for
printing (`args.echo_train_output`) and returncode handling intact.
In `@data/workload_statistics/analyze_workload_logs.py`:
- Around line 591-596: The --burst-event-quantile parser entry currently allows
out-of-range floats; validate it at parse time by adding an argparse-level
check: implement and use a small validator (e.g., validate_quantile) as the type
or call a post-parse check for the argument from
parser.add_argument("--burst-event-quantile", ...), ensuring values are between
0.0 and 1.0 and raising argparse.ArgumentTypeError (or exiting with
parser.error) for invalid input so bad configs are rejected rather than silently
clamped downstream.
- Around line 306-319: The broad except Exception in the consume function should
be narrowed to only expected parse/data errors to avoid hiding bugs; replace the
generic except with a specific catch for parsing and missing-key issues (e.g.,
except (KeyError, ValueError, TypeError): skipped += 1; return) while letting
other exceptions propagate so real coding errors surface; reference the consume
function and the calls/operations parse_submit_hour, parse_duration_hours,
float(...) on row[submit_col], row[duration_col], row[nodes_col],
row[cores_col], and the cores_is_total/node_count branch when making the change.
In `@src/metrics_tracker.py`:
- Around line 82-88: The new episode list attributes (episode_on_nodes,
episode_used_nodes, episode_used_cores, episode_baseline_used_nodes,
episode_baseline_used_cores, episode_job_queue_sizes, episode_price_stats) lack
type annotations and should be annotated consistently with the other episode
fields in the class (e.g., the annotated lists declared on lines 90-95); update
each of these attributes in the MetricsTracker class (or the enclosing init
method) to include the same list type hints used elsewhere (matching types like
List[int], List[float], or List[List[int]] as appropriate) so all episode fields
use consistent type annotations.
In `@train.py`:
- Around line 227-229: Extract the dense inline occupancy calculations into
clearly named local variables before the print call: read max_queue =
env.metrics.max_queue_size_reached, compute agent_cores_pct and
baseline_cores_pct using env.metrics.episode_used_cores and
env.metrics.episode_baseline_used_cores with the existing empty-list guards and
CORES_PER_NODE and MAX_NODES, and compute agent_nodes_pct and baseline_nodes_pct
from env.metrics.episode_used_nodes and env.metrics.episode_baseline_used_nodes
with the existing guards and MAX_NODES; then use those variables in the f-string
instead of the long inline expressions to improve readability in train.py where
the print/log is emitted.
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
analyze_lambda_occupancy.pydata/workload_statistics/analyze_workload_logs.pydata/workload_statistics/workload_logs.txtsrc/baseline.pysrc/environment.pysrc/metrics_tracker.pytest/test_inspect_workloadgen.pytrain.py
| "suggested_flags: " | ||
| f"--wg-burst-small-prob {stats['suggested_wg_burst_small_prob']:.4f} " | ||
| f"--wg-burst-heavy-prob {stats['suggested_wg_burst_heavy_prob']:.4f}" | ||
| f"--wg-burst-heavy-prob {stats['suggested_wg_burst_heavy_prob']:.4f} " | ||
| f"--wg-burst-small-jobs-min {fmt_suggested_int(stats['suggested_wg_burst_small_jobs_min'])} " | ||
| f"--wg-burst-small-jobs-max {fmt_suggested_int(stats['suggested_wg_burst_small_jobs_max'])} " | ||
| f"--wg-burst-small-duration-min {fmt_suggested_int(stats['suggested_wg_burst_small_duration_min'])} " | ||
| f"--wg-burst-small-duration-max {fmt_suggested_int(stats['suggested_wg_burst_small_duration_max'])} " | ||
| f"--wg-burst-small-nodes-min {fmt_suggested_int(stats['suggested_wg_burst_small_nodes_min'])} " | ||
| f"--wg-burst-small-nodes-max {fmt_suggested_int(stats['suggested_wg_burst_small_nodes_max'])} " | ||
| f"--wg-burst-small-cores-min {fmt_suggested_int(stats['suggested_wg_burst_small_cores_min'])} " | ||
| f"--wg-burst-small-cores-max {fmt_suggested_int(stats['suggested_wg_burst_small_cores_max'])} " | ||
| f"--wg-burst-heavy-jobs-min {fmt_suggested_int(stats['suggested_wg_burst_heavy_jobs_min'])} " | ||
| f"--wg-burst-heavy-jobs-max {fmt_suggested_int(stats['suggested_wg_burst_heavy_jobs_max'])} " | ||
| f"--wg-burst-heavy-duration-min {fmt_suggested_int(stats['suggested_wg_burst_heavy_duration_min'])} " | ||
| f"--wg-burst-heavy-duration-max {fmt_suggested_int(stats['suggested_wg_burst_heavy_duration_max'])} " | ||
| f"--wg-burst-heavy-nodes-min {fmt_suggested_int(stats['suggested_wg_burst_heavy_nodes_min'])} " | ||
| f"--wg-burst-heavy-nodes-max {fmt_suggested_int(stats['suggested_wg_burst_heavy_nodes_max'])} " | ||
| f"--wg-burst-heavy-cores-min {fmt_suggested_int(stats['suggested_wg_burst_heavy_cores_min'])} " | ||
| f"--wg-burst-heavy-cores-max {fmt_suggested_int(stats['suggested_wg_burst_heavy_cores_max'])}" |
There was a problem hiding this comment.
Do not emit nan in suggested CLI flags.
Lines 686-701 can print flags like --wg-burst-heavy-jobs-min nan. Copy-pasting this into training commands will break integer flag parsing. Omit unset integer flags (or substitute defaults) when values are non-finite.
💡 Suggested fix (omit non-finite integer flags)
+def _maybe_int_flag(name: str, value: float) -> str:
+ if not math.isfinite(value):
+ return ""
+ return f"{name} {fmt_suggested_int(value)}"
+
...
- print(
- "suggested_flags: "
- f"--wg-burst-small-prob {stats['suggested_wg_burst_small_prob']:.4f} "
- f"--wg-burst-heavy-prob {stats['suggested_wg_burst_heavy_prob']:.4f} "
- f"--wg-burst-small-jobs-min {fmt_suggested_int(stats['suggested_wg_burst_small_jobs_min'])} "
- ...
- f"--wg-burst-heavy-cores-max {fmt_suggested_int(stats['suggested_wg_burst_heavy_cores_max'])}"
- )
+ flag_parts = [
+ f"--wg-burst-small-prob {stats['suggested_wg_burst_small_prob']:.4f}",
+ f"--wg-burst-heavy-prob {stats['suggested_wg_burst_heavy_prob']:.4f}",
+ _maybe_int_flag("--wg-burst-small-jobs-min", stats["suggested_wg_burst_small_jobs_min"]),
+ _maybe_int_flag("--wg-burst-small-jobs-max", stats["suggested_wg_burst_small_jobs_max"]),
+ _maybe_int_flag("--wg-burst-small-duration-min", stats["suggested_wg_burst_small_duration_min"]),
+ _maybe_int_flag("--wg-burst-small-duration-max", stats["suggested_wg_burst_small_duration_max"]),
+ _maybe_int_flag("--wg-burst-small-nodes-min", stats["suggested_wg_burst_small_nodes_min"]),
+ _maybe_int_flag("--wg-burst-small-nodes-max", stats["suggested_wg_burst_small_nodes_max"]),
+ _maybe_int_flag("--wg-burst-small-cores-min", stats["suggested_wg_burst_small_cores_min"]),
+ _maybe_int_flag("--wg-burst-small-cores-max", stats["suggested_wg_burst_small_cores_max"]),
+ _maybe_int_flag("--wg-burst-heavy-jobs-min", stats["suggested_wg_burst_heavy_jobs_min"]),
+ _maybe_int_flag("--wg-burst-heavy-jobs-max", stats["suggested_wg_burst_heavy_jobs_max"]),
+ _maybe_int_flag("--wg-burst-heavy-duration-min", stats["suggested_wg_burst_heavy_duration_min"]),
+ _maybe_int_flag("--wg-burst-heavy-duration-max", stats["suggested_wg_burst_heavy_duration_max"]),
+ _maybe_int_flag("--wg-burst-heavy-nodes-min", stats["suggested_wg_burst_heavy_nodes_min"]),
+ _maybe_int_flag("--wg-burst-heavy-nodes-max", stats["suggested_wg_burst_heavy_nodes_max"]),
+ _maybe_int_flag("--wg-burst-heavy-cores-min", stats["suggested_wg_burst_heavy_cores_min"]),
+ _maybe_int_flag("--wg-burst-heavy-cores-max", stats["suggested_wg_burst_heavy_cores_max"]),
+ ]
+ print("suggested_flags: " + " ".join(p for p in flag_parts if p))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@data/workload_statistics/analyze_workload_logs.py` around lines 683 - 701,
The generated suggested_flags block emits "nan" for unset integer stats; update
the code that constructs the f-string (the block using fmt_suggested_int and
stats[...] keys like suggested_wg_burst_small_jobs_min,
suggested_wg_burst_heavy_jobs_min, etc.) to omit any integer flag whose value is
not finite: check each numeric value with math.isfinite (or numpy.isfinite)
before appending its "--wg-..." fragment and only call fmt_suggested_int when
finite; for probabilities you can keep the current formatting but for all
"*_min", "*_max", "*_nodes*", "*_cores*" fields skip adding the flag when the
stat is non-finite (alternatively substitute a safe integer default), ensuring
the produced suggested_flags string is CLI-valid.
…aluation mode as capacity/occupancy of cluster.
…endlines - add `analyze_lambda_occupancy.py` to run `train.py` in evaluation mode across Poisson lambdas - default run matches the provided training/evaluation flags (12-month evaluation, model/session/workload defaults) - parse per-episode `Agent Occupancy (Nodes)` and cumulative savings from `train.py` output - convert cumulative savings to per-episode savings and compute mean/std per lambda - implement adaptive lambda selection to cover occupancy range with minimal points, with `--lambdas` override - fit 3rd-order polynomial curves for: - lambda vs occupancy - occupancy vs savings - occupancy vs savings_off - export artifacts: `summary.csv`, `summary.json`, `trendlines.png`, plus per-lambda logs - add optional `--echo-train-output` and headless plotting support via matplotlib `Agg` Add: extra plots, showing completion rate vs lambda and effective savings
Add: Fit now toggled through parser
Co-authored-by: Alexey Rybalchenko <alexryba@gmail.com>
Co-authored-by: Alexey Rybalchenko <alexryba@gmail.com>
Co-authored-by: Alexey Rybalchenko <alexryba@gmail.com>
Add: Fit now toggled through parser
No description provided.