Skip to content

Comments

Refactor Axolotl to pass a generation error callback#4927

Open
mgrange1998 wants to merge 8 commits intofacebook:mainfrom
mgrange1998:export-D89751541
Open

Refactor Axolotl to pass a generation error callback#4927
mgrange1998 wants to merge 8 commits intofacebook:mainfrom
mgrange1998:export-D89751541

Conversation

@mgrange1998
Copy link
Contributor

Summary:
Adds an on_generation_error callback parameter to generate_candidate_trials and a corresponding on_generation_error function in Axolotl utils. This allows callers like Axolotl to format error messages (including paste upload with full traceback) without the Orchestrator needing to know about paste infrastructure.

The generate_candidate_trials return type changes from a 3-tuple to a 4-tuple, adding a cannot_generate_reason string that the callback populates when generation fails. The existing candidate trial accounting is also moved from generate_candidate_trials into compute_n_to_generate, so the n parameter now represents exactly the number of new trials to generate.

Differential Revision: D89751541

Matthew Grange and others added 8 commits February 19, 2026 13:00
…tch_utils

Summary: Renames the `max_parallelism` parameter to `max_concurrency` across GenerationStep, GenerationNode, and the generation strategy dispatch utilities. Adds backward-compatible deprecated `max_parallelism` parameters with deprecation warnings where the public API is affected (`choose_generation_strategy`). Internal variable names (`sobol_parallelism`, `bo_parallelism`) are renamed to `sobol_concurrency`, `bo_concurrency` for consistency.

Differential Revision: D92457714
Summary: Renames the `parallelism` parameter to `concurrency` in `Client.run_trials()` and adds backward-compatible deprecated `max_parallelism` parameters in `AxClient.create_experiment()` and `AxClient.get_max_parallelism()` → `get_max_concurrency()`. Both include deprecation warnings guiding callers to use the new parameter names, with validation that old and new parameters are not specified simultaneously.

Differential Revision: D93771849
…Settings

Summary: Renames `num_parallel_jobs` to `num_concurrent_jobs` in `BenchmarkExecutionSettings` and all nightly benchmark configurations. Also updates the docstring in `BenchmarkMethod` to reference "pending trials" instead of "parallelism". This is a mechanical rename with no behavioral change.

Differential Revision: D93771883
…ants, and telemetry

Summary: Updates remaining references from "parallelism" to "concurrency" across orchestration, telemetry, early stopping, and other modules. This covers docstrings, comments, constant names (`MAX_PENDING_TRIALS` → `MAX_CONCURRENT_TRIALS`, `DUMMY_MAX_PENDING_TRIALS` → `DUMMY_MAX_CONCURRENT_TRIALS`), telemetry field names, and variable names in test files. No behavioral changes — purely a terminology alignment.

Differential Revision: D93771906
…tDesign.concurrency_limit`

Summary: As titled, adding a simple `ExperimentDesign` object. Putting it into properties for serialization for now, so as to not do duplicate work ahead of the storage refactor implementation (and also in case we change things while working on this stack).

Differential Revision: D89770462
Summary: Migrates all references from `experiment._properties[Keys.EXPERIMENT_TOTAL_CONCURRENT_ARMS]` to `experiment.design.concurrency_limit`, completing the transition to the `ExperimentDesign` dataclass introduced in the prior diff. This affects generation node input constructors (including `ALL_N` and `REPEAT_N`), the Axolotl updater, and associated tests. Also cleans up the `no-commit` code in `generation_node_input_constructors.py` to use the new `concurrency_limit` field with a fallback to a default of 10.

Differential Revision: D89772029
Summary:
## Changes

Consolidates `generate_candidates` and `_prepare_trials` into a unified API:

- Renames `generate_candidates` → `generate_candidate_trials` and changes its return type to a 3-tuple `(existing_candidates, new_trials, error)`, incorporating the existing-candidate-trial logic that was previously in `_prepare_trials`.
- Extracts the capacity/limit calculation from `_prepare_trials` into a new `compute_n_to_generate` method, which the Orchestrator's main loop now calls before `generate_candidate_trials`.
- Renames `should_generate_candidates_for_pts` → `should_generate_candidate_trials_for_pts` and adds a "not enough data" check that validates metrics have at least 1 day of data before allowing generation.
- Adds two new test methods for the "not enough data" and "missing metrics + not enough data" scenarios.

## Devmate session

How doing this with Devmate went:

1. First we ask Devmate to analyse the difference betwen the methods; it does remarkably well:{F1984363089} {F1984363089} {F1984363089}

2. Next a tangent: I renamed `generate_candidates` with a more precise name (`generate_candidate_trials`), since that is the method we will keep between the two, and it might as well have a better name. Asked Devmate to apply the changes throughout fbcode.
 {F1984363157} {F1984363170}

3. Now for the hard part: get `generate_candidate_trials` to match the behavior or `_prepare_trials`, without me writing any of the code:  {F1984363323} {F1984363333}
^ Pretty good for starters! I give corrections, see above; it applies them well: {F1984363346}
Then with one more small correction, we have a very solid plan:  {F1984363398}, which Devmate implements:  {F1984363406}  {F1984363458}. I think it did really well!

Differential Revision: D89750211
Summary:
Adds an `on_generation_error` callback parameter to `generate_candidate_trials` and a corresponding `on_generation_error` function in Axolotl utils. This allows callers like Axolotl to format error messages (including paste upload with full traceback) without the Orchestrator needing to know about paste infrastructure.

The `generate_candidate_trials` return type changes from a 3-tuple to a 4-tuple, adding a `cannot_generate_reason` string that the callback populates when generation fails. The existing candidate trial accounting is also moved from `generate_candidate_trials` into `compute_n_to_generate`, so the `n` parameter now represents exactly the number of new trials to generate.

Differential Revision: D89751541
@meta-cla meta-cla bot added the CLA Signed Do not delete this pull request or issue due to inactivity. label Feb 20, 2026
@meta-codesync
Copy link

meta-codesync bot commented Feb 20, 2026

@mgrange1998 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D89751541.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed Do not delete this pull request or issue due to inactivity. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants