Skip to content

Commit 828a26f

Browse files
authored
Merge branch 'microsoft:main' into feat/dict-code-support
2 parents cafedaa + f00a538 commit 828a26f

File tree

9 files changed

+112
-92
lines changed

9 files changed

+112
-92
lines changed

pyproject.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -112,6 +112,7 @@ docs = {file = ["requirements/docs.txt"]}
112112
lint = {file = ["requirements/lint.txt"]}
113113
package = {file = ["requirements/package.txt"]}
114114
test = {file = ["requirements/test.txt"]}
115+
torch = {file = ["requirements/torch.txt"]} # some agent algorithms need torch. pip install rdagent[torch]
115116

116117
[tool.setuptools_scm]
117118
local_scheme = "no-local-version"

rdagent/app/data_science/conf.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -145,7 +145,9 @@ class DataScienceBasePropSetting(KaggleBasePropSetting):
145145
coder_longer_timeout_multiplier_upper: int = 3
146146
runner_longer_timeout_multiplier_upper: int = 2
147147
coder_timeout_increase_stage: float = 0.3
148-
runner_timeout_increase_stage: float = 0.15
148+
runner_timeout_increase_stage: float = 0.3
149+
runner_timeout_increase_stage_patience: int = 2
150+
"""Number of failures tolerated before escalating to next timeout level (stage width). Every 'patience' failures, timeout increases by 'runner_timeout_increase_stage'"""
149151
show_hard_limit: bool = True
150152

151153
#### enable runner code change summary
@@ -174,6 +176,8 @@ class DataScienceBasePropSetting(KaggleBasePropSetting):
174176
#### Task Generate related
175177
fix_seed_and_data_split: bool = False
176178

179+
ensemble_time_upper_bound: bool = False
180+
177181

178182
DS_RD_SETTING = DataScienceBasePropSetting()
179183

rdagent/scenarios/data_science/dev/runner/prompts.yaml

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -34,16 +34,18 @@ DSCoSTEER_eval:
3434
For example, if the code uses only a very small portion of the allowed time, and hyperparameters like `n_estimators` or `epochs` have low values, with early stopping not being triggered and possible signs of underfitting, you should suggest increasing these hyperparameters.
3535
You should also notice other resources utilization hyper-parameters.
3636
For example, if you are using a GPU with large memory, and the batch size is set very low, you should suggest increasing the batch size if it is not reasonable.
37-
37+
For example, prioritize adjustments to batch size and number of epochs. If further tuning is needed, consider parameters with significant impact on performance such as learning rate and the number of model folds. For CV competitions, also consider image size (imgsize), and for NLP competitions, consider maximum sequence length (maxlen), as these can have a substantial impact on results.
3838
## Evaluation Guidelines
3939
1. The code execution time or resource utilization suggest that there is room for improvement in the hyperparameters.
4040
2. The code must apply early stopping strategy already (in order to prevent overfitting).
4141
3. Your suggestion should have a strong chance of improving the model's performance. Focus on the most obvious and impactful opportunities for quick improvement by leveraging more training time. Don't explore hyperparameters with low confidence. If there are no obvious and impactful opportunities and the code runs well, please accept it.
4242
4. Only include the suggestions in your response without leak any time limit information because the user might over-fit the model to the time limit.
4343
5. Never make your judgment only based on the time spent, you should also consider the code and the stdout.
44+
4445
If the code satisfy the requirements:
4546
- Set "hyperparameter_tuning_decision" to true.
46-
- In "hyperparameter_tuning_suggestion", provide a clear, specific, and actionable suggestion. Begin with a concrete observation, then state a direct action to take. Do not use vague language, options, or uncertainty (avoid words like "A or B"). For example: "[Observation] The maximum number of epochs was reached, but the validation loss is still decreasing and early stopping was not activated. Only small portion of the allowed time was used. [Suggestion] Increase epochs to 100 to avoid underfitting and further improve model performance."
47+
- In "hyperparameter_tuning_suggestion", provide a clear, specific, and actionable suggestion. Begin with a concrete observation, then state a direct action to take. Do not use vague language, options, or uncertainty (avoid words like "A or B"). For example: "[Observation] Training stopped due to early stopping while the validation loss was still decreasing. This suggests the patience parameter may be too small.
48+
[Suggestion] Increase the early stopping patience to allow more training epochs before stopping, which can further improve model performance."
4749
If the code does not satisfy the requirements:
4850
- Set "hyperparameter_tuning_decision" to false.
4951
- Set "hyperparameter_tuning_suggestion" to an empty string.

rdagent/scenarios/data_science/proposal/exp_gen/prompts_v2.yaml

Lines changed: 13 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -539,28 +539,31 @@ hypothesis_select:
539539
- **Time Limit Guidance**
540540
{% if time_max < 0 %}
541541
- Initial Case: runtime info unavailable, keep most hypotheses if component is Ensemble.
542-
- Remove only those clearly excessive (e.g., > {{ full_time }} hours) or overly complex.
543542
{% elif time_max >= full_time * 0.5 %}
544543
- High Runtime Case: current max runtime ({{ time_max }} hours) leaves little room for extra runs.
545544
- Avoid high-fold or heavy ensembles.
546545
- Maximum recommended folds: {{ (full_time // time_max) | int }}
547-
- Remove hypotheses clearly excessive (> {{ full_time }} hours)
548546
{% else %}
549547
- Low Runtime Case: current max runtime ({{ time_max }} hours) is far from the time limit.
550548
- Prefer hypotheses with runtimes ≤ {{ full_time }} hours.
551549
- Hypotheses slightly above {{ time_max }} hours can be retained only with strong justification.
552550
{% endif %}
553551
554552
### Ensemble Model Core Principle in Low Runtime Case
555-
Your goal is not just to tune individual models, but to build an **effective ensemble**. Make design decisions that lead to **strong overall ensemble performance**, not just strong base models.
556-
Please note: you are operating under a time budget dedicated to ensemble training of {{res_time}} seconds, and the maximum allowed time is {{full_time}} seconds.
557-
558-
Please take the remaining {{res_time}} seconds to carefully consider and design the most reasonable and optimal ensemble models based on your current progress.
553+
Your goal is not just to tune individual models, but to build an **effective ensemble**. Make design decisions that lead to **strong overall ensemble performance**, not just strong base models.
554+
These are examples:
555+
556+
Example 1:
559557
Assume training a single model takes about 1 hour. For example, if you have roughly twice that time left, you can try training multiple models with different random seeds or data splits to reuse time effectively.
560558
If you have more time, you might consider training a multi-fold ensemble. Use your judgment to decide how many folds or seeds fit within your remaining time budget.
559+
560+
Example 2:
561+
Assume training a single fold of a model takes at most {{ time_max }} hours. Within your remaining time budget, prioritize training multiple folds of the same model rather than trying many different models.
562+
For instance, if you have roughly 2 × {{ time_max }} hours left, you could train 2 folds of the same model with different data splits or random seeds.
563+
If more time is available, you might consider increasing the number of folds further. Use your judgment to decide how many folds fit within the remaining time budget while respecting the time_max constraint for a single fold.
561564
562565
### 2. Training-Time Resource Allocation
563-
- You may use **multiple folds** if justified, but you must **ensure the full pipeline completes within runtime limits**.
566+
- You may use **multiple folds** if justified, but you must **ensure the full pipeline completes within remaining time budget**.
564567
- Avoid reducing base model quality just to save time. For example:
565568
- Freezing large parts of the model (e.g., embeddings)
566569
- Using only embedding-level regression instead of full modeling
@@ -702,19 +705,10 @@ task_gen:
702705
10. File Handling & DataFrame Generation: Generate a pandas DataFrame with columns [“id”, “path”, “fold”].
703706
- id: a unique identifier for each sample.
704707
- path: the file path of the corresponding sample.
705-
- split: indicates the assignment of each sample for data splitting. Two modes are supported:
706-
- K-Fold (optional): assign integers 0, 1, …, K-1 for each fold.
707-
- Train/Test Split (optional): assign "train" or "test" for each sample according to the split ratio (e.g., 8:2).
708-
- Ensure reproducibility: the DataFrame must be generated exactly the same way every time the script runs, e.g., by fixing the random seed 42.
709-
Data Splitting: use this DataFrame to perform dataset splitting, selecting samples for training and testing based on the fold column.
710-
11. Random Seed for Model Training:
711-
- If training neural networks, ensure the initial weights and all random operations use a fixed seed of 42 (e.g., torch.manual_seed(42), numpy.random.seed(42), random.seed(42)).
712-
- If training machine learning models such as LightGBM, XGBoost, or scikit-learn estimators, absolutely ensure the random seed is fixed (e.g., `random_state=42`) to guarantee reproducibility.
713-
- This is mandatory: all aspects of the experiment must be fully reproducible and aligned, including dataset splits and random seeds;
714-
- For multi-fold training, use out-of-fold (OOF) predictions as validation scores and save them as an oof file.
715-
12. Hypothesis Handling: At the initial stage, multiple hypotheses may be proposed simultaneously. If some hypotheses overlap, select the most promising one for implementation and ignore redundant overlapping hypotheses. Each implemented hypothesis should remain an independent task.
716-
Ensure reproducibility: the DataFrame must be generated exactly the same way every time the script runs, regardless of system or runtime conditions (e.g., by fixing the random seed).
708+
709+
11. Hypothesis Handling: At the initial stage, multiple hypotheses may be proposed simultaneously. If some hypotheses overlap, select the most promising one for implementation and ignore redundant overlapping hypotheses. Each implemented hypothesis should remain an independent task.
717710
{% endif %}
711+
718712
## Package Declaration
719713
At the end of your design, **you MUST** provide a key `packages` in the final JSON output.
720714
It should be an **array of PyPI package names** (strings) that you expect to `import` in the forthcoming implementation.

rdagent/scenarios/data_science/proposal/exp_gen/proposal.py

Lines changed: 35 additions & 52 deletions
Original file line numberDiff line numberDiff line change
@@ -606,7 +606,7 @@ def identify_problem(
606606
all_problems[problem_name] = fb_problems[problem_name]
607607
return all_problems
608608

609-
@wait_retry(retry_n=5)
609+
@wait_retry(retry_n=10)
610610
def hypothesis_gen(
611611
self,
612612
component_desc: str,
@@ -920,77 +920,56 @@ def select_hypothesis(
920920
)
921921
return index_to_pick_pool_list[reproducible_int]
922922

923-
# BEGIN: for support llm-based hypothesis selection -----
924-
def _cosine_similarity_matrix_numpy(self, A, B):
925-
dot_products = np.matmul(A, B.T)
926-
A_norms = np.linalg.norm(A, axis=1, keepdims=True)
927-
B_norms = np.linalg.norm(B, axis=1, keepdims=True).T
928-
return dot_products / (A_norms * B_norms)
929-
930-
def _gumbel_softmax_hard_sample(self, logits, tau=1.0, n_samples=1):
931-
932-
gumbel_noise = -np.log(-np.log(np.random.uniform(size=logits.shape) + 1e-20) + 1e-20)
933-
y = (logits + gumbel_noise) / tau
934-
# softmax
935-
y_soft = np.exp(y - np.max(y, axis=1, keepdims=True))
936-
y_soft = y_soft / np.sum(y_soft, axis=1, keepdims=True)
923+
def _cosine_similarity_matrix_torch(self, A, B):
924+
import torch
937925

938-
sampled_indices = []
939-
for i in range(y_soft.shape[0]):
940-
choices = np.arange(y_soft.shape[1])
941-
idx = np.random.choice(choices, size=n_samples, replace=False, p=y_soft[i])
942-
sampled_indices.append(idx)
943-
sampled_indices = np.unique(np.concatenate(sampled_indices))
944-
return sampled_indices.tolist()
926+
dot_products = torch.matmul(A, B.T)
927+
A_norms = torch.norm(A, dim=1, keepdim=True)
928+
B_norms = torch.norm(B, dim=1, keepdim=True).T
929+
return dot_products / (A_norms * B_norms)
945930

946-
def _prob_dis(
931+
def _prob_dis_torch(
947932
self,
948933
current_sota_score_in_current_trace,
949934
extra_hypo_l: list[tuple[DSHypothesis, float]],
950935
hypothesis_candidates,
951936
competition,
952937
path_length,
953938
):
954-
# TODO: typing
939+
import torch
940+
955941
history_hypo_str, history_scores = [], []
956942
for hypo, score in extra_hypo_l:
957943
history_hypo_str.append(hypo.hypothesis)
958944
history_scores.append(score)
959945

960946
target_texts = [v["hypothesis"] for v in hypothesis_candidates.values()]
961-
target_embs = np.array(APIBackend().create_embedding(target_texts), dtype=np.float32)
947+
target_embs = torch.tensor(APIBackend().create_embedding(target_texts), dtype=torch.float32)
962948

963949
if not history_hypo_str:
964950
return []
965-
history_embs = np.array(APIBackend().create_embedding(history_hypo_str), dtype=np.float32)
966-
# TODO: Here is an example to help understand the code:(Please check the correctness of the comment
967-
# history_embs: numpy.ndarray of shape (N, D) where N is the number of historical hypotheses
968-
# and D is the embedding dimension returned by APIBackend().create_embedding.
969-
# It contains vector representations of each hypothesis string in history_hypo_str,
970-
# used for computing similarity with target embeddings.
971-
# Example: if history_hypo_str = ["Try RandomForest with 200 estimators", "Use LightGBM with early stopping"]
972-
# and embedding dimension D=3, history_embs might be:
973-
# array([[ 0.123, -0.456, 0.789],
974-
# [ 0.234, 0.567, -0.890]], dtype=float32)
975-
sim_matrix = self._cosine_similarity_matrix_numpy(target_embs, history_embs)
976-
candidate_scores = np.full((len(target_texts), 1), current_sota_score_in_current_trace, dtype=np.float32)
977-
history_scores = np.array(history_scores, dtype=np.float32).reshape(1, -1)
951+
history_embs = torch.tensor(APIBackend().create_embedding(history_hypo_str), dtype=torch.float32)
952+
sim_matrix = self._cosine_similarity_matrix_torch(target_embs, history_embs)
953+
candidate_scores = [current_sota_score_in_current_trace for i in range(len(target_texts))]
954+
candidate_scores = torch.tensor(candidate_scores, dtype=torch.float32).unsqueeze(1)
955+
history_scores = torch.tensor(history_scores, dtype=torch.float32).unsqueeze(0)
978956
bigger_is_better = get_metric_direction(competition)
979957
if bigger_is_better:
980958
score_diff_matrix = history_scores - candidate_scores
981959
else:
982960
score_diff_matrix = candidate_scores - history_scores
983961
alpha, beta = 1.0, 1.0
984-
if current_sota_score_in_current_trace == -1: # FIXME: less magic number;
962+
if current_sota_score_in_current_trace == -1:
985963
alpha, beta = 1.0, 0
986964
gamma = math.log(2) / 30
987-
logits = alpha * sim_matrix * math.exp(-gamma * path_length) + beta * np.tanh(score_diff_matrix)
988-
logits_max = np.max(logits, axis=1, keepdims=True)
989-
exp_logits = np.exp(logits - logits_max)
990-
probs = exp_logits / np.sum(exp_logits, axis=1, keepdims=True)
991-
num_candidates = probs.shape[-1]
965+
logits = alpha * sim_matrix * math.exp(-gamma * path_length) + beta * torch.tanh(score_diff_matrix)
966+
logits = torch.clamp(logits, min=-2, max=2)
967+
probs = torch.softmax(logits, dim=1)
968+
969+
num_candidates = probs.size(-1)
992970
n_samples = min(2, num_candidates)
993-
flat_indices = self._gumbel_softmax_hard_sample(np.log(probs + 1e-20), tau=0.01, n_samples=n_samples)
971+
sampled_indices = torch.multinomial(probs, num_samples=n_samples).squeeze(1)
972+
flat_indices = sampled_indices.flatten().unique().tolist()
994973
if bigger_is_better:
995974
best_idx = history_scores[0].argmax().item()
996975
best_entry = (history_hypo_str[best_idx], history_scores[0, best_idx])
@@ -1097,14 +1076,18 @@ def hypothesis_select_with_llm(
10971076
if getattr(tr[1], "decision", False)
10981077
]
10991078
time_max = max(time_list_success) / 3600
1100-
# sota_flag = (hasattr(trace, "sota_exp_to_submit") and trace.sota_exp_to_submit is not None)----> V10 CODE VERSION
1101-
bvs = BestValidSelector() # ----> V14 CODE VERSION
1102-
sota_exp = bvs.get_sota_exp_to_submit(trace) # ----> V14 CODE VERSION
1103-
sota_flag = sota_exp is not None and sota_exp.result is not None # ----> V14 CODE VERSION
1079+
sota_flag = (
1080+
hasattr(trace, "sota_exp_to_submit") and trace.sota_exp_to_submit is not None
1081+
) # ----> V10 CODE VERSION
1082+
# bvs = BestValidSelector() # ----> V14 CODE VERSION
1083+
# sota_exp = bvs.get_sota_exp_to_submit(trace) # ----> V14 CODE VERSION
1084+
# sota_flag = sota_exp is not None and sota_exp.result is not None # ----> V14 CODE VERSION
11041085

11051086
if sota_flag:
1106-
current_sota_score = sota_exp.result.loc["ensemble"].iloc[0].round(3) # ----> V14 CODE VERSION
1107-
# trace.sota_exp_to_submit.result.loc["ensemble"].iloc[0].round(3) ----> V10 CODE VERSION
1087+
# current_sota_score = sota_exp.result.loc["ensemble"].iloc[0].round(3) # ----> V14 CODE VERSION
1088+
current_sota_score = (
1089+
trace.sota_exp_to_submit.result.loc["ensemble"].iloc[0].round(3)
1090+
) # ----> V10 CODE VERSION
11081091
else:
11091092
current_sota_score = -1
11101093

@@ -1120,7 +1103,7 @@ def hypothesis_select_with_llm(
11201103
extra_hypo_l = self._llm_select_extra_hypo(trace)
11211104
if len(extra_hypo_l) > 0:
11221105
# TODO:
1123-
selected_extra_hypo_l = self._prob_dis(
1106+
selected_extra_hypo_l = self._prob_dis_torch(
11241107
current_sota_score_in_current_trace,
11251108
extra_hypo_l,
11261109
hypothesis_candidates,

0 commit comments

Comments
 (0)