Add full Reward Conditioning support in 3.0_beta#309
Add full Reward Conditioning support in 3.0_beta#309Aditya-Gupta26 wants to merge 4 commits into3.0_betafrom
Conversation
Greptile SummaryThis PR adds full reward conditioning support (GigaFlow-style) to the driving environment, introducing velocity reward, overspeed penalty, comfort violation penalty, timestep penalty, and reverse penalty, gated behind the
Confidence Score: 2/5
Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[c_step: Per-agent reward loop] --> B{collision_state > 0?}
B -->|VEHICLE_COLLISION| C{reward_conditioning?}
C -->|Yes| D[reward = coef_collision - 0.1 * speed]
C -->|No| E[reward = env.reward_vehicle_collision]
B -->|OFFROAD| F{reward_conditioning?}
F -->|Yes| G[reward += coef_offroad]
F -->|No| H["reward = env.reward_offroad ⚠️ uses = not +="]
B -->|No collision| I[Goal reward logic]
I --> J[Log metrics: lane_center, comfort, velocity, speed]
J --> K{reward_conditioning?}
K -->|Yes| L[Lane align reward]
K -->|Yes| M[Lane center reward]
K -->|Yes| N[Comfort penalty]
K -->|Yes| O[Velocity reward]
K -->|Yes| P[Timestep penalty]
K -->|Yes| Q[Reverse penalty]
K -->|Yes| R[Overspeed penalty]
K -->|No| S["No lane/new rewards ⚠️ regression"]
Last reviewed commit: bad8905 |
pufferlib/ocean/drive/drive.h
Outdated
| env->rewards[i] = env->reward_offroad_collision; | ||
| env->logs[i].episode_return = env->reward_offroad_collision; |
There was a problem hiding this comment.
Offroad reward uses = instead of +=
In the non-conditioning (else) branch, env->rewards[i] and env->logs[i].episode_return are assigned with = instead of +=. The base code (before this PR) used += here. This means any rewards accumulated earlier in the timestep (e.g., goal rewards, jerk penalties) will be silently overwritten when an offroad collision is detected with reward_conditioning off.
Compare with the conditioning branch (lines 2575-2576) which correctly uses +=.
| env->rewards[i] = env->reward_offroad_collision; | |
| env->logs[i].episode_return = env->reward_offroad_collision; | |
| env->rewards[i] += env->reward_offroad_collision; | |
| env->logs[i].episode_return += env->reward_offroad_collision; |
There was a problem hiding this comment.
Yes, it's correct. The issue is this line:
env->rewards[i] = env->reward_offroad_collision;Should be:
env->rewards[i] += env->reward_offroad_collision;Same with episode_return. The = operator overwrites any previous rewards accumulated in the timestep, while the vehicle collision branch (and conditioning branches) correctly use += to add to the existing reward.
| if (env->reward_conditioning) { | ||
|
|
||
| float lane_center_reward = | ||
| agent->reward_coefs[REWARD_COEF_LANE_CENTER] * env->dt * ((cos_theta > 0.5f) * adjusted_dist - exp_decay); | ||
| // Add lane rewards | ||
| // Get lane angle metric: cos(θ_f) where θ_f = heading diff from lane | ||
| float cos_theta = agent->metrics_array[LANE_ANGLE_IDX]; | ||
| float theta_f = acosf(fminf(fmaxf(cos_theta, -1.0f), 1.0f)); // Get |θ_f| from cos | ||
| // env->logs[i].lane_heading_aligned_rate += (cos_theta >= LANE_ALIGN_COS_THRESHOLD) ? 1.0f : 0.0f; <- | ||
| // Commented becaue this is already catered by us | ||
|
|
||
| env->rewards[i] += lane_center_reward; | ||
| // env->logs[i].lane_center_rate += fabsf(lane_center_distance) < 0.5f ? 1.0f : 0.0f; <- Commented for now ( | ||
| // need to add this to add_log) | ||
| env->logs[i].episode_return += lane_center_reward; | ||
| // | ||
| // Rl-align (GIGAFLOW): min(cos,0) + vel_align*min(cos*v,0) + 0.0025*(1-|θ|/(π/2)) | ||
| float against_lane_penalty = fminf(cos_theta, 0.0f); // negative when >90 degrees off | ||
| float vel_aligned_penalty = | ||
| agent->reward_coefs[REWARD_COEF_VEL_ALIGN] * fminf(cos_theta * speed_magnitude, 0.0f); | ||
| float alignment_bonus = 0.0025f * (1.0f - theta_f / (M_PI / 2.0f)); | ||
|
|
||
| float lane_align_reward = agent->reward_coefs[REWARD_COEF_LANE_ALIGN] * env->dt * | ||
| (against_lane_penalty + vel_aligned_penalty + alignment_bonus); | ||
|
|
||
| env->rewards[i] += lane_align_reward; | ||
| env->logs[i].episode_return += lane_align_reward; | ||
|
|
||
| // Rl-center (GIGAFLOW): -α * dt * (|x_f - bias| - 0.05/(exp(|x_f - bias| - 0.5)) | ||
| float adjusted_dist = fabsf(lane_center_distance - agent->reward_coefs[REWARD_COEF_CENTER_BIAS]); | ||
| float exp_decay = 0.05f / expf(adjusted_dist - 0.5f); | ||
| float lane_center_reward = agent->reward_coefs[REWARD_COEF_LANE_CENTER] * env->dt * | ||
| ((cos_theta > 0.5f) * adjusted_dist - exp_decay); | ||
|
|
||
| env->rewards[i] += lane_center_reward; | ||
| env->logs[i].episode_return += lane_center_reward; | ||
| // OTHER REWARDS (From Gigaflow) | ||
|
|
||
| // Red light violation reward (GIGAFLOW) - OMITTED | ||
|
|
||
| // Comfort reward (GIGAFLOW) | ||
|
|
||
| float comfort_penalty = agent->reward_coefs[REWARD_COEF_COMFORT] * comfort_violations; | ||
|
|
||
| env->rewards[i] += comfort_penalty; | ||
| env->logs[i].episode_return += comfort_penalty; | ||
|
|
||
| // Further support TO BE ADDED for all other types of reward conditioning | ||
| // Velocity reward (GIGAFLOW) | ||
| float velocity_reward = agent->reward_coefs[REWARD_COEF_VELOCITY] * env->dt * velocity_progress; | ||
|
|
||
| env->rewards[i] += velocity_reward; | ||
| env->logs[i].episode_return += velocity_reward; | ||
|
|
||
| // Timestep reward (GIGAFLOW) | ||
| float accel = sqrtf(agent->a_long * agent->a_long + agent->a_lat * agent->a_lat); | ||
| // Only penalize when moving (v > 0) or accelerating (a > 0) | ||
| if (speed_magnitude > 0.01f || accel > 0.01f) { | ||
| float timestep_penalty = agent->reward_coefs[REWARD_COEF_TIMESTEP] * env->dt; | ||
|
|
||
| env->rewards[i] += timestep_penalty; | ||
| env->logs[i].episode_return += timestep_penalty; | ||
| } | ||
|
|
||
| // Reverse reward (GIGAFLOW) | ||
| if (signed_speed < -0.01f) { | ||
| float reverse_penalty = agent->reward_coefs[REWARD_COEF_REVERSE] * env->dt; | ||
|
|
||
| env->rewards[i] += reverse_penalty; | ||
| env->logs[i].episode_return += reverse_penalty; | ||
| } | ||
|
|
||
| // Over speed reward (GIGAFLOW++) | ||
| float speed_reward = agent->reward_coefs[REWARD_COEF_OVERSPEED] * agent->metrics_array[SPEED_LIMIT_IDX]; | ||
|
|
||
| env->rewards[i] += speed_reward; | ||
| env->logs[i].episode_return += speed_reward; | ||
| } |
There was a problem hiding this comment.
Lane rewards dropped when reward_conditioning is off
Previously, lane alignment and lane center rewards were always computed and applied unconditionally. This PR moves them inside the if (env->reward_conditioning) block, meaning when reward_conditioning is disabled, agents no longer receive any lane-level rewards at all.
If this is intentional (i.e., the non-conditioning path is expected to not have lane rewards), it would be helpful to document that somewhere. If not, the lane align and lane center reward logic should be moved back outside the conditioning block as before, with only the new rewards (comfort, velocity, timestep, reverse, overspeed) guarded by the flag.
Additional Comments (1)
This function returns 0 for all non-lane reward coefficients (line 496), including the newly added Since this PR implements full reward conditioning, these new coefficients should now be observable by the agent. Otherwise the agent's policy cannot condition on reward coefficients it cannot see, undermining the purpose of reward conditioning for the new reward types. Consider updating this filter to also allow the new reward coefficient indices through. |
pufferlib/ocean/drive/drive.h
Outdated
|
|
||
| if (env->dynamics_model == JERK) { | ||
|
|
||
| const float COMFORT_JERK_THRESHOLD = 5.0f; // m/s³ |
There was a problem hiding this comment.
Constants should be declared in one place, otherwise it's hard to figure out where they live
f9de64e to
5e18e9f
Compare
This PR implements reward conditioning (as implemented in GigaFlow). We had lane level rewards, and now have -
We omit the stop line penalty for now (not supported yet)
It also adds some metrics to the training logs.