Skip to content

Comments

Add full Reward Conditioning support in 3.0_beta#309

Open
Aditya-Gupta26 wants to merge 4 commits into3.0_betafrom
aditya/full_reward_conditioning
Open

Add full Reward Conditioning support in 3.0_beta#309
Aditya-Gupta26 wants to merge 4 commits into3.0_betafrom
aditya/full_reward_conditioning

Conversation

@Aditya-Gupta26
Copy link

This PR implements reward conditioning (as implemented in GigaFlow). We had lane level rewards, and now have -

  • Velocity reward
  • Overspeed penalty
  • Comfort violation penalty
  • Timestep penalty
  • Reverse penalty

We omit the stop line penalty for now (not supported yet)

It also adds some metrics to the training logs.

@greptile-apps
Copy link

greptile-apps bot commented Feb 21, 2026

Greptile Summary

This PR adds full reward conditioning support (GigaFlow-style) to the driving environment, introducing velocity reward, overspeed penalty, comfort violation penalty, timestep penalty, and reverse penalty, gated behind the reward_conditioning flag. It also adds corresponding metrics to the training logs and refactors the speed limit from a local variable to a global SPEED_LIMIT constant.

  • Offroad reward regression: The non-conditioning path for offroad collisions now uses = instead of +=, overwriting any previously accumulated rewards in the timestep
  • Lane rewards no longer applied without conditioning: Lane alignment and lane center rewards were previously applied unconditionally but are now gated behind reward_conditioning, changing behavior for users with conditioning disabled
  • Observation gap for new reward coefficients: normalize_reward_coef still returns 0 for all non-lane reward types (per a TODO comment marked "TO BE REMOVED ONCE ALL REWARD CONDITIONING IS IMPLEMENTED"), so the agent cannot observe the new reward coefficients (comfort, velocity, timestep, reverse, overspeed) even though they now affect the reward signal
  • New metrics (lane_center_rate, comfort_violation_count, velocity_progress_sum, avg_speed_per_agent) are correctly propagated through the logging pipeline

Confidence Score: 2/5

  • This PR introduces a reward overwrite bug and a behavioral regression in the non-conditioning path that should be addressed before merging.
  • Score of 2 reflects: (1) a clear = vs += bug in the offroad reward path that silently overwrites accumulated rewards, (2) lane rewards being dropped for users with reward_conditioning off which is a behavioral regression from the base branch, and (3) the observation normalization function not being updated to expose new reward coefficients to the agent, undermining the purpose of conditioning.
  • Pay close attention to pufferlib/ocean/drive/drive.h — the reward computation section (lines 2547-2716) and the normalize_reward_coef function (lines 490-519).

Important Files Changed

Filename Overview
pufferlib/ocean/drive/drive.h Adds reward conditioning support (velocity, comfort, overspeed, timestep, reverse rewards), new metrics computation, and logging. Contains a regression where offroad reward uses = instead of += in the non-conditioning path, moves previously unconditional lane rewards inside the conditioning gate, and does not update normalize_reward_coef to expose new reward coefficients in observations.
pufferlib/ocean/drive/binding.c Adds four new metric fields (lane_center_rate, comfort_violation_count, velocity_progress_sum, avg_speed_per_agent) to the Python logging dictionary. Straightforward and correct additions.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[c_step: Per-agent reward loop] --> B{collision_state > 0?}
    B -->|VEHICLE_COLLISION| C{reward_conditioning?}
    C -->|Yes| D[reward = coef_collision - 0.1 * speed]
    C -->|No| E[reward = env.reward_vehicle_collision]
    B -->|OFFROAD| F{reward_conditioning?}
    F -->|Yes| G[reward += coef_offroad]
    F -->|No| H["reward = env.reward_offroad ⚠️ uses = not +="]
    B -->|No collision| I[Goal reward logic]
    I --> J[Log metrics: lane_center, comfort, velocity, speed]
    J --> K{reward_conditioning?}
    K -->|Yes| L[Lane align reward]
    K -->|Yes| M[Lane center reward]
    K -->|Yes| N[Comfort penalty]
    K -->|Yes| O[Velocity reward]
    K -->|Yes| P[Timestep penalty]
    K -->|Yes| Q[Reverse penalty]
    K -->|Yes| R[Overspeed penalty]
    K -->|No| S["No lane/new rewards ⚠️ regression"]
Loading

Last reviewed commit: bad8905

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Comment on lines 2578 to 2579
env->rewards[i] = env->reward_offroad_collision;
env->logs[i].episode_return = env->reward_offroad_collision;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Offroad reward uses = instead of +=

In the non-conditioning (else) branch, env->rewards[i] and env->logs[i].episode_return are assigned with = instead of +=. The base code (before this PR) used += here. This means any rewards accumulated earlier in the timestep (e.g., goal rewards, jerk penalties) will be silently overwritten when an offroad collision is detected with reward_conditioning off.

Compare with the conditioning branch (lines 2575-2576) which correctly uses +=.

Suggested change
env->rewards[i] = env->reward_offroad_collision;
env->logs[i].episode_return = env->reward_offroad_collision;
env->rewards[i] += env->reward_offroad_collision;
env->logs[i].episode_return += env->reward_offroad_collision;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this suggestion seems correct?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's correct. The issue is this line:

env->rewards[i] = env->reward_offroad_collision;

Should be:

env->rewards[i] += env->reward_offroad_collision;

Same with episode_return. The = operator overwrites any previous rewards accumulated in the timestep, while the vehicle collision branch (and conditioning branches) correctly use += to add to the existing reward.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, updated.

Comment on lines +2647 to +2716
if (env->reward_conditioning) {

float lane_center_reward =
agent->reward_coefs[REWARD_COEF_LANE_CENTER] * env->dt * ((cos_theta > 0.5f) * adjusted_dist - exp_decay);
// Add lane rewards
// Get lane angle metric: cos(θ_f) where θ_f = heading diff from lane
float cos_theta = agent->metrics_array[LANE_ANGLE_IDX];
float theta_f = acosf(fminf(fmaxf(cos_theta, -1.0f), 1.0f)); // Get |θ_f| from cos
// env->logs[i].lane_heading_aligned_rate += (cos_theta >= LANE_ALIGN_COS_THRESHOLD) ? 1.0f : 0.0f; <-
// Commented becaue this is already catered by us

env->rewards[i] += lane_center_reward;
// env->logs[i].lane_center_rate += fabsf(lane_center_distance) < 0.5f ? 1.0f : 0.0f; <- Commented for now (
// need to add this to add_log)
env->logs[i].episode_return += lane_center_reward;
//
// Rl-align (GIGAFLOW): min(cos,0) + vel_align*min(cos*v,0) + 0.0025*(1-|θ|/(π/2))
float against_lane_penalty = fminf(cos_theta, 0.0f); // negative when >90 degrees off
float vel_aligned_penalty =
agent->reward_coefs[REWARD_COEF_VEL_ALIGN] * fminf(cos_theta * speed_magnitude, 0.0f);
float alignment_bonus = 0.0025f * (1.0f - theta_f / (M_PI / 2.0f));

float lane_align_reward = agent->reward_coefs[REWARD_COEF_LANE_ALIGN] * env->dt *
(against_lane_penalty + vel_aligned_penalty + alignment_bonus);

env->rewards[i] += lane_align_reward;
env->logs[i].episode_return += lane_align_reward;

// Rl-center (GIGAFLOW): -α * dt * (|x_f - bias| - 0.05/(exp(|x_f - bias| - 0.5))
float adjusted_dist = fabsf(lane_center_distance - agent->reward_coefs[REWARD_COEF_CENTER_BIAS]);
float exp_decay = 0.05f / expf(adjusted_dist - 0.5f);
float lane_center_reward = agent->reward_coefs[REWARD_COEF_LANE_CENTER] * env->dt *
((cos_theta > 0.5f) * adjusted_dist - exp_decay);

env->rewards[i] += lane_center_reward;
env->logs[i].episode_return += lane_center_reward;
// OTHER REWARDS (From Gigaflow)

// Red light violation reward (GIGAFLOW) - OMITTED

// Comfort reward (GIGAFLOW)

float comfort_penalty = agent->reward_coefs[REWARD_COEF_COMFORT] * comfort_violations;

env->rewards[i] += comfort_penalty;
env->logs[i].episode_return += comfort_penalty;

// Further support TO BE ADDED for all other types of reward conditioning
// Velocity reward (GIGAFLOW)
float velocity_reward = agent->reward_coefs[REWARD_COEF_VELOCITY] * env->dt * velocity_progress;

env->rewards[i] += velocity_reward;
env->logs[i].episode_return += velocity_reward;

// Timestep reward (GIGAFLOW)
float accel = sqrtf(agent->a_long * agent->a_long + agent->a_lat * agent->a_lat);
// Only penalize when moving (v > 0) or accelerating (a > 0)
if (speed_magnitude > 0.01f || accel > 0.01f) {
float timestep_penalty = agent->reward_coefs[REWARD_COEF_TIMESTEP] * env->dt;

env->rewards[i] += timestep_penalty;
env->logs[i].episode_return += timestep_penalty;
}

// Reverse reward (GIGAFLOW)
if (signed_speed < -0.01f) {
float reverse_penalty = agent->reward_coefs[REWARD_COEF_REVERSE] * env->dt;

env->rewards[i] += reverse_penalty;
env->logs[i].episode_return += reverse_penalty;
}

// Over speed reward (GIGAFLOW++)
float speed_reward = agent->reward_coefs[REWARD_COEF_OVERSPEED] * agent->metrics_array[SPEED_LIMIT_IDX];

env->rewards[i] += speed_reward;
env->logs[i].episode_return += speed_reward;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lane rewards dropped when reward_conditioning is off

Previously, lane alignment and lane center rewards were always computed and applied unconditionally. This PR moves them inside the if (env->reward_conditioning) block, meaning when reward_conditioning is disabled, agents no longer receive any lane-level rewards at all.

If this is intentional (i.e., the non-conditioning path is expected to not have lane rewards), it would be helpful to document that somewhere. If not, the lane align and lane center reward logic should be moved back outside the conditioning block as before, with only the new rewards (comfort, velocity, timestep, reverse, overspeed) guarded by the flag.

@greptile-apps
Copy link

greptile-apps bot commented Feb 21, 2026

Additional Comments (1)

pufferlib/ocean/drive/drive.h
normalize_reward_coef zeroes new reward types in observations

This function returns 0 for all non-lane reward coefficients (line 496), including the newly added REWARD_COEF_COMFORT, REWARD_COEF_VELOCITY, REWARD_COEF_TIMESTEP, REWARD_COEF_REVERSE, and REWARD_COEF_OVERSPEED. The comment on line 493 says "TO BE REMOVED ONCE ALL REWARD CONDITIONING IS IMPLEMENTED."

Since this PR implements full reward conditioning, these new coefficients should now be observable by the agent. Otherwise the agent's policy cannot condition on reward coefficients it cannot see, undermining the purpose of reward conditioning for the new reward types.

Consider updating this filter to also allow the new reward coefficient indices through.


if (env->dynamics_model == JERK) {

const float COMFORT_JERK_THRESHOLD = 5.0f; // m/s³

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Constants should be declared in one place, otherwise it's hard to figure out where they live

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@Aditya-Gupta26 Aditya-Gupta26 force-pushed the aditya/full_reward_conditioning branch from f9de64e to 5e18e9f Compare February 23, 2026 23:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants