Add full Reward Conditioning support in 3.0_beta by Aditya-Gupta26 · Pull Request #309 · Emerge-Lab/PufferDrive

Aditya-Gupta26 · 2026-02-21T14:31:48Z

This PR implements reward conditioning (as implemented in GigaFlow). We had lane level rewards, and now have -

Velocity reward
Overspeed penalty
Comfort violation penalty
Timestep penalty
Reverse penalty

We omit the stop line penalty for now (not supported yet)

It also adds some metrics to the training logs.

greptile-apps · 2026-02-21T14:35:47Z

Greptile Summary

This PR adds full reward conditioning support (GigaFlow-style) to the driving environment, introducing velocity reward, overspeed penalty, comfort violation penalty, timestep penalty, and reverse penalty, gated behind the reward_conditioning flag. It also adds corresponding metrics to the training logs and refactors the speed limit from a local variable to a global SPEED_LIMIT constant.

Offroad reward regression: The non-conditioning path for offroad collisions now uses = instead of +=, overwriting any previously accumulated rewards in the timestep
Lane rewards no longer applied without conditioning: Lane alignment and lane center rewards were previously applied unconditionally but are now gated behind reward_conditioning, changing behavior for users with conditioning disabled
Observation gap for new reward coefficients: normalize_reward_coef still returns 0 for all non-lane reward types (per a TODO comment marked "TO BE REMOVED ONCE ALL REWARD CONDITIONING IS IMPLEMENTED"), so the agent cannot observe the new reward coefficients (comfort, velocity, timestep, reverse, overspeed) even though they now affect the reward signal
New metrics (lane_center_rate, comfort_violation_count, velocity_progress_sum, avg_speed_per_agent) are correctly propagated through the logging pipeline

Confidence Score: 2/5

This PR introduces a reward overwrite bug and a behavioral regression in the non-conditioning path that should be addressed before merging.
Score of 2 reflects: (1) a clear = vs += bug in the offroad reward path that silently overwrites accumulated rewards, (2) lane rewards being dropped for users with reward_conditioning off which is a behavioral regression from the base branch, and (3) the observation normalization function not being updated to expose new reward coefficients to the agent, undermining the purpose of conditioning.
Pay close attention to pufferlib/ocean/drive/drive.h — the reward computation section (lines 2547-2716) and the normalize_reward_coef function (lines 490-519).

Important Files Changed

Filename	Overview
pufferlib/ocean/drive/drive.h	Adds reward conditioning support (velocity, comfort, overspeed, timestep, reverse rewards), new metrics computation, and logging. Contains a regression where offroad reward uses `=` instead of `+=` in the non-conditioning path, moves previously unconditional lane rewards inside the conditioning gate, and does not update `normalize_reward_coef` to expose new reward coefficients in observations.
pufferlib/ocean/drive/binding.c	Adds four new metric fields (lane_center_rate, comfort_violation_count, velocity_progress_sum, avg_speed_per_agent) to the Python logging dictionary. Straightforward and correct additions.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[c_step: Per-agent reward loop] --> B{collision_state > 0?}
    B -->|VEHICLE_COLLISION| C{reward_conditioning?}
    C -->|Yes| D[reward = coef_collision - 0.1 * speed]
    C -->|No| E[reward = env.reward_vehicle_collision]
    B -->|OFFROAD| F{reward_conditioning?}
    F -->|Yes| G[reward += coef_offroad]
    F -->|No| H["reward = env.reward_offroad ⚠️ uses = not +="]
    B -->|No collision| I[Goal reward logic]
    I --> J[Log metrics: lane_center, comfort, velocity, speed]
    J --> K{reward_conditioning?}
    K -->|Yes| L[Lane align reward]
    K -->|Yes| M[Lane center reward]
    K -->|Yes| N[Comfort penalty]
    K -->|Yes| O[Velocity reward]
    K -->|Yes| P[Timestep penalty]
    K -->|Yes| Q[Reverse penalty]
    K -->|Yes| R[Overspeed penalty]
    K -->|No| S["No lane/new rewards ⚠️ regression"]

_{Last reviewed commit: bad8905}

greptile-apps

_{2 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-21T14:35:50Z

pufferlib/ocean/drive/drive.h

+                    env->rewards[i] = env->reward_offroad_collision;
+                    env->logs[i].episode_return = env->reward_offroad_collision;


Offroad reward uses = instead of +=

In the non-conditioning (else) branch, env->rewards[i] and env->logs[i].episode_return are assigned with = instead of +=. The base code (before this PR) used += here. This means any rewards accumulated earlier in the timestep (e.g., goal rewards, jerk penalties) will be silently overwritten when an offroad collision is detected with reward_conditioning off.

Compare with the conditioning branch (lines 2575-2576) which correctly uses +=.

Suggested change

env->rewards[i] = env->reward_offroad_collision;

env->logs[i].episode_return = env->reward_offroad_collision;

env->rewards[i] += env->reward_offroad_collision;

env->logs[i].episode_return += env->reward_offroad_collision;

this suggestion seems correct?

Yes, it's correct. The issue is this line:

env->rewards[i] = env->reward_offroad_collision;

Should be:

env->rewards[i] += env->reward_offroad_collision;

Same with episode_return. The = operator overwrites any previous rewards accumulated in the timestep, while the vehicle collision branch (and conditioning branches) correctly use += to add to the existing reward.

Yeah, updated.

greptile-apps · 2026-02-21T14:35:51Z

pufferlib/ocean/drive/drive.h

+        if (env->reward_conditioning) {

-        float lane_center_reward =
-            agent->reward_coefs[REWARD_COEF_LANE_CENTER] * env->dt * ((cos_theta > 0.5f) * adjusted_dist - exp_decay);
+            // Add lane rewards
+            //  Get lane angle metric: cos(θ_f) where θ_f = heading diff from lane
+            float cos_theta = agent->metrics_array[LANE_ANGLE_IDX];
+            float theta_f = acosf(fminf(fmaxf(cos_theta, -1.0f), 1.0f)); // Get |θ_f| from cos
+            // env->logs[i].lane_heading_aligned_rate += (cos_theta >= LANE_ALIGN_COS_THRESHOLD) ? 1.0f : 0.0f; <-
+            // Commented becaue this is already catered by us

-        env->rewards[i] += lane_center_reward;
-        // env->logs[i].lane_center_rate += fabsf(lane_center_distance) < 0.5f ? 1.0f : 0.0f; <- Commented for now (
-        // need to add this to add_log)
-        env->logs[i].episode_return += lane_center_reward;
-        //
+            // Rl-align (GIGAFLOW): min(cos,0) + vel_align*min(cos*v,0) + 0.0025*(1-|θ|/(π/2))
+            float against_lane_penalty = fminf(cos_theta, 0.0f); // negative when >90 degrees off
+            float vel_aligned_penalty =
+                agent->reward_coefs[REWARD_COEF_VEL_ALIGN] * fminf(cos_theta * speed_magnitude, 0.0f);
+            float alignment_bonus = 0.0025f * (1.0f - theta_f / (M_PI / 2.0f));
+
+            float lane_align_reward = agent->reward_coefs[REWARD_COEF_LANE_ALIGN] * env->dt *
+                                      (against_lane_penalty + vel_aligned_penalty + alignment_bonus);
+
+            env->rewards[i] += lane_align_reward;
+            env->logs[i].episode_return += lane_align_reward;
+
+            // Rl-center (GIGAFLOW): -α * dt * (|x_f - bias| - 0.05/(exp(|x_f - bias| - 0.5))
+            float adjusted_dist = fabsf(lane_center_distance - agent->reward_coefs[REWARD_COEF_CENTER_BIAS]);
+            float exp_decay = 0.05f / expf(adjusted_dist - 0.5f);
+            float lane_center_reward = agent->reward_coefs[REWARD_COEF_LANE_CENTER] * env->dt *
+                                       ((cos_theta > 0.5f) * adjusted_dist - exp_decay);
+
+            env->rewards[i] += lane_center_reward;
+            env->logs[i].episode_return += lane_center_reward;
+            // OTHER REWARDS (From Gigaflow)
+
+            // Red light violation reward (GIGAFLOW) - OMITTED
+
+            // Comfort reward (GIGAFLOW)
+
+            float comfort_penalty = agent->reward_coefs[REWARD_COEF_COMFORT] * comfort_violations;
+
+            env->rewards[i] += comfort_penalty;
+            env->logs[i].episode_return += comfort_penalty;

-        // Further support TO BE ADDED for all other types of reward conditioning
+            // Velocity reward (GIGAFLOW)
+            float velocity_reward = agent->reward_coefs[REWARD_COEF_VELOCITY] * env->dt * velocity_progress;
+
+            env->rewards[i] += velocity_reward;
+            env->logs[i].episode_return += velocity_reward;
+
+            // Timestep reward (GIGAFLOW)
+            float accel = sqrtf(agent->a_long * agent->a_long + agent->a_lat * agent->a_lat);
+            // Only penalize when moving (v > 0) or accelerating (a > 0)
+            if (speed_magnitude > 0.01f || accel > 0.01f) {
+                float timestep_penalty = agent->reward_coefs[REWARD_COEF_TIMESTEP] * env->dt;
+
+                env->rewards[i] += timestep_penalty;
+                env->logs[i].episode_return += timestep_penalty;
+            }
+
+            // Reverse reward (GIGAFLOW)
+            if (signed_speed < -0.01f) {
+                float reverse_penalty = agent->reward_coefs[REWARD_COEF_REVERSE] * env->dt;
+
+                env->rewards[i] += reverse_penalty;
+                env->logs[i].episode_return += reverse_penalty;
+            }
+
+            // Over speed reward (GIGAFLOW++)
+            float speed_reward = agent->reward_coefs[REWARD_COEF_OVERSPEED] * agent->metrics_array[SPEED_LIMIT_IDX];
+
+            env->rewards[i] += speed_reward;
+            env->logs[i].episode_return += speed_reward;
+        }


Lane rewards dropped when reward_conditioning is off

Previously, lane alignment and lane center rewards were always computed and applied unconditionally. This PR moves them inside the if (env->reward_conditioning) block, meaning when reward_conditioning is disabled, agents no longer receive any lane-level rewards at all.

If this is intentional (i.e., the non-conditioning path is expected to not have lane rewards), it would be helpful to document that somewhere. If not, the lane align and lane center reward logic should be moved back outside the conditioning block as before, with only the new rewards (comfort, velocity, timestep, reverse, overspeed) guarded by the flag.

greptile-apps · 2026-02-21T14:35:56Z

Additional Comments (1)

pufferlib/ocean/drive/drive.h
normalize_reward_coef zeroes new reward types in observations

This function returns 0 for all non-lane reward coefficients (line 496), including the newly added REWARD_COEF_COMFORT, REWARD_COEF_VELOCITY, REWARD_COEF_TIMESTEP, REWARD_COEF_REVERSE, and REWARD_COEF_OVERSPEED. The comment on line 493 says "TO BE REMOVED ONCE ALL REWARD CONDITIONING IS IMPLEMENTED."

Since this PR implements full reward conditioning, these new coefficients should now be observable by the agent. Otherwise the agent's policy cannot condition on reward coefficients it cannot see, undermining the purpose of reward conditioning for the new reward types.

Consider updating this filter to also allow the new reward coefficient indices through.

eugenevinitsky · 2026-02-21T14:45:13Z

pufferlib/ocean/drive/drive.h

+
+    if (env->dynamics_model == JERK) {
+
+        const float COMFORT_JERK_THRESHOLD = 5.0f; // m/s³


Constants should be declared in one place, otherwise it's hard to figure out where they live

greptile-apps bot reviewed Feb 21, 2026

View reviewed changes

eugenevinitsky reviewed Feb 21, 2026

View reviewed changes

Aditya-Gupta26 added 4 commits February 23, 2026 18:53

Adding full Reward conditioning support

47a956f

Adding more logs for comparison with reward conditioning

40f901f

Fixing trailing spaces

bb8d6e5

Fixing changes

5e18e9f

Aditya-Gupta26 force-pushed the aditya/full_reward_conditioning branch from f9de64e to 5e18e9f Compare February 23, 2026 23:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Add full Reward Conditioning support in 3.0_beta#309

Add full Reward Conditioning support in 3.0_beta#309
Aditya-Gupta26 wants to merge 4 commits into3.0_betafrom
aditya/full_reward_conditioning

Aditya-Gupta26 commented Feb 21, 2026

Uh oh!

greptile-apps bot commented Feb 21, 2026

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Feb 21, 2026

Uh oh!

eugenevinitsky Feb 21, 2026

Uh oh!

greptile-apps bot Feb 21, 2026

Uh oh!

Aditya-Gupta26 Feb 23, 2026

Uh oh!

greptile-apps bot Feb 21, 2026

Uh oh!

greptile-apps bot commented Feb 21, 2026

Uh oh!

eugenevinitsky Feb 21, 2026

Uh oh!

Aditya-Gupta26 Feb 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		env->rewards[i] = env->reward_offroad_collision;
		env->logs[i].episode_return = env->reward_offroad_collision;


		if (env->dynamics_model == JERK) {

		const float COMFORT_JERK_THRESHOLD = 5.0f; // m/s³

Comments

Conversation

Aditya-Gupta26 commented Feb 21, 2026

Uh oh!

greptile-apps bot commented Feb 21, 2026

Greptile Summary

Confidence Score: 2/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

eugenevinitsky Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

Aditya-Gupta26 Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Feb 21, 2026

Uh oh!

eugenevinitsky Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

Aditya-Gupta26 Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants