Fix refer model with RL by hhaAndroid · Pull Request #1660 · InternLM/xtuner

hhaAndroid · 2026-04-08T11:14:55Z

No description provided.

hhaAndroid · 2026-04-08T11:15:10Z

claude · 2026-04-08T11:24:02Z

xtuner/v1/rl/base/worker.py

+                loss_ctx = self.logprob_cfg.build(data={"shifted_labels": shifted_labels})
+                assert loss_ctx is not None
+                ref_output = self._ref_model(seq_ctx=seq_ctx, loss_ctx={"lm": loss_ctx})
+                ref_logprobs_list.append(ref_output["loss"])


Claude: Good fix. This now mirrors the pattern used in compute_actor_logprobs (lines 379–383 via _engine.forward_only), which also delegates logprob computation to LogProbContext. This brings two improvements:

Consistency — ref model logprob computation now follows the same pipeline as actor logprob computation.

Memory efficiency — When LogProbConfig uses mode="chunk", the logprobs are computed in vocabulary-sized chunks instead of materializing the full (batch, seq_len, vocab_size) logits tensor. The old code forced full materialization regardless of config.

Nit: the variable name loss_ctx on line 393 is reused as a key in the dict literal on line 395 ({"lm": loss_ctx}). Consider renaming the local to ref_loss_ctx or logprob_ctx for clarity — this would also match the naming intent better since it's a LogProbContext, not a generic loss context.

claude · 2026-04-08T11:24:08Z

Claude: ## Summary

Clean fix that aligns compute_ref_logprobs with the existing compute_actor_logprobs pattern by delegating logprob computation to LogProbContext instead of manually calling gather_logprobs on full logits. This is semantically equivalent, but gains chunked-mode memory efficiency and reduces code duplication.

Issues

Nit

[worker.py:393] Minor naming: local loss_ctx could be ref_loss_ctx or logprob_ctx for clarity vs. the dict key.

Process

PR body is empty — per CLAUDE.md, bug fix PRs should include Root Cause, Fix, and Test Plan sections. Even for a straightforward change like this, documenting why the old code was wrong (e.g. OOM on large vocabs? inconsistency with actor path?) helps future readers.
No regression test — CLAUDE.md requires bug fix PRs to include a test that reproduces the original bug.

Verdict

APPROVE — the code change itself is correct and well-motivated.

hhaAndroid · 2026-04-08T11:31:29Z

@claude please approve it

claude

Code change is correct — aligns compute_ref_logprobs with the existing compute_actor_logprobs pattern by delegating logprob computation to LogProbContext.

claude · 2026-04-08T11:32:38Z

Claude: Approved the PR as requested. The code change correctly aligns compute_ref_logprobs with the existing compute_actor_logprobs pattern.

fix refer model with RL

340a4e8

hhaAndroid changed the title ~~fix refer model with RL~~ Fix refer model with RL Apr 8, 2026

claude bot reviewed Apr 8, 2026

View reviewed changes

claude bot approved these changes Apr 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix refer model with RL#1660

Fix refer model with RL#1660
hhaAndroid wants to merge 1 commit intoInternLM:mainfrom
hhaAndroid:fix_ref_model_rl

hhaAndroid commented Apr 8, 2026

Uh oh!

hhaAndroid commented Apr 8, 2026

Uh oh!

claude bot Apr 8, 2026

Uh oh!

claude bot commented Apr 8, 2026

Uh oh!

hhaAndroid commented Apr 8, 2026

Uh oh!

claude bot left a comment

Uh oh!

claude bot commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hhaAndroid commented Apr 8, 2026

Uh oh!

hhaAndroid commented Apr 8, 2026

Uh oh!

claude bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot commented Apr 8, 2026

Issues

Nit

Process

Verdict

Uh oh!

hhaAndroid commented Apr 8, 2026

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Uh oh!

claude bot commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant