Skip to content

fix: retry fallback for channel affinity failures#4053

Open
rty813 wants to merge 2 commits intoQuantumNous:mainfrom
rty813:fix/affinity-retry-fallback
Open

fix: retry fallback for channel affinity failures#4053
rty813 wants to merge 2 commits intoQuantumNous:mainfrom
rty813:fix/affinity-retry-fallback

Conversation

@rty813
Copy link
Copy Markdown

@rty813 rty813 commented Apr 2, 2026

⚠️ 提交警告 / PR Warning

请注意: 请提供人工撰写的简洁摘要。包含大量 AI 灌水内容、逻辑混乱或无视模版的 PR 可能会被无视或直接关闭


💡 沟通提示 / Pre-submission

重大功能变更? 请先提交 Issue 交流,避免无效劳动。

📝 变更描述 / Description

当亲和成功某渠道后,假设这个渠道请求失败或者被禁用等情况出现后,原逻辑不会重试,既便没有勾选“失败后不重试”也不会重试,如下图:
image

本次改动主要做了如下变更:

  1. middleware/distributor.go — 分发中间件增加重试回退逻辑
  • 当渠道亲和性(channel affinity)选中的首选渠道不可用时(已禁用、不在缓存、或 SetupContextForSelectedChannel 失败),不再直接报错,而是:
    • 调用 ClearMatchedChannelAffinity 清除失效的亲和性缓存
    • 根据 ShouldSkipRetryForMatchedChannelAffinityRule 判断是否允许重试
    • 若允许重试,进入新函数 setupDistributedChannel,循环尝试其他满足条件的渠道(最多 RetryTimes 次)
  • 新增 setupDistributedChannel 函数:封装了「首选渠道 setup 失败后,随机选备选渠道重试」的完整流程
  1. service/channel_affinity.go — 新增两个服务函数
  • ShouldSkipRetryForMatchedChannelAffinityRule:从亲和性规则 meta 判断是否应跳过重试
  • ClearMatchedChannelAffinity:清除当前请求匹配的亲和性缓存条目
  1. service/channel_affinity_template_test.go — 补充测试
  • 为新拆分的 ShouldSkipRetryForMatchedChannelAffinityRule 和 ClearMatchedChannelAffinity 添加了单元测试

🚀 变更类型 / Type of change

  • 🐛 Bug 修复 (Bug fix)
  • ✨ 新功能 (New feature) - 重大特性建议先 Issue 沟通
  • ⚡ 性能优化 / 重构 (Refactor)
  • 📝 文档更新 (Documentation)

🔗 关联任务 / Related Issue

✅ 提交前检查项 / Checklist

  • 人工确认: 我已亲自撰写此描述,去除了 AI 原始输出的冗余。
  • 深度理解: 我已完全理解这些更改的工作原理及潜在影响。
  • 范围聚焦: 本 PR 未包含任何与当前任务无关的代码改动。
  • 本地验证: 已在本地运行并通过了测试或手动验证。
  • 安全合规: 代码中无敏感凭据,且符合项目代码规范。

📸 运行证明 / Proof of Work

# go test ./service ./middleware ./controller
go: downloading github.com/stripe/stripe-go/v81 v81.4.0
go: downloading github.com/thanhpk/randstr v1.0.6
go: downloading github.com/waffo-com/waffo-go v1.3.1
go: downloading github.com/stretchr/testify v1.11.1
go: downloading modernc.org/sqlite v1.40.1
go: downloading github.com/aws/aws-sdk-go-v2 v1.41.2
go: downloading github.com/aws/aws-sdk-go-v2/credentials v1.19.10
go: downloading github.com/aws/smithy-go v1.24.2
go: downloading github.com/aws/aws-sdk-go-v2/service/bedrockruntime v1.50.0
go: downloading github.com/klauspost/compress v1.18.0
go: downloading github.com/pmezard/go-difflib v1.0.1-0.20181226105442-5d4384ee4fb2
go: downloading github.com/davecgh/go-spew v1.1.2-0.20180830191138-d8f796af33cc
go: downloading github.com/aws/aws-sdk-go-v2/aws/protocol/eventstream v1.7.5
go: downloading github.com/aws/aws-sdk-go-v2/internal/configsources v1.4.18
go: downloading github.com/aws/aws-sdk-go-v2/internal/endpoints/v2 v2.7.18
ok      github.com/QuantumNous/new-api/service  0.095s
?       github.com/QuantumNous/new-api/middleware       [no test files]
ok      github.com/QuantumNous/new-api/controller       0.056s

修改之后的效果:
image

Summary by CodeRabbit

  • Improvements

    • Improved channel distribution with more reliable fallback selection and controlled retry behavior when initial setup fails
    • Clearer tracking and handling of preferred-channel affinity, including automatic clearing of stale affinity when selections are invalid or disabled
    • Better auto-group selection flow and more predictable affinity usage reporting
  • Tests

    • Added tests covering retry-skip behavior and clearing of matched channel affinity

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 2, 2026

Walkthrough

Refines channel-affinity handling and fallback: adds a dedicated setup helper, avoids redundant context setup, clears stale affinity when appropriate, and retries fallback channel selection up to configured attempts. Also adds/adjusts service helpers for reading and clearing matched affinity and expands tests.

Changes

Cohort / File(s) Summary
Distributor channel setup & affinity logic
middleware/distributor.go
Introduced channelContextReady and affinityPreferredSelected. Added setupDistributedChannel(...) to call SetupContextForSelectedChannel, optionally clear stale affinity, and retry random fallback channels up to common.RetryTimes. Updated Distribute() to use this helper and abort on returned API errors.
Channel affinity service functions
service/channel_affinity.go
Modified ShouldSkipRetryAfterChannelAffinityFailure to return false when the gin context flag is absent/invalid. Added ClearMatchedChannelAffinity(*gin.Context) bool to compute and delete the matched affinity cache key via cache.DeleteMany and report deletion success.
Channel affinity tests
service/channel_affinity_template_test.go
Updated test cases for skip-retry semantics and added TestClearMatchedChannelAffinity which writes a cache key and asserts ClearMatchedChannelAffinity removes it. Adjusted expectations for scenarios where no skip flag exists.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Distributor
    participant Cache
    participant ChannelModel as Channel Model
    participant Setup as SetupContextForSelectedChannel
    participant Fallback as Fallback Selector

    Client->>Distributor: Request arrives (with affinity metadata)
    Distributor->>Cache: Lookup matched preferred channel
    Cache-->>Distributor: Return preferred channel (or nil)
    Distributor->>ChannelModel: Select initial channel (preferred or chosen)
    Distributor->>Setup: setupDistributedChannel(channel, ...)
    alt Setup success
        Setup-->>Distributor: Context ready
        Distributor-->>Client: Proceed
    else Setup fails
        Setup-->>Distributor: Error
        Distributor->>Cache: Maybe ClearMatchedChannelAffinity()
        alt Allow retry
            Distributor->>Fallback: Select random fallback (avoid seen IDs)
            Fallback-->>Distributor: Fallback channel
            Distributor->>Setup: Retry setup with fallback (repeat up to RetryTimes)
            alt Any retry succeeds
                Setup-->>Distributor: Context ready
                Distributor-->>Client: Proceed
            else All retries fail
                Distributor-->>Client: Return final API error
            end
        else Skip retry
            Distributor-->>Client: Return API error
        end
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • PR #3333: Modifies channel-affinity handling in both middleware/distributor.go and service/channel_affinity.go, overlapping with skip-retry and preferred-channel logic changes.
  • PR #3413: Addresses stale channel-affinity blocking fallback for Codex/Claude; closely related to the intent of clearing stale affinity before fallback.

Poem

🐰 Hops and whiskers, sniff the trail,

stale keys cleared so fallbacks prevail.
I nudge the cache, a tiny drum,
retries hop in till progress comes,
carrot-coded fixes — hurray, yum!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly addresses the main change: implementing retry fallback logic when channel affinity failures occur, which is the core objective of the PR.
Linked Issues check ✅ Passed All code changes directly implement the requirements from issue #3414: clearing stale affinity cache, preventing SkipRetryOnFailure from blocking fallback on unavailable preferred channels, and adding setupDistributedChannel to retry with other eligible channels.
Out of Scope Changes check ✅ Passed All changes are directly scoped to fixing the affinity retry fallback issue; no unrelated modifications or unnecessary refactoring were introduced.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
middleware/distributor.go (1)

205-238: Consider early exit when all eligible channels exhausted.

The retry loop continues iterating until RetryTimes even when skipping already-tried channels (lines 225-227). If CacheGetRandomSatisfiedChannel repeatedly returns channels that have already been tried, this wastes iterations.

Consider tracking when no new channels are available:

♻️ Suggested improvement
 	triedChannelIDs := make(map[int]struct{}, common.RetryTimes+1)
 	if channel != nil && channel.Id > 0 {
 		triedChannelIDs[channel.Id] = struct{}{}
 	}

 	retryParam := &service.RetryParam{
 		Ctx:        c,
 		ModelName:  modelName,
 		TokenGroup: tokenGroup,
 		Retry:      common.GetPointer(0),
 	}
 	lastErr := setupErr
+	consecutiveSkips := 0
 	for ; retryParam.GetRetry() <= common.RetryTimes; retryParam.IncreaseRetry() {
 		fallbackChannel, selectGroup, err := service.CacheGetRandomSatisfiedChannel(retryParam)
 		if err != nil {
 			return nil, types.NewError(err, types.ErrorCodeGetChannelFailed, types.ErrOptionWithSkipRetry())
 		}
 		if fallbackChannel == nil {
 			break
 		}
 		if _, exists := triedChannelIDs[fallbackChannel.Id]; exists {
+			consecutiveSkips++
+			if consecutiveSkips > len(triedChannelIDs)+1 {
+				break // likely exhausted all eligible channels
+			}
 			continue
 		}
+		consecutiveSkips = 0
 		triedChannelIDs[fallbackChannel.Id] = struct{}{}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@middleware/distributor.go` around lines 205 - 238, The loop can waste
iterations if CacheGetRandomSatisfiedChannel repeatedly returns already-tried
channels; update the logic around triedChannelIDs and the continue branch
(symbols: triedChannelIDs, CacheGetRandomSatisfiedChannel,
retryParam.GetRetry(), common.RetryTimes) to detect when no new channels remain
and break early: add a small attemptsWithoutNew counter that increments when you
hit the "if _, exists := triedChannelIDs[fallbackChannel.Id]; exists { continue
}" case and resets when a new channel is accepted, and break the loop (return
lastErr) once attemptsWithoutNew exceeds a threshold (e.g., number of tried IDs
or common.RetryTimes) or when the returned selectGroup exposes a total/eligible
channel count and len(triedChannelIDs) >= that count; this prevents wasted
retries and returns the last setup error when exhausted.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@middleware/distributor.go`:
- Around line 198-203: The current logic returns nil and setupErr only if
service.ShouldSkipRetryForMatchedChannelAffinityRule(c) is true after clearing
affinity; instead, remove the skip-retry check from this failure path so
transient SetupContextForSelectedChannel failures still allow normal
retry/fallback. Specifically, in the block that runs when
affinityPreferredSelected and SetupContextForSelectedChannel produced setupErr,
keep the call to service.ClearMatchedChannelAffinity(c) but do not call or
branch on service.ShouldSkipRetryForMatchedChannelAffinityRule(c); simply return
nil, setupErr so the retry flow can handle transient errors.
- Around line 109-114: The current branch treats a disabled preferred channel as
a stale affinity but then immediately honors
service.ShouldSkipRetryForMatchedChannelAffinityRule and aborts, which prevents
fallback; change the logic so that when preferred.Status !=
common.ChannelStatusEnabled you call service.ClearMatchedChannelAffinity(c) and
do NOT consult ShouldSkipRetryForMatchedChannelAffinityRule here—only evaluate
skip-retry after a channel has been successfully selected and an actual upstream
call fails (e.g., after SetupContextForSelectedChannel and an upstream error).
In practice remove or move the ShouldSkipRetryForMatchedChannelAffinityRule
check from the disabled-channel branch (and the abortWithOpenAiMessage call) so
stale affinities allow normal fallback to other channels while preserving
skip-retry behavior for genuine upstream failures.

In `@service/channel_affinity.go`:
- Around line 633-649: Clear up the cache key format mismatch by making all
cache operations use the raw, unprefixed key (cacheKeySuffix) consistently: in
ClearMatchedChannelAffinity (and any calls using cache.DeleteMany) pass
cacheKeySuffix instead of cacheKey, so DeleteMany, Get
(cache.Get(cacheKeySuffix)), and SetWithTTL (replace cache.SetWithTTL(cacheKey,
...) to use cache.SetWithTTL(cacheKeySuffix, ...)) all use the same identifier;
locate getChannelAffinityContext to obtain cacheKeySuffix, and update usages
around ClearMatchedChannelAffinity, getChannelAffinityCache, cache.DeleteMany,
cache.Get, and cache.SetWithTTL so the key format is uniform.

---

Nitpick comments:
In `@middleware/distributor.go`:
- Around line 205-238: The loop can waste iterations if
CacheGetRandomSatisfiedChannel repeatedly returns already-tried channels; update
the logic around triedChannelIDs and the continue branch (symbols:
triedChannelIDs, CacheGetRandomSatisfiedChannel, retryParam.GetRetry(),
common.RetryTimes) to detect when no new channels remain and break early: add a
small attemptsWithoutNew counter that increments when you hit the "if _, exists
:= triedChannelIDs[fallbackChannel.Id]; exists { continue }" case and resets
when a new channel is accepted, and break the loop (return lastErr) once
attemptsWithoutNew exceeds a threshold (e.g., number of tried IDs or
common.RetryTimes) or when the returned selectGroup exposes a total/eligible
channel count and len(triedChannelIDs) >= that count; this prevents wasted
retries and returns the last setup error when exhausted.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a987c207-8f87-4809-b0ec-02ce6046bbc8

📥 Commits

Reviewing files that changed from the base of the PR and between 41cd051 and ebe4f44.

📒 Files selected for processing (3)
  • middleware/distributor.go
  • service/channel_affinity.go
  • service/channel_affinity_template_test.go

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
service/channel_affinity.go (1)

622-638: Return value is ambiguous for callers checking deletion success.

Per the DeleteMany implementation in pkg/cachex/hybrid_cache.go:263-265, deleted[cacheKey] is false both when deletion fails AND when the key didn't exist (e.g., already expired). Since callers in distributor.go don't use the return value for control flow, this is not a functional issue, but consider documenting this behavior or returning a sentinel value if future callers need to distinguish these cases.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@service/channel_affinity.go` around lines 622 - 638,
ClearMatchedChannelAffinity currently returns a single bool that is ambiguous
because cache.DeleteMany's deleted[cacheKey] is false both when deletion failed
and when the key did not exist; change the function to return (bool, error) (or
document the ambiguity) so callers can distinguish a cache error from a
non-existent key: update ClearMatchedChannelAffinity signature to return
(deleted bool, err error), propagate the underlying error from cache.DeleteMany
when err != nil, and return (false, nil) when DeleteMany succeeds but
deleted[cacheKey] is false (meaning key was absent/expired); update all callers
(e.g., in distributor.go) to handle the new return values and add a short
comment in getChannelAffinityContext/getChannelAffinityCache usage explaining
the semantics.
middleware/distributor.go (1)

210-231: Retry budget consumed by duplicate channel IDs.

When GetRandomSatisfiedChannel returns a channel already in triedChannelIDs, the continue still consumes a retry iteration. If the channel pool is small or randomization repeatedly returns the same channel, this exhausts retries without actually trying distinct channels.

Consider not incrementing the retry counter on duplicate hits:

♻️ Suggested approach
 	lastErr := setupErr
-	for ; retryParam.GetRetry() <= common.RetryTimes; retryParam.IncreaseRetry() {
+	for retryParam.GetRetry() <= common.RetryTimes {
 		fallbackChannel, selectGroup, err := service.CacheGetRandomSatisfiedChannel(retryParam)
 		if err != nil {
 			return nil, types.NewError(err, types.ErrorCodeGetChannelFailed, types.ErrOptionWithSkipRetry())
 		}
 		if fallbackChannel == nil {
 			break
 		}
 		if _, exists := triedChannelIDs[fallbackChannel.Id]; exists {
 			continue
 		}
 		triedChannelIDs[fallbackChannel.Id] = struct{}{}
+		retryParam.IncreaseRetry()
 
 		setupErr = SetupContextForSelectedChannel(c, fallbackChannel, modelName)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@middleware/distributor.go` around lines 210 - 231, The loop consumes a retry
on duplicate channel hits because the for-loop increments via
retryParam.IncreaseRetry() on every continue; change behavior so duplicates
don't count against the retry budget by preventing the retry counter from
advancing when a duplicate is encountered. Concretely, modify the logic around
service.CacheGetRandomSatisfiedChannel and the triedChannelIDs check: when a
fallbackChannel.Id is already in triedChannelIDs, do not let
retryParam.IncreaseRetry() run (e.g., convert the for-loop to manual
incrementing or call a retryParam.DecreaseRetry()/undo method before continue),
ensuring only actual channel setup attempts (where you call
SetupContextForSelectedChannel and possibly service.MarkChannelAffinityUsed)
increment the retry counter; keep references to retryParam.GetRetry(),
retryParam.IncreaseRetry(), service.CacheGetRandomSatisfiedChannel,
triedChannelIDs, and SetupContextForSelectedChannel so the change targets the
right spots.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@middleware/distributor.go`:
- Around line 210-231: The loop consumes a retry on duplicate channel hits
because the for-loop increments via retryParam.IncreaseRetry() on every
continue; change behavior so duplicates don't count against the retry budget by
preventing the retry counter from advancing when a duplicate is encountered.
Concretely, modify the logic around service.CacheGetRandomSatisfiedChannel and
the triedChannelIDs check: when a fallbackChannel.Id is already in
triedChannelIDs, do not let retryParam.IncreaseRetry() run (e.g., convert the
for-loop to manual incrementing or call a retryParam.DecreaseRetry()/undo method
before continue), ensuring only actual channel setup attempts (where you call
SetupContextForSelectedChannel and possibly service.MarkChannelAffinityUsed)
increment the retry counter; keep references to retryParam.GetRetry(),
retryParam.IncreaseRetry(), service.CacheGetRandomSatisfiedChannel,
triedChannelIDs, and SetupContextForSelectedChannel so the change targets the
right spots.

In `@service/channel_affinity.go`:
- Around line 622-638: ClearMatchedChannelAffinity currently returns a single
bool that is ambiguous because cache.DeleteMany's deleted[cacheKey] is false
both when deletion failed and when the key did not exist; change the function to
return (bool, error) (or document the ambiguity) so callers can distinguish a
cache error from a non-existent key: update ClearMatchedChannelAffinity
signature to return (deleted bool, err error), propagate the underlying error
from cache.DeleteMany when err != nil, and return (false, nil) when DeleteMany
succeeds but deleted[cacheKey] is false (meaning key was absent/expired); update
all callers (e.g., in distributor.go) to handle the new return values and add a
short comment in getChannelAffinityContext/getChannelAffinityCache usage
explaining the semantics.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 905cad35-923e-4510-97d3-71b606ec7541

📥 Commits

Reviewing files that changed from the base of the PR and between ebe4f44 and 36bd17b.

📒 Files selected for processing (3)
  • middleware/distributor.go
  • service/channel_affinity.go
  • service/channel_affinity_template_test.go

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Codex/Claude 的 stale channel affinity 指向失效渠道时会阻断正常 fallback

1 participant