fix: retry fallback for channel affinity failures#4053
fix: retry fallback for channel affinity failures#4053rty813 wants to merge 2 commits intoQuantumNous:mainfrom
Conversation
WalkthroughRefines channel-affinity handling and fallback: adds a dedicated setup helper, avoids redundant context setup, clears stale affinity when appropriate, and retries fallback channel selection up to configured attempts. Also adds/adjusts service helpers for reading and clearing matched affinity and expands tests. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Distributor
participant Cache
participant ChannelModel as Channel Model
participant Setup as SetupContextForSelectedChannel
participant Fallback as Fallback Selector
Client->>Distributor: Request arrives (with affinity metadata)
Distributor->>Cache: Lookup matched preferred channel
Cache-->>Distributor: Return preferred channel (or nil)
Distributor->>ChannelModel: Select initial channel (preferred or chosen)
Distributor->>Setup: setupDistributedChannel(channel, ...)
alt Setup success
Setup-->>Distributor: Context ready
Distributor-->>Client: Proceed
else Setup fails
Setup-->>Distributor: Error
Distributor->>Cache: Maybe ClearMatchedChannelAffinity()
alt Allow retry
Distributor->>Fallback: Select random fallback (avoid seen IDs)
Fallback-->>Distributor: Fallback channel
Distributor->>Setup: Retry setup with fallback (repeat up to RetryTimes)
alt Any retry succeeds
Setup-->>Distributor: Context ready
Distributor-->>Client: Proceed
else All retries fail
Distributor-->>Client: Return final API error
end
else Skip retry
Distributor-->>Client: Return API error
end
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
middleware/distributor.go (1)
205-238: Consider early exit when all eligible channels exhausted.The retry loop continues iterating until
RetryTimeseven when skipping already-tried channels (lines 225-227). IfCacheGetRandomSatisfiedChannelrepeatedly returns channels that have already been tried, this wastes iterations.Consider tracking when no new channels are available:
♻️ Suggested improvement
triedChannelIDs := make(map[int]struct{}, common.RetryTimes+1) if channel != nil && channel.Id > 0 { triedChannelIDs[channel.Id] = struct{}{} } retryParam := &service.RetryParam{ Ctx: c, ModelName: modelName, TokenGroup: tokenGroup, Retry: common.GetPointer(0), } lastErr := setupErr + consecutiveSkips := 0 for ; retryParam.GetRetry() <= common.RetryTimes; retryParam.IncreaseRetry() { fallbackChannel, selectGroup, err := service.CacheGetRandomSatisfiedChannel(retryParam) if err != nil { return nil, types.NewError(err, types.ErrorCodeGetChannelFailed, types.ErrOptionWithSkipRetry()) } if fallbackChannel == nil { break } if _, exists := triedChannelIDs[fallbackChannel.Id]; exists { + consecutiveSkips++ + if consecutiveSkips > len(triedChannelIDs)+1 { + break // likely exhausted all eligible channels + } continue } + consecutiveSkips = 0 triedChannelIDs[fallbackChannel.Id] = struct{}{}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@middleware/distributor.go` around lines 205 - 238, The loop can waste iterations if CacheGetRandomSatisfiedChannel repeatedly returns already-tried channels; update the logic around triedChannelIDs and the continue branch (symbols: triedChannelIDs, CacheGetRandomSatisfiedChannel, retryParam.GetRetry(), common.RetryTimes) to detect when no new channels remain and break early: add a small attemptsWithoutNew counter that increments when you hit the "if _, exists := triedChannelIDs[fallbackChannel.Id]; exists { continue }" case and resets when a new channel is accepted, and break the loop (return lastErr) once attemptsWithoutNew exceeds a threshold (e.g., number of tried IDs or common.RetryTimes) or when the returned selectGroup exposes a total/eligible channel count and len(triedChannelIDs) >= that count; this prevents wasted retries and returns the last setup error when exhausted.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@middleware/distributor.go`:
- Around line 198-203: The current logic returns nil and setupErr only if
service.ShouldSkipRetryForMatchedChannelAffinityRule(c) is true after clearing
affinity; instead, remove the skip-retry check from this failure path so
transient SetupContextForSelectedChannel failures still allow normal
retry/fallback. Specifically, in the block that runs when
affinityPreferredSelected and SetupContextForSelectedChannel produced setupErr,
keep the call to service.ClearMatchedChannelAffinity(c) but do not call or
branch on service.ShouldSkipRetryForMatchedChannelAffinityRule(c); simply return
nil, setupErr so the retry flow can handle transient errors.
- Around line 109-114: The current branch treats a disabled preferred channel as
a stale affinity but then immediately honors
service.ShouldSkipRetryForMatchedChannelAffinityRule and aborts, which prevents
fallback; change the logic so that when preferred.Status !=
common.ChannelStatusEnabled you call service.ClearMatchedChannelAffinity(c) and
do NOT consult ShouldSkipRetryForMatchedChannelAffinityRule here—only evaluate
skip-retry after a channel has been successfully selected and an actual upstream
call fails (e.g., after SetupContextForSelectedChannel and an upstream error).
In practice remove or move the ShouldSkipRetryForMatchedChannelAffinityRule
check from the disabled-channel branch (and the abortWithOpenAiMessage call) so
stale affinities allow normal fallback to other channels while preserving
skip-retry behavior for genuine upstream failures.
In `@service/channel_affinity.go`:
- Around line 633-649: Clear up the cache key format mismatch by making all
cache operations use the raw, unprefixed key (cacheKeySuffix) consistently: in
ClearMatchedChannelAffinity (and any calls using cache.DeleteMany) pass
cacheKeySuffix instead of cacheKey, so DeleteMany, Get
(cache.Get(cacheKeySuffix)), and SetWithTTL (replace cache.SetWithTTL(cacheKey,
...) to use cache.SetWithTTL(cacheKeySuffix, ...)) all use the same identifier;
locate getChannelAffinityContext to obtain cacheKeySuffix, and update usages
around ClearMatchedChannelAffinity, getChannelAffinityCache, cache.DeleteMany,
cache.Get, and cache.SetWithTTL so the key format is uniform.
---
Nitpick comments:
In `@middleware/distributor.go`:
- Around line 205-238: The loop can waste iterations if
CacheGetRandomSatisfiedChannel repeatedly returns already-tried channels; update
the logic around triedChannelIDs and the continue branch (symbols:
triedChannelIDs, CacheGetRandomSatisfiedChannel, retryParam.GetRetry(),
common.RetryTimes) to detect when no new channels remain and break early: add a
small attemptsWithoutNew counter that increments when you hit the "if _, exists
:= triedChannelIDs[fallbackChannel.Id]; exists { continue }" case and resets
when a new channel is accepted, and break the loop (return lastErr) once
attemptsWithoutNew exceeds a threshold (e.g., number of tried IDs or
common.RetryTimes) or when the returned selectGroup exposes a total/eligible
channel count and len(triedChannelIDs) >= that count; this prevents wasted
retries and returns the last setup error when exhausted.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: a987c207-8f87-4809-b0ec-02ce6046bbc8
📒 Files selected for processing (3)
middleware/distributor.goservice/channel_affinity.goservice/channel_affinity_template_test.go
There was a problem hiding this comment.
🧹 Nitpick comments (2)
service/channel_affinity.go (1)
622-638: Return value is ambiguous for callers checking deletion success.Per the
DeleteManyimplementation inpkg/cachex/hybrid_cache.go:263-265,deleted[cacheKey]isfalseboth when deletion fails AND when the key didn't exist (e.g., already expired). Since callers indistributor.godon't use the return value for control flow, this is not a functional issue, but consider documenting this behavior or returning a sentinel value if future callers need to distinguish these cases.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@service/channel_affinity.go` around lines 622 - 638, ClearMatchedChannelAffinity currently returns a single bool that is ambiguous because cache.DeleteMany's deleted[cacheKey] is false both when deletion failed and when the key did not exist; change the function to return (bool, error) (or document the ambiguity) so callers can distinguish a cache error from a non-existent key: update ClearMatchedChannelAffinity signature to return (deleted bool, err error), propagate the underlying error from cache.DeleteMany when err != nil, and return (false, nil) when DeleteMany succeeds but deleted[cacheKey] is false (meaning key was absent/expired); update all callers (e.g., in distributor.go) to handle the new return values and add a short comment in getChannelAffinityContext/getChannelAffinityCache usage explaining the semantics.middleware/distributor.go (1)
210-231: Retry budget consumed by duplicate channel IDs.When
GetRandomSatisfiedChannelreturns a channel already intriedChannelIDs, thecontinuestill consumes a retry iteration. If the channel pool is small or randomization repeatedly returns the same channel, this exhausts retries without actually trying distinct channels.Consider not incrementing the retry counter on duplicate hits:
♻️ Suggested approach
lastErr := setupErr - for ; retryParam.GetRetry() <= common.RetryTimes; retryParam.IncreaseRetry() { + for retryParam.GetRetry() <= common.RetryTimes { fallbackChannel, selectGroup, err := service.CacheGetRandomSatisfiedChannel(retryParam) if err != nil { return nil, types.NewError(err, types.ErrorCodeGetChannelFailed, types.ErrOptionWithSkipRetry()) } if fallbackChannel == nil { break } if _, exists := triedChannelIDs[fallbackChannel.Id]; exists { continue } triedChannelIDs[fallbackChannel.Id] = struct{}{} + retryParam.IncreaseRetry() setupErr = SetupContextForSelectedChannel(c, fallbackChannel, modelName)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@middleware/distributor.go` around lines 210 - 231, The loop consumes a retry on duplicate channel hits because the for-loop increments via retryParam.IncreaseRetry() on every continue; change behavior so duplicates don't count against the retry budget by preventing the retry counter from advancing when a duplicate is encountered. Concretely, modify the logic around service.CacheGetRandomSatisfiedChannel and the triedChannelIDs check: when a fallbackChannel.Id is already in triedChannelIDs, do not let retryParam.IncreaseRetry() run (e.g., convert the for-loop to manual incrementing or call a retryParam.DecreaseRetry()/undo method before continue), ensuring only actual channel setup attempts (where you call SetupContextForSelectedChannel and possibly service.MarkChannelAffinityUsed) increment the retry counter; keep references to retryParam.GetRetry(), retryParam.IncreaseRetry(), service.CacheGetRandomSatisfiedChannel, triedChannelIDs, and SetupContextForSelectedChannel so the change targets the right spots.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@middleware/distributor.go`:
- Around line 210-231: The loop consumes a retry on duplicate channel hits
because the for-loop increments via retryParam.IncreaseRetry() on every
continue; change behavior so duplicates don't count against the retry budget by
preventing the retry counter from advancing when a duplicate is encountered.
Concretely, modify the logic around service.CacheGetRandomSatisfiedChannel and
the triedChannelIDs check: when a fallbackChannel.Id is already in
triedChannelIDs, do not let retryParam.IncreaseRetry() run (e.g., convert the
for-loop to manual incrementing or call a retryParam.DecreaseRetry()/undo method
before continue), ensuring only actual channel setup attempts (where you call
SetupContextForSelectedChannel and possibly service.MarkChannelAffinityUsed)
increment the retry counter; keep references to retryParam.GetRetry(),
retryParam.IncreaseRetry(), service.CacheGetRandomSatisfiedChannel,
triedChannelIDs, and SetupContextForSelectedChannel so the change targets the
right spots.
In `@service/channel_affinity.go`:
- Around line 622-638: ClearMatchedChannelAffinity currently returns a single
bool that is ambiguous because cache.DeleteMany's deleted[cacheKey] is false
both when deletion failed and when the key did not exist; change the function to
return (bool, error) (or document the ambiguity) so callers can distinguish a
cache error from a non-existent key: update ClearMatchedChannelAffinity
signature to return (deleted bool, err error), propagate the underlying error
from cache.DeleteMany when err != nil, and return (false, nil) when DeleteMany
succeeds but deleted[cacheKey] is false (meaning key was absent/expired); update
all callers (e.g., in distributor.go) to handle the new return values and add a
short comment in getChannelAffinityContext/getChannelAffinityCache usage
explaining the semantics.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 905cad35-923e-4510-97d3-71b606ec7541
📒 Files selected for processing (3)
middleware/distributor.goservice/channel_affinity.goservice/channel_affinity_template_test.go
💡 沟通提示 / Pre-submission
📝 变更描述 / Description
当亲和成功某渠道后,假设这个渠道请求失败或者被禁用等情况出现后,原逻辑不会重试,既便没有勾选“失败后不重试”也不会重试,如下图:

本次改动主要做了如下变更:
🚀 变更类型 / Type of change
🔗 关联任务 / Related Issue
✅ 提交前检查项 / Checklist
📸 运行证明 / Proof of Work
修改之后的效果:

Summary by CodeRabbit
Improvements
Tests