feat(monitor): add Prometheus metrics#4077
Conversation
…nto all subsystems
WalkthroughThis pull request adds comprehensive Prometheus monitoring and observability across the application. It introduces a new Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (6)
monitor/metrics.go (2)
11-14: Potential data race onenabledflag.The
enabledvariable is written insideinitOnce.Do()and read byEnabled()without synchronization. Whilesync.Onceguarantees the write happens-before any subsequentDo()call returns, concurrent readers callingEnabled()beforeInitMetrics()completes could observe a stalefalsevalue, which is likely the intended behavior. However, ifEnabled()is called concurrently duringInitMetrics()execution, there's a potential race.Consider using
sync/atomicfor the flag to ensure memory visibility:🔧 Proposed fix using atomic
import ( "sync" + "sync/atomic" "github.com/prometheus/client_golang/prometheus" ) const namespace = "newapi" var ( initOnce sync.Once - enabled bool + enabled atomic.Bool )Then update
Enabled():func Enabled() bool { - return enabled + return enabled.Load() }And in
InitMetrics():- enabled = true + enabled.Store(true)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@monitor/metrics.go` around lines 11 - 14, The package has a potential data race on the package-level enabled flag: change the `enabled` variable to an atomic type (e.g., uint32) and use atomic operations so reads/writes are synchronized; update `Enabled()` to return atomic.LoadUint32(&enabled) == 1 and set the flag in `InitMetrics()` with atomic.StoreUint32(&enabled, 1) (or 0 on failure) instead of unsynchronized reads/writes, leaving `initOnce` usage intact to ensure one-time initialization.
158-185: High cardinality metrics may impact Prometheus performance.Tier 3 metrics use
user_idandtoken_idlabels, which can create unbounded cardinality (O(users × tokens × models)). While the design decision to use only counters (no histograms) is appropriate, consider:
- Adding documentation about expected cardinality and Prometheus resource requirements
- Providing a configuration option to disable Tier 3 metrics independently
- Implementing metric expiration/cleanup for inactive users/tokens
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@monitor/metrics.go` around lines 158 - 185, Tier 3 counters (TokenRequestsTotal, TokenTokensUsedTotal, TokenQuotaConsumedTotal) create potentially unbounded label cardinality (user_id, token_id, model) which can hurt Prometheus; add a runtime toggle to enable/disable these Tier 3 metrics (e.g., a config flag or env var read during metrics init), document expected cardinality and resource guidance for running with this flag enabled, and implement a cleanup/expiration strategy for these CounterVecs (e.g., periodically calling DeleteLabelValues or using a short-lived registry) when tokens/users become inactive so labels are pruned; update the metrics initialization code to respect the new config flag and add brief docs mentioning Prometheus requirements and recommended limits.monitor/db.go (2)
78-87: Background goroutine lacks shutdown mechanism.The
StartDBPoolCollectorgoroutine runs indefinitely with no way to stop it. This can cause issues during graceful shutdown or in tests. Consider accepting a context:🔧 Proposed enhancement for graceful shutdown
-func StartDBPoolCollector(sqlDB *sql.DB) { +func StartDBPoolCollector(ctx context.Context, sqlDB *sql.DB) { if sqlDB == nil || !Enabled() { return } go func() { ticker := time.NewTicker(15 * time.Second) defer ticker.Stop() - for range ticker.C { + for { + select { + case <-ctx.Done(): + return + case <-ticker.C: + } stats := sqlDB.Stats() DBOpenConnections.WithLabelValues("open").Set(float64(stats.OpenConnections)) DBOpenConnections.WithLabelValues("in_use").Set(float64(stats.InUse)) DBOpenConnections.WithLabelValues("idle").Set(float64(stats.Idle)) } }() }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@monitor/db.go` around lines 78 - 87, The background goroutine that polls sqlDB.Stats() (currently started in StartDBPoolCollector) runs forever and needs a shutdown mechanism; modify StartDBPoolCollector to accept a context.Context, replace the for range ticker.C loop with a select that listens for ctx.Done() and the ticker.C case, and on ctx cancellation stop the ticker and return; update references to sqlDB.Stats(), DBOpenConnections and the ticker usage within that function accordingly and ensure callers pass a cancellable or app context so the collector can be cleanly stopped.
36-52: Callback registration errors are silently ignored.The
db.Callback().*.Register()calls return errors that are discarded with_. If registration fails (e.g., duplicate callback name), metrics will silently stop working. Consider logging registration failures:🔧 Proposed fix to log registration errors
+import "log" func registerCallback(db *gorm.DB, operation string) { beforeName := "prometheus:before_" + operation afterName := "prometheus:after_" + operation // ... callback definitions ... switch operation { case "create": - _ = db.Callback().Create().Before("*").Register(beforeName, before) - _ = db.Callback().Create().After("*").Register(afterName, after) + if err := db.Callback().Create().Before("*").Register(beforeName, before); err != nil { + log.Printf("Failed to register DB callback %s: %v", beforeName, err) + } + if err := db.Callback().Create().After("*").Register(afterName, after); err != nil { + log.Printf("Failed to register DB callback %s: %v", afterName, err) + } // ... similar for other cases ... } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@monitor/db.go` around lines 36 - 52, The callback Register calls currently discard returned errors; update each registration (the db.Callback().Create()/Query()/Update()/Delete()/Raw().Before("*").Register(beforeName, before) and .After("*").Register(afterName, after) calls) to capture the error into a variable and log it if non-nil (include operation, beforeName/afterName and the error in the log message) instead of using `_ =`. Apply this change for both the Before and After registrations in all switch cases so failures (e.g., duplicate callback names) are surfaced.monitor/server.go (1)
32-34: Consider adding graceful shutdown support.The server currently has no shutdown mechanism. If the main application shuts down, the metrics server goroutine will be abandoned. Consider accepting a context for graceful shutdown:
🔧 Proposed enhancement for graceful shutdown
+import "context" -func StartMetricsServer() { +func StartMetricsServer(ctx context.Context) { if !Enabled() { return } // ... port setup ... server := &http.Server{ Addr: addr, Handler: mux, } + go func() { + <-ctx.Done() + server.Shutdown(context.Background()) + }() + log.Printf("Prometheus metrics server listening on %s", addr) - if err := http.ListenAndServe(addr, mux); err != nil { + if err := server.ListenAndServe(); err != nil && err != http.ErrServerClosed { log.Printf("Prometheus metrics server error: %v", err) } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@monitor/server.go` around lines 32 - 34, Replace the direct call to http.ListenAndServe with an http.Server instance and add graceful shutdown using a context: create server := &http.Server{Addr: addr, Handler: mux}, run server.ListenAndServe in a goroutine, and in the main routine wait for the provided context to be cancelled (ctx.Done()) then call server.Shutdown(shutdownCtx) with a short timeout context to allow in-flight requests to finish; update any function signature that currently starts the server to accept a context.Context so you can invoke server.Shutdown when the app is stopping (referencing http.ListenAndServe, server.ListenAndServe, and server.Shutdown).model/main.go (1)
199-201: Consider instrumenting LOG_DB for consistency.The main DB is instrumented with Prometheus metrics, but
InitLogDB()(lines 218-253) does not register callbacks or pool collectors forLOG_DBwhen it uses a separate database connection. If observability into log database operations is desired, consider adding similar instrumentation there.The current placement after connection pool configuration is appropriate.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@model/main.go` around lines 199 - 201, InitLogDB currently creates a separate log GORM connection but doesn't register Prometheus instrumentation; after you configure the log DB's connection pool in InitLogDB(), call monitor.RegisterDBCallbacks(logDB) and monitor.StartDBPoolCollector(logSqlDB) (where logDB is the GORM *gorm.DB returned/assigned in InitLogDB and logSqlDB is the underlying *sql.DB obtained via logDB.DB()) so the LOG_DB is instrumented the same way as the main DB.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@controller/relay.go`:
- Line 235: The Tier 3 metric calls are creating unbounded cardinality via
RecordTokenRequest(relayInfo.UserId, relayInfo.TokenId) (and related metrics
TokenRequestsTotal, TokenTokensUsedTotal, TokenQuotaConsumedTotal) — add a
runtime config flag (e.g., EnableTier3Metrics or DisableHighCardinalityMetrics)
in the monitor or config package and check it before invoking
monitor.RecordTokenRequest and the other Tier 3 metric emitters so these metrics
can be disabled per deployment; alternatively implement low-cardinality
bucketing/sampling inside RecordTokenRequest (e.g., hash the userId/tokenId and
map to fixed N buckets or sample 1/N requests) and update the metric
names/labels accordingly to avoid user_id/token_id labels being emitted for
every unique id.
In `@monitor/metrics.go`:
- Around line 87-100: DBWaitTotal and DBWaitDuration are never updated; modify
StartDBPoolCollector to read stats := sqlDB.Stats() each tick and compute deltas
against stored previous values (e.g., lastWaitCount int64 and lastWaitDuration
time.Duration) and call DBWaitTotal.Add(float64(deltaWaitCount)) and
DBWaitDuration.Add(deltaWaitDuration.Seconds()) only when the current values
exceed the previous ones, then update the stored previous variables so counters
reflect incremental WaitCount and WaitDuration from sql.DBStats rather than
repeatedly adding the cumulative totals.
In `@monitor/middleware.go`:
- Around line 18-23: HTTPActiveConnections may not be decremented if a
downstream handler panics because Dec() is called after c.Next(); fix by making
the decrement run in a defer immediately after increment so it always executes:
add defer HTTPActiveConnections.Dec() right after HTTPActiveConnections.Inc(),
and if you compute request duration using the start variable, move that
duration/reporting logic into a deferred closure that captures start (or perform
duration calculation inside the same defer) so c.Next() can still run and any
panics won't leave the gauge inflated; reference HTTPActiveConnections,
c.Next(), and the start variable.
---
Nitpick comments:
In `@model/main.go`:
- Around line 199-201: InitLogDB currently creates a separate log GORM
connection but doesn't register Prometheus instrumentation; after you configure
the log DB's connection pool in InitLogDB(), call
monitor.RegisterDBCallbacks(logDB) and monitor.StartDBPoolCollector(logSqlDB)
(where logDB is the GORM *gorm.DB returned/assigned in InitLogDB and logSqlDB is
the underlying *sql.DB obtained via logDB.DB()) so the LOG_DB is instrumented
the same way as the main DB.
In `@monitor/db.go`:
- Around line 78-87: The background goroutine that polls sqlDB.Stats()
(currently started in StartDBPoolCollector) runs forever and needs a shutdown
mechanism; modify StartDBPoolCollector to accept a context.Context, replace the
for range ticker.C loop with a select that listens for ctx.Done() and the
ticker.C case, and on ctx cancellation stop the ticker and return; update
references to sqlDB.Stats(), DBOpenConnections and the ticker usage within that
function accordingly and ensure callers pass a cancellable or app context so the
collector can be cleanly stopped.
- Around line 36-52: The callback Register calls currently discard returned
errors; update each registration (the
db.Callback().Create()/Query()/Update()/Delete()/Raw().Before("*").Register(beforeName,
before) and .After("*").Register(afterName, after) calls) to capture the error
into a variable and log it if non-nil (include operation, beforeName/afterName
and the error in the log message) instead of using `_ =`. Apply this change for
both the Before and After registrations in all switch cases so failures (e.g.,
duplicate callback names) are surfaced.
In `@monitor/metrics.go`:
- Around line 11-14: The package has a potential data race on the package-level
enabled flag: change the `enabled` variable to an atomic type (e.g., uint32) and
use atomic operations so reads/writes are synchronized; update `Enabled()` to
return atomic.LoadUint32(&enabled) == 1 and set the flag in `InitMetrics()` with
atomic.StoreUint32(&enabled, 1) (or 0 on failure) instead of unsynchronized
reads/writes, leaving `initOnce` usage intact to ensure one-time initialization.
- Around line 158-185: Tier 3 counters (TokenRequestsTotal,
TokenTokensUsedTotal, TokenQuotaConsumedTotal) create potentially unbounded
label cardinality (user_id, token_id, model) which can hurt Prometheus; add a
runtime toggle to enable/disable these Tier 3 metrics (e.g., a config flag or
env var read during metrics init), document expected cardinality and resource
guidance for running with this flag enabled, and implement a cleanup/expiration
strategy for these CounterVecs (e.g., periodically calling DeleteLabelValues or
using a short-lived registry) when tokens/users become inactive so labels are
pruned; update the metrics initialization code to respect the new config flag
and add brief docs mentioning Prometheus requirements and recommended limits.
In `@monitor/server.go`:
- Around line 32-34: Replace the direct call to http.ListenAndServe with an
http.Server instance and add graceful shutdown using a context: create server :=
&http.Server{Addr: addr, Handler: mux}, run server.ListenAndServe in a
goroutine, and in the main routine wait for the provided context to be cancelled
(ctx.Done()) then call server.Shutdown(shutdownCtx) with a short timeout context
to allow in-flight requests to finish; update any function signature that
currently starts the server to accept a context.Context so you can invoke
server.Shutdown when the app is stopping (referencing http.ListenAndServe,
server.ListenAndServe, and server.Shutdown).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 14efaaab-95e5-4c48-8953-7aee20d115ed
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum
📒 Files selected for processing (12)
controller/relay.gogo.modmain.gomodel/main.gomonitor/db.gomonitor/metrics.gomonitor/middleware.gomonitor/redis_hook.gomonitor/relay.gomonitor/server.gomonitor/token.goservice/text_quota.go
| FirstTokenTime: relayInfo.FirstResponseTime, | ||
| IsStream: relayInfo.IsStream, | ||
| }) | ||
| monitor.RecordTokenRequest(relayInfo.UserId, relayInfo.TokenId) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check if there's any cardinality limiting or Tier 3 disable flag
rg -n "tier.?3|token.*metric|user.*metric" --type go -iRepository: QuantumNous/new-api
Length of output: 192
🏁 Script executed:
# 1. Find and examine the RecordTokenRequest function
rg -n "RecordTokenRequest" --type go -A 10Repository: QuantumNous/new-api
Length of output: 2091
🏁 Script executed:
# 2. Examine the monitor/metrics.go file around Tier 3 metrics
rg -n "Tier 3|User/Token-Level" --type go -B 5 -A 15 monitor/Repository: QuantumNous/new-api
Length of output: 1896
🏁 Script executed:
# 3. Check for any cardinality limiting or configuration flags
rg -n "cardinality|disable.*tier|tier.*disable|maxSeries|maxLabels" --type go -iRepository: QuantumNous/new-api
Length of output: 318
🏁 Script executed:
# 4. Verify lines 235 and 262 in controller/relay.go
sed -n '230,240p;258,268p' controller/relay.goRepository: QuantumNous/new-api
Length of output: 829
Address high cardinality Prometheus metrics with user_id and token_id labels.
RecordTokenRequest(relayInfo.UserId, relayInfo.TokenId) at lines 235 and 262 creates metrics with unbounded cardinality—each unique user and token combination generates a new metric series. With thousands of users and tokens, this can exhaust Prometheus memory, degrade query performance, and increase storage costs significantly.
The Tier 3 metrics (TokenRequestsTotal, TokenTokensUsedTotal, TokenQuotaConsumedTotal) are explicitly documented as high-cardinality but have no safeguards. Add one of the following:
- A configuration flag to disable Tier 3 metrics independently
- Clear operational documentation warning about cardinality limits for deployments
- Bucketed ranges or sampling instead of exact IDs
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@controller/relay.go` at line 235, The Tier 3 metric calls are creating
unbounded cardinality via RecordTokenRequest(relayInfo.UserId,
relayInfo.TokenId) (and related metrics TokenRequestsTotal,
TokenTokensUsedTotal, TokenQuotaConsumedTotal) — add a runtime config flag
(e.g., EnableTier3Metrics or DisableHighCardinalityMetrics) in the monitor or
config package and check it before invoking monitor.RecordTokenRequest and the
other Tier 3 metric emitters so these metrics can be disabled per deployment;
alternatively implement low-cardinality bucketing/sampling inside
RecordTokenRequest (e.g., hash the userId/tokenId and map to fixed N buckets or
sample 1/N requests) and update the metric names/labels accordingly to avoid
user_id/token_id labels being emitted for every unique id.
| DBWaitTotal = prometheus.NewCounter( | ||
| prometheus.CounterOpts{ | ||
| Namespace: namespace, | ||
| Name: "db_wait_total", | ||
| Help: "Total number of waits for a database connection.", | ||
| }, | ||
| ) | ||
| DBWaitDuration = prometheus.NewCounter( | ||
| prometheus.CounterOpts{ | ||
| Namespace: namespace, | ||
| Name: "db_wait_duration_seconds_total", | ||
| Help: "Total time spent waiting for a database connection in seconds.", | ||
| }, | ||
| ) |
There was a problem hiding this comment.
DBWaitTotal and DBWaitDuration are registered but never recorded.
These counters are defined and registered, but StartDBPoolCollector in monitor/db.go only records DBOpenConnections. The sql.DBStats struct provides WaitCount and WaitDuration fields that should populate these metrics.
🔧 Proposed fix in monitor/db.go StartDBPoolCollector
for range ticker.C {
stats := sqlDB.Stats()
DBOpenConnections.WithLabelValues("open").Set(float64(stats.OpenConnections))
DBOpenConnections.WithLabelValues("in_use").Set(float64(stats.InUse))
DBOpenConnections.WithLabelValues("idle").Set(float64(stats.Idle))
+ DBWaitTotal.Add(float64(stats.WaitCount))
+ DBWaitDuration.Add(stats.WaitDuration.Seconds())
}Note: Since these are cumulative counters from sql.DBStats, you'll need to track the previous values and add only the delta to avoid overcounting:
var lastWaitCount int64
var lastWaitDuration time.Duration
for range ticker.C {
stats := sqlDB.Stats()
// ... existing gauge updates ...
if stats.WaitCount > lastWaitCount {
DBWaitTotal.Add(float64(stats.WaitCount - lastWaitCount))
lastWaitCount = stats.WaitCount
}
if stats.WaitDuration > lastWaitDuration {
DBWaitDuration.Add((stats.WaitDuration - lastWaitDuration).Seconds())
lastWaitDuration = stats.WaitDuration
}
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@monitor/metrics.go` around lines 87 - 100, DBWaitTotal and DBWaitDuration are
never updated; modify StartDBPoolCollector to read stats := sqlDB.Stats() each
tick and compute deltas against stored previous values (e.g., lastWaitCount
int64 and lastWaitDuration time.Duration) and call
DBWaitTotal.Add(float64(deltaWaitCount)) and
DBWaitDuration.Add(deltaWaitDuration.Seconds()) only when the current values
exceed the previous ones, then update the stored previous variables so counters
reflect incremental WaitCount and WaitDuration from sql.DBStats rather than
repeatedly adding the cumulative totals.
| HTTPActiveConnections.Inc() | ||
| start := time.Now() | ||
|
|
||
| c.Next() | ||
|
|
||
| HTTPActiveConnections.Dec() |
There was a problem hiding this comment.
HTTPActiveConnections gauge may become inaccurate if handler panics.
If a downstream handler panics, HTTPActiveConnections.Dec() will be skipped, causing the gauge to drift upward over time. Use defer to ensure the decrement always executes:
🔧 Proposed fix using defer
func PrometheusMiddleware() gin.HandlerFunc {
return func(c *gin.Context) {
if !Enabled() {
c.Next()
return
}
HTTPActiveConnections.Inc()
+ defer HTTPActiveConnections.Dec()
start := time.Now()
c.Next()
- HTTPActiveConnections.Dec()
-
path := c.FullPath()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| HTTPActiveConnections.Inc() | |
| start := time.Now() | |
| c.Next() | |
| HTTPActiveConnections.Dec() | |
| HTTPActiveConnections.Inc() | |
| defer HTTPActiveConnections.Dec() | |
| start := time.Now() | |
| c.Next() |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@monitor/middleware.go` around lines 18 - 23, HTTPActiveConnections may not be
decremented if a downstream handler panics because Dec() is called after
c.Next(); fix by making the decrement run in a defer immediately after increment
so it always executes: add defer HTTPActiveConnections.Dec() right after
HTTPActiveConnections.Inc(), and if you compute request duration using the start
variable, move that duration/reporting logic into a deferred closure that
captures start (or perform duration calculation inside the same defer) so
c.Next() can still run and any panics won't leave the gauge inflated; reference
HTTPActiveConnections, c.Next(), and the start variable.
Summary
Adds comprehensive Prometheus metrics to the API gateway with a three-tiered approach to control label cardinality. Metrics are served on a dedicated internal port (
/metrics), separate from the main API server.Metric Tiers
Tier 1 — System-Level (Low Cardinality)
newapi_http_requests_totalmethod,path,status_codenewapi_http_request_duration_secondsmethod,path,status_codenewapi_http_active_connectionsnewapi_redis_commands_totalcommand,statusnewapi_redis_command_duration_secondscommandnewapi_db_queries_totaloperation,statusnewapi_db_query_duration_secondsoperationnewapi_db_open_connectionsstatenewapi_db_wait_totalnewapi_db_wait_duration_seconds_totalTier 2 — Relay/Channel-Level (Moderate Cardinality)
newapi_relay_requests_totalchannel_id,channel_type,model,status_code,relay_modenewapi_relay_request_duration_secondschannel_id,channel_type,model,relay_modenewapi_relay_first_token_secondschannel_id,channel_type,modelnewapi_relay_tokens_used_totalchannel_id,channel_type,model,directionnewapi_relay_errors_totalchannel_id,channel_type,error_typenewapi_relay_retries_totalchannel_id,channel_typeTier 3 — User/Token-Level (High Cardinality, Counters Only)
newapi_token_requests_totaluser_id,token_idnewapi_token_tokens_used_totaluser_id,token_id,model,directionnewapi_token_quota_consumed_totaluser_id,token_idArchitecture
New
monitor/package with 7 files:metrics.goInitMetrics(),Enabled()server.go/metricsendpointmiddleware.goredis_hook.goHookfor command-level Redis metricsdb.gorelay.gotoken.goIntegration Points
main.go: Initialize metrics early in startup, register Gin middleware, attach Redis hook after client initmodel/main.go: Register GORM callbacks and start DB pool stats collector after DB initcontroller/relay.go: Record relay request/retry/error metrics on success and failure pathsservice/text_quota.go: Record token usage and quota consumption after billing calculationConfiguration
PROMETHEUS_ENABLEDtrue"false"to fully disable (zero overhead)PROMETHEUS_PORT9090Design Decisions
c.FullPath()(Gin route patterns) to avoid cardinality explosion from path parametersmonitor/uses stdliblogfor its own logging instead of importingcommonEnabled()guard on every call pathSummary by CodeRabbit
Release Notes