feat(monitor): add Prometheus metrics by wzxjohn · Pull Request #4077 · QuantumNous/new-api

wzxjohn · 2026-04-03T11:01:37Z

Summary

Adds comprehensive Prometheus metrics to the API gateway with a three-tiered approach to control label cardinality. Metrics are served on a dedicated internal port (/metrics), separate from the main API server.

Metric Tiers

Tier 1 — System-Level (Low Cardinality)

Metric	Type	Labels
`newapi_http_requests_total`	Counter	`method`, `path`, `status_code`
`newapi_http_request_duration_seconds`	Histogram	`method`, `path`, `status_code`
`newapi_http_active_connections`	Gauge	—
`newapi_redis_commands_total`	Counter	`command`, `status`
`newapi_redis_command_duration_seconds`	Histogram	`command`
`newapi_db_queries_total`	Counter	`operation`, `status`
`newapi_db_query_duration_seconds`	Histogram	`operation`
`newapi_db_open_connections`	Gauge	`state`
`newapi_db_wait_total`	Counter	—
`newapi_db_wait_duration_seconds_total`	Counter	—

Tier 2 — Relay/Channel-Level (Moderate Cardinality)

Metric	Type	Labels
`newapi_relay_requests_total`	Counter	`channel_id`, `channel_type`, `model`, `status_code`, `relay_mode`
`newapi_relay_request_duration_seconds`	Histogram	`channel_id`, `channel_type`, `model`, `relay_mode`
`newapi_relay_first_token_seconds`	Histogram	`channel_id`, `channel_type`, `model`
`newapi_relay_tokens_used_total`	Counter	`channel_id`, `channel_type`, `model`, `direction`
`newapi_relay_errors_total`	Counter	`channel_id`, `channel_type`, `error_type`
`newapi_relay_retries_total`	Counter	`channel_id`, `channel_type`

Tier 3 — User/Token-Level (High Cardinality, Counters Only)

Metric	Type	Labels
`newapi_token_requests_total`	Counter	`user_id`, `token_id`
`newapi_token_tokens_used_total`	Counter	`user_id`, `token_id`, `model`, `direction`
`newapi_token_quota_consumed_total`	Counter	`user_id`, `token_id`

Architecture

New monitor/ package with 7 files:

File	Responsibility
`metrics.go`	All Prometheus metric definitions, `InitMetrics()`, `Enabled()`
`server.go`	Dedicated HTTP server for `/metrics` endpoint
`middleware.go`	Gin middleware for HTTP request metrics
`redis_hook.go`	go-redis/v8 `Hook` for command-level Redis metrics
`db.go`	GORM callbacks for query metrics + background DB pool stats collector
`relay.go`	Relay/channel metric recording helpers
`token.go`	User/token metric recording helpers

Integration Points

main.go: Initialize metrics early in startup, register Gin middleware, attach Redis hook after client init
model/main.go: Register GORM callbacks and start DB pool stats collector after DB init
controller/relay.go: Record relay request/retry/error metrics on success and failure paths
service/text_quota.go: Record token usage and quota consumption after billing calculation

Configuration

Env Var	Default	Description
`PROMETHEUS_ENABLED`	`true`	Set to `"false"` to fully disable (zero overhead)
`PROMETHEUS_PORT`	`9090`	Port for the dedicated metrics HTTP server

Design Decisions

Dedicated port: Metrics endpoint is isolated from the public API, avoiding accidental exposure
Tiered cardinality: Tier 3 (user/token) uses counters only (no histograms) to limit storage impact at scale
Path normalization: HTTP middleware uses c.FullPath() (Gin route patterns) to avoid cardinality explosion from path parameters
No import cycles: monitor/ uses stdlib log for its own logging instead of importing common
Fire-and-forget: All metric recording is non-blocking; Enabled() guard on every call path

Summary by CodeRabbit

Release Notes

New Features
- Added comprehensive Prometheus metrics integration for enhanced system monitoring and observability
- Integrated automatic tracking of HTTP requests, database queries, and Redis operations
- Introduced relay request and token usage metrics for API performance insights
- Metrics server endpoint now available for accessing real-time application performance and health data

…nto all subsystems

coderabbitai · 2026-04-03T11:01:53Z

Walkthrough

This pull request adds comprehensive Prometheus monitoring and observability across the application. It introduces a new monitor package with metrics for HTTP requests, Redis commands, database operations, relay requests, and token usage. Monitoring is integrated via middleware, database callbacks, Redis hooks, and explicit metric recording calls throughout the codebase, with optional startup of a dedicated metrics HTTP server.

Changes

Cohort / File(s)	Summary
Monitor Package `monitor/metrics.go`, `monitor/middleware.go`, `monitor/db.go`, `monitor/redis_hook.go`, `monitor/relay.go`, `monitor/token.go`, `monitor/server.go`	Introduces comprehensive Prometheus instrumentation framework with metric definitions (counters, histograms, gauges), Gin middleware for HTTP tracking, GORM callback registration for database query metrics, Redis hook registration for command metrics, relay request/retry/token recording functions, token usage tracking, and optional metrics HTTP server on configurable port.
Application Integration `main.go`, `model/main.go`	Wires Prometheus middleware into Gin, initializes metrics system during startup, conditionally registers Redis and database hooks based on enabled configuration, and starts metrics server in a background goroutine when enabled.
Relay Monitoring `controller/relay.go`	Adds metric recording calls on successful relay (including status/duration/first-token time and token usage) and on failure; records retry events during retry loop iterations.
Token Quota Tracking `service/text_quota.go`	Records relay and token usage metrics when quota is consumed, capturing channel, model, and token counts.
Dependency Update `go.mod`	Promotes `github.com/prometheus/client_golang` from indirect to direct dependency.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

refactor: async task #2985: Modifies controller/relay.go monitoring integration alongside relay request handling and token tracking.
refactor: optimize billing flow for OpenAI-to-Anthropic convert #3400: Modifies service/text_quota.go quota recording flow; this PR adds metric recording to the same function.
Alpha #1775: Refactors relay control flow and pre-consume/quota-return handling in controller/relay.go.

Suggested reviewers

creamlike1024

Poem

🐰 Metrics now hop through every flow,
From Redis hops to DB glow,
Each relay jump and token spent,
We track with Prometheus intent—
Observability complete! 📊

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 63.16% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title 'feat(monitor): add Prometheus metrics' directly and clearly summarizes the primary change—addition of Prometheus metrics monitoring functionality across the codebase.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (6)

monitor/metrics.go (2)

11-14: Potential data race on enabled flag.

The enabled variable is written inside initOnce.Do() and read by Enabled() without synchronization. While sync.Once guarantees the write happens-before any subsequent Do() call returns, concurrent readers calling Enabled() before InitMetrics() completes could observe a stale false value, which is likely the intended behavior. However, if Enabled() is called concurrently during InitMetrics() execution, there's a potential race.

Consider using sync/atomic for the flag to ensure memory visibility:
🔧 Proposed fix using atomic
 import (
 	"sync"
+	"sync/atomic"

 	"github.com/prometheus/client_golang/prometheus"
 )

 const namespace = "newapi"

 var (
 	initOnce sync.Once
-	enabled  bool
+	enabled  atomic.Bool
 )
Then update Enabled():
 func Enabled() bool {
-	return enabled
+	return enabled.Load()
 }
And in InitMetrics():
-		enabled = true
+		enabled.Store(true)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@monitor/metrics.go` around lines 11 - 14, The package has a potential data
race on the package-level enabled flag: change the `enabled` variable to an
atomic type (e.g., uint32) and use atomic operations so reads/writes are
synchronized; update `Enabled()` to return atomic.LoadUint32(&enabled) == 1 and
set the flag in `InitMetrics()` with atomic.StoreUint32(&enabled, 1) (or 0 on
failure) instead of unsynchronized reads/writes, leaving `initOnce` usage intact
to ensure one-time initialization.
158-185: High cardinality metrics may impact Prometheus performance.

Tier 3 metrics use user_id and token_id labels, which can create unbounded cardinality (O(users × tokens × models)). While the design decision to use only counters (no histograms) is appropriate, consider:

Adding documentation about expected cardinality and Prometheus resource requirements

Providing a configuration option to disable Tier 3 metrics independently

Implementing metric expiration/cleanup for inactive users/tokens
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@monitor/metrics.go` around lines 158 - 185, Tier 3 counters
(TokenRequestsTotal, TokenTokensUsedTotal, TokenQuotaConsumedTotal) create
potentially unbounded label cardinality (user_id, token_id, model) which can
hurt Prometheus; add a runtime toggle to enable/disable these Tier 3 metrics
(e.g., a config flag or env var read during metrics init), document expected
cardinality and resource guidance for running with this flag enabled, and
implement a cleanup/expiration strategy for these CounterVecs (e.g.,
periodically calling DeleteLabelValues or using a short-lived registry) when
tokens/users become inactive so labels are pruned; update the metrics
initialization code to respect the new config flag and add brief docs mentioning
Prometheus requirements and recommended limits.

monitor/db.go (2)

78-87: Background goroutine lacks shutdown mechanism.

The StartDBPoolCollector goroutine runs indefinitely with no way to stop it. This can cause issues during graceful shutdown or in tests. Consider accepting a context:

🔧 Proposed enhancement for graceful shutdown

-func StartDBPoolCollector(sqlDB *sql.DB) {
+func StartDBPoolCollector(ctx context.Context, sqlDB *sql.DB) {
 	if sqlDB == nil || !Enabled() {
 		return
 	}

 	go func() {
 		ticker := time.NewTicker(15 * time.Second)
 		defer ticker.Stop()
-		for range ticker.C {
+		for {
+			select {
+			case <-ctx.Done():
+				return
+			case <-ticker.C:
+			}
 			stats := sqlDB.Stats()
 			DBOpenConnections.WithLabelValues("open").Set(float64(stats.OpenConnections))
 			DBOpenConnections.WithLabelValues("in_use").Set(float64(stats.InUse))
 			DBOpenConnections.WithLabelValues("idle").Set(float64(stats.Idle))
 		}
 	}()
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@monitor/db.go` around lines 78 - 87, The background goroutine that polls
sqlDB.Stats() (currently started in StartDBPoolCollector) runs forever and needs
a shutdown mechanism; modify StartDBPoolCollector to accept a context.Context,
replace the for range ticker.C loop with a select that listens for ctx.Done()
and the ticker.C case, and on ctx cancellation stop the ticker and return;
update references to sqlDB.Stats(), DBOpenConnections and the ticker usage
within that function accordingly and ensure callers pass a cancellable or app
context so the collector can be cleanly stopped.

36-52: Callback registration errors are silently ignored.

The db.Callback().*.Register() calls return errors that are discarded with _. If registration fails (e.g., duplicate callback name), metrics will silently stop working. Consider logging registration failures:

🔧 Proposed fix to log registration errors

+import "log"

 func registerCallback(db *gorm.DB, operation string) {
 	beforeName := "prometheus:before_" + operation
 	afterName := "prometheus:after_" + operation
 	// ... callback definitions ...

 	switch operation {
 	case "create":
-		_ = db.Callback().Create().Before("*").Register(beforeName, before)
-		_ = db.Callback().Create().After("*").Register(afterName, after)
+		if err := db.Callback().Create().Before("*").Register(beforeName, before); err != nil {
+			log.Printf("Failed to register DB callback %s: %v", beforeName, err)
+		}
+		if err := db.Callback().Create().After("*").Register(afterName, after); err != nil {
+			log.Printf("Failed to register DB callback %s: %v", afterName, err)
+		}
 	// ... similar for other cases ...
 	}
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@monitor/db.go` around lines 36 - 52, The callback Register calls currently
discard returned errors; update each registration (the
db.Callback().Create()/Query()/Update()/Delete()/Raw().Before("*").Register(beforeName,
before) and .After("*").Register(afterName, after) calls) to capture the error
into a variable and log it if non-nil (include operation, beforeName/afterName
and the error in the log message) instead of using `_ =`. Apply this change for
both the Before and After registrations in all switch cases so failures (e.g.,
duplicate callback names) are surfaced.

monitor/server.go (1)

32-34: Consider adding graceful shutdown support.

The server currently has no shutdown mechanism. If the main application shuts down, the metrics server goroutine will be abandoned. Consider accepting a context for graceful shutdown:

🔧 Proposed enhancement for graceful shutdown

+import "context"

-func StartMetricsServer() {
+func StartMetricsServer(ctx context.Context) {
 	if !Enabled() {
 		return
 	}
 	// ... port setup ...
 
 	server := &http.Server{
 		Addr:    addr,
 		Handler: mux,
 	}
 
+	go func() {
+		<-ctx.Done()
+		server.Shutdown(context.Background())
+	}()
+
 	log.Printf("Prometheus metrics server listening on %s", addr)
-	if err := http.ListenAndServe(addr, mux); err != nil {
+	if err := server.ListenAndServe(); err != nil && err != http.ErrServerClosed {
 		log.Printf("Prometheus metrics server error: %v", err)
 	}
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@monitor/server.go` around lines 32 - 34, Replace the direct call to
http.ListenAndServe with an http.Server instance and add graceful shutdown using
a context: create server := &http.Server{Addr: addr, Handler: mux}, run
server.ListenAndServe in a goroutine, and in the main routine wait for the
provided context to be cancelled (ctx.Done()) then call
server.Shutdown(shutdownCtx) with a short timeout context to allow in-flight
requests to finish; update any function signature that currently starts the
server to accept a context.Context so you can invoke server.Shutdown when the
app is stopping (referencing http.ListenAndServe, server.ListenAndServe, and
server.Shutdown).

model/main.go (1)

199-201: Consider instrumenting LOG_DB for consistency.

The main DB is instrumented with Prometheus metrics, but InitLogDB() (lines 218-253) does not register callbacks or pool collectors for LOG_DB when it uses a separate database connection. If observability into log database operations is desired, consider adding similar instrumentation there.

The current placement after connection pool configuration is appropriate.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model/main.go` around lines 199 - 201, InitLogDB currently creates a separate
log GORM connection but doesn't register Prometheus instrumentation; after you
configure the log DB's connection pool in InitLogDB(), call
monitor.RegisterDBCallbacks(logDB) and monitor.StartDBPoolCollector(logSqlDB)
(where logDB is the GORM *gorm.DB returned/assigned in InitLogDB and logSqlDB is
the underlying *sql.DB obtained via logDB.DB()) so the LOG_DB is instrumented
the same way as the main DB.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@controller/relay.go`:
- Line 235: The Tier 3 metric calls are creating unbounded cardinality via
RecordTokenRequest(relayInfo.UserId, relayInfo.TokenId) (and related metrics
TokenRequestsTotal, TokenTokensUsedTotal, TokenQuotaConsumedTotal) — add a
runtime config flag (e.g., EnableTier3Metrics or DisableHighCardinalityMetrics)
in the monitor or config package and check it before invoking
monitor.RecordTokenRequest and the other Tier 3 metric emitters so these metrics
can be disabled per deployment; alternatively implement low-cardinality
bucketing/sampling inside RecordTokenRequest (e.g., hash the userId/tokenId and
map to fixed N buckets or sample 1/N requests) and update the metric
names/labels accordingly to avoid user_id/token_id labels being emitted for
every unique id.

In `@monitor/metrics.go`:
- Around line 87-100: DBWaitTotal and DBWaitDuration are never updated; modify
StartDBPoolCollector to read stats := sqlDB.Stats() each tick and compute deltas
against stored previous values (e.g., lastWaitCount int64 and lastWaitDuration
time.Duration) and call DBWaitTotal.Add(float64(deltaWaitCount)) and
DBWaitDuration.Add(deltaWaitDuration.Seconds()) only when the current values
exceed the previous ones, then update the stored previous variables so counters
reflect incremental WaitCount and WaitDuration from sql.DBStats rather than
repeatedly adding the cumulative totals.

In `@monitor/middleware.go`:
- Around line 18-23: HTTPActiveConnections may not be decremented if a
downstream handler panics because Dec() is called after c.Next(); fix by making
the decrement run in a defer immediately after increment so it always executes:
add defer HTTPActiveConnections.Dec() right after HTTPActiveConnections.Inc(),
and if you compute request duration using the start variable, move that
duration/reporting logic into a deferred closure that captures start (or perform
duration calculation inside the same defer) so c.Next() can still run and any
panics won't leave the gauge inflated; reference HTTPActiveConnections,
c.Next(), and the start variable.

---

Nitpick comments:
In `@model/main.go`:
- Around line 199-201: InitLogDB currently creates a separate log GORM
connection but doesn't register Prometheus instrumentation; after you configure
the log DB's connection pool in InitLogDB(), call
monitor.RegisterDBCallbacks(logDB) and monitor.StartDBPoolCollector(logSqlDB)
(where logDB is the GORM *gorm.DB returned/assigned in InitLogDB and logSqlDB is
the underlying *sql.DB obtained via logDB.DB()) so the LOG_DB is instrumented
the same way as the main DB.

In `@monitor/db.go`:
- Around line 78-87: The background goroutine that polls sqlDB.Stats()
(currently started in StartDBPoolCollector) runs forever and needs a shutdown
mechanism; modify StartDBPoolCollector to accept a context.Context, replace the
for range ticker.C loop with a select that listens for ctx.Done() and the
ticker.C case, and on ctx cancellation stop the ticker and return; update
references to sqlDB.Stats(), DBOpenConnections and the ticker usage within that
function accordingly and ensure callers pass a cancellable or app context so the
collector can be cleanly stopped.
- Around line 36-52: The callback Register calls currently discard returned
errors; update each registration (the
db.Callback().Create()/Query()/Update()/Delete()/Raw().Before("*").Register(beforeName,
before) and .After("*").Register(afterName, after) calls) to capture the error
into a variable and log it if non-nil (include operation, beforeName/afterName
and the error in the log message) instead of using `_ =`. Apply this change for
both the Before and After registrations in all switch cases so failures (e.g.,
duplicate callback names) are surfaced.

In `@monitor/metrics.go`:
- Around line 11-14: The package has a potential data race on the package-level
enabled flag: change the `enabled` variable to an atomic type (e.g., uint32) and
use atomic operations so reads/writes are synchronized; update `Enabled()` to
return atomic.LoadUint32(&enabled) == 1 and set the flag in `InitMetrics()` with
atomic.StoreUint32(&enabled, 1) (or 0 on failure) instead of unsynchronized
reads/writes, leaving `initOnce` usage intact to ensure one-time initialization.
- Around line 158-185: Tier 3 counters (TokenRequestsTotal,
TokenTokensUsedTotal, TokenQuotaConsumedTotal) create potentially unbounded
label cardinality (user_id, token_id, model) which can hurt Prometheus; add a
runtime toggle to enable/disable these Tier 3 metrics (e.g., a config flag or
env var read during metrics init), document expected cardinality and resource
guidance for running with this flag enabled, and implement a cleanup/expiration
strategy for these CounterVecs (e.g., periodically calling DeleteLabelValues or
using a short-lived registry) when tokens/users become inactive so labels are
pruned; update the metrics initialization code to respect the new config flag
and add brief docs mentioning Prometheus requirements and recommended limits.

In `@monitor/server.go`:
- Around line 32-34: Replace the direct call to http.ListenAndServe with an
http.Server instance and add graceful shutdown using a context: create server :=
&http.Server{Addr: addr, Handler: mux}, run server.ListenAndServe in a
goroutine, and in the main routine wait for the provided context to be cancelled
(ctx.Done()) then call server.Shutdown(shutdownCtx) with a short timeout context
to allow in-flight requests to finish; update any function signature that
currently starts the server to accept a context.Context so you can invoke
server.Shutdown when the app is stopping (referencing http.ListenAndServe,
server.ListenAndServe, and server.Shutdown).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 14efaaab-95e5-4c48-8953-7aee20d115ed

📥 Commits

Reviewing files that changed from the base of the PR and between c9611c4 and 8d8cf0a.

⛔ Files ignored due to path filters (1)

go.sum is excluded by !**/*.sum

📒 Files selected for processing (12)

controller/relay.go
go.mod
main.go
model/main.go
monitor/db.go
monitor/metrics.go
monitor/middleware.go
monitor/redis_hook.go
monitor/relay.go
monitor/server.go
monitor/token.go
service/text_quota.go

coderabbitai · 2026-04-03T11:07:21Z

controller/relay.go

+				FirstTokenTime: relayInfo.FirstResponseTime,
+				IsStream:       relayInfo.IsStream,
+			})
+			monitor.RecordTokenRequest(relayInfo.UserId, relayInfo.TokenId)


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check if there's any cardinality limiting or Tier 3 disable flag rg -n "tier.?3|token.*metric|user.*metric" --type go -i

Repository: QuantumNous/new-api

Length of output: 192

🏁 Script executed:

# 1. Find and examine the RecordTokenRequest function rg -n "RecordTokenRequest" --type go -A 10

Repository: QuantumNous/new-api

Length of output: 2091

🏁 Script executed:

# 2. Examine the monitor/metrics.go file around Tier 3 metrics rg -n "Tier 3|User/Token-Level" --type go -B 5 -A 15 monitor/

Repository: QuantumNous/new-api

Length of output: 1896

🏁 Script executed:

# 3. Check for any cardinality limiting or configuration flags rg -n "cardinality|disable.*tier|tier.*disable|maxSeries|maxLabels" --type go -i

Repository: QuantumNous/new-api

Length of output: 318

🏁 Script executed:

# 4. Verify lines 235 and 262 in controller/relay.go sed -n '230,240p;258,268p' controller/relay.go

Repository: QuantumNous/new-api

Length of output: 829

Address high cardinality Prometheus metrics with user_id and token_id labels.

RecordTokenRequest(relayInfo.UserId, relayInfo.TokenId) at lines 235 and 262 creates metrics with unbounded cardinality—each unique user and token combination generates a new metric series. With thousands of users and tokens, this can exhaust Prometheus memory, degrade query performance, and increase storage costs significantly.

The Tier 3 metrics (TokenRequestsTotal, TokenTokensUsedTotal, TokenQuotaConsumedTotal) are explicitly documented as high-cardinality but have no safeguards. Add one of the following:

A configuration flag to disable Tier 3 metrics independently

Clear operational documentation warning about cardinality limits for deployments

Bucketed ranges or sampling instead of exact IDs

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@controller/relay.go` at line 235, The Tier 3 metric calls are creating unbounded cardinality via RecordTokenRequest(relayInfo.UserId, relayInfo.TokenId) (and related metrics TokenRequestsTotal, TokenTokensUsedTotal, TokenQuotaConsumedTotal) — add a runtime config flag (e.g., EnableTier3Metrics or DisableHighCardinalityMetrics) in the monitor or config package and check it before invoking monitor.RecordTokenRequest and the other Tier 3 metric emitters so these metrics can be disabled per deployment; alternatively implement low-cardinality bucketing/sampling inside RecordTokenRequest (e.g., hash the userId/tokenId and map to fixed N buckets or sample 1/N requests) and update the metric names/labels accordingly to avoid user_id/token_id labels being emitted for every unique id.

coderabbitai · 2026-04-03T11:07:21Z

monitor/metrics.go

+	DBWaitTotal = prometheus.NewCounter(
+		prometheus.CounterOpts{
+			Namespace: namespace,
+			Name:      "db_wait_total",
+			Help:      "Total number of waits for a database connection.",
+		},
+	)
+	DBWaitDuration = prometheus.NewCounter(
+		prometheus.CounterOpts{
+			Namespace: namespace,
+			Name:      "db_wait_duration_seconds_total",
+			Help:      "Total time spent waiting for a database connection in seconds.",
+		},
+	)


⚠️ Potential issue | 🟡 Minor

DBWaitTotal and DBWaitDuration are registered but never recorded.

These counters are defined and registered, but StartDBPoolCollector in monitor/db.go only records DBOpenConnections. The sql.DBStats struct provides WaitCount and WaitDuration fields that should populate these metrics.

🔧 Proposed fix in monitor/db.go StartDBPoolCollector

for range ticker.C { stats := sqlDB.Stats() DBOpenConnections.WithLabelValues("open").Set(float64(stats.OpenConnections)) DBOpenConnections.WithLabelValues("in_use").Set(float64(stats.InUse)) DBOpenConnections.WithLabelValues("idle").Set(float64(stats.Idle)) + DBWaitTotal.Add(float64(stats.WaitCount)) + DBWaitDuration.Add(stats.WaitDuration.Seconds()) }

Note: Since these are cumulative counters from sql.DBStats, you'll need to track the previous values and add only the delta to avoid overcounting:

var lastWaitCount int64 var lastWaitDuration time.Duration for range ticker.C { stats := sqlDB.Stats() // ... existing gauge updates ... if stats.WaitCount > lastWaitCount { DBWaitTotal.Add(float64(stats.WaitCount - lastWaitCount)) lastWaitCount = stats.WaitCount } if stats.WaitDuration > lastWaitDuration { DBWaitDuration.Add((stats.WaitDuration - lastWaitDuration).Seconds()) lastWaitDuration = stats.WaitDuration } }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@monitor/metrics.go` around lines 87 - 100, DBWaitTotal and DBWaitDuration are never updated; modify StartDBPoolCollector to read stats := sqlDB.Stats() each tick and compute deltas against stored previous values (e.g., lastWaitCount int64 and lastWaitDuration time.Duration) and call DBWaitTotal.Add(float64(deltaWaitCount)) and DBWaitDuration.Add(deltaWaitDuration.Seconds()) only when the current values exceed the previous ones, then update the stored previous variables so counters reflect incremental WaitCount and WaitDuration from sql.DBStats rather than repeatedly adding the cumulative totals.

coderabbitai · 2026-04-03T11:07:21Z

monitor/middleware.go

+		HTTPActiveConnections.Inc()
+		start := time.Now()
+
+		c.Next()
+
+		HTTPActiveConnections.Dec()


⚠️ Potential issue | 🟡 Minor

HTTPActiveConnections gauge may become inaccurate if handler panics.

If a downstream handler panics, HTTPActiveConnections.Dec() will be skipped, causing the gauge to drift upward over time. Use defer to ensure the decrement always executes:

🔧 Proposed fix using defer

func PrometheusMiddleware() gin.HandlerFunc { return func(c *gin.Context) { if !Enabled() { c.Next() return } HTTPActiveConnections.Inc() + defer HTTPActiveConnections.Dec() start := time.Now() c.Next() - HTTPActiveConnections.Dec() - path := c.FullPath()

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

HTTPActiveConnections.Inc()

start := time.Now()

c.Next()

HTTPActiveConnections.Dec()

HTTPActiveConnections.Inc()

defer HTTPActiveConnections.Dec()

start := time.Now()

c.Next()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@monitor/middleware.go` around lines 18 - 23, HTTPActiveConnections may not be decremented if a downstream handler panics because Dec() is called after c.Next(); fix by making the decrement run in a defer immediately after increment so it always executes: add defer HTTPActiveConnections.Dec() right after HTTPActiveConnections.Inc(), and if you compute request duration using the start variable, move that duration/reporting logic into a deferred closure that captures start (or perform duration calculation inside the same defer) so c.Next() can still run and any panics won't leave the gauge inflated; reference HTTPActiveConnections, c.Next(), and the start variable.

feat(monitor): add Prometheus metrics package and integrate metrics i…

8d8cf0a

…nto all subsystems

coderabbitai bot reviewed Apr 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(monitor): add Prometheus metrics#4077

feat(monitor): add Prometheus metrics#4077
wzxjohn wants to merge 1 commit intoQuantumNous:mainfrom
wzxjohn:dev

wzxjohn commented Apr 3, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Apr 3, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Apr 3, 2026

Uh oh!

coderabbitai bot Apr 3, 2026

Uh oh!

coderabbitai bot Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wzxjohn commented Apr 3, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Metric Tiers

Architecture

Integration Points

Configuration

Design Decisions

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wzxjohn commented Apr 3, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 3, 2026 •

edited

Loading