Skip to content

feat(monitor): add Prometheus metrics#4077

Open
wzxjohn wants to merge 1 commit intoQuantumNous:mainfrom
wzxjohn:dev
Open

feat(monitor): add Prometheus metrics#4077
wzxjohn wants to merge 1 commit intoQuantumNous:mainfrom
wzxjohn:dev

Conversation

@wzxjohn
Copy link
Copy Markdown
Contributor

@wzxjohn wzxjohn commented Apr 3, 2026

Summary

Adds comprehensive Prometheus metrics to the API gateway with a three-tiered approach to control label cardinality. Metrics are served on a dedicated internal port (/metrics), separate from the main API server.

Metric Tiers

Tier 1 — System-Level (Low Cardinality)

Metric Type Labels
newapi_http_requests_total Counter method, path, status_code
newapi_http_request_duration_seconds Histogram method, path, status_code
newapi_http_active_connections Gauge
newapi_redis_commands_total Counter command, status
newapi_redis_command_duration_seconds Histogram command
newapi_db_queries_total Counter operation, status
newapi_db_query_duration_seconds Histogram operation
newapi_db_open_connections Gauge state
newapi_db_wait_total Counter
newapi_db_wait_duration_seconds_total Counter

Tier 2 — Relay/Channel-Level (Moderate Cardinality)

Metric Type Labels
newapi_relay_requests_total Counter channel_id, channel_type, model, status_code, relay_mode
newapi_relay_request_duration_seconds Histogram channel_id, channel_type, model, relay_mode
newapi_relay_first_token_seconds Histogram channel_id, channel_type, model
newapi_relay_tokens_used_total Counter channel_id, channel_type, model, direction
newapi_relay_errors_total Counter channel_id, channel_type, error_type
newapi_relay_retries_total Counter channel_id, channel_type

Tier 3 — User/Token-Level (High Cardinality, Counters Only)

Metric Type Labels
newapi_token_requests_total Counter user_id, token_id
newapi_token_tokens_used_total Counter user_id, token_id, model, direction
newapi_token_quota_consumed_total Counter user_id, token_id

Architecture

New monitor/ package with 7 files:

File Responsibility
metrics.go All Prometheus metric definitions, InitMetrics(), Enabled()
server.go Dedicated HTTP server for /metrics endpoint
middleware.go Gin middleware for HTTP request metrics
redis_hook.go go-redis/v8 Hook for command-level Redis metrics
db.go GORM callbacks for query metrics + background DB pool stats collector
relay.go Relay/channel metric recording helpers
token.go User/token metric recording helpers

Integration Points

  • main.go: Initialize metrics early in startup, register Gin middleware, attach Redis hook after client init
  • model/main.go: Register GORM callbacks and start DB pool stats collector after DB init
  • controller/relay.go: Record relay request/retry/error metrics on success and failure paths
  • service/text_quota.go: Record token usage and quota consumption after billing calculation

Configuration

Env Var Default Description
PROMETHEUS_ENABLED true Set to "false" to fully disable (zero overhead)
PROMETHEUS_PORT 9090 Port for the dedicated metrics HTTP server

Design Decisions

  • Dedicated port: Metrics endpoint is isolated from the public API, avoiding accidental exposure
  • Tiered cardinality: Tier 3 (user/token) uses counters only (no histograms) to limit storage impact at scale
  • Path normalization: HTTP middleware uses c.FullPath() (Gin route patterns) to avoid cardinality explosion from path parameters
  • No import cycles: monitor/ uses stdlib log for its own logging instead of importing common
  • Fire-and-forget: All metric recording is non-blocking; Enabled() guard on every call path

Summary by CodeRabbit

Release Notes

  • New Features
    • Added comprehensive Prometheus metrics integration for enhanced system monitoring and observability
    • Integrated automatic tracking of HTTP requests, database queries, and Redis operations
    • Introduced relay request and token usage metrics for API performance insights
    • Metrics server endpoint now available for accessing real-time application performance and health data

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 3, 2026

Walkthrough

This pull request adds comprehensive Prometheus monitoring and observability across the application. It introduces a new monitor package with metrics for HTTP requests, Redis commands, database operations, relay requests, and token usage. Monitoring is integrated via middleware, database callbacks, Redis hooks, and explicit metric recording calls throughout the codebase, with optional startup of a dedicated metrics HTTP server.

Changes

Cohort / File(s) Summary
Monitor Package
monitor/metrics.go, monitor/middleware.go, monitor/db.go, monitor/redis_hook.go, monitor/relay.go, monitor/token.go, monitor/server.go
Introduces comprehensive Prometheus instrumentation framework with metric definitions (counters, histograms, gauges), Gin middleware for HTTP tracking, GORM callback registration for database query metrics, Redis hook registration for command metrics, relay request/retry/token recording functions, token usage tracking, and optional metrics HTTP server on configurable port.
Application Integration
main.go, model/main.go
Wires Prometheus middleware into Gin, initializes metrics system during startup, conditionally registers Redis and database hooks based on enabled configuration, and starts metrics server in a background goroutine when enabled.
Relay Monitoring
controller/relay.go
Adds metric recording calls on successful relay (including status/duration/first-token time and token usage) and on failure; records retry events during retry loop iterations.
Token Quota Tracking
service/text_quota.go
Records relay and token usage metrics when quota is consumed, capturing channel, model, and token counts.
Dependency Update
go.mod
Promotes github.com/prometheus/client_golang from indirect to direct dependency.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • creamlike1024

Poem

🐰 Metrics now hop through every flow,
From Redis hops to DB glow,
Each relay jump and token spent,
We track with Prometheus intent—
Observability complete! 📊

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 63.16% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'feat(monitor): add Prometheus metrics' directly and clearly summarizes the primary change—addition of Prometheus metrics monitoring functionality across the codebase.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (6)
monitor/metrics.go (2)

11-14: Potential data race on enabled flag.

The enabled variable is written inside initOnce.Do() and read by Enabled() without synchronization. While sync.Once guarantees the write happens-before any subsequent Do() call returns, concurrent readers calling Enabled() before InitMetrics() completes could observe a stale false value, which is likely the intended behavior. However, if Enabled() is called concurrently during InitMetrics() execution, there's a potential race.

Consider using sync/atomic for the flag to ensure memory visibility:

🔧 Proposed fix using atomic
 import (
 	"sync"
+	"sync/atomic"

 	"github.com/prometheus/client_golang/prometheus"
 )

 const namespace = "newapi"

 var (
 	initOnce sync.Once
-	enabled  bool
+	enabled  atomic.Bool
 )

Then update Enabled():

 func Enabled() bool {
-	return enabled
+	return enabled.Load()
 }

And in InitMetrics():

-		enabled = true
+		enabled.Store(true)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@monitor/metrics.go` around lines 11 - 14, The package has a potential data
race on the package-level enabled flag: change the `enabled` variable to an
atomic type (e.g., uint32) and use atomic operations so reads/writes are
synchronized; update `Enabled()` to return atomic.LoadUint32(&enabled) == 1 and
set the flag in `InitMetrics()` with atomic.StoreUint32(&enabled, 1) (or 0 on
failure) instead of unsynchronized reads/writes, leaving `initOnce` usage intact
to ensure one-time initialization.

158-185: High cardinality metrics may impact Prometheus performance.

Tier 3 metrics use user_id and token_id labels, which can create unbounded cardinality (O(users × tokens × models)). While the design decision to use only counters (no histograms) is appropriate, consider:

  1. Adding documentation about expected cardinality and Prometheus resource requirements
  2. Providing a configuration option to disable Tier 3 metrics independently
  3. Implementing metric expiration/cleanup for inactive users/tokens
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@monitor/metrics.go` around lines 158 - 185, Tier 3 counters
(TokenRequestsTotal, TokenTokensUsedTotal, TokenQuotaConsumedTotal) create
potentially unbounded label cardinality (user_id, token_id, model) which can
hurt Prometheus; add a runtime toggle to enable/disable these Tier 3 metrics
(e.g., a config flag or env var read during metrics init), document expected
cardinality and resource guidance for running with this flag enabled, and
implement a cleanup/expiration strategy for these CounterVecs (e.g.,
periodically calling DeleteLabelValues or using a short-lived registry) when
tokens/users become inactive so labels are pruned; update the metrics
initialization code to respect the new config flag and add brief docs mentioning
Prometheus requirements and recommended limits.
monitor/db.go (2)

78-87: Background goroutine lacks shutdown mechanism.

The StartDBPoolCollector goroutine runs indefinitely with no way to stop it. This can cause issues during graceful shutdown or in tests. Consider accepting a context:

🔧 Proposed enhancement for graceful shutdown
-func StartDBPoolCollector(sqlDB *sql.DB) {
+func StartDBPoolCollector(ctx context.Context, sqlDB *sql.DB) {
 	if sqlDB == nil || !Enabled() {
 		return
 	}

 	go func() {
 		ticker := time.NewTicker(15 * time.Second)
 		defer ticker.Stop()
-		for range ticker.C {
+		for {
+			select {
+			case <-ctx.Done():
+				return
+			case <-ticker.C:
+			}
 			stats := sqlDB.Stats()
 			DBOpenConnections.WithLabelValues("open").Set(float64(stats.OpenConnections))
 			DBOpenConnections.WithLabelValues("in_use").Set(float64(stats.InUse))
 			DBOpenConnections.WithLabelValues("idle").Set(float64(stats.Idle))
 		}
 	}()
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@monitor/db.go` around lines 78 - 87, The background goroutine that polls
sqlDB.Stats() (currently started in StartDBPoolCollector) runs forever and needs
a shutdown mechanism; modify StartDBPoolCollector to accept a context.Context,
replace the for range ticker.C loop with a select that listens for ctx.Done()
and the ticker.C case, and on ctx cancellation stop the ticker and return;
update references to sqlDB.Stats(), DBOpenConnections and the ticker usage
within that function accordingly and ensure callers pass a cancellable or app
context so the collector can be cleanly stopped.

36-52: Callback registration errors are silently ignored.

The db.Callback().*.Register() calls return errors that are discarded with _. If registration fails (e.g., duplicate callback name), metrics will silently stop working. Consider logging registration failures:

🔧 Proposed fix to log registration errors
+import "log"

 func registerCallback(db *gorm.DB, operation string) {
 	beforeName := "prometheus:before_" + operation
 	afterName := "prometheus:after_" + operation
 	// ... callback definitions ...

 	switch operation {
 	case "create":
-		_ = db.Callback().Create().Before("*").Register(beforeName, before)
-		_ = db.Callback().Create().After("*").Register(afterName, after)
+		if err := db.Callback().Create().Before("*").Register(beforeName, before); err != nil {
+			log.Printf("Failed to register DB callback %s: %v", beforeName, err)
+		}
+		if err := db.Callback().Create().After("*").Register(afterName, after); err != nil {
+			log.Printf("Failed to register DB callback %s: %v", afterName, err)
+		}
 	// ... similar for other cases ...
 	}
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@monitor/db.go` around lines 36 - 52, The callback Register calls currently
discard returned errors; update each registration (the
db.Callback().Create()/Query()/Update()/Delete()/Raw().Before("*").Register(beforeName,
before) and .After("*").Register(afterName, after) calls) to capture the error
into a variable and log it if non-nil (include operation, beforeName/afterName
and the error in the log message) instead of using `_ =`. Apply this change for
both the Before and After registrations in all switch cases so failures (e.g.,
duplicate callback names) are surfaced.
monitor/server.go (1)

32-34: Consider adding graceful shutdown support.

The server currently has no shutdown mechanism. If the main application shuts down, the metrics server goroutine will be abandoned. Consider accepting a context for graceful shutdown:

🔧 Proposed enhancement for graceful shutdown
+import "context"

-func StartMetricsServer() {
+func StartMetricsServer(ctx context.Context) {
 	if !Enabled() {
 		return
 	}
 	// ... port setup ...
 
 	server := &http.Server{
 		Addr:    addr,
 		Handler: mux,
 	}
 
+	go func() {
+		<-ctx.Done()
+		server.Shutdown(context.Background())
+	}()
+
 	log.Printf("Prometheus metrics server listening on %s", addr)
-	if err := http.ListenAndServe(addr, mux); err != nil {
+	if err := server.ListenAndServe(); err != nil && err != http.ErrServerClosed {
 		log.Printf("Prometheus metrics server error: %v", err)
 	}
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@monitor/server.go` around lines 32 - 34, Replace the direct call to
http.ListenAndServe with an http.Server instance and add graceful shutdown using
a context: create server := &http.Server{Addr: addr, Handler: mux}, run
server.ListenAndServe in a goroutine, and in the main routine wait for the
provided context to be cancelled (ctx.Done()) then call
server.Shutdown(shutdownCtx) with a short timeout context to allow in-flight
requests to finish; update any function signature that currently starts the
server to accept a context.Context so you can invoke server.Shutdown when the
app is stopping (referencing http.ListenAndServe, server.ListenAndServe, and
server.Shutdown).
model/main.go (1)

199-201: Consider instrumenting LOG_DB for consistency.

The main DB is instrumented with Prometheus metrics, but InitLogDB() (lines 218-253) does not register callbacks or pool collectors for LOG_DB when it uses a separate database connection. If observability into log database operations is desired, consider adding similar instrumentation there.

The current placement after connection pool configuration is appropriate.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model/main.go` around lines 199 - 201, InitLogDB currently creates a separate
log GORM connection but doesn't register Prometheus instrumentation; after you
configure the log DB's connection pool in InitLogDB(), call
monitor.RegisterDBCallbacks(logDB) and monitor.StartDBPoolCollector(logSqlDB)
(where logDB is the GORM *gorm.DB returned/assigned in InitLogDB and logSqlDB is
the underlying *sql.DB obtained via logDB.DB()) so the LOG_DB is instrumented
the same way as the main DB.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@controller/relay.go`:
- Line 235: The Tier 3 metric calls are creating unbounded cardinality via
RecordTokenRequest(relayInfo.UserId, relayInfo.TokenId) (and related metrics
TokenRequestsTotal, TokenTokensUsedTotal, TokenQuotaConsumedTotal) — add a
runtime config flag (e.g., EnableTier3Metrics or DisableHighCardinalityMetrics)
in the monitor or config package and check it before invoking
monitor.RecordTokenRequest and the other Tier 3 metric emitters so these metrics
can be disabled per deployment; alternatively implement low-cardinality
bucketing/sampling inside RecordTokenRequest (e.g., hash the userId/tokenId and
map to fixed N buckets or sample 1/N requests) and update the metric
names/labels accordingly to avoid user_id/token_id labels being emitted for
every unique id.

In `@monitor/metrics.go`:
- Around line 87-100: DBWaitTotal and DBWaitDuration are never updated; modify
StartDBPoolCollector to read stats := sqlDB.Stats() each tick and compute deltas
against stored previous values (e.g., lastWaitCount int64 and lastWaitDuration
time.Duration) and call DBWaitTotal.Add(float64(deltaWaitCount)) and
DBWaitDuration.Add(deltaWaitDuration.Seconds()) only when the current values
exceed the previous ones, then update the stored previous variables so counters
reflect incremental WaitCount and WaitDuration from sql.DBStats rather than
repeatedly adding the cumulative totals.

In `@monitor/middleware.go`:
- Around line 18-23: HTTPActiveConnections may not be decremented if a
downstream handler panics because Dec() is called after c.Next(); fix by making
the decrement run in a defer immediately after increment so it always executes:
add defer HTTPActiveConnections.Dec() right after HTTPActiveConnections.Inc(),
and if you compute request duration using the start variable, move that
duration/reporting logic into a deferred closure that captures start (or perform
duration calculation inside the same defer) so c.Next() can still run and any
panics won't leave the gauge inflated; reference HTTPActiveConnections,
c.Next(), and the start variable.

---

Nitpick comments:
In `@model/main.go`:
- Around line 199-201: InitLogDB currently creates a separate log GORM
connection but doesn't register Prometheus instrumentation; after you configure
the log DB's connection pool in InitLogDB(), call
monitor.RegisterDBCallbacks(logDB) and monitor.StartDBPoolCollector(logSqlDB)
(where logDB is the GORM *gorm.DB returned/assigned in InitLogDB and logSqlDB is
the underlying *sql.DB obtained via logDB.DB()) so the LOG_DB is instrumented
the same way as the main DB.

In `@monitor/db.go`:
- Around line 78-87: The background goroutine that polls sqlDB.Stats()
(currently started in StartDBPoolCollector) runs forever and needs a shutdown
mechanism; modify StartDBPoolCollector to accept a context.Context, replace the
for range ticker.C loop with a select that listens for ctx.Done() and the
ticker.C case, and on ctx cancellation stop the ticker and return; update
references to sqlDB.Stats(), DBOpenConnections and the ticker usage within that
function accordingly and ensure callers pass a cancellable or app context so the
collector can be cleanly stopped.
- Around line 36-52: The callback Register calls currently discard returned
errors; update each registration (the
db.Callback().Create()/Query()/Update()/Delete()/Raw().Before("*").Register(beforeName,
before) and .After("*").Register(afterName, after) calls) to capture the error
into a variable and log it if non-nil (include operation, beforeName/afterName
and the error in the log message) instead of using `_ =`. Apply this change for
both the Before and After registrations in all switch cases so failures (e.g.,
duplicate callback names) are surfaced.

In `@monitor/metrics.go`:
- Around line 11-14: The package has a potential data race on the package-level
enabled flag: change the `enabled` variable to an atomic type (e.g., uint32) and
use atomic operations so reads/writes are synchronized; update `Enabled()` to
return atomic.LoadUint32(&enabled) == 1 and set the flag in `InitMetrics()` with
atomic.StoreUint32(&enabled, 1) (or 0 on failure) instead of unsynchronized
reads/writes, leaving `initOnce` usage intact to ensure one-time initialization.
- Around line 158-185: Tier 3 counters (TokenRequestsTotal,
TokenTokensUsedTotal, TokenQuotaConsumedTotal) create potentially unbounded
label cardinality (user_id, token_id, model) which can hurt Prometheus; add a
runtime toggle to enable/disable these Tier 3 metrics (e.g., a config flag or
env var read during metrics init), document expected cardinality and resource
guidance for running with this flag enabled, and implement a cleanup/expiration
strategy for these CounterVecs (e.g., periodically calling DeleteLabelValues or
using a short-lived registry) when tokens/users become inactive so labels are
pruned; update the metrics initialization code to respect the new config flag
and add brief docs mentioning Prometheus requirements and recommended limits.

In `@monitor/server.go`:
- Around line 32-34: Replace the direct call to http.ListenAndServe with an
http.Server instance and add graceful shutdown using a context: create server :=
&http.Server{Addr: addr, Handler: mux}, run server.ListenAndServe in a
goroutine, and in the main routine wait for the provided context to be cancelled
(ctx.Done()) then call server.Shutdown(shutdownCtx) with a short timeout context
to allow in-flight requests to finish; update any function signature that
currently starts the server to accept a context.Context so you can invoke
server.Shutdown when the app is stopping (referencing http.ListenAndServe,
server.ListenAndServe, and server.Shutdown).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 14efaaab-95e5-4c48-8953-7aee20d115ed

📥 Commits

Reviewing files that changed from the base of the PR and between c9611c4 and 8d8cf0a.

⛔ Files ignored due to path filters (1)
  • go.sum is excluded by !**/*.sum
📒 Files selected for processing (12)
  • controller/relay.go
  • go.mod
  • main.go
  • model/main.go
  • monitor/db.go
  • monitor/metrics.go
  • monitor/middleware.go
  • monitor/redis_hook.go
  • monitor/relay.go
  • monitor/server.go
  • monitor/token.go
  • service/text_quota.go

FirstTokenTime: relayInfo.FirstResponseTime,
IsStream: relayInfo.IsStream,
})
monitor.RecordTokenRequest(relayInfo.UserId, relayInfo.TokenId)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if there's any cardinality limiting or Tier 3 disable flag
rg -n "tier.?3|token.*metric|user.*metric" --type go -i

Repository: QuantumNous/new-api

Length of output: 192


🏁 Script executed:

# 1. Find and examine the RecordTokenRequest function
rg -n "RecordTokenRequest" --type go -A 10

Repository: QuantumNous/new-api

Length of output: 2091


🏁 Script executed:

# 2. Examine the monitor/metrics.go file around Tier 3 metrics
rg -n "Tier 3|User/Token-Level" --type go -B 5 -A 15 monitor/

Repository: QuantumNous/new-api

Length of output: 1896


🏁 Script executed:

# 3. Check for any cardinality limiting or configuration flags
rg -n "cardinality|disable.*tier|tier.*disable|maxSeries|maxLabels" --type go -i

Repository: QuantumNous/new-api

Length of output: 318


🏁 Script executed:

# 4. Verify lines 235 and 262 in controller/relay.go
sed -n '230,240p;258,268p' controller/relay.go

Repository: QuantumNous/new-api

Length of output: 829


Address high cardinality Prometheus metrics with user_id and token_id labels.

RecordTokenRequest(relayInfo.UserId, relayInfo.TokenId) at lines 235 and 262 creates metrics with unbounded cardinality—each unique user and token combination generates a new metric series. With thousands of users and tokens, this can exhaust Prometheus memory, degrade query performance, and increase storage costs significantly.

The Tier 3 metrics (TokenRequestsTotal, TokenTokensUsedTotal, TokenQuotaConsumedTotal) are explicitly documented as high-cardinality but have no safeguards. Add one of the following:

  1. A configuration flag to disable Tier 3 metrics independently
  2. Clear operational documentation warning about cardinality limits for deployments
  3. Bucketed ranges or sampling instead of exact IDs
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@controller/relay.go` at line 235, The Tier 3 metric calls are creating
unbounded cardinality via RecordTokenRequest(relayInfo.UserId,
relayInfo.TokenId) (and related metrics TokenRequestsTotal,
TokenTokensUsedTotal, TokenQuotaConsumedTotal) — add a runtime config flag
(e.g., EnableTier3Metrics or DisableHighCardinalityMetrics) in the monitor or
config package and check it before invoking monitor.RecordTokenRequest and the
other Tier 3 metric emitters so these metrics can be disabled per deployment;
alternatively implement low-cardinality bucketing/sampling inside
RecordTokenRequest (e.g., hash the userId/tokenId and map to fixed N buckets or
sample 1/N requests) and update the metric names/labels accordingly to avoid
user_id/token_id labels being emitted for every unique id.

Comment on lines +87 to +100
DBWaitTotal = prometheus.NewCounter(
prometheus.CounterOpts{
Namespace: namespace,
Name: "db_wait_total",
Help: "Total number of waits for a database connection.",
},
)
DBWaitDuration = prometheus.NewCounter(
prometheus.CounterOpts{
Namespace: namespace,
Name: "db_wait_duration_seconds_total",
Help: "Total time spent waiting for a database connection in seconds.",
},
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

DBWaitTotal and DBWaitDuration are registered but never recorded.

These counters are defined and registered, but StartDBPoolCollector in monitor/db.go only records DBOpenConnections. The sql.DBStats struct provides WaitCount and WaitDuration fields that should populate these metrics.

🔧 Proposed fix in monitor/db.go StartDBPoolCollector
 for range ticker.C {
 	stats := sqlDB.Stats()
 	DBOpenConnections.WithLabelValues("open").Set(float64(stats.OpenConnections))
 	DBOpenConnections.WithLabelValues("in_use").Set(float64(stats.InUse))
 	DBOpenConnections.WithLabelValues("idle").Set(float64(stats.Idle))
+	DBWaitTotal.Add(float64(stats.WaitCount))
+	DBWaitDuration.Add(stats.WaitDuration.Seconds())
 }

Note: Since these are cumulative counters from sql.DBStats, you'll need to track the previous values and add only the delta to avoid overcounting:

var lastWaitCount int64
var lastWaitDuration time.Duration

for range ticker.C {
	stats := sqlDB.Stats()
	// ... existing gauge updates ...
	
	if stats.WaitCount > lastWaitCount {
		DBWaitTotal.Add(float64(stats.WaitCount - lastWaitCount))
		lastWaitCount = stats.WaitCount
	}
	if stats.WaitDuration > lastWaitDuration {
		DBWaitDuration.Add((stats.WaitDuration - lastWaitDuration).Seconds())
		lastWaitDuration = stats.WaitDuration
	}
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@monitor/metrics.go` around lines 87 - 100, DBWaitTotal and DBWaitDuration are
never updated; modify StartDBPoolCollector to read stats := sqlDB.Stats() each
tick and compute deltas against stored previous values (e.g., lastWaitCount
int64 and lastWaitDuration time.Duration) and call
DBWaitTotal.Add(float64(deltaWaitCount)) and
DBWaitDuration.Add(deltaWaitDuration.Seconds()) only when the current values
exceed the previous ones, then update the stored previous variables so counters
reflect incremental WaitCount and WaitDuration from sql.DBStats rather than
repeatedly adding the cumulative totals.

Comment on lines +18 to +23
HTTPActiveConnections.Inc()
start := time.Now()

c.Next()

HTTPActiveConnections.Dec()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

HTTPActiveConnections gauge may become inaccurate if handler panics.

If a downstream handler panics, HTTPActiveConnections.Dec() will be skipped, causing the gauge to drift upward over time. Use defer to ensure the decrement always executes:

🔧 Proposed fix using defer
 func PrometheusMiddleware() gin.HandlerFunc {
 	return func(c *gin.Context) {
 		if !Enabled() {
 			c.Next()
 			return
 		}

 		HTTPActiveConnections.Inc()
+		defer HTTPActiveConnections.Dec()
 		start := time.Now()

 		c.Next()

-		HTTPActiveConnections.Dec()
-
 		path := c.FullPath()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
HTTPActiveConnections.Inc()
start := time.Now()
c.Next()
HTTPActiveConnections.Dec()
HTTPActiveConnections.Inc()
defer HTTPActiveConnections.Dec()
start := time.Now()
c.Next()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@monitor/middleware.go` around lines 18 - 23, HTTPActiveConnections may not be
decremented if a downstream handler panics because Dec() is called after
c.Next(); fix by making the decrement run in a defer immediately after increment
so it always executes: add defer HTTPActiveConnections.Dec() right after
HTTPActiveConnections.Inc(), and if you compute request duration using the start
variable, move that duration/reporting logic into a deferred closure that
captures start (or perform duration calculation inside the same defer) so
c.Next() can still run and any panics won't leave the gauge inflated; reference
HTTPActiveConnections, c.Next(), and the start variable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant