Description
Summary
After upgrading PMM Client from 3.3.0 to 3.6.0 (affected versions: 3.3.1 - 3.6.0), mongodb_exporter (v0.45.0) fails with "Cannot connect to MongoDB: context deadline exceeded" every second on the high-resolution (HR) scrape job. MongoDB is fully operational and reachable — the error is caused by the scrape context being exhausted before any MongoDB operation can execute.
Severity
Major — MongoDB monitoring is completely broken for the HR scrape job on all hosts running PMM Client 3.6.0. The exporter reports mongodb_up = 0 on every HR scrape despite having ESTABLISHED TCP connections to MongoDB.
Affected Versions
- PMM Server: 3.6.0
- PMM Client: 3.6.0 (pmm-client 3.6.0-7.noble)
- mongodb_exporter: v0.45.0 (commit bea2924e, build 2026-02-03)
- MongoDB: 8.0.17 (Percona Server for MongoDB), replica set mode
- OS: Ubuntu 24.04.1 LTS (Noble)
Previously working: PMM Client 3.3.0 (confirmed).
Root Cause
An incompatibility between three components that together create a zero-second effective timeout:
1. PMM Server sets scrape_timeout equal to scrape_interval for the HR job
For a default HR resolution of 1s (or even 2s in some configurations), the generated vmagent scrape config is:
# High-resolution job
scrape_interval: 1s
scrape_timeout: 1s
This causes vmagent to send the HTTP header X-Prometheus-Scrape-Timeout-Seconds: 1 to the exporter.
2. mongodb_exporter v0.45.0 enforces web.timeout-offset >= 1
In main.go:49:
TimeoutOffset int `name:"web.timeout-offset" help:"Offset to subtract from the request timeout in seconds" default:"1"`
In main.go:121-124:
if opts.TimeoutOffset <= 0 {
logger.Warn("Timeout offset needs to be greater than \"0\", falling back to \"1\".")
opts.TimeoutOffset = 1
}
The minimum enforced value is 1 second. It cannot be set to 0.
3. The exporter subtracts the offset from the scrape timeout to create the context
In exporter.go:313-317:
seconds -= float64(e.opts.TimeoutOffset)
ctx, cancel := context.WithTimeout(r.Context(), time.Duration(seconds*float64(time.Second)))
Result
effective_timeout = scrape_timeout(1s) - web.timeout-offset(1s) = 0s
The context expires instantly. Every client.Ping(ctx, nil) in getClient() returns "context deadline exceeded". This repeats every second matching the scrape_interval: 1s.
Additional factor: hardcoded 1s DialTimeout in PMM managed
In managed/services/agents/mongodb.go, the MongoDB URI is built with:
exporter.DSN(service, models.DSNParams{DialTimeout: time.Second, ...}, ...)
This produces connectTimeoutMS=1000&serverSelectionTimeoutMS=1000 in the MONGODB_URI environment variable, further restricting any connection attempt.
Evidence
Proof that the exporter works with sufficient timeout
On the affected host, querying the same running exporter with different timeout headers:
X-Prometheus-Scrape-Timeout-Seconds |
Effective (minus offset=1) |
mongodb_up |
MongoDB metrics |
| 1 |
0s |
0 |
None |
| 2 |
1s |
1 |
Full metrics ✅ |
| 5 |
4s |
1 |
Full metrics ✅ |
| 10 |
9s |
1 |
Full metrics ✅ |
# Fails (0s effective timeout):
curl -u pmm:<agent_id> -H 'X-Prometheus-Scrape-Timeout-Seconds: 1' \
'http://127.0.0.1:42002/metrics?collect[]=diagnosticdata' | grep mongodb_up
# mongodb_up{cluster_role=""} 0
# Works (1s effective timeout):
curl -u pmm:<agent_id> -H 'X-Prometheus-Scrape-Timeout-Seconds: 2' \
'http://127.0.0.1:42002/metrics?collect[]=diagnosticdata' | grep mongodb_up
# mongodb_up{cluster_role="",cl_id="...",rs_nm="rs0",rs_state="1"} 1
Proof that MongoDB is fully operational
$ mongosh 'mongodb://pmm:<password>@MONGODB_HOSTNAME:27017/?ssl=true' --eval 'db.runCommand({ping:1})'
{ ok: 1 }
Exporter has established TCP connections but still reports errors
$ ss -tnp | grep mongodb_exporte
ESTAB 0 0 MONGODB_IP_ADDRESS:7852 MONGODB_IP_ADDRESS:27017 users:(("mongodb_exporte",pid=81600,fd=7))
ESTAB 0 0 MONGODB_IP_ADDRESS:7858 MONGODB_IP_ADDRESS:27017 users:(("mongodb_exporte",pid=81600,fd=8))
ESTAB 0 0 MONGODB_IP_ADDRESS:7860 MONGODB_IP_ADDRESS:27017 users:(("mongodb_exporte",pid=81600,fd=9))
Error pattern — every 1 second, matching scrape_interval
Feb 25 18:18:06 pmm-agent[80434]: level=error msg="Cannot connect to MongoDB" error="context deadline exceeded"
Feb 25 18:18:07 pmm-agent[80434]: level=error msg="Cannot connect to MongoDB" error="context deadline exceeded"
Feb 25 18:18:08 pmm-agent[80434]: level=error msg="Cannot connect to MongoDB" error="context deadline exceeded"
...repeats every second indefinitely...
The LR scrape job (scrape_timeout: 27s) works fine
The low-resolution job with scrape_timeout: 27s (effective = 26s) operates correctly. Only the HR job is affected.
Reproducer
A docker-compose.yml and setup.sh script are provided. Steps:
docker compose up -d
# Wait ~2-3 minutes for PMM server to initialize (health checks will gate pmm-client)
./setup.sh
The setup.sh script automatically:
- Waits for all containers to be healthy
- Initializes MongoDB replica set and waits for PRIMARY
- Creates a
pmm monitoring user with authentication verification
- Sets PMM HR metrics resolution to 1s (to trigger the bug)
- Registers MongoDB with PMM
- Waits 30s for errors to appear
- Demonstrates the bug by scraping with 1s vs 10s timeout headers
Expected: mongodb_exporter reports metrics successfully on all scrape jobs.
Actual: mongodb_exporter logs "Cannot connect to MongoDB: context deadline exceeded" every second on the HR job.
Verification commands inside the container
# Show mongodb_up = 0 with 1s timeout (simulating HR scrape):
docker compose exec pmm-client bash -c '
PORT=$(pmm-admin list | grep mongodb_exporter | awk "{print \$NF}")
AGENT_ID=$(pmm-admin list | grep mongodb_exporter | awk "{print \$4}")
curl -s -u pmm:$AGENT_ID -H "X-Prometheus-Scrape-Timeout-Seconds: 1" \
http://127.0.0.1:$PORT/metrics | grep mongodb_up
'
# Show mongodb_up = 1 with 10s timeout (sufficient):
docker compose exec pmm-client bash -c '
PORT=$(pmm-admin list | grep mongodb_exporter | awk "{print \$NF}")
AGENT_ID=$(pmm-admin list | grep mongodb_exporter | awk "{print \$4}")
curl -s -u pmm:$AGENT_ID -H "X-Prometheus-Scrape-Timeout-Seconds: 10" \
http://127.0.0.1:$PORT/metrics | grep mongodb_up
'
Suggested Fixes
Option A (Recommended): Ensure scrape_timeout > web.timeout-offset
In PMM Server (managed/), when generating the vmagent scrape config, ensure:
scrape_timeout = max(scrape_interval, web.timeout-offset + minimum_operation_time)
For HR=1s, set scrape_timeout to at least 2s (giving 1s effective).
Option B: Allow web.timeout-offset = 0 in mongodb_exporter
Remove the minimum enforcement in main.go:121-124. Allow users (and PMM) to pass --web.timeout-offset=0 so the full scrape_timeout is available for MongoDB operations.
Option C: Increase default HR for MongoDB services
Set a minimum HR interval of 5s for MongoDB exporters, separate from the global HR setting. MongoDB TLS+auth connections need more headroom than simple node_exporter scrapes.
Option D: Increase DialTimeout in managed/services/agents/mongodb.go
Change DialTimeout: time.Second to DialTimeout: 5 * time.Second (or make it configurable). The current 1s connectTimeoutMS in the URI is too aggressive for TLS connections, especially in environments with network latency.
Workaround
Increase the global HR metrics resolution to ≥ 5s:
Via UI: PMM → Configuration → Settings → Advanced Settings → Metrics Resolution
Via API:
curl -k -u admin:<password> -X PUT \
'https://<pmm-server>/v1/server/settings' \
-H 'Content-Type: application/json' \
-d '{"metrics_resolutions": {"hr": "5s", "mr": "5s", "lr": "30s"}}'
Then restart pmm-agent on affected hosts:
sudo systemctl restart pmm-agent
Note: This changes resolution globally for ALL monitored services.
Expected Results
mongodb_exporter reports metrics successfully on all scrape jobs.
Actual Results
mongodb_exporter logs "Cannot connect to MongoDB: context deadline exceeded" every second on the HR job.
Version
PMM Client from 3.3.1 to 3.6.0, PMM Server from 3.3.1 to 3.6.0
Environment Details
- PMM Server 3.6.0 deployed on Kubernetes (Helm chart)
- PMM Client 3.6.0 on bare-metal Ubuntu 24.04
- MongoDB 8.0.17 (Percona Server) with TLS (requireTLS) and replica set (rs0)
- Connection via hostname over TLS to external IP (same host)
Steps to reproduce
setup.sh
##
## Reproducer: mongodb_exporter "Cannot connect to MongoDB: context deadline exceeded"
##
## PMM 3.6.0 (mongodb_exporter v0.45.0) — HR scrape_timeout=1s combined with
## web.timeout-offset=1 yields 0s effective timeout for every high-resolution scrape.
##
## Usage:
## docker compose up -d
## # Wait ~2-3 minutes for PMM server to initialize
## ./setup.sh
## # The script will wait for errors and verify the bug automatically
##
services:
pmm-server:
image: percona/pmm-server:3.6.0
container_name: pmm-server
hostname: pmm-server
ports:
- "8443:8443"
- "8080:8080"
environment:
PMM_ADMIN_PASSWORD: admin
PMM_ENABLE_UPDATES: "false"
volumes:
- pmm-data:/srv
healthcheck:
test: ["CMD", "curl", "-sSf", "http://localhost:8080/v1/server/readyz"]
interval: 10s
timeout: 5s
retries: 40
start_period: 90s
mongodb:
image: percona/percona-server-mongodb:7.0
container_name: mongodb
hostname: mongodb
command: >
--replSet rs0
--bind_ip_all
--port 27017
volumes:
- mongo-data:/data/db
healthcheck:
test: ["CMD", "mongosh", "--quiet", "--eval", "db.runCommand({ping:1}).ok"]
interval: 5s
timeout: 3s
retries: 10
pmm-client:
image: percona/pmm-client:3.6.0
container_name: pmm-client
hostname: pmm-client
depends_on:
pmm-server:
condition: service_healthy
mongodb:
condition: service_healthy
environment:
PMM_AGENT_SERVER_ADDRESS: pmm-server:8443
PMM_AGENT_SERVER_USERNAME: admin
PMM_AGENT_SERVER_PASSWORD: admin
PMM_AGENT_SERVER_INSECURE_TLS: "1"
PMM_AGENT_SETUP: "1"
PMM_AGENT_CONFIG_FILE: /usr/local/percona/pmm/config/pmm-agent.yaml
volumes:
pmm-data:
mongo-data:
Relevant logs
Code of Conduct
Description
Summary
After upgrading PMM Client from 3.3.0 to 3.6.0 (affected versions: 3.3.1 - 3.6.0),
mongodb_exporter(v0.45.0) fails with"Cannot connect to MongoDB: context deadline exceeded"every second on the high-resolution (HR) scrape job. MongoDB is fully operational and reachable — the error is caused by the scrape context being exhausted before any MongoDB operation can execute.Severity
Major — MongoDB monitoring is completely broken for the HR scrape job on all hosts running PMM Client 3.6.0. The exporter reports
mongodb_up = 0on every HR scrape despite having ESTABLISHED TCP connections to MongoDB.Affected Versions
Previously working: PMM Client 3.3.0 (confirmed).
Root Cause
An incompatibility between three components that together create a zero-second effective timeout:
1. PMM Server sets
scrape_timeoutequal toscrape_intervalfor the HR jobFor a default HR resolution of 1s (or even 2s in some configurations), the generated vmagent scrape config is:
This causes vmagent to send the HTTP header
X-Prometheus-Scrape-Timeout-Seconds: 1to the exporter.2. mongodb_exporter v0.45.0 enforces
web.timeout-offset >= 1In
main.go:49:In
main.go:121-124:The minimum enforced value is 1 second. It cannot be set to 0.
3. The exporter subtracts the offset from the scrape timeout to create the context
In
exporter.go:313-317:Result
The context expires instantly. Every
client.Ping(ctx, nil)ingetClient()returns"context deadline exceeded". This repeats every second matching thescrape_interval: 1s.Additional factor: hardcoded 1s DialTimeout in PMM managed
In
managed/services/agents/mongodb.go, the MongoDB URI is built with:This produces
connectTimeoutMS=1000&serverSelectionTimeoutMS=1000in theMONGODB_URIenvironment variable, further restricting any connection attempt.Evidence
Proof that the exporter works with sufficient timeout
On the affected host, querying the same running exporter with different timeout headers:
X-Prometheus-Scrape-Timeout-Secondsmongodb_up0111Proof that MongoDB is fully operational
Exporter has established TCP connections but still reports errors
Error pattern — every 1 second, matching scrape_interval
The LR scrape job (scrape_timeout: 27s) works fine
The low-resolution job with
scrape_timeout: 27s(effective = 26s) operates correctly. Only the HR job is affected.Reproducer
A
docker-compose.ymlandsetup.shscript are provided. Steps:docker compose up -d # Wait ~2-3 minutes for PMM server to initialize (health checks will gate pmm-client) ./setup.shThe
setup.shscript automatically:pmmmonitoring user with authentication verificationExpected:
mongodb_exporterreports metrics successfully on all scrape jobs.Actual:
mongodb_exporterlogs"Cannot connect to MongoDB: context deadline exceeded"every second on the HR job.Verification commands inside the container
Suggested Fixes
Option A (Recommended): Ensure scrape_timeout > web.timeout-offset
In PMM Server (
managed/), when generating the vmagent scrape config, ensure:For HR=1s, set
scrape_timeoutto at least 2s (giving 1s effective).Option B: Allow web.timeout-offset = 0 in mongodb_exporter
Remove the minimum enforcement in
main.go:121-124. Allow users (and PMM) to pass--web.timeout-offset=0so the full scrape_timeout is available for MongoDB operations.Option C: Increase default HR for MongoDB services
Set a minimum HR interval of 5s for MongoDB exporters, separate from the global HR setting. MongoDB TLS+auth connections need more headroom than simple node_exporter scrapes.
Option D: Increase DialTimeout in managed/services/agents/mongodb.go
Change
DialTimeout: time.SecondtoDialTimeout: 5 * time.Second(or make it configurable). The current 1sconnectTimeoutMSin the URI is too aggressive for TLS connections, especially in environments with network latency.Workaround
Increase the global HR metrics resolution to ≥ 5s:
Via UI: PMM → Configuration → Settings → Advanced Settings → Metrics Resolution
Via API:
Then restart pmm-agent on affected hosts:
Note: This changes resolution globally for ALL monitored services.
Expected Results
mongodb_exporterreports metrics successfully on all scrape jobs.Actual Results
mongodb_exporterlogs"Cannot connect to MongoDB: context deadline exceeded"every second on the HR job.Version
PMM Client from 3.3.1 to 3.6.0, PMM Server from 3.3.1 to 3.6.0
Environment Details
Steps to reproduce
setup.sh
Relevant logs
Code of Conduct