Skip to content

mongodb_exporter v0.45.0: "Cannot connect to MongoDB: context deadline exceeded" on every high-resolution scrape #5093

Description

@MemberIT

Description

Summary

After upgrading PMM Client from 3.3.0 to 3.6.0 (affected versions: 3.3.1 - 3.6.0), mongodb_exporter (v0.45.0) fails with "Cannot connect to MongoDB: context deadline exceeded" every second on the high-resolution (HR) scrape job. MongoDB is fully operational and reachable — the error is caused by the scrape context being exhausted before any MongoDB operation can execute.

Severity

Major — MongoDB monitoring is completely broken for the HR scrape job on all hosts running PMM Client 3.6.0. The exporter reports mongodb_up = 0 on every HR scrape despite having ESTABLISHED TCP connections to MongoDB.

Affected Versions

  • PMM Server: 3.6.0
  • PMM Client: 3.6.0 (pmm-client 3.6.0-7.noble)
  • mongodb_exporter: v0.45.0 (commit bea2924e, build 2026-02-03)
  • MongoDB: 8.0.17 (Percona Server for MongoDB), replica set mode
  • OS: Ubuntu 24.04.1 LTS (Noble)

Previously working: PMM Client 3.3.0 (confirmed).

Root Cause

An incompatibility between three components that together create a zero-second effective timeout:

1. PMM Server sets scrape_timeout equal to scrape_interval for the HR job

For a default HR resolution of 1s (or even 2s in some configurations), the generated vmagent scrape config is:

# High-resolution job
scrape_interval: 1s
scrape_timeout: 1s

This causes vmagent to send the HTTP header X-Prometheus-Scrape-Timeout-Seconds: 1 to the exporter.

2. mongodb_exporter v0.45.0 enforces web.timeout-offset >= 1

In main.go:49:

TimeoutOffset int `name:"web.timeout-offset" help:"Offset to subtract from the request timeout in seconds" default:"1"`

In main.go:121-124:

if opts.TimeoutOffset <= 0 {
    logger.Warn("Timeout offset needs to be greater than \"0\", falling back to \"1\".")
    opts.TimeoutOffset = 1
}

The minimum enforced value is 1 second. It cannot be set to 0.

3. The exporter subtracts the offset from the scrape timeout to create the context

In exporter.go:313-317:

seconds -= float64(e.opts.TimeoutOffset)
ctx, cancel := context.WithTimeout(r.Context(), time.Duration(seconds*float64(time.Second)))

Result

effective_timeout = scrape_timeout(1s) - web.timeout-offset(1s) = 0s

The context expires instantly. Every client.Ping(ctx, nil) in getClient() returns "context deadline exceeded". This repeats every second matching the scrape_interval: 1s.

Additional factor: hardcoded 1s DialTimeout in PMM managed

In managed/services/agents/mongodb.go, the MongoDB URI is built with:

exporter.DSN(service, models.DSNParams{DialTimeout: time.Second, ...}, ...)

This produces connectTimeoutMS=1000&serverSelectionTimeoutMS=1000 in the MONGODB_URI environment variable, further restricting any connection attempt.

Evidence

Proof that the exporter works with sufficient timeout

On the affected host, querying the same running exporter with different timeout headers:

X-Prometheus-Scrape-Timeout-Seconds Effective (minus offset=1) mongodb_up MongoDB metrics
1 0s 0 None
2 1s 1 Full metrics ✅
5 4s 1 Full metrics ✅
10 9s 1 Full metrics ✅
# Fails (0s effective timeout):
curl -u pmm:<agent_id> -H 'X-Prometheus-Scrape-Timeout-Seconds: 1' \
  'http://127.0.0.1:42002/metrics?collect[]=diagnosticdata' | grep mongodb_up
# mongodb_up{cluster_role=""} 0

# Works (1s effective timeout):
curl -u pmm:<agent_id> -H 'X-Prometheus-Scrape-Timeout-Seconds: 2' \
  'http://127.0.0.1:42002/metrics?collect[]=diagnosticdata' | grep mongodb_up
# mongodb_up{cluster_role="",cl_id="...",rs_nm="rs0",rs_state="1"} 1

Proof that MongoDB is fully operational

$ mongosh 'mongodb://pmm:<password>@MONGODB_HOSTNAME:27017/?ssl=true' --eval 'db.runCommand({ping:1})'
{ ok: 1 }

Exporter has established TCP connections but still reports errors

$ ss -tnp | grep mongodb_exporte
ESTAB 0 0 MONGODB_IP_ADDRESS:7852 MONGODB_IP_ADDRESS:27017 users:(("mongodb_exporte",pid=81600,fd=7))
ESTAB 0 0 MONGODB_IP_ADDRESS:7858 MONGODB_IP_ADDRESS:27017 users:(("mongodb_exporte",pid=81600,fd=8))
ESTAB 0 0 MONGODB_IP_ADDRESS:7860 MONGODB_IP_ADDRESS:27017 users:(("mongodb_exporte",pid=81600,fd=9))

Error pattern — every 1 second, matching scrape_interval

Feb 25 18:18:06 pmm-agent[80434]: level=error msg="Cannot connect to MongoDB" error="context deadline exceeded"
Feb 25 18:18:07 pmm-agent[80434]: level=error msg="Cannot connect to MongoDB" error="context deadline exceeded"
Feb 25 18:18:08 pmm-agent[80434]: level=error msg="Cannot connect to MongoDB" error="context deadline exceeded"
...repeats every second indefinitely...

The LR scrape job (scrape_timeout: 27s) works fine

The low-resolution job with scrape_timeout: 27s (effective = 26s) operates correctly. Only the HR job is affected.

Reproducer

A docker-compose.yml and setup.sh script are provided. Steps:

docker compose up -d
# Wait ~2-3 minutes for PMM server to initialize (health checks will gate pmm-client)
./setup.sh

The setup.sh script automatically:

  1. Waits for all containers to be healthy
  2. Initializes MongoDB replica set and waits for PRIMARY
  3. Creates a pmm monitoring user with authentication verification
  4. Sets PMM HR metrics resolution to 1s (to trigger the bug)
  5. Registers MongoDB with PMM
  6. Waits 30s for errors to appear
  7. Demonstrates the bug by scraping with 1s vs 10s timeout headers

Expected: mongodb_exporter reports metrics successfully on all scrape jobs.
Actual: mongodb_exporter logs "Cannot connect to MongoDB: context deadline exceeded" every second on the HR job.

Verification commands inside the container

# Show mongodb_up = 0 with 1s timeout (simulating HR scrape):
docker compose exec pmm-client bash -c '
  PORT=$(pmm-admin list | grep mongodb_exporter | awk "{print \$NF}")
  AGENT_ID=$(pmm-admin list | grep mongodb_exporter | awk "{print \$4}")
  curl -s -u pmm:$AGENT_ID -H "X-Prometheus-Scrape-Timeout-Seconds: 1" \
    http://127.0.0.1:$PORT/metrics | grep mongodb_up
'

# Show mongodb_up = 1 with 10s timeout (sufficient):
docker compose exec pmm-client bash -c '
  PORT=$(pmm-admin list | grep mongodb_exporter | awk "{print \$NF}")
  AGENT_ID=$(pmm-admin list | grep mongodb_exporter | awk "{print \$4}")
  curl -s -u pmm:$AGENT_ID -H "X-Prometheus-Scrape-Timeout-Seconds: 10" \
    http://127.0.0.1:$PORT/metrics | grep mongodb_up
'

Suggested Fixes

Option A (Recommended): Ensure scrape_timeout > web.timeout-offset

In PMM Server (managed/), when generating the vmagent scrape config, ensure:

scrape_timeout = max(scrape_interval, web.timeout-offset + minimum_operation_time)

For HR=1s, set scrape_timeout to at least 2s (giving 1s effective).

Option B: Allow web.timeout-offset = 0 in mongodb_exporter

Remove the minimum enforcement in main.go:121-124. Allow users (and PMM) to pass --web.timeout-offset=0 so the full scrape_timeout is available for MongoDB operations.

Option C: Increase default HR for MongoDB services

Set a minimum HR interval of 5s for MongoDB exporters, separate from the global HR setting. MongoDB TLS+auth connections need more headroom than simple node_exporter scrapes.

Option D: Increase DialTimeout in managed/services/agents/mongodb.go

Change DialTimeout: time.Second to DialTimeout: 5 * time.Second (or make it configurable). The current 1s connectTimeoutMS in the URI is too aggressive for TLS connections, especially in environments with network latency.

Workaround

Increase the global HR metrics resolution to ≥ 5s:

Via UI: PMM → Configuration → Settings → Advanced Settings → Metrics Resolution

Via API:

curl -k -u admin:<password> -X PUT \
  'https://<pmm-server>/v1/server/settings' \
  -H 'Content-Type: application/json' \
  -d '{"metrics_resolutions": {"hr": "5s", "mr": "5s", "lr": "30s"}}'

Then restart pmm-agent on affected hosts:

sudo systemctl restart pmm-agent

Note: This changes resolution globally for ALL monitored services.

Expected Results

mongodb_exporter reports metrics successfully on all scrape jobs.

Actual Results

mongodb_exporter logs "Cannot connect to MongoDB: context deadline exceeded" every second on the HR job.

Version

PMM Client from 3.3.1 to 3.6.0, PMM Server from 3.3.1 to 3.6.0

Environment Details

  • PMM Server 3.6.0 deployed on Kubernetes (Helm chart)
  • PMM Client 3.6.0 on bare-metal Ubuntu 24.04
  • MongoDB 8.0.17 (Percona Server) with TLS (requireTLS) and replica set (rs0)
  • Connection via hostname over TLS to external IP (same host)

Steps to reproduce

setup.sh

##
## Reproducer: mongodb_exporter "Cannot connect to MongoDB: context deadline exceeded"
##
## PMM 3.6.0 (mongodb_exporter v0.45.0) — HR scrape_timeout=1s combined with
## web.timeout-offset=1 yields 0s effective timeout for every high-resolution scrape.
##
## Usage:
##   docker compose up -d
##   # Wait ~2-3 minutes for PMM server to initialize
##   ./setup.sh
##   # The script will wait for errors and verify the bug automatically
##

services:
  pmm-server:
    image: percona/pmm-server:3.6.0
    container_name: pmm-server
    hostname: pmm-server
    ports:
      - "8443:8443"
      - "8080:8080"
    environment:
      PMM_ADMIN_PASSWORD: admin
      PMM_ENABLE_UPDATES: "false"
    volumes:
      - pmm-data:/srv
    healthcheck:
      test: ["CMD", "curl", "-sSf", "http://localhost:8080/v1/server/readyz"]
      interval: 10s
      timeout: 5s
      retries: 40
      start_period: 90s

  mongodb:
    image: percona/percona-server-mongodb:7.0
    container_name: mongodb
    hostname: mongodb
    command: >
      --replSet rs0
      --bind_ip_all
      --port 27017
    volumes:
      - mongo-data:/data/db
    healthcheck:
      test: ["CMD", "mongosh", "--quiet", "--eval", "db.runCommand({ping:1}).ok"]
      interval: 5s
      timeout: 3s
      retries: 10

  pmm-client:
    image: percona/pmm-client:3.6.0
    container_name: pmm-client
    hostname: pmm-client
    depends_on:
      pmm-server:
        condition: service_healthy
      mongodb:
        condition: service_healthy
    environment:
      PMM_AGENT_SERVER_ADDRESS: pmm-server:8443
      PMM_AGENT_SERVER_USERNAME: admin
      PMM_AGENT_SERVER_PASSWORD: admin
      PMM_AGENT_SERVER_INSECURE_TLS: "1"
      PMM_AGENT_SETUP: "1"
      PMM_AGENT_CONFIG_FILE: /usr/local/percona/pmm/config/pmm-agent.yaml

volumes:
  pmm-data:
  mongo-data:

Relevant logs

Code of Conduct

  • I agree to follow Percona Community Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugBug report

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions