mongodb_exporter v0.45.0: "Cannot connect to MongoDB: context deadline exceeded" on every high-resolution scrape

### Description

## Summary

After upgrading PMM Client from 3.3.0 to 3.6.0 (affected versions: 3.3.1 - 3.6.0), `mongodb_exporter` (v0.45.0) fails with `"Cannot connect to MongoDB: context deadline exceeded"` every second on the high-resolution (HR) scrape job. MongoDB is fully operational and reachable — the error is caused by the scrape context being exhausted before any MongoDB operation can execute.

## Severity

**Major** — MongoDB monitoring is completely broken for the HR scrape job on all hosts running PMM Client 3.6.0. The exporter reports `mongodb_up = 0` on every HR scrape despite having ESTABLISHED TCP connections to MongoDB.

## Affected Versions

- **PMM Server**: 3.6.0
- **PMM Client**: 3.6.0 (pmm-client 3.6.0-7.noble)
- **mongodb_exporter**: v0.45.0 (commit bea2924e, build 2026-02-03)
- **MongoDB**: 8.0.17 (Percona Server for MongoDB), replica set mode
- **OS**: Ubuntu 24.04.1 LTS (Noble)

**Previously working**: PMM Client 3.3.0 (confirmed).

## Root Cause

An incompatibility between three components that together create a **zero-second effective timeout**:

### 1. PMM Server sets `scrape_timeout` equal to `scrape_interval` for the HR job

For a default HR resolution of 1s (or even 2s in some configurations), the generated vmagent scrape config is:

```yaml
# High-resolution job
scrape_interval: 1s
scrape_timeout: 1s
```

This causes vmagent to send the HTTP header `X-Prometheus-Scrape-Timeout-Seconds: 1` to the exporter.

### 2. mongodb_exporter v0.45.0 enforces `web.timeout-offset >= 1`

In [`main.go:49`](https://github.com/percona/mongodb_exporter/blob/main/main.go#L49):

```go
TimeoutOffset int `name:"web.timeout-offset" help:"Offset to subtract from the request timeout in seconds" default:"1"`
```

In [`main.go:121-124`](https://github.com/percona/mongodb_exporter/blob/main/main.go#L121):

```go
if opts.TimeoutOffset <= 0 {
    logger.Warn("Timeout offset needs to be greater than \"0\", falling back to \"1\".")
    opts.TimeoutOffset = 1
}
```

The minimum enforced value is **1 second**. It cannot be set to 0.

### 3. The exporter subtracts the offset from the scrape timeout to create the context

In [`exporter.go:313-317`](https://github.com/percona/mongodb_exporter/blob/main/exporter/exporter.go#L313):

```go
seconds -= float64(e.opts.TimeoutOffset)
ctx, cancel := context.WithTimeout(r.Context(), time.Duration(seconds*float64(time.Second)))
```

### Result

```
effective_timeout = scrape_timeout(1s) - web.timeout-offset(1s) = 0s
```

The context expires **instantly**. Every `client.Ping(ctx, nil)` in `getClient()` returns `"context deadline exceeded"`. This repeats every second matching the `scrape_interval: 1s`.

### Additional factor: hardcoded 1s DialTimeout in PMM managed

In [`managed/services/agents/mongodb.go`](https://github.com/percona/pmm/blob/main/managed/services/agents/mongodb.go), the MongoDB URI is built with:

```go
exporter.DSN(service, models.DSNParams{DialTimeout: time.Second, ...}, ...)
```

This produces `connectTimeoutMS=1000&serverSelectionTimeoutMS=1000` in the `MONGODB_URI` environment variable, further restricting any connection attempt.

## Evidence

### Proof that the exporter works with sufficient timeout

On the affected host, querying the **same running exporter** with different timeout headers:

| `X-Prometheus-Scrape-Timeout-Seconds` | Effective (minus offset=1) | `mongodb_up` | MongoDB metrics |
| ------------------------------------- | -------------------------- | ------------ | --------------- |
| 1                                     | **0s**                     | `0`          | None            |
| 2                                     | 1s                         | `1`          | Full metrics ✅ |
| 5                                     | 4s                         | `1`          | Full metrics ✅ |
| 10                                    | 9s                         | `1`          | Full metrics ✅ |

```bash
# Fails (0s effective timeout):
curl -u pmm:<agent_id> -H 'X-Prometheus-Scrape-Timeout-Seconds: 1' \
  'http://127.0.0.1:42002/metrics?collect[]=diagnosticdata' | grep mongodb_up
# mongodb_up{cluster_role=""} 0

# Works (1s effective timeout):
curl -u pmm:<agent_id> -H 'X-Prometheus-Scrape-Timeout-Seconds: 2' \
  'http://127.0.0.1:42002/metrics?collect[]=diagnosticdata' | grep mongodb_up
# mongodb_up{cluster_role="",cl_id="...",rs_nm="rs0",rs_state="1"} 1
```

### Proof that MongoDB is fully operational

```bash
$ mongosh 'mongodb://pmm:<password>@MONGODB_HOSTNAME:27017/?ssl=true' --eval 'db.runCommand({ping:1})'
{ ok: 1 }
```

### Exporter has established TCP connections but still reports errors

```
$ ss -tnp | grep mongodb_exporte
ESTAB 0 0 MONGODB_IP_ADDRESS:7852 MONGODB_IP_ADDRESS:27017 users:(("mongodb_exporte",pid=81600,fd=7))
ESTAB 0 0 MONGODB_IP_ADDRESS:7858 MONGODB_IP_ADDRESS:27017 users:(("mongodb_exporte",pid=81600,fd=8))
ESTAB 0 0 MONGODB_IP_ADDRESS:7860 MONGODB_IP_ADDRESS:27017 users:(("mongodb_exporte",pid=81600,fd=9))
```

### Error pattern — every 1 second, matching scrape_interval

```
Feb 25 18:18:06 pmm-agent[80434]: level=error msg="Cannot connect to MongoDB" error="context deadline exceeded"
Feb 25 18:18:07 pmm-agent[80434]: level=error msg="Cannot connect to MongoDB" error="context deadline exceeded"
Feb 25 18:18:08 pmm-agent[80434]: level=error msg="Cannot connect to MongoDB" error="context deadline exceeded"
...repeats every second indefinitely...
```

### The LR scrape job (scrape_timeout: 27s) works fine

The low-resolution job with `scrape_timeout: 27s` (effective = 26s) operates correctly. Only the HR job is affected.

## Reproducer

A `docker-compose.yml` and `setup.sh` script are provided. Steps:

```bash
docker compose up -d
# Wait ~2-3 minutes for PMM server to initialize (health checks will gate pmm-client)
./setup.sh
```

The `setup.sh` script automatically:

1. Waits for all containers to be healthy
2. Initializes MongoDB replica set and waits for PRIMARY
3. Creates a `pmm` monitoring user with authentication verification
4. Sets PMM HR metrics resolution to 1s (to trigger the bug)
5. Registers MongoDB with PMM
6. Waits 30s for errors to appear
7. Demonstrates the bug by scraping with 1s vs 10s timeout headers

**Expected**: `mongodb_exporter` reports metrics successfully on all scrape jobs.
**Actual**: `mongodb_exporter` logs `"Cannot connect to MongoDB: context deadline exceeded"` every second on the HR job.

### Verification commands inside the container

```bash
# Show mongodb_up = 0 with 1s timeout (simulating HR scrape):
docker compose exec pmm-client bash -c '
  PORT=$(pmm-admin list | grep mongodb_exporter | awk "{print \$NF}")
  AGENT_ID=$(pmm-admin list | grep mongodb_exporter | awk "{print \$4}")
  curl -s -u pmm:$AGENT_ID -H "X-Prometheus-Scrape-Timeout-Seconds: 1" \
    http://127.0.0.1:$PORT/metrics | grep mongodb_up
'

# Show mongodb_up = 1 with 10s timeout (sufficient):
docker compose exec pmm-client bash -c '
  PORT=$(pmm-admin list | grep mongodb_exporter | awk "{print \$NF}")
  AGENT_ID=$(pmm-admin list | grep mongodb_exporter | awk "{print \$4}")
  curl -s -u pmm:$AGENT_ID -H "X-Prometheus-Scrape-Timeout-Seconds: 10" \
    http://127.0.0.1:$PORT/metrics | grep mongodb_up
'
```

## Suggested Fixes

### Option A (Recommended): Ensure scrape_timeout > web.timeout-offset

In PMM Server (`managed/`), when generating the vmagent scrape config, ensure:

```
scrape_timeout = max(scrape_interval, web.timeout-offset + minimum_operation_time)
```

For HR=1s, set `scrape_timeout` to at least 2s (giving 1s effective).

### Option B: Allow web.timeout-offset = 0 in mongodb_exporter

Remove the minimum enforcement in `main.go:121-124`. Allow users (and PMM) to pass `--web.timeout-offset=0` so the full scrape_timeout is available for MongoDB operations.

### Option C: Increase default HR for MongoDB services

Set a minimum HR interval of 5s for MongoDB exporters, separate from the global HR setting. MongoDB TLS+auth connections need more headroom than simple node_exporter scrapes.

### Option D: Increase DialTimeout in managed/services/agents/mongodb.go

Change `DialTimeout: time.Second` to `DialTimeout: 5 * time.Second` (or make it configurable). The current 1s `connectTimeoutMS` in the URI is too aggressive for TLS connections, especially in environments with network latency.

## Workaround

Increase the global HR metrics resolution to ≥ 5s:

**Via UI**: PMM → Configuration → Settings → Advanced Settings → Metrics Resolution

**Via API**:

```bash
curl -k -u admin:<password> -X PUT \
  'https://<pmm-server>/v1/server/settings' \
  -H 'Content-Type: application/json' \
  -d '{"metrics_resolutions": {"hr": "5s", "mr": "5s", "lr": "30s"}}'
```

Then restart pmm-agent on affected hosts:

```bash
sudo systemctl restart pmm-agent
```

**Note**: This changes resolution globally for ALL monitored services.

### Expected Results

`mongodb_exporter` reports metrics successfully on all scrape jobs.

### Actual Results

`mongodb_exporter` logs `"Cannot connect to MongoDB: context deadline exceeded"` every second on the HR job.

### Version

## PMM Client from 3.3.1 to 3.6.0, PMM Server from 3.3.1 to 3.6.0

## Environment Details

- PMM Server 3.6.0 deployed on Kubernetes (Helm chart)
- PMM Client 3.6.0 on bare-metal Ubuntu 24.04
- MongoDB 8.0.17 (Percona Server) with TLS (requireTLS) and replica set (rs0)
- Connection via hostname over TLS to external IP (same host)

### Steps to reproduce

[setup.sh](https://github.com/user-attachments/files/25610804/setup.sh)
```yaml
##
## Reproducer: mongodb_exporter "Cannot connect to MongoDB: context deadline exceeded"
##
## PMM 3.6.0 (mongodb_exporter v0.45.0) — HR scrape_timeout=1s combined with
## web.timeout-offset=1 yields 0s effective timeout for every high-resolution scrape.
##
## Usage:
##   docker compose up -d
##   # Wait ~2-3 minutes for PMM server to initialize
##   ./setup.sh
##   # The script will wait for errors and verify the bug automatically
##

services:
  pmm-server:
    image: percona/pmm-server:3.6.0
    container_name: pmm-server
    hostname: pmm-server
    ports:
      - "8443:8443"
      - "8080:8080"
    environment:
      PMM_ADMIN_PASSWORD: admin
      PMM_ENABLE_UPDATES: "false"
    volumes:
      - pmm-data:/srv
    healthcheck:
      test: ["CMD", "curl", "-sSf", "http://localhost:8080/v1/server/readyz"]
      interval: 10s
      timeout: 5s
      retries: 40
      start_period: 90s

  mongodb:
    image: percona/percona-server-mongodb:7.0
    container_name: mongodb
    hostname: mongodb
    command: >
      --replSet rs0
      --bind_ip_all
      --port 27017
    volumes:
      - mongo-data:/data/db
    healthcheck:
      test: ["CMD", "mongosh", "--quiet", "--eval", "db.runCommand({ping:1}).ok"]
      interval: 5s
      timeout: 3s
      retries: 10

  pmm-client:
    image: percona/pmm-client:3.6.0
    container_name: pmm-client
    hostname: pmm-client
    depends_on:
      pmm-server:
        condition: service_healthy
      mongodb:
        condition: service_healthy
    environment:
      PMM_AGENT_SERVER_ADDRESS: pmm-server:8443
      PMM_AGENT_SERVER_USERNAME: admin
      PMM_AGENT_SERVER_PASSWORD: admin
      PMM_AGENT_SERVER_INSECURE_TLS: "1"
      PMM_AGENT_SETUP: "1"
      PMM_AGENT_CONFIG_FILE: /usr/local/percona/pmm/config/pmm-agent.yaml

volumes:
  pmm-data:
  mongo-data:
```

### Relevant logs

```Shell

```

### Code of Conduct

- [x] I agree to follow Percona Community Code of Conduct

`X-Prometheus-Scrape-Timeout-Seconds`	Effective (minus offset=1)	`mongodb_up`	MongoDB metrics
1	0s	`0`	None
2	1s	`1`	Full metrics ✅
5	4s	`1`	Full metrics ✅
10	9s	`1`	Full metrics ✅

Uh oh!

mongodb_exporter v0.45.0: "Cannot connect to MongoDB: context deadline exceeded" on every high-resolution scrape #5093

Description

Description

Summary

Severity

Affected Versions

Root Cause

1. PMM Server sets scrape_timeout equal to scrape_interval for the HR job

2. mongodb_exporter v0.45.0 enforces web.timeout-offset >= 1

3. The exporter subtracts the offset from the scrape timeout to create the context

Result

Additional factor: hardcoded 1s DialTimeout in PMM managed

Evidence

Proof that the exporter works with sufficient timeout

Proof that MongoDB is fully operational

Exporter has established TCP connections but still reports errors

Error pattern — every 1 second, matching scrape_interval

The LR scrape job (scrape_timeout: 27s) works fine

Reproducer

Verification commands inside the container

Suggested Fixes

Option A (Recommended): Ensure scrape_timeout > web.timeout-offset

Option B: Allow web.timeout-offset = 0 in mongodb_exporter

Option C: Increase default HR for MongoDB services

Option D: Increase DialTimeout in managed/services/agents/mongodb.go

Workaround

Expected Results

Actual Results

Version

PMM Client from 3.3.1 to 3.6.0, PMM Server from 3.3.1 to 3.6.0

Environment Details

Steps to reproduce

Relevant logs

Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

1. PMM Server sets `scrape_timeout` equal to `scrape_interval` for the HR job

2. mongodb_exporter v0.45.0 enforces `web.timeout-offset >= 1`