Skip to content

listCodexModels spawns new codex app-server on every request, causing intermittent 120s timeouts #806

@creeep123

Description

@creeep123

Problem

GET /api/sessions/{id}/codex-models intermittently returns 500 after 120 seconds, which gets wrapped as a 502 Bad Gateway by Cloudflare Tunnel.

Root Cause

In cli/src/modules/common/codexModels.ts, listCodexModels() creates a new CodexAppServerClient on every call, which spawns a fresh codex app-server child process:

export async function listCodexModels(includeHidden: boolean = false): Promise<CodexModelSummary[]> {
    const client = new CodexAppServerClient();
    try {
        await client.connect();        // spawns "codex app-server" process
        await client.initialize(...);   // 30s timeout
        const response = await client.listModels({ includeHidden }); // 30s timeout
        ...
    } finally {
        await client.disconnect();     // kills the process
    }
}

The hub-side RPC timeout is MODEL_LIST_RPC_TIMEOUT_MS = 120_000 (120s). When the spawned codex app-server is slow to respond (e.g., OpenAI token refresh stalls), the RPC times out at 120s and returns 500.

Evidence

From hub.log, out of ~168 codex-models requests:

  • 163 (97%) succeeded in 1-5 seconds
  • 5 (3%) timed out at exactly 120s → 500

From runner.log, the codex app-server processes spawned for model listing consistently exit within ~1 second:

[09:21:39.748] List Codex models request
[09:21:39.748] [CodexAppServer] Connected
[09:21:40.147] Codex app-server exited (code=0, signal=null)
[09:21:40.157] [CodexAppServer] Disconnected

The 1-second exit suggests the app-server sometimes fails silently (exits before completing listModels), triggering the full 120s RPC timeout on the hub side.

Environment

  • hapi: 0.19.0
  • codex-cli: 0.136.0
  • OS: Ubuntu 24.04 (Linux x86_64)
  • codex auth mode: chatgpt (OAuth token with refresh)

Suggested Fix

  1. Cache model list on the runner/machine level (models rarely change, cache for 5-10 minutes)
  2. Reuse a persistent app-server instead of spawn-per-request
  3. Reduce RPC timeout — 120s is excessive for a model list that normally takes <5s
  4. Add faster failure detection — if the app-server exits early, return an error immediately instead of waiting for the full RPC timeout

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions