Skip to content

pg0 info/running reports a postgres instance as healthy with no real connectivity check, masking zombie processes after shared-memory loss #24

Description

@swarthyplacebo

Submitted with the assistance of Sally Sonnet

Summary

pg0 info / pg0 list (and the running property on the Python Pg0 class) report a postgres instance as "running" based on process/port-alive state only — there's no real connectivity check (e.g. SELECT 1). This means a postgres backend that's technically alive in ps and bound to its port, but unable to actually serve any query, is reported as healthy indefinitely.

How this happens in practice

On Linux hosts where systemd-logind's default RemoveIPC=yes reaps a user's shared-memory segments when their last login session ends (and that user has no loginctl linger enabled), an embedded pg0 postgres instance can lose its shared memory while the OS process itself keeps running. Any subsequent real connection attempt fails:

FATAL: could not open shared memory segment "/PostgreSQL.NNNNNNNNNN": No such file or directory

But pg0 info --name <instance> and pg0 list continue to report the instance as running with a valid-looking connection URI, because the check never actually opens a connection — it appears to only check that the process exists and the port is listening.

Why this matters

For any application that calls pg0.info().running (or the CLI equivalent) to decide whether to skip startup/use an existing instance — as hindsight-api's embedded-Postgres manager does — a zombie instance like this is invisible. The application happily reuses (or tries to reuse) a connection to an instance that can never actually serve a query, and the resulting failure surfaces much later, in application-level code, with no indication that pg0 itself already "knew" the instance was unhealthy.

Verification performed

  • Reproduced directly: stopped a healthy instance's shared memory out from under it (via the systemd RemoveIPC interaction above), confirmed the process was still alive (ps), confirmed pg0 list still reported (running), and confirmed a direct psql connection failed with the shared-memory error shown above.
  • After restarting the instance with pg0 stop + pg0 start (getting a fresh shared-memory segment), the same instance correctly served real queries again.

Suggested fix

Have pg0 info/pg0 list/the running property perform a lightweight real connectivity check (e.g. attempt a trivial query via the bundled psql, or open a raw libpq connection) rather than relying solely on process-alive + port-listening state.

Environment

  • pg0-embedded 0.14.2 (Python SDK), pg0 CLI 0.14.2
  • PostgreSQL 18.1.0 (bundled)
  • Host: Debian 12 bookworm, x86_64

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions