Skip to content

fix: prevent orphaned BlueZ discovery sessions in continuous mode#81

Open
KristianP26 wants to merge 12 commits intomainfrom
fix/bluez-orphaned-discovery
Open

fix: prevent orphaned BlueZ discovery sessions in continuous mode#81
KristianP26 wants to merge 12 commits intomainfrom
fix/bluez-orphaned-discovery

Conversation

@KristianP26
Copy link
Copy Markdown
Owner

@KristianP26 KristianP26 commented Mar 5, 2026

Summary

Fixes #80: BlueZ discovery sessions become orphaned in continuous mode, causing persistent "Operation already in progress" errors that survive container restarts.

Root cause: Each scan cycle creates and destroys a D-Bus connection. When destroy() closes the socket immediately after stopDiscovery(), BlueZ may not finish session teardown, leaving an orphaned discovery session owned by a dead client. No new client can stop it, and all recovery tiers fail.

Two targeted fixes in handler-node-ble.ts:

  1. Quiesce delay before destroy: add a 500ms delay between stopDiscovery() and destroy() in the finally block, giving BlueZ time to clean up the session before the D-Bus socket closes. This prevents orphaned sessions in normal continuous mode operation.

  2. Fresh D-Bus connection in Tier 4 btmgmt recovery: after the kernel-level adapter reset, destroy the stale D-Bus connection and create a completely new one. Previously the code reused the invalidated connection, which always failed with "Operation already in progress" because the old proxies pointed to a destroyed /org/bluez/hci0 object.

Test plan

  • npx tsc --noEmit passes
  • npm run lint passes
  • npm test passes (1097/1097)
  • CI (typecheck-and-test on Node 20/22/24)
  • Verify on Docker with continuous mode: restart container while scale is powered on, confirm discovery recovers on next cycle

Two changes to handler-node-ble.ts:

1. Add a quiesce delay between stopDiscovery() and destroy() in the
   finally block. Without this, bluetoothd may not finish session
   teardown before the D-Bus socket closes, leaving an orphaned
   discovery that blocks future scans with "Operation already in
   progress".

2. In Tier 4 btmgmt recovery, destroy the stale D-Bus connection and
   create a fresh one. The kernel-level adapter reset causes BlueZ to
   recreate /org/bluez/hci0, invalidating the existing D-Bus proxies.
   Previously the code reused the stale connection, which always failed.

Closes #80
Copilot AI review requested due to automatic review settings March 5, 2026 11:37
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Addresses BlueZ discovery sessions getting orphaned in continuous mode by ensuring discovery teardown completes before D-Bus disconnect, and by recreating the D-Bus connection after kernel-level adapter resets so subsequent discovery attempts don’t reuse stale proxies.

Changes:

  • Add a mutable DbusConnection wrapper so Tier 4 (btmgmt) recovery can replace the D-Bus connection and proxies in-place.
  • Introduce a post-stopDiscovery() quiesce delay before destroying the D-Bus connection to prevent orphaned discovery sessions.
  • Update README license badge URL to force-refresh.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
src/ble/handler-node-ble.ts Adds D-Bus connection refresh logic during btmgmt recovery and delays D-Bus destroy after stopping discovery to avoid orphaned BlueZ sessions.
README.md Updates the license badge URL query param (cache bust).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread src/ble/handler-node-ble.ts Outdated
Comment on lines +540 to +551
// D-Bus socket. Without this delay, bluetoothd may not finish session
// teardown, leaving an orphaned discovery that blocks future scans.
await sleep(POST_DISCOVERY_QUIESCE_MS);
conn.destroy();
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

await sleep(POST_DISCOVERY_QUIESCE_MS) runs even when btAdapter is null (e.g., if defaultAdapter() fails), adding an unconditional 500ms delay to error paths. Consider only sleeping when an adapter exists and/or when stopDiscovery() was actually attempted (or when discovery was started).

Copilot uses AI. Check for mistakes.
Comment thread src/ble/handler-node-ble.ts Outdated
Comment on lines +569 to +573
const conn: DbusConnection = { bluetooth: null!, destroy: () => {} };
try {
({ bluetooth, destroy } = NodeBle.createBluetooth());
const created = NodeBle.createBluetooth();
conn.bluetooth = created.bluetooth;
conn.destroy = created.destroy;
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above: conn uses bluetooth: null! even though a valid bluetooth is always assigned immediately after createBluetooth() succeeds. Initializing conn from the created object would avoid the non-null assertion and reduce the chance of accidental runtime use before assignment.

Copilot uses AI. Check for mistakes.
Comment thread src/ble/handler-node-ble.ts Outdated
Comment on lines +631 to +639
}
destroy();
await sleep(POST_DISCOVERY_QUIESCE_MS);
conn.destroy();
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

await sleep(POST_DISCOVERY_QUIESCE_MS) in scanDevices runs even when discovery was never started / btAdapter is null, which unnecessarily delays failures and short scans. Consider guarding the sleep behind if (btAdapter) (or a discoveryStarted flag) so only real discovery teardowns pay the quiesce delay.

Copilot uses AI. Check for mistakes.
Comment thread src/ble/handler-node-ble.ts Outdated
Comment on lines +382 to +386
const conn: DbusConnection = { bluetooth: null!, destroy: () => {} };
try {
({ bluetooth, destroy } = NodeBle.createBluetooth());
const created = NodeBle.createBluetooth();
conn.bluetooth = created.bluetooth;
conn.destroy = created.destroy;
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

conn is initialized with bluetooth: null!, but NodeBle.createBluetooth() is called immediately after and the function throws on failure. Consider initializing conn from created directly (or using a let conn: DbusConnection assigned in the try) to avoid the non-null assertion and make the initialization safer/clearer.

Copilot uses AI. Check for mistakes.
…sessions (#80)

Each scan cycle previously created and destroyed a D-Bus connection.
BlueZ tracks discovery sessions per D-Bus client, and destroying the
connection while bluetoothd is still cleaning up leaves an orphaned
session that blocks all future scans with "Operation already in progress".

The fix keeps a single D-Bus connection alive across scan cycles so the
same client owns the discovery session. stopDiscovery() always works
because the owning client is still alive. The connection is automatically
reset if it becomes stale (e.g. after bluetoothd restart or btmgmt reset).

scanDevices() (one-shot operations like diagnose/scan) keeps its own
short-lived connection to avoid interfering with continuous mode.
…ers (#80)

Keep both the D-Bus connection and adapter proxy alive across scan cycles,
and leave discovery running between idle timeouts instead of stopping and
restarting it every cycle. This minimizes the BlueZ state transitions that
trigger the well-known Discovering property desync bug (bluez/bluez#807,
bluez/bluer#47) where StartDiscovery returns "Operation already in progress"
while StopDiscovery returns "No discovery started".

Recovery tiers 5 (rfkill block/unblock) and 6 (systemctl restart bluetooth)
provide deeper resets when the adapter-level tiers fail. Docker docs updated
to mount /dev/rfkill for RF-level recovery access.
After GATT connect/disconnect on Broadcom adapters (RPi), BlueZ can
enter a zombie state where Discovering=true but the HCI controller is
no longer scanning. The persistent D-Bus connection survives but
discovery never finds new devices.

Reset the D-Bus connection and adapter proxy after every GATT operation
(successful or failed). This guarantees a clean discovery state for the
next scan cycle. For idle cycles with no connection, discovery is kept
running to avoid the start/stop desync bug.
)

D-Bus reset alone is not enough to recover from the zombie discovery
state on Broadcom adapters. After 2-3 GATT cycles, bluetoothd keeps
reporting Discovering=true while the HCI controller no longer runs LE
scan, so a fresh D-Bus client + fresh startDiscovery() succeed on paper
but never sees any advertisements. btmgmt power off/on hits the
controller directly and clears the zombie, restoring scans for the next
cycle. Only CAP_NET_ADMIN is needed so it works regardless of container
USER.

Also:
- docker-compose.example.yml maps /dev/rfkill so Tier 5 recovery is
  available to existing users without re-reading the docs.
- Troubleshooting page gains a dedicated BlueZ zombie section with the
  NOBLE_DRIVER=stoprocent escape hatch for users whose BlueZ keeps
  getting stuck despite the recovery tiers.

Refs bluez/bluez#807, bluez/bluer#47.
…scovery

# Conflicts:
#	CHANGELOG.md
#	docs/changelog.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Docker container BLE errors

2 participants