Skip to content

Add state backup/restore command for droplet migration#202

Open
benvinegar wants to merge 4 commits intomainfrom
feat/state-backup-restore
Open

Add state backup/restore command for droplet migration#202
benvinegar wants to merge 4 commits intomainfrom
feat/state-backup-restore

Conversation

@benvinegar
Copy link
Member

@benvinegar benvinegar commented Mar 15, 2026

Summary

  • add a new baudbot state command with backup and restore subcommands
  • include persistent runtime state in archives: memory, todos, settings, extensions, skills, and subagents
  • include secrets by default (~/.config/.env, ~/.pi/agent/auth.json) with --exclude-secrets option for shareable archives
  • add safe zip extraction (path traversal checks) and restore-time ownership/permission normalization
  • document migration flow in README.md and docs/operations.md
  • add shell test coverage for state round-trip behavior and CLI root requirement

Validation

  • bash bin/state.test.sh
  • bash bin/baudbot.test.sh
  • npm run lint:shell
  • npm run test:shell
  • npm test

Real droplet validation (Arch Linux)

  • provisioned two temporary DigitalOcean droplets with bin/ci/droplet.sh
  • seeded memory/todos/custom extension+skill and a secret marker on old droplet
  • ran sudo baudbot state backup /tmp/baudbot-state.zip on old droplet
  • copied archive to new droplet and ran sudo baudbot state restore /tmp/baudbot-state.zip
  • verified restored memory/todos/custom files + secret marker on new droplet
  • verified sensitive file permissions remained locked (600)
  • verified archive checksum consistency across hosts
  • destroyed both droplets + ephemeral SSH keys after test

@greptile-apps
Copy link

greptile-apps bot commented Mar 15, 2026

Greptile Summary

This PR adds a baudbot state backup/restore command that packages persistent agent state (memory, todos, extensions, skills, subagents, settings, and optionally secrets) into a zip archive to facilitate droplet migrations. The implementation is well-structured — it follows existing register_command conventions, includes safe zip extraction with path-traversal protection, post-restore ownership/permission normalization, and shell test coverage for the round-trip and --exclude-secrets flows.

Key observations:

  • The zip extraction guard in extract_zip_archive_safe correctly rejects absolute paths, null bytes, and .. components, and uses a startswith containment check as a final safety net.
  • write_metadata_file interpolates $host_name, $BAUDBOT_AGENT_USER, and $BAUDBOT_AGENT_HOME directly into a JSON heredoc without escaping, which could produce malformed JSON in unusual environments and cause a hard Python traceback on restore.
  • cmd_restore uses cp -a to merge the archive into the agent home rather than replacing it — any pre-existing files not in the archive are silently kept, which can leave the destination in an inconsistent state without any warning to the operator.
  • test_state_requires_root in baudbot.test.sh captures command output to the hardcoded path /tmp/baudbot-state.out instead of the per-test temp directory, making it fragile under concurrent test execution.

Confidence Score: 3/5

  • Safe to merge with minor improvements — no data-loss or security regressions in the happy path, but the metadata JSON generation and merge-on-restore behavior should be addressed before this is used in production migrations.
  • The core backup/restore logic is functionally correct and the zip path-traversal protection is solid. The two style-level issues (unescaped JSON metadata values, silent merge semantics on restore) are unlikely to trigger in normal deployments but could cause confusing failures or unexpected state in edge cases. The test isolation bug is a real fragility but doesn't affect production behavior.
  • bin/state.sh (metadata JSON escaping, merge-on-restore semantics) and bin/baudbot.test.sh (hardcoded temp file path in test_state_requires_root)

Important Files Changed

Filename Overview
bin/state.sh New state backup/restore script. Path-traversal protection in extract_zip_archive_safe looks solid. Two concerns: metadata JSON values are shell-interpolated without escaping (could produce malformed JSON for unusual paths/hostnames), and cmd_restore silently merges into an existing agent home rather than replacing it, which can leave stale files.
bin/baudbot.test.sh New test_state_requires_root test uses a hardcoded /tmp/baudbot-state.out path for capturing output instead of the per-test temp dir already set up in the same function, causing a potential concurrency collision and leaving a stale file on failure.
bin/state.test.sh New shell test file covering round-trip backup/restore and --exclude-secrets. Uses isolated temp dirs and proper traps for cleanup. Tests look comprehensive and well-structured.
bin/baudbot Registers the new state command with root required. Change is minimal and follows the existing pattern exactly.
bin/test.sh Adds state.test.sh to the shell test runner. One-liner addition, no issues.
test/shell-scripts.test.mjs Adds Vitest coverage for the state backup/restore shell test suite. Follows existing pattern, no issues.
docs/operations.md Documents the droplet migration workflow with clear examples and appropriate security notes. No issues.
README.md Adds a concise migration snippet to the README. No issues.

Sequence Diagram

sequenceDiagram
    actor Operator
    participant CLI as baudbot CLI
    participant state.sh
    participant Python as Python3
    participant FS as Agent Home (~/.pi / ~/.config)

    Note over Operator,FS: Backup flow (old droplet)
    Operator->>CLI: sudo baudbot state backup /tmp/state.zip
    CLI->>state.sh: exec (requires root)
    state.sh->>state.sh: resolve_archive_path()
    state.sh->>FS: copy_path_if_present() for each STATE_PATH
    state.sh->>state.sh: write_metadata_file() → metadata.json
    state.sh->>Python: create_zip_archive(baudbot-state/ → state.zip)
    state.sh->>FS: chmod 600 state.zip
    state.sh-->>Operator: ✓ state backup created

    Note over Operator,FS: Restore flow (new droplet)
    Operator->>CLI: sudo baudbot stop
    Operator->>CLI: sudo baudbot state restore /tmp/state.zip
    CLI->>state.sh: exec (requires root)
    state.sh->>state.sh: service_running() → must be stopped
    state.sh->>Python: extract_zip_archive_safe() (path-traversal checked)
    state.sh->>Python: validate metadata.json format == "baudbot-state-v1"
    state.sh->>FS: cp -a payload/. → BAUDBOT_AGENT_HOME/ (merge)
    state.sh->>FS: restore_ownership_if_root() chown -R
    state.sh->>FS: restore_secure_permissions() chmod 600
    state.sh-->>Operator: ✓ state restored
    Operator->>CLI: sudo baudbot start
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: bin/baudbot.test.sh
Line: 154-159

Comment:
**Hardcoded temp file path causes test isolation failure**

The test declares `local out` (line 136) but then redirects command output to the hardcoded global path `/tmp/baudbot-state.out` instead of using the already-available `$tmp` directory. If two test processes run concurrently (or a previous run left the file behind), they will clobber each other's output and the grep on line 160 could read stale data from a prior run.

```suggestion
    if PATH="$fakebin:$PATH" BAUDBOT_ROOT="$REPO_ROOT" bash "$CLI" state backup >"$tmp/state.out" 2>&1; then
      return 1
    fi

    out="$(cat "$tmp/state.out")"
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: bin/state.sh
Line: 120-129

Comment:
**Shell variables interpolated into JSON without escaping**

`$host_name`, `$BAUDBOT_AGENT_USER`, and `$BAUDBOT_AGENT_HOME` are embedded directly into the JSON heredoc. While typical Linux hostnames and usernames don't contain JSON-special characters, `$BAUDBOT_AGENT_HOME` is a file-system path and could (in unusual environments) contain a backslash or other character that would produce malformed JSON. When `cmd_restore` later calls `json.load()` on this file, a parse error will cause a hard failure with a Python traceback rather than a clear diagnostic message.

Consider using Python to write the metadata instead, or at minimum `printf`-escaping the values:
```bash
python3 - "$metadata_file" "$STATE_FORMAT" "$now" "$host_name" "$BAUDBOT_AGENT_USER" "$BAUDBOT_AGENT_HOME" "$include_secrets" <<'PY'
import json, sys
path, fmt, created, host, user, home, secrets = sys.argv[1:]
with open(path, "w") as f:
    json.dump({"format": fmt, "created_at": created, "host": host,
               "agent_user": user, "agent_home": home,
               "include_secrets": secrets == "1"}, f, indent=2)
    f.write("\n")
PY
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: bin/state.sh
Line: 374-375

Comment:
**Restore merges into agent home rather than replacing it**

`cp -a "$payload_root/." "$BAUDBOT_AGENT_HOME/"` merges the archive contents on top of whatever already exists in the agent home. Any files present on the new droplet that are not in the archive (e.g., a fresh `settings.json` written by `deploy`) are silently kept alongside the restored files. This can lead to an inconsistent state that is difficult to diagnose.

The docs describe the operation as a "restore" which implies replacement. If merge semantics are intentional, at least surface a warning to the operator:
```bash
  # Warn if the agent home already has content
  if [ -n "$(ls -A "$BAUDBOT_AGENT_HOME" 2>/dev/null)" ]; then
    bb_warn "agent home is not empty; existing files not in the archive will be preserved"
  fi
  cp -a "$payload_root/." "$BAUDBOT_AGENT_HOME/"
```

How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: 45e3840

Comment on lines +154 to +159
if PATH="$fakebin:$PATH" BAUDBOT_ROOT="$REPO_ROOT" bash "$CLI" state backup >/tmp/baudbot-state.out 2>&1; then
return 1
fi

out="$(cat /tmp/baudbot-state.out)"
rm -f /tmp/baudbot-state.out
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoded temp file path causes test isolation failure

The test declares local out (line 136) but then redirects command output to the hardcoded global path /tmp/baudbot-state.out instead of using the already-available $tmp directory. If two test processes run concurrently (or a previous run left the file behind), they will clobber each other's output and the grep on line 160 could read stale data from a prior run.

Suggested change
if PATH="$fakebin:$PATH" BAUDBOT_ROOT="$REPO_ROOT" bash "$CLI" state backup >/tmp/baudbot-state.out 2>&1; then
return 1
fi
out="$(cat /tmp/baudbot-state.out)"
rm -f /tmp/baudbot-state.out
if PATH="$fakebin:$PATH" BAUDBOT_ROOT="$REPO_ROOT" bash "$CLI" state backup >"$tmp/state.out" 2>&1; then
return 1
fi
out="$(cat "$tmp/state.out")"
Prompt To Fix With AI
This is a comment left during a code review.
Path: bin/baudbot.test.sh
Line: 154-159

Comment:
**Hardcoded temp file path causes test isolation failure**

The test declares `local out` (line 136) but then redirects command output to the hardcoded global path `/tmp/baudbot-state.out` instead of using the already-available `$tmp` directory. If two test processes run concurrently (or a previous run left the file behind), they will clobber each other's output and the grep on line 160 could read stale data from a prior run.

```suggestion
    if PATH="$fakebin:$PATH" BAUDBOT_ROOT="$REPO_ROOT" bash "$CLI" state backup >"$tmp/state.out" 2>&1; then
      return 1
    fi

    out="$(cat "$tmp/state.out")"
```

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in commit c0ccf59. I changed the test to write/read output via the per-test temp directory instead of a shared /tmp file. I applied the same isolation cleanup pattern to the nearby debug/broker root-requirement tests as well.

Responded by baudbot-dev-agent using openai/gpt-5.

bin/state.sh Outdated
Comment on lines +120 to +129
cat > "$metadata_file" <<EOF
{
"format": "$STATE_FORMAT",
"created_at": "$now",
"host": "$host_name",
"agent_user": "$BAUDBOT_AGENT_USER",
"agent_home": "$BAUDBOT_AGENT_HOME",
"include_secrets": $([ "$include_secrets" = "1" ] && echo true || echo false)
}
EOF
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shell variables interpolated into JSON without escaping

$host_name, $BAUDBOT_AGENT_USER, and $BAUDBOT_AGENT_HOME are embedded directly into the JSON heredoc. While typical Linux hostnames and usernames don't contain JSON-special characters, $BAUDBOT_AGENT_HOME is a file-system path and could (in unusual environments) contain a backslash or other character that would produce malformed JSON. When cmd_restore later calls json.load() on this file, a parse error will cause a hard failure with a Python traceback rather than a clear diagnostic message.

Consider using Python to write the metadata instead, or at minimum printf-escaping the values:

python3 - "$metadata_file" "$STATE_FORMAT" "$now" "$host_name" "$BAUDBOT_AGENT_USER" "$BAUDBOT_AGENT_HOME" "$include_secrets" <<'PY'
import json, sys
path, fmt, created, host, user, home, secrets = sys.argv[1:]
with open(path, "w") as f:
    json.dump({"format": fmt, "created_at": created, "host": host,
               "agent_user": user, "agent_home": home,
               "include_secrets": secrets == "1"}, f, indent=2)
    f.write("\n")
PY
Prompt To Fix With AI
This is a comment left during a code review.
Path: bin/state.sh
Line: 120-129

Comment:
**Shell variables interpolated into JSON without escaping**

`$host_name`, `$BAUDBOT_AGENT_USER`, and `$BAUDBOT_AGENT_HOME` are embedded directly into the JSON heredoc. While typical Linux hostnames and usernames don't contain JSON-special characters, `$BAUDBOT_AGENT_HOME` is a file-system path and could (in unusual environments) contain a backslash or other character that would produce malformed JSON. When `cmd_restore` later calls `json.load()` on this file, a parse error will cause a hard failure with a Python traceback rather than a clear diagnostic message.

Consider using Python to write the metadata instead, or at minimum `printf`-escaping the values:
```bash
python3 - "$metadata_file" "$STATE_FORMAT" "$now" "$host_name" "$BAUDBOT_AGENT_USER" "$BAUDBOT_AGENT_HOME" "$include_secrets" <<'PY'
import json, sys
path, fmt, created, host, user, home, secrets = sys.argv[1:]
with open(path, "w") as f:
    json.dump({"format": fmt, "created_at": created, "host": host,
               "agent_user": user, "agent_home": home,
               "include_secrets": secrets == "1"}, f, indent=2)
    f.write("\n")
PY
```

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed and fixed in commit c0ccf59. Metadata JSON is now written via Python json.dump instead of shell heredoc interpolation, so values are correctly escaped before restore-time parsing.

Responded by baudbot-dev-agent using openai/gpt-5.

Comment on lines +374 to +375
mkdir -p "$BAUDBOT_AGENT_HOME"
cp -a "$payload_root/." "$BAUDBOT_AGENT_HOME/"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restore merges into agent home rather than replacing it

cp -a "$payload_root/." "$BAUDBOT_AGENT_HOME/" merges the archive contents on top of whatever already exists in the agent home. Any files present on the new droplet that are not in the archive (e.g., a fresh settings.json written by deploy) are silently kept alongside the restored files. This can lead to an inconsistent state that is difficult to diagnose.

The docs describe the operation as a "restore" which implies replacement. If merge semantics are intentional, at least surface a warning to the operator:

  # Warn if the agent home already has content
  if [ -n "$(ls -A "$BAUDBOT_AGENT_HOME" 2>/dev/null)" ]; then
    bb_warn "agent home is not empty; existing files not in the archive will be preserved"
  fi
  cp -a "$payload_root/." "$BAUDBOT_AGENT_HOME/"
Prompt To Fix With AI
This is a comment left during a code review.
Path: bin/state.sh
Line: 374-375

Comment:
**Restore merges into agent home rather than replacing it**

`cp -a "$payload_root/." "$BAUDBOT_AGENT_HOME/"` merges the archive contents on top of whatever already exists in the agent home. Any files present on the new droplet that are not in the archive (e.g., a fresh `settings.json` written by `deploy`) are silently kept alongside the restored files. This can lead to an inconsistent state that is difficult to diagnose.

The docs describe the operation as a "restore" which implies replacement. If merge semantics are intentional, at least surface a warning to the operator:
```bash
  # Warn if the agent home already has content
  if [ -n "$(ls -A "$BAUDBOT_AGENT_HOME" 2>/dev/null)" ]; then
    bb_warn "agent home is not empty; existing files not in the archive will be preserved"
  fi
  cp -a "$payload_root/." "$BAUDBOT_AGENT_HOME/"
```

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Merge semantics are intentional so restore only overlays managed state paths, but I added an explicit warning when the agent home is non-empty before copying archive contents (commit c0ccf59). That makes the behavior visible to operators.

Responded by baudbot-dev-agent using openai/gpt-5.

@benvinegar
Copy link
Member Author

Correction to my previous note (shell ate inline code formatting):

Addressed all review comments and pushed fixes:

  • preserved zip entry mode bits on restore + added executable-permission regression check
  • switched metadata JSON writing to Python json.dump to avoid shell interpolation issues
  • added explicit warning for merge-style restore into non-empty agent home
  • fixed test isolation in bin/baudbot.test.sh (no shared /tmp output files)

Also fixed CI breakages discovered in integration runs:

  • added varlock compatibility symlink from ~/.varlock/bin/varlock to ~/.config/varlock/bin/varlock during setup
  • updated inference smoke model to anthropic/claude-haiku-4-5

PR checks are now green, including arch + ubuntu integration.

Comment on lines +157 to +159
os.makedirs(archive_dir, exist_ok=True)

with zipfile.ZipFile(archive_path, "w", compression=zipfile.ZIP_DEFLATED) as zip_file:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: The state restore process creates directories with default permissions (e.g., 755) instead of the required secure permissions (700), exposing sensitive data to other group users.
Severity: CRITICAL

Suggested Fix

Modify the backup process to explicitly write directory entries to the zip archive, preserving their original permissions. Alternatively, update the restore_secure_permissions function to recursively apply the correct 700 permissions to the ~/.pi directory and its critical subdirectories after the archive is extracted.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.

Location: bin/state.sh#L157-L159

Potential issue: The state restore process creates directories with incorrect
permissions. The backup function `create_zip_archive` does not save directory metadata.
During restore, `extract_zip_archive_safe` creates directories using `os.makedirs()`,
which applies default permissions (e.g., `755`) based on the system's umask. The
subsequent `restore_secure_permissions` function only corrects permissions for specific
files, not directories. As a result, security-critical directories like `~/.pi` and
`~/.pi/agent` are left with world-readable permissions instead of the required `700`,
allowing other users in the same group to access sensitive agent data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant