Skip to content

[WIP] nvme_driver: WORKAROUND: don't defer queue create to after restore#3149

Open
mattkur wants to merge 2 commits intomicrosoft:release/1.7.2511from
mattkur:no-lazy-ioq-restore
Open

[WIP] nvme_driver: WORKAROUND: don't defer queue create to after restore#3149
mattkur wants to merge 2 commits intomicrosoft:release/1.7.2511from
mattkur:no-lazy-ioq-restore

Conversation

@mattkur
Copy link
Copy Markdown
Contributor

@mattkur mattkur commented Mar 28, 2026

No description provided.

@mattkur mattkur requested review from a team as code owners March 28, 2026 14:09
Copilot AI review requested due to automatic review settings March 28, 2026 14:09
@github-actions github-actions bot added the release_1.7.2511 Targets the release/1.7.2511 branch. label Mar 28, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adjusts NVMe driver save/restore behavior to avoid deferring I/O queue creation until after restore (workaround for restore-time queue handling issues).

Changes:

  • Restores all saved I/O queues eagerly (previous eager-restore filter is disabled).
  • Alters drain-after-restore state assignment for restored queues (empty queues now get SelfDrained).
  • Disables the “prototype IO queue” (proto_io) restore path by commenting it out.

Comment on lines +961 to 964
// .filter(|q| {
// q.queue_data.qid == 1 || !q.queue_data.handler_data.pending_cmds.commands.is_empty()
// })
.flat_map(|q| -> Result<IoQueue<D>, anyhow::Error> {
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The queue-restore selection filter has been commented out, leaving dead/commented code in the restore path. Please remove the commented-out code and update the surrounding restore-step comments to reflect the new behavior (restoring all I/O queues eagerly), or gate the behavior behind an explicit flag/config so this doesn’t ship as commented-out logic.

Copilot uses AI. Check for mistakes.
Comment on lines +1029 to 1048
// );
// max_seen_qid = max_seen_qid.max(q.queue_data.qid);
// let mem_block = restored_memory
// .iter()
// .find(|mem| {
// mem.len() == q.queue_data.mem_len && q.queue_data.base_pfn == mem.pfns()[0]
// })
// .expect("unable to find restored mem block")
// .to_owned();
// (
// q.cpu,
// ProtoIoQueue {
// save_state: q.clone(),
// mem: mem_block,
// drain_after_restore,
// },
// )
// })
// .collect();

Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The prototype-queue restore block is entirely commented out. As a result, drain_after_restore_for_proto_queues (created above) becomes unused and the restore logic for proto_io is effectively disabled, but the surrounding code still carries the proto-queue scaffolding. Please either fully remove the proto-queue mechanism (and delete the now-unused precreation logic/variables), or re-enable it with the intended behavior—leaving it commented risks warnings and makes the restore semantics hard to reason about.

Suggested change
// );
// max_seen_qid = max_seen_qid.max(q.queue_data.qid);
// let mem_block = restored_memory
// .iter()
// .find(|mem| {
// mem.len() == q.queue_data.mem_len && q.queue_data.base_pfn == mem.pfns()[0]
// })
// .expect("unable to find restored mem block")
// .to_owned();
// (
// q.cpu,
// ProtoIoQueue {
// save_state: q.clone(),
// mem: mem_block,
// drain_after_restore,
// },
// )
// })
// .collect();

Copilot uses AI. Check for mistakes.
Comment on lines 957 to 964
worker.io = saved_state
.worker_data
.io
.iter()
.filter(|q| {
q.queue_data.qid == 1 || !q.queue_data.handler_data.pending_cmds.commands.is_empty()
})
// .filter(|q| {
// q.queue_data.qid == 1 || !q.queue_data.handler_data.pending_cmds.commands.is_empty()
// })
.flat_map(|q| -> Result<IoQueue<D>, anyhow::Error> {
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change alters restore semantics (empty queues are now restored eagerly instead of being deferred via proto_io), but there’s no unit/integration test covering save/restore behavior to catch regressions like a queue getting stuck in drain-after-restore. Please add a test in nvme_driver/src/tests.rs that saves state with a mix of empty/non-empty IO queues and verifies restore produces usable issuers/queues (and does not hang waiting for drain completion).

Copilot generated this review using guidance from repository custom instructions.
Comment on lines +992 to 1000
if !q.queue_data.handler_data.pending_cmds.commands.is_empty() {
drain_after_restore_template.new_draining()
} else if q.queue_data.qid == 1 {
drain_after_restore_for_qid1
.take()
.expect("only QID 1 should be empty in eager restore")
.unwrap_or_else(|| drain_after_restore_template.new_self_drained())
} else {
drain_after_restore_template.new_draining()
drain_after_restore_template.new_self_drained()
},
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DrainAfterRestoreBuilder::new_self_drained() is used for empty restored queues. If the last Draining queue finishes and signals before a given empty queue is restored, that queue’s waiter will be created after the uncached signal and can remain stuck in SelfDrained forever (no new guest IO processed on that queue). Use DrainAfterRestoreBuilder::new_for_new_queue() (or pre-create/register all SelfDrained waiters before starting any draining queues) for empty restored queues to avoid missing the drain-complete signal when draining finishes early.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release_1.7.2511 Targets the release/1.7.2511 branch.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants