[WIP] nvme_driver: WORKAROUND: don't defer queue create to after restore by mattkur · Pull Request #3149 · microsoft/openvmm

mattkur · 2026-03-28T14:09:40Z

No description provided.

Copilot

Pull request overview

Adjusts NVMe driver save/restore behavior to avoid deferring I/O queue creation until after restore (workaround for restore-time queue handling issues).

Changes:

Restores all saved I/O queues eagerly (previous eager-restore filter is disabled).
Alters drain-after-restore state assignment for restored queues (empty queues now get SelfDrained).
Disables the “prototype IO queue” (proto_io) restore path by commenting it out.

Copilot · 2026-03-28T14:18:31Z

vm/devices/storage/disk_nvme/nvme_driver/src/driver.rs

+            // .filter(|q| {
+            //     q.queue_data.qid == 1 || !q.queue_data.handler_data.pending_cmds.commands.is_empty()
+            // })
            .flat_map(|q| -> Result<IoQueue<D>, anyhow::Error> {


The queue-restore selection filter has been commented out, leaving dead/commented code in the restore path. Please remove the commented-out code and update the surrounding restore-step comments to reflect the new behavior (restoring all I/O queues eagerly), or gate the behavior behind an explicit flag/config so this doesn’t ship as commented-out logic.

Copilot · 2026-03-28T14:18:31Z

vm/devices/storage/disk_nvme/nvme_driver/src/driver.rs

+        //         );
+        //         max_seen_qid = max_seen_qid.max(q.queue_data.qid);
+        //         let mem_block = restored_memory
+        //             .iter()
+        //             .find(|mem| {
+        //                 mem.len() == q.queue_data.mem_len && q.queue_data.base_pfn == mem.pfns()[0]
+        //             })
+        //             .expect("unable to find restored mem block")
+        //             .to_owned();
+        //         (
+        //             q.cpu,
+        //             ProtoIoQueue {
+        //                 save_state: q.clone(),
+        //                 mem: mem_block,
+        //                 drain_after_restore,
+        //             },
+        //         )
+        //     })
+        //     .collect();



The prototype-queue restore block is entirely commented out. As a result, drain_after_restore_for_proto_queues (created above) becomes unused and the restore logic for proto_io is effectively disabled, but the surrounding code still carries the proto-queue scaffolding. Please either fully remove the proto-queue mechanism (and delete the now-unused precreation logic/variables), or re-enable it with the intended behavior—leaving it commented risks warnings and makes the restore semantics hard to reason about.

Suggested change

// );

// max_seen_qid = max_seen_qid.max(q.queue_data.qid);

// let mem_block = restored_memory

// .iter()

// .find(|mem| {

// mem.len() == q.queue_data.mem_len && q.queue_data.base_pfn == mem.pfns()[0]

// })

// .expect("unable to find restored mem block")

// .to_owned();

// (

// q.cpu,

// ProtoIoQueue {

// save_state: q.clone(),

// mem: mem_block,

// drain_after_restore,

// },

// )

// })

// .collect();

Copilot · 2026-03-28T14:18:32Z

vm/devices/storage/disk_nvme/nvme_driver/src/driver.rs

        worker.io = saved_state
            .worker_data
            .io
            .iter()
-            .filter(|q| {
-                q.queue_data.qid == 1 || !q.queue_data.handler_data.pending_cmds.commands.is_empty()
-            })
+            // .filter(|q| {
+            //     q.queue_data.qid == 1 || !q.queue_data.handler_data.pending_cmds.commands.is_empty()
+            // })
            .flat_map(|q| -> Result<IoQueue<D>, anyhow::Error> {


This change alters restore semantics (empty queues are now restored eagerly instead of being deferred via proto_io), but there’s no unit/integration test covering save/restore behavior to catch regressions like a queue getting stuck in drain-after-restore. Please add a test in nvme_driver/src/tests.rs that saves state with a mix of empty/non-empty IO queues and verifies restore produces usable issuers/queues (and does not hang waiting for drain completion).

Copilot · 2026-03-28T14:18:32Z

vm/devices/storage/disk_nvme/nvme_driver/src/driver.rs

+                    if !q.queue_data.handler_data.pending_cmds.commands.is_empty() {
+                        drain_after_restore_template.new_draining()
+                    } else if q.queue_data.qid == 1 {
                        drain_after_restore_for_qid1
                            .take()
-                            .expect("only QID 1 should be empty in eager restore")
+                            .unwrap_or_else(|| drain_after_restore_template.new_self_drained())
                    } else {
-                        drain_after_restore_template.new_draining()
+                        drain_after_restore_template.new_self_drained()
                    },


DrainAfterRestoreBuilder::new_self_drained() is used for empty restored queues. If the last Draining queue finishes and signals before a given empty queue is restored, that queue’s waiter will be created after the uncached signal and can remain stuck in SelfDrained forever (no new guest IO processed on that queue). Use DrainAfterRestoreBuilder::new_for_new_queue() (or pre-create/register all SelfDrained waiters before starting any draining queues) for empty restored queues to avoid missing the drain-complete signal when draining finishes early.

nvme_driver: WORKAROUND: don't defer queue create to after restore

a9ede15

mattkur requested review from a team as code owners March 28, 2026 14:09

Copilot AI review requested due to automatic review settings March 28, 2026 14:09

github-actions bot added the release_1.7.2511 Targets the release/1.7.2511 branch. label Mar 28, 2026

Copilot started reviewing on behalf of mattkur March 28, 2026 14:13 View session

Copilot AI reviewed Mar 28, 2026

View reviewed changes

clippy fix

4c76614

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] nvme_driver: WORKAROUND: don't defer queue create to after restore#3149

[WIP] nvme_driver: WORKAROUND: don't defer queue create to after restore#3149
mattkur wants to merge 2 commits intomicrosoft:release/1.7.2511from
mattkur:no-lazy-ioq-restore

mattkur commented Mar 28, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 28, 2026

Uh oh!

Copilot AI Mar 28, 2026

Uh oh!

Copilot AI Mar 28, 2026

Uh oh!

Copilot AI Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mattkur commented Mar 28, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants