-
Notifications
You must be signed in to change notification settings - Fork 1.2k
[core] Queue::drop is acquiring the snatch guard as well
#8691
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: trunk
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -854,7 +854,11 @@ impl Device { | |
| user_closures.mappings, | ||
| user_closures.blas_compact_ready, | ||
| queue_empty, | ||
| ) = queue_result | ||
| ) = queue_result; | ||
| // Queue::drop is acquiring the snatch lock as well | ||
| drop(snatch_guard); | ||
| } else { | ||
| drop(snatch_guard); | ||
| }; | ||
|
|
||
| // Based on the queue empty status, and the current finished submission index, determine the result of the poll. | ||
|
|
@@ -909,7 +913,6 @@ impl Device { | |
|
|
||
| // Don't hold the locks while calling release_gpu_resources. | ||
| drop(fence); | ||
| drop(snatch_guard); | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. question: I think it's fine—but I'm not positive—to no longer be locking the snatch lock when we take the device lost closure. Tagging in @cwfitzgerald to double-check my conclusion here. Does that sound right, Connor?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't see any reason why this would be problematic. The snatch lock doesn't guard the device lost closure in any way. |
||
|
|
||
| if should_release_gpu_resource { | ||
| self.release_gpu_resources(); | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: It seems like the potential for
Queue::dropgetting called here is when:Device::maintainon this thread (let's call it thread 1) while holding a last strong ref. to the queue on another thread (let's call that thread 2).queue, and follow the branch to callqueue.maintain(…).Does that sound right? Are you using
wgpufrom multiple threads?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct - thread 1 is calling
Instance::poll_all(true)in a loop, and other threads 2+ create/drop queuesThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add a (like eric's) to say:
This kind of thing is really hard to reason about and the effects are deeply non-local so retaining as much of the knowledge as we can about the decisions that were made is super important.