dequantization offload accounting (fixes Flux2 OOMs - incl TEs) #11171

rattus128 · 2025-12-07T11:41:37Z

This is the hopefully full root cause fix on:

Primary commit message:

commit 53bd09926cf0f680d0fd67afcb2d0a289d71940d
Author: Rattus <[email protected]>
Date:   Sun Dec 7 21:23:05 2025 +1000

    Account for dequantization and type-casts in offload costs
    
    When measuring the cost of offload, identify weights that need a type
    change or dequantization and add the size of the conversion result
    to the offload cost.
    
    This is mutually exclusive with lowvram patches which already has
    a large conservative estimate and wont overlap the dequant cost so
    dont double count.

Example Test case:

RTX3060 Flux2 workflow with ModelComputeDtype node set to fp32

Before:

Requested to load Flux2TEModel_
loaded partially; 10508.42 MB usable, 10025.59 MB loaded, 7155.01 MB offloaded, 480.00 MB buffer reserved, lowvram patches: 0
!!! Exception during processing !!! Allocation on device 
...
    x = self.mlp(x)
        ^^^^^^^^^^^
  File "/home/rattus/venv2/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rattus/venv2/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rattus/ComfyUI/comfy/text_encoders/llama.py", line 327, in forward
    return self.down_proj(self.activation(self.gate_proj(x)) * self.up_proj(x))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rattus/venv2/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rattus/venv2/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rattus/ComfyUI/comfy/ops.py", line 608, in forward
    return self.forward_comfy_cast_weights(input, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rattus/ComfyUI/comfy/ops.py", line 599, in forward_comfy_cast_weights
    weight, bias, offload_stream = cast_bias_weight(self, input, offloadable=True)
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rattus/ComfyUI/comfy/ops.py", line 127, in cast_bias_weight
    weight = weight.dequantize()
             ^^^^^^^^^^^^^^^^^^^
  File "/home/rattus/ComfyUI/comfy/quant_ops.py", line 197, in dequantize
    return LAYOUTS[self._layout_type].dequantize(self._qdata, **self._layout_params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rattus/ComfyUI/comfy/quant_ops.py", line 431, in dequantize
    plain_tensor = torch.ops.aten._to_copy.default(qdata, dtype=orig_dtype)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rattus/venv2/lib/python3.12/site-packages/torch/_ops.py", line 841, in __call__
    return self._op(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: Allocation on device 

Got an OOM, unloading all loaded models.
Prompt executed in 7.47 seconds

After:

Handle the case where the attribute doesnt exist by returning a static sentinel (distinct from None). If the sentinel is passed in as the set value, del the attr.

When measuring the cost of offload, identify weights that need a type change or dequantization and add the size of the conversion result to the offload cost. This is mutually exclusive with lowvram patches which already has a large conservative estimate and wont overlap the dequant cost so\ dont double count.

So that the loader can know the size of weights for dequant accounting.

Balladie · 2025-12-07T13:04:01Z

Confirmed that it resolves #10891 (comment).

rattus128 added 3 commits December 7, 2025 21:22

make setattr safe for non existent attributes

0833e3b

Handle the case where the attribute doesnt exist by returning a static sentinel (distinct from None). If the sentinel is passed in as the set value, del the attr.

Set the compute type on CLIP MPs

a03d98e

So that the loader can know the size of weights for dequant accounting.

rattus128 requested review from Kosinkadink, comfyanonymous and guill as code owners December 7, 2025 11:41

rattus128 mentioned this pull request Dec 7, 2025

Fix OOM by reserving extra VRAM for cached memory from CudaMallocAsync allocator #11154

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

dequantization offload accounting (fixes Flux2 OOMs - incl TEs) #11171

dequantization offload accounting (fixes Flux2 OOMs - incl TEs) #11171

rattus128 commented Dec 7, 2025

Uh oh!

Balladie commented Dec 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dequantization offload accounting (fixes Flux2 OOMs - incl TEs) #11171

Are you sure you want to change the base?

dequantization offload accounting (fixes Flux2 OOMs - incl TEs) #11171

Conversation

rattus128 commented Dec 7, 2025

Uh oh!

Balladie commented Dec 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants