Skip to content

Commit c1f0cd3

Browse files
authored
Merge branch 'main' into modular-z
2 parents 0bb53ba + 54fa074 commit c1f0cd3

File tree

17 files changed

+1293
-69
lines changed

17 files changed

+1293
-69
lines changed

docs/source/en/_toctree.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -651,7 +651,7 @@
651651
- local: api/pipelines/wuerstchen
652652
title: Wuerstchen
653653
- local: api/pipelines/z_image
654-
title: Z-Image
654+
title: Z-Image
655655
title: Image
656656
- sections:
657657
- local: api/pipelines/allegro

docs/source/en/api/pipelines/z_image.md

Lines changed: 34 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,8 +26,41 @@ specific language governing permissions and limitations under the License.
2626

2727
Z-Image-Turbo is a distilled version of Z-Image that matches or exceeds leading competitors with only 8 NFEs (Number of Function Evaluations). It offers sub-second inference latency on enterprise-grade H800 GPUs and fits comfortably within 16G VRAM consumer devices. It excels in photorealistic image generation, bilingual text rendering (English & Chinese), and robust instruction adherence.
2828

29+
## Image-to-image
30+
31+
Use [`ZImageImg2ImgPipeline`] to transform an existing image based on a text prompt.
32+
33+
```python
34+
import torch
35+
from diffusers import ZImageImg2ImgPipeline
36+
from diffusers.utils import load_image
37+
38+
pipe = ZImageImg2ImgPipeline.from_pretrained("Tongyi-MAI/Z-Image-Turbo", torch_dtype=torch.bfloat16)
39+
pipe.to("cuda")
40+
41+
url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
42+
init_image = load_image(url).resize((1024, 1024))
43+
44+
prompt = "A fantasy landscape with mountains and a river, detailed, vibrant colors"
45+
image = pipe(
46+
prompt,
47+
image=init_image,
48+
strength=0.6,
49+
num_inference_steps=9,
50+
guidance_scale=0.0,
51+
generator=torch.Generator("cuda").manual_seed(42),
52+
).images[0]
53+
image.save("zimage_img2img.png")
54+
```
55+
2956
## ZImagePipeline
3057

3158
[[autodoc]] ZImagePipeline
3259
- all
33-
- __call__
60+
- __call__
61+
62+
## ZImageImg2ImgPipeline
63+
64+
[[autodoc]] ZImageImg2ImgPipeline
65+
- all
66+
- __call__

docs/source/en/quantization/modelopt.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ specific language governing permissions and limitations under the License. -->
1111

1212
# NVIDIA ModelOpt
1313

14-
[NVIDIA-ModelOpt](https://github.com/NVIDIA/TensorRT-Model-Optimizer) is a unified library of state-of-the-art model optimization techniques like quantization, pruning, distillation, speculative decoding, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed.
14+
[NVIDIA-ModelOpt](https://github.com/NVIDIA/Model-Optimizer) is a unified library of state-of-the-art model optimization techniques like quantization, pruning, distillation, speculative decoding, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed.
1515

1616
Before you begin, make sure you have nvidia_modelopt installed.
1717

@@ -57,7 +57,7 @@ image.save("output.png")
5757
>
5858
> The quantization methods in NVIDIA-ModelOpt are designed to reduce the memory footprint of model weights using various QAT (Quantization-Aware Training) and PTQ (Post-Training Quantization) techniques while maintaining model performance. However, the actual performance gain during inference depends on the deployment framework (e.g., TRT-LLM, TensorRT) and the specific hardware configuration.
5959
>
60-
> More details can be found [here](https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples).
60+
> More details can be found [here](https://github.com/NVIDIA/Model-Optimizer/tree/main/examples).
6161
6262
## NVIDIAModelOptConfig
6363

@@ -86,7 +86,7 @@ The quantization methods supported are as follows:
8686
| **NVFP4** | `nvfp4 weight only`, `nvfp4 block quantization` | `quant_type`, `quant_type + channel_quantize + block_quantize` | `channel_quantize = -1 is only supported for now`|
8787

8888

89-
Refer to the [official modelopt documentation](https://nvidia.github.io/TensorRT-Model-Optimizer/) for a better understanding of the available quantization methods and the exhaustive list of configuration options available.
89+
Refer to the [official modelopt documentation](https://nvidia.github.io/Model-Optimizer/) for a better understanding of the available quantization methods and the exhaustive list of configuration options available.
9090

9191
## Serializing and Deserializing quantized models
9292

scripts/convert_hunyuan_video1_5_to_diffusers.py

Lines changed: 27 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -69,6 +69,11 @@
6969
"target_size": 960,
7070
"task_type": "i2v",
7171
},
72+
"480p_i2v_step_distilled": {
73+
"target_size": 640,
74+
"task_type": "i2v",
75+
"use_meanflow": True,
76+
},
7277
}
7378

7479
SCHEDULER_CONFIGS = {
@@ -93,6 +98,9 @@
9398
"720p_i2v_distilled": {
9499
"shift": 7.0,
95100
},
101+
"480p_i2v_step_distilled": {
102+
"shift": 7.0,
103+
},
96104
}
97105

98106
GUIDANCE_CONFIGS = {
@@ -117,6 +125,9 @@
117125
"720p_i2v_distilled": {
118126
"guidance_scale": 1.0,
119127
},
128+
"480p_i2v_step_distilled": {
129+
"guidance_scale": 1.0,
130+
},
120131
}
121132

122133

@@ -126,7 +137,7 @@ def swap_scale_shift(weight):
126137
return new_weight
127138

128139

129-
def convert_hyvideo15_transformer_to_diffusers(original_state_dict):
140+
def convert_hyvideo15_transformer_to_diffusers(original_state_dict, config=None):
130141
"""
131142
Convert HunyuanVideo 1.5 original checkpoint to Diffusers format.
132143
"""
@@ -142,6 +153,20 @@ def convert_hyvideo15_transformer_to_diffusers(original_state_dict):
142153
)
143154
converted_state_dict["time_embed.timestep_embedder.linear_2.bias"] = original_state_dict.pop("time_in.mlp.2.bias")
144155

156+
if config.use_meanflow:
157+
converted_state_dict["time_embed.timestep_embedder_r.linear_1.weight"] = original_state_dict.pop(
158+
"time_r_in.mlp.0.weight"
159+
)
160+
converted_state_dict["time_embed.timestep_embedder_r.linear_1.bias"] = original_state_dict.pop(
161+
"time_r_in.mlp.0.bias"
162+
)
163+
converted_state_dict["time_embed.timestep_embedder_r.linear_2.weight"] = original_state_dict.pop(
164+
"time_r_in.mlp.2.weight"
165+
)
166+
converted_state_dict["time_embed.timestep_embedder_r.linear_2.bias"] = original_state_dict.pop(
167+
"time_r_in.mlp.2.bias"
168+
)
169+
145170
# 2. context_embedder.time_text_embed.timestep_embedder <- txt_in.t_embedder
146171
converted_state_dict["context_embedder.time_text_embed.timestep_embedder.linear_1.weight"] = (
147172
original_state_dict.pop("txt_in.t_embedder.mlp.0.weight")
@@ -627,7 +652,7 @@ def convert_transformer(args):
627652
config = TRANSFORMER_CONFIGS[args.transformer_type]
628653
with init_empty_weights():
629654
transformer = HunyuanVideo15Transformer3DModel(**config)
630-
state_dict = convert_hyvideo15_transformer_to_diffusers(original_state_dict)
655+
state_dict = convert_hyvideo15_transformer_to_diffusers(original_state_dict, config=transformer.config)
631656
transformer.load_state_dict(state_dict, strict=True, assign=True)
632657

633658
return transformer

src/diffusers/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -664,6 +664,7 @@
664664
"WuerstchenCombinedPipeline",
665665
"WuerstchenDecoderPipeline",
666666
"WuerstchenPriorPipeline",
667+
"ZImageImg2ImgPipeline",
667668
"ZImagePipeline",
668669
]
669670
)
@@ -1364,6 +1365,7 @@
13641365
WuerstchenCombinedPipeline,
13651366
WuerstchenDecoderPipeline,
13661367
WuerstchenPriorPipeline,
1368+
ZImageImg2ImgPipeline,
13671369
ZImagePipeline,
13681370
)
13691371

src/diffusers/models/transformers/transformer_hunyuan_video15.py

Lines changed: 18 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -184,19 +184,32 @@ class HunyuanVideo15TimeEmbedding(nn.Module):
184184
The dimension of the output embedding.
185185
"""
186186

187-
def __init__(self, embedding_dim: int):
187+
def __init__(self, embedding_dim: int, use_meanflow: bool = False):
188188
super().__init__()
189189

190190
self.time_proj = Timesteps(num_channels=256, flip_sin_to_cos=True, downscale_freq_shift=0)
191191
self.timestep_embedder = TimestepEmbedding(in_channels=256, time_embed_dim=embedding_dim)
192192

193+
self.use_meanflow = use_meanflow
194+
self.time_proj_r = None
195+
self.timestep_embedder_r = None
196+
if use_meanflow:
197+
self.time_proj_r = Timesteps(num_channels=256, flip_sin_to_cos=True, downscale_freq_shift=0)
198+
self.timestep_embedder_r = TimestepEmbedding(in_channels=256, time_embed_dim=embedding_dim)
199+
193200
def forward(
194201
self,
195202
timestep: torch.Tensor,
203+
timestep_r: Optional[torch.Tensor] = None,
196204
) -> torch.Tensor:
197205
timesteps_proj = self.time_proj(timestep)
198206
timesteps_emb = self.timestep_embedder(timesteps_proj.to(dtype=timestep.dtype))
199207

208+
if timestep_r is not None:
209+
timesteps_proj_r = self.time_proj_r(timestep_r)
210+
timesteps_emb_r = self.timestep_embedder_r(timesteps_proj_r.to(dtype=timestep.dtype))
211+
timesteps_emb = timesteps_emb + timesteps_emb_r
212+
200213
return timesteps_emb
201214

202215

@@ -567,6 +580,7 @@ def __init__(
567580
# YiYi Notes: config based on target_size_config https://github.com/yiyixuxu/hy15/blob/main/hyvideo/pipelines/hunyuan_video_pipeline.py#L205
568581
target_size: int = 640, # did not name sample_size since it is in pixel spaces
569582
task_type: str = "i2v",
583+
use_meanflow: bool = False,
570584
) -> None:
571585
super().__init__()
572586

@@ -582,7 +596,7 @@ def __init__(
582596
)
583597
self.context_embedder_2 = HunyuanVideo15ByT5TextProjection(text_embed_2_dim, 2048, inner_dim)
584598

585-
self.time_embed = HunyuanVideo15TimeEmbedding(inner_dim)
599+
self.time_embed = HunyuanVideo15TimeEmbedding(inner_dim, use_meanflow=use_meanflow)
586600

587601
self.cond_type_embed = nn.Embedding(3, inner_dim)
588602

@@ -612,6 +626,7 @@ def forward(
612626
timestep: torch.LongTensor,
613627
encoder_hidden_states: torch.Tensor,
614628
encoder_attention_mask: torch.Tensor,
629+
timestep_r: Optional[torch.LongTensor] = None,
615630
encoder_hidden_states_2: Optional[torch.Tensor] = None,
616631
encoder_attention_mask_2: Optional[torch.Tensor] = None,
617632
image_embeds: Optional[torch.Tensor] = None,
@@ -643,7 +658,7 @@ def forward(
643658
image_rotary_emb = self.rope(hidden_states)
644659

645660
# 2. Conditional embeddings
646-
temb = self.time_embed(timestep)
661+
temb = self.time_embed(timestep, timestep_r=timestep_r)
647662

648663
hidden_states = self.x_embedder(hidden_states)
649664

src/diffusers/models/transformers/transformer_prx.py

Lines changed: 30 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,6 @@
1616

1717
import torch
1818
from torch import nn
19-
from torch.nn.functional import fold, unfold
2019

2120
from ...configuration_utils import ConfigMixin, register_to_config
2221
from ...utils import logging
@@ -532,7 +531,19 @@ def img2seq(img: torch.Tensor, patch_size: int) -> torch.Tensor:
532531
Flattened patch sequence of shape `(B, L, C * patch_size * patch_size)`, where `L = (H // patch_size) * (W
533532
// patch_size)` is the number of patches.
534533
"""
535-
return unfold(img, kernel_size=patch_size, stride=patch_size).transpose(1, 2)
534+
b, c, h, w = img.shape
535+
p = patch_size
536+
537+
# Reshape to (B, C, H//p, p, W//p, p) separating grid and patch dimensions
538+
img = img.reshape(b, c, h // p, p, w // p, p)
539+
540+
# Permute to (B, H//p, W//p, C, p, p) using einsum
541+
# n=batch, c=channels, h=grid_height, p=patch_height, w=grid_width, q=patch_width
542+
img = torch.einsum("nchpwq->nhwcpq", img)
543+
544+
# Flatten to (B, L, C * p * p)
545+
img = img.reshape(b, -1, c * p * p)
546+
return img
536547

537548

538549
def seq2img(seq: torch.Tensor, patch_size: int, shape: torch.Tensor) -> torch.Tensor:
@@ -554,12 +565,26 @@ def seq2img(seq: torch.Tensor, patch_size: int, shape: torch.Tensor) -> torch.Te
554565
Reconstructed image tensor of shape `(B, C, H, W)`.
555566
"""
556567
if isinstance(shape, tuple):
557-
shape = shape[-2:]
568+
h, w = shape[-2:]
558569
elif isinstance(shape, torch.Tensor):
559-
shape = (int(shape[0]), int(shape[1]))
570+
h, w = (int(shape[0]), int(shape[1]))
560571
else:
561572
raise NotImplementedError(f"shape type {type(shape)} not supported")
562-
return fold(seq.transpose(1, 2), shape, kernel_size=patch_size, stride=patch_size)
573+
574+
b, l, d = seq.shape
575+
p = patch_size
576+
c = d // (p * p)
577+
578+
# Reshape back to grid structure: (B, H//p, W//p, C, p, p)
579+
seq = seq.reshape(b, h // p, w // p, c, p, p)
580+
581+
# Permute back to image layout: (B, C, H//p, p, W//p, p)
582+
# n=batch, h=grid_height, w=grid_width, c=channels, p=patch_height, q=patch_width
583+
seq = torch.einsum("nhwcpq->nchpwq", seq)
584+
585+
# Final reshape to (B, C, H, W)
586+
seq = seq.reshape(b, c, h, w)
587+
return seq
563588

564589

565590
class PRXTransformer2DModel(ModelMixin, ConfigMixin, AttentionMixin):

src/diffusers/pipelines/__init__.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -404,7 +404,7 @@
404404
"Kandinsky5T2IPipeline",
405405
"Kandinsky5I2IPipeline",
406406
]
407-
_import_structure["z_image"] = ["ZImagePipeline"]
407+
_import_structure["z_image"] = ["ZImageImg2ImgPipeline", "ZImagePipeline"]
408408
_import_structure["skyreels_v2"] = [
409409
"SkyReelsV2DiffusionForcingPipeline",
410410
"SkyReelsV2DiffusionForcingImageToVideoPipeline",
@@ -841,7 +841,7 @@
841841
WuerstchenDecoderPipeline,
842842
WuerstchenPriorPipeline,
843843
)
844-
from .z_image import ZImagePipeline
844+
from .z_image import ZImageImg2ImgPipeline, ZImagePipeline
845845

846846
try:
847847
if not is_onnx_available():

src/diffusers/pipelines/auto_pipeline.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -119,6 +119,7 @@
119119
)
120120
from .wan import WanImageToVideoPipeline, WanPipeline, WanVideoToVideoPipeline
121121
from .wuerstchen import WuerstchenCombinedPipeline, WuerstchenDecoderPipeline
122+
from .z_image import ZImageImg2ImgPipeline, ZImagePipeline
122123

123124

124125
AUTO_TEXT2IMAGE_PIPELINES_MAPPING = OrderedDict(
@@ -162,6 +163,7 @@
162163
("cogview4-control", CogView4ControlPipeline),
163164
("qwenimage", QwenImagePipeline),
164165
("qwenimage-controlnet", QwenImageControlNetPipeline),
166+
("z-image", ZImagePipeline),
165167
]
166168
)
167169

@@ -189,6 +191,7 @@
189191
("qwenimage", QwenImageImg2ImgPipeline),
190192
("qwenimage-edit", QwenImageEditPipeline),
191193
("qwenimage-edit-plus", QwenImageEditPlusPipeline),
194+
("z-image", ZImageImg2ImgPipeline),
192195
]
193196
)
194197

src/diffusers/pipelines/hunyuan_video1_5/pipeline_hunyuan_video1_5_image2video.py

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -852,6 +852,15 @@ def __call__(
852852
# broadcast to batch dimension in a way that's compatible with ONNX/Core ML
853853
timestep = t.expand(latent_model_input.shape[0]).to(latent_model_input.dtype)
854854

855+
if self.transformer.config.use_meanflow:
856+
if i == len(timesteps) - 1:
857+
timestep_r = torch.tensor([0.0], device=device)
858+
else:
859+
timestep_r = timesteps[i + 1]
860+
timestep_r = timestep_r.expand(latents.shape[0]).to(latents.dtype)
861+
else:
862+
timestep_r = None
863+
855864
# Step 1: Collect model inputs needed for the guidance method
856865
# conditional inputs should always be first element in the tuple
857866
guider_inputs = {
@@ -893,6 +902,7 @@ def __call__(
893902
hidden_states=latent_model_input,
894903
image_embeds=image_embeds,
895904
timestep=timestep,
905+
timestep_r=timestep_r,
896906
attention_kwargs=self.attention_kwargs,
897907
return_dict=False,
898908
**cond_kwargs,

0 commit comments

Comments
 (0)