Uiuc vlm pr compressed fixed by immuntasir · Pull Request #511 · google/tunix

immuntasir · 2025-10-06T19:06:31Z

Resolves #510

This PR introduces multimodal support to Tunix’s Gemma3 model and adds a new vision-language DPO demonstration notebook (vl_dpo_demo_gemma3.ipynb), extending the framework to handle image-text reasoning and multimodal alignment. Key changes includes:

Multimodal Gemma3 Model:
1. Added SigLIP vision encoder and MultimodalProjector to Gemma3, enabling joint image–text forward passes.
2. Updated ModelConfig and parameter mapping to support multimodal checkpoints (multimodal=True).
DPO Example Notebook: Introduced examples/vl_dpo_demo_gemma3.ipynb, demonstrating multimodal DPO fine-tuning with image-conditioned prompts using multimodal gemma3
Gemma3 Tokenizer Adapter: Extended TokenizerAdapter to support Hugging Face Processor objects
VLM Sampler and Utils:
1. Added VLMSampler for multimodal generation (tunix/generate/vlm_sampler.py)
2. Added preprocess_image() utility in tunix/generate/utils.py for SigLIP/CLIP-style normalization.
DPO Pipeline Updates: Modified tunix/sft/dpo/dpo_trainer.py to handle multimodal data ({“text”, “image”} prompts) and propagate image tensors through training.

@Tianjiao-Yu led this effort and @jxiong21029 contributed to the Gemma3 integration. Please also mention @Tianjiao-Yu if you have any questions/comments/feedback.

Colab Notebook
vl_dpo_demo_gemma3.ipynb

Checklist

I have added all the necessary unit tests for my change.
I have verified that my change does not break existing code and all unit tests pass.
I have added all appropriate doc-strings/documentation.
My PR is based on the latest changes of the main branch (if unsure, rebase the code).
I have signed the Contributor License Agreement.
I have followed Contribution Guidelines.

google-cla · 2025-10-06T19:06:37Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

abheesht17 · 2025-10-10T04:05:45Z

@immuntasir - let me know when this is ready for review!

immuntasir · 2025-10-10T15:34:00Z

@immuntasir - let me know when this is ready for review!

@Tianjiao-Yu confirmed that this is ready for review.

immuntasir · 2025-10-11T01:19:03Z

examples/dpo_demo_gemma3.ipynb

@Tianjiao-Yu I think this should be removed from this PR.

abheesht17

Quick review, I'll do another pass tomorrow

abheesht17 · 2025-10-13T16:31:48Z

examples/vl_dpo_demo_gemma3.ipynb

+      "source": [
+        "# Fine-tuning a Visual Language Model (VLM) using DPO\n",
+        "\n",
+        "This notebook demonstrates how to fine-tune a Visual Language Model (VLM), specifically the Gemma 3-1B-it model, using the Direct Preference Optimization (DPO) algorithm.\n",


Gemma 3-1B-it model

This is a text-only model though. 4B onwards are VLMs

examples/vl_dpo_demo_gemma3.ipynb

abheesht17 · 2025-10-13T16:37:54Z

tunix/sft/dpo/dpo_trainer.py


  This can be used when inputs are raw strings. Tokenization, padding and
-  preprocessing is taken care of by `DPOTrainer`.
+  preprocessing is taken care of by `DpoTrainer`.


abheesht17 · 2025-10-13T16:42:05Z

examples/dpo_demo_gemma3.ipynb

abheesht17 · 2025-10-13T16:43:17Z

tunix/generate/tokenizer_adapter.py

+    elif self._tokenizer_type == TokenizerType.HFP:
+      inputs = self._tokenizer(text=text, **kwargs)
+      if 'images' in kwargs:
+        return inputs['input_ids'], inputs['pixel_values']


Better to return a dictionary here rather than a tuple (in case we add more modalities later)?

abheesht17 · 2025-10-13T16:44:57Z

tunix/generate/tokenizer_adapter.py

  HF: str = 'hf'  # huggingface tokenizer
+  HFP: str = 'hfp'  # huggingface processor


Is the only difference between these two that the processor can take images, and other modalities too? If yes, do you think we should just use HF processor everywhere (and remove HF tokeniser)?

Because if processor(text) works, we can just use processor everywhere

I don't think every tokenizer has an associated processor definition, so it probably makes sense to have both.

abheesht17 · 2025-10-13T16:48:32Z

tunix/generate/utils.py

+# Defaults compatible with CLIP / many SigLIP configs; override if needed.
+_CLIP_MEAN = jnp.array([0.48145466, 0.4578275, 0.40821073], dtype=jnp.float32)
+_CLIP_STD = jnp.array([0.26862954, 0.26130258, 0.27577711], dtype=jnp.float32)


Do you think we can move it to models/siglip?

abheesht17 · 2025-10-13T16:49:29Z

tunix/generate/utils.py

+    mean: Iterable[float] = _CLIP_MEAN,
+    std: Iterable[float] = _CLIP_STD,
+) -> jnp.ndarray:
+  """Resize + normalize images for SigLIP.


Just SigLIP? Does it not work for other vision models? In generate/utils.py, we should have generic functions (as much as possible)

abheesht17 · 2025-10-13T16:51:21Z

tunix/models/gemma3/model.py

+
+    if self.config.multimodal:
+      assert pixel_values is not None
+      image_mask = last_tokens == 262144  # 262144: <image_soft_token>


Better to define this somewhere instead of hardcoding

jxiong21029 · 2025-11-01T20:55:26Z

Oh, I didn't mean to request so many reviews. Not sure how that happened. Maybe from the CLA failing?

…ipynb

…-DPO notebook

jxiong21029 · 2025-11-01T23:02:47Z

Rebased to fix emails for CLA.

abheesht17

Sorry, getting back to this. This LGTM! Could you please resolve the merge conflicts?

Tianjiao-Yu · 2025-12-16T02:43:51Z

Hi, Abheesht, I made a small update to switch the demo to use Gemma-3-4B-IT, which is a true VLM, instead of Gemma-3-1B-IT. Could you please take a quick look when you have a moment? After that, I’ll proceed with resolving the remaining merge conflicts. Thanks!

ridcl · 2026-01-04T15:32:51Z

Hi guys! First of all, thanks for adding multimodal support - it works great in my use case, and I can't wait to see it merged!

I noticed that currently number of images is restricted to one per input, i.e. pixel_values is assumed to be of size [B, H, W, C], while original Gemma implementation actually allows inputs of size [B, N, H, W, C]. More precisely, in the original Gemma implementation, there are two versions of patchiy_images:

gemma.mutlimodel.image.patchify_images with signature (legacy?):

def patchify_images(
    images: typing.Float["B H W C"],
    patch_size: int = _DEFAULT_PATCH_SIZE,
    padding: str = "VALID",
) -> typing.Float["P D"]: ...

gemma.multimodal.vision.patchify_images with signature (actually used in transformer):

def patchify_images(self, images: Float["*B H W C"]) -> Float["*B P D"]: ...

From a quick test, I think to add support for multiple images here it should be enough to add batch dimension in a few place like PatchEmbed.__call__():

  @jax.named_scope("patch_embed")
  def __call__(self, x: jaxtyping.Array) -> jaxtyping.Array:
    # x: [B,H,W,3] -> conv -> [B,H/P,W/P,D] -> [B,N,D]
    x = self.proj(x)
    *b, h, w, d = x.shape                       # was:  b, h, w, d = x.shape
    x = x.reshape(*b, h * w, d)             # was: x = x.reshape(b, h * w, d)
    x = shard(x, self.cfg.shd_config.act_bnd)
    return x

But my local branch diverged significantly from this PR, so I can't really test the full fix just yet.

abheesht17 · 2026-01-23T04:52:54Z

Hi, Abheesht, I made a small update to switch the demo to use Gemma-3-4B-IT, which is a true VLM, instead of Gemma-3-1B-IT. Could you please take a quick look when you have a moment? After that, I’ll proceed with resolving the remaining merge conflicts. Thanks!

Looks good. Could you please give me edit access to this branch? I'll resolve merge conflicts and make a few changes (especially regarding the multiple images point). Thanks!

ridcl · 2026-01-23T10:33:09Z

@abheesht17 For the multiple image support, you may want to look at this commit as a reference point (mostly files tunix/models/gemma3/model.py, tunix/models/siglip/model.py and tunix/models/siglip/preprocess.py, the rest should be formatting).

Another change that you might be interested in is saving LoRA params for multimodal Gemma. Alternatively, I can create a clear pull request for it after the current PR is merged.

abheesht17

I was going through this again, and found a few issues:

Image tokens should have bidirectional attention, but I don't see that in the code.
We should support multiple images.
We have a Hugging Face preprocessor ("hfp"), but we don't seem to be using it. Also, the special tokens in HF preprocessor/tokeniser are different from the upstream GDM implementation.
Gemma 3 uses special start of image tokens, end of image tokens, etc., which are not there in the code.

I have a WIP PR for resolving some of these issues. Give me some time.

abheesht17 · 2026-01-26T02:08:11Z

tunix/models/siglip/preprocess.py

+_CLIP_STD = jnp.array([0.26862954, 0.26130258, 0.27577711], dtype=jnp.float32)
+
+
+def preprocess(


I don't see this function being used anywhere

abheesht17 · 2026-01-26T02:09:07Z

examples/vl_dpo_demo_gemma3.ipynb

+      },
+      "outputs": [],
+      "source": [
+        "gemma_tokenizer = tokenizer_lib.Tokenizer(tokenizer_path=GEMMA_TOKENIZER_PATH)"


Why can we not use the HF processor directly?

abheesht17 · 2026-01-26T02:27:24Z

examples/vl_dpo_demo_gemma3.ipynb

+        "model_config = dataclasses.replace(\n",
+        "    model_config, multimodal=True, num_embed=262208\n",


Why don't we just expose multimodal as an arg in gemma3_model_lib.ModelConfig.gemma3_4b(multimodal=True)?

abheesht17 · 2026-01-26T02:37:44Z

tunix/models/gemma3/model.py

@@ -927,18 +1001,26 @@ def __call__(
      positions: jaxtyping.Array,  # [B, L]
      cache: Cache | None,  # (sequence length L')
      attention_mask: jaxtyping.Array,  # [B, L, L']


Gemma 3 is supposed to have bidirectional attention for image tokens, but I don't see that here, or in the VLM DPO notebook.

…ed SigLIP preprocess

jxiong21029 force-pushed the uiuc-vlm-pr-compressed-fixed branch from 052a484 to 03a1804 Compare October 9, 2025 15:37

Tianjiao-Yu force-pushed the uiuc-vlm-pr-compressed-fixed branch from 03a1804 to 0b1d000 Compare October 10, 2025 02:48

abheesht17 self-requested a review October 10, 2025 04:05

immuntasir commented Oct 11, 2025

View reviewed changes

abheesht17 reviewed Oct 13, 2025

View reviewed changes

jxiong21029 requested review from hgao327, jiangyangmu, lc5211, sizhit2, tianshub and wang2yn84 as code owners November 1, 2025 19:20

Tianjiao-Yu and others added 15 commits November 1, 2025 17:36

Add SigLIP and Paligemma code

73e41a0

Add multimodal support for Gemma 3

191dbd7

Begin implementing VL-DPO trainer

756c3d6

Add DPO demo

154c464

Add dpo_demo_gemma3_v2.ipynb

a583fbe

Edit VL-DPO trainer

3880a16

Add fixes for dpo_demo_gemma3_v2.ipynb and create dpo_demo_gemma3_v3.…

d48af92

…ipynb

Modify tests for vl_dpo_trainer

969d930

Begin integrating image inputs support in DPOTrainer

100dea3

Remove unused files and update vl_dpo_demo_gemma3

611962d

restore siglip code (needed for multimodal Gemma models)

f4038a9

Add SigLIP and Paligemma code

d45d0d6

Add multimodal support for Gemma 3

5413a10

Begin implementing VL-DPO trainer

74de87a

Add DPO demo

1c26c2f

immuntasir and others added 10 commits November 1, 2025 17:40

Remove unused files and update vl_dpo_demo_gemma3

e5e2c36

Fix: gemma3 modelconfig imported with wrong class name

0a463ee

Fix safetensors loading for multimodal Gemma3 models

f548009

VL-DPO notebook fixes

9220769

fix

97fd343

More VL-DPO notebook fixes

6350957

Remove VLMSampler; move preprocess_image from generate/utils.py to VL…

0c46733

…-DPO notebook

Delete examples/dpo_demo_gemma3.ipynb

5d3ecec

Revert dpo_demo_gemma3.ipynb

8c3e370

Restore SigLIP

bed8fa7

jxiong21029 force-pushed the uiuc-vlm-pr-compressed-fixed branch 7 times, most recently from 5afdf7b to bed8fa7 Compare November 1, 2025 22:53

abheesht17 approved these changes Dec 15, 2025

View reviewed changes

Tianjiao-Yu added 3 commits December 15, 2025 17:00

Fix multimodal DPO by forwarding pixel_values through logp computation

afe0d52

Forward pixel_values in get_per_token_logps for multimodal DPO

9c118e8

fix notebook example for multimodal demo

369ef14

Fix merge conflicts

d170ceb

abheesht17 requested changes Jan 26, 2026

View reviewed changes

Add Gemma3 multimodal attention mask, update DPO trainer, remove unus…

1c4d1ff

…ed SigLIP preprocess

ridcl mentioned this pull request Feb 7, 2026

Qwen3-VL #1063

Open

2 tasks

		HF: str = 'hf' # huggingface tokenizer
		HFP: str = 'hfp' # huggingface processor

		_CLIP_STD = jnp.array([0.26862954, 0.26130258, 0.27577711], dtype=jnp.float32)


		def preprocess(

		"model_config = dataclasses.replace(\n",
		" model_config, multimodal=True, num_embed=262208\n",

Conversation

immuntasir commented Oct 6, 2025

Uh oh!

google-cla bot commented Oct 6, 2025

Uh oh!

abheesht17 commented Oct 10, 2025

Uh oh!

immuntasir commented Oct 10, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abheesht17 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jxiong21029 commented Nov 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jxiong21029 commented Nov 1, 2025

Uh oh!

abheesht17 left a comment

Choose a reason for hiding this comment

Uh oh!

Tianjiao-Yu commented Dec 16, 2025

Uh oh!

ridcl commented Jan 4, 2026

Uh oh!

abheesht17 commented Jan 23, 2026

Uh oh!

ridcl commented Jan 23, 2026

Uh oh!

abheesht17 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abheesht17 Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jxiong21029 commented Nov 1, 2025 •

edited

Loading

abheesht17 left a comment •

edited

Loading

abheesht17 Jan 26, 2026 •

edited

Loading