Skip to content

Add MoMa (Mixture of Modality-Aware Experts) VLM architecture#107

Open
amazloumi wants to merge 4 commits into
mainfrom
moma-arch
Open

Add MoMa (Mixture of Modality-Aware Experts) VLM architecture#107
amazloumi wants to merge 4 commits into
mainfrom
moma-arch

Conversation

@amazloumi
Copy link
Copy Markdown
Member

@amazloumi amazloumi commented May 18, 2026

Summary

  • Adds paper-faithful MoMa as the 4th VLM architecture (arch = "moma"), following Lin et al. 2024 (arXiv:2407.21770). Single shared Q/K/V/O attention + per-modality MoE FFN groups with expert-choice + Sigmoid routing.
  • Token routing is two-stage: deterministic by modality_ids (level 1, reusing MoT's existing mechanism), then learned expert-choice + Sigmoid within each modality group (level 2, with Gumbel-Sigmoid noise per paper Eq. 5).
  • v1 supports training only — expert-choice routing is non-causal; auxiliary routers for inference (paper §2.4) are deferred to v2.
  • pre-norm vs paper's Swin post-norm.

Deferred to v2

  • Auxiliary routers for inference causality (paper §2.4) — required for autoregressive generation under EC routing.
  • Upcycling helper (1t1i seed → multi-expert MoMa, paper §2.5) — paper reports a 1.16–1.2× speedup from this staged-training trick.
  • torch.compile support — modality_ids scatter/gather + EC top-k currently cause graph breaks; JobConfig.validate emits a warning.
  • Expert Parallelism (rejected in v1 — per-modality expert groups need EP-aware dispatch that isn't wired yet).

Testing

  • ruff format — clean (148 files unchanged)
  • ruff check — All checks passed
  • pytest tests/unit/1342 passed, 2 skipped (~180s)
  • pytest tests/integration/77 passed on CUDA (~44s)
  • torchrun --nproc_per_node=2 -m pytest tests/distributed/test_vlm_*_fsdp.py30 passed (~38s)
  • torchrun --nproc_per_node=2 -m pytest tests/distributed/test_fsdp.py test_checkpoint.py17 passed, 1 skipped (regression sanity on non-VLM paths)
  • End-to-end smoke on 4× H200: uv run torchrun --nproc_per_node=4 scripts/train.py configs/train/vlm_7b_moma.toml — 46.87 B params total, loss volatile through 5-step warmup as expected, ~9.8k tok/s steady-state, ~105 GB / 140 GB per GPU

Closes #

@amazloumi amazloumi requested review from Naeemkh and mmshad May 18, 2026 21:01
@codecov
Copy link
Copy Markdown

codecov Bot commented May 18, 2026

Codecov Report

❌ Patch coverage is 94.35028% with 10 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
kempnerforge/config/job.py 11.11% 7 Missing and 1 partial ⚠️
kempnerforge/model/moma.py 97.97% 1 Missing and 1 partial ⚠️
Files with missing lines Coverage Δ
kempnerforge/config/vlm.py 100.00% <100.00%> (ø)
kempnerforge/model/transformer.py 94.41% <100.00%> (+0.74%) ⬆️
kempnerforge/model/vlm.py 98.94% <100.00%> (+0.13%) ⬆️
kempnerforge/model/moma.py 97.97% <97.97%> (ø)
kempnerforge/config/job.py 85.41% <11.11%> (-7.77%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

# Parameter / memory note: with the default 7B-dense-shaped backbone
# (dim=4096, n_layers=32, ffn ~14336) and 8 SwiGLU experts per layer
# (4 image + 4 text), total params is much larger than dense 7B. Use
# FSDP=4 + activation_checkpointing="full" to fit on 4x H200. For a
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

configs/train/vlm_7b_moma.toml gives three conflicting messages on activation_checkpointing:

  • L22 (header note): Use FSDP=4 + activation_checkpointing="full" to fit on 4x H200 - recommends it.
  • L84 (just above the field): AC=full is currently a no-op for MoMa - says it does nothing.
  • L89 (just above the field): Required given the per-layer expert duplication on 4x H200 - says it's mandatory.

A user reading top-down is told to enable AC, then that it's required, then that it's inert.

Suggest:

  1. Reword L22 to flag the no-op and recommend reducing moma_experts_per_modality on 4x H200 instead.
  2. Reword L89's "Required" to something like "Intended once apply_ac is refactored; currently inert".

"MoMa + Expert Parallelism is not supported in v1. Per-modality "
"expert groups need EP-aware dispatch that is not yet wired."
)
if self.train.compile_model:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MoMa branch of JobConfig.validate (job.py:225-247) already warns on compile_model (lines 241-247). The AC=full no-op fits the same precedent but is silent: a user authoring a fresh MoMa config without reading vlm_7b_moma.toml's NOTE block gets no warning that activation checkpointing has no effect.

Add a sibling warning right after the compile block (import logging is already in scope there):

  if self.train.activation_checkpointing == "full":
      logging.getLogger(__name__).warning(
          "AC=full is currently a no-op for MoMa (apply_ac matches "
          "TransformerBlock only; MoMaBlock is a sibling nn.Module). "
          "Use ac='selective' (wraps the Attention submodule, still works) "
          "or reduce moma_experts_per_modality until the apply_ac refactor lands."
      )

Trigger only on "full", not != "none": apply_ac's selective branch (parallel.py:120) uses isinstance(m, Attention), and MoMaBlock.attention is a standard Attention (moma.py:401), so selective mode does wrap MoMa correctly. Only full is broken.

Could ride on the upcoming apply_ac refactor PR, or land as a small patch here.

@mmshad
Copy link
Copy Markdown
Collaborator

mmshad commented May 21, 2026

I tested the MoMa architecture on two H100 GPU and it works fine. Added a few minor comments in PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants