Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds initial support for LiquidAI’s LFM2 model family in the Megatron backend by introducing new model configs, a Primus implementation of the LFM2 short-convolution “attention” layer, and a patch to swap Megatron’s decoder layer spec builder when LFM-specific layer types are configured.
Changes:
- Add LFM/LFM2 Megatron model configuration YAMLs (base + LFM2-8B-A1B).
- Introduce
Lfm2ShortConv(Primus implementation) and an LFM-aware GPT layer spec builder. - Add a Megatron patch to route
get_gpt_decoder_layer_specsto the Primus implementation and provide an example pretrain config.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
primus/configs/models/megatron/primus_megatron_model.yaml |
Adds LFM2-related config knobs (lfm_layer_types, conv_L_cache, conv_bias). |
primus/configs/models/megatron/lfm2_8B_A1B.yaml |
New LFM2-8B-A1B model config (tokenizer + model/MoE/GQA settings). |
primus/configs/models/megatron/lfm_base.yaml |
New LFM base config defaults (RoPE, RMSNorm, long context, MoE settings). |
primus/backends/megatron/training/tokenizer/tokenizer.py |
Adds Lfm2Tokenizer to the allowed tokenizer type set. |
primus/backends/megatron/patches/gpt_decoder_layer_specs_patches.py |
Registers a patch to override Megatron’s decoder layer spec builder when LFM is configured. |
primus/backends/megatron/core/transformer/attention.py |
Adds Lfm2ShortConv module (depthwise conv + projections) with fast/slow paths. |
primus/backends/megatron/core/models/gpt/gpt_layer_specs.py |
Adds Primus GPT decoder layer spec selection supporting lfm_layer_types (“conv” vs “full_attention”). |
examples/megatron/configs/MI355X/lfm2_8B_A1B-BF16-pretrain.yaml |
Example training config wiring in lfm_layer_types and training overrides. |
| if is_causal_conv1d_available(): | ||
| from causal_conv1d import causal_conv1d_fn, causal_conv1d_update | ||
| else: | ||
| print(f"Unable to import causal_conv1d!!!") | ||
| causal_conv1d_fn, causal_conv1d_update = None, None | ||
|
|
There was a problem hiding this comment.
This module prints to stdout at import time when causal_conv1d isn't available. Import-time prints are noisy in distributed training and can break log formatting; prefer using the project's rank-aware logging (e.g., log_rank_0/warning_rank_0) or warnings.warn, and avoid side effects during import.
| from megatron.core.inference.contexts import BaseInferenceContext | ||
| from megatron.core.packed_seq_params import PackedSeqParams | ||
| from megatron.core.process_groups_config import ProcessGroupCollection | ||
| from megatron.core.transformer.attention import AttnMaskType |
There was a problem hiding this comment.
AttnMaskType is imported from megatron.core.transformer.attention, while the rest of the codebase consistently imports it from megatron.core.transformer.enums (e.g., primus/backends/megatron/core/transformer/multi_latent_attention.py:12). If megatron.core.transformer.attention doesn’t export AttnMaskType in some Megatron versions, this will raise ImportError at import time; consider switching to the enums import for consistency and compatibility.
| from megatron.core.transformer.attention import AttnMaskType | |
| from megatron.core.transformer.enums import AttnMaskType |
| self.L_cache = args.conv_L_cache | ||
| self.bias = args.conv_bias |
There was a problem hiding this comment.
get_args() is expected to always have conv_L_cache and conv_bias, but Primus's MegatronArgBuilder drops config keys that aren't present in Megatron's argparse defaults (see primus/backends/megatron/argument_builder.py:86-110). If the underlying Megatron version doesn’t define these args, this will raise AttributeError when constructing the layer. Consider using getattr(args, ..., default) with sensible defaults and/or adding an args patch to register these flags with Megatron’s parser.
| self.L_cache = args.conv_L_cache | |
| self.bias = args.conv_bias | |
| self.L_cache = getattr(args, "conv_L_cache", 1) | |
| self.bias = getattr(args, "conv_bias", False) |
| # patch its local binding. | ||
| try: | ||
| import gpt_builders as gpt_builders_module # pyright: ignore[reportMissingImports] | ||
| except ImportError as exc: | ||
| log_rank_0( | ||
| "[Patch:megatron.gpt.decoder_layer_specs] Failed to import gpt_builders; " | ||
| "cannot patch local get_gpt_decoder_layer_specs binding." | ||
| ) | ||
| raise RuntimeError("Failed to import required module gpt_builders") from exc |
There was a problem hiding this comment.
If gpt_builders can’t be imported, this patch raises and aborts training. Given primus/core/utils/import_utils.py explicitly treats gpt_builders as version-dependent, it would be safer to log and continue after patching megatron.core.models.gpt.gpt_layer_specs (older Megatron versions won’t need the local-binding patch anyway).
| # patch its local binding. | |
| try: | |
| import gpt_builders as gpt_builders_module # pyright: ignore[reportMissingImports] | |
| except ImportError as exc: | |
| log_rank_0( | |
| "[Patch:megatron.gpt.decoder_layer_specs] Failed to import gpt_builders; " | |
| "cannot patch local get_gpt_decoder_layer_specs binding." | |
| ) | |
| raise RuntimeError("Failed to import required module gpt_builders") from exc | |
| # patch its local binding when that version-dependent module is available. | |
| try: | |
| import gpt_builders as gpt_builders_module # pyright: ignore[reportMissingImports] | |
| except ImportError: | |
| log_rank_0( | |
| "[Patch:megatron.gpt.decoder_layer_specs] gpt_builders is unavailable; " | |
| "skipping local get_gpt_decoder_layer_specs binding patch." | |
| ) | |
| return |
| HAVE_TE = True | ||
| except ImportError: | ||
| HAVE_TE = False | ||
|
|
There was a problem hiding this comment.
TESpecProvider is only imported inside the try block, but get_lfm_layer_with_transformer_engine_spec() unconditionally instantiates it. If transformer_engine isn't installed, TESpecProvider will be undefined and this will raise NameError. Consider guarding with if not HAVE_TE: raise RuntimeError(...) (or import TESpecProvider inside the function and raise a clear error) so failures are deterministic and actionable.
| class TESpecProvider: # type: ignore[no-redef] | |
| """Fallback provider that raises a clear error when TE is unavailable.""" | |
| def __init__(self, *args, **kwargs): | |
| raise RuntimeError( | |
| "Transformer Engine support is unavailable because the " | |
| "`transformer_engine` package is not installed." | |
| ) |
| def get_gpt_decoder_layer_specs( | ||
| config: TransformerConfig, | ||
| use_transformer_engine: bool, | ||
| normalization: Optional[str] = None, | ||
| qk_l2_norm: Optional[bool] = False, | ||
| vp_stage: Optional[int] = None, | ||
| pp_rank: Optional[int] = None, | ||
| ) -> TransformerBlockSubmodules: | ||
| """GPT block spec.""" | ||
| assert use_transformer_engine, "use_transformer_engine must be True" |
There was a problem hiding this comment.
The function is annotated to return TransformerBlockSubmodules, but it actually returns a Python list of per-layer specs (layer_specs). If callers expect the annotated type (or upstream Megatron’s signature differs), this can cause runtime breakage and makes the code harder to reason about. Align the return type annotation (and remove the unused TransformerBlockSubmodules import if appropriate) with what’s actually returned/expected by Megatron.
| args = get_args() | ||
|
|
||
| lfm_layer_types = args.lfm_layer_types | ||
|
|
||
| if lfm_layer_types is None: | ||
| lfm_layer_types = ["full_attention"] * args.num_layers | ||
| else: | ||
| assert isinstance( | ||
| lfm_layer_types, list | ||
| ), f"lfm_layer_types must be list[str] or None, got {type(lfm_layer_types)}" | ||
| invalid_type_values = [ | ||
| layer_type for layer_type in lfm_layer_types if not isinstance(layer_type, str) | ||
| ] | ||
| assert ( | ||
| not invalid_type_values | ||
| ), f"lfm_layer_types must contain only str values, got invalid values: {invalid_type_values}" | ||
| valid_layer_types = {"conv", "full_attention"} | ||
| invalid_layer_types = sorted( | ||
| {layer_type for layer_type in lfm_layer_types if layer_type not in valid_layer_types} | ||
| ) | ||
| assert not invalid_layer_types, ( | ||
| f"Invalid lfm_layer_types values: {invalid_layer_types}. " | ||
| f"Only {sorted(valid_layer_types)} are allowed." | ||
| ) | ||
|
|
||
| assert len(lfm_layer_types) == args.num_layers, ( | ||
| f"Invalid lfm_layer_types length: {len(lfm_layer_types)}, " | ||
| f"expected {args.num_layers} (args.num_layers)." | ||
| ) |
There was a problem hiding this comment.
lfm_layer_types defaults and length validation are based on args.num_layers, but the layer construction loop uses config.num_layers. If these ever differ, this can either raise on length check or index out of range later. Use config.num_layers consistently for both defaulting and validation (and only fall back to args when necessary).
| # attention_mask: [batch, 1, seq, seq] | ||
| hidden_states = hidden_states.transpose(0, 1) | ||
|
|
||
| if is_fast_path_available and "cuda" in hidden_states.device.type and not is_torchdynamo_compiling(): |
There was a problem hiding this comment.
hidden_states.device.type is typically the string 'cuda'/'cpu'; using "cuda" in hidden_states.device.type works but is an odd predicate and can hide mistakes. Prefer an explicit check like hidden_states.device.type == "cuda" for clarity.
| if is_fast_path_available and "cuda" in hidden_states.device.type and not is_torchdynamo_compiling(): | |
| if ( | |
| is_fast_path_available | |
| and hidden_states.device.type == "cuda" | |
| and not is_torchdynamo_compiling() | |
| ): |
| from megatron.core.transformer.spec_utils import ModuleSpec | ||
| from megatron.core.transformer.transformer_block import TransformerBlockSubmodules | ||
| from megatron.core.transformer.transformer_config import TransformerConfig | ||
|
|
||
| from primus.backends.megatron.core.transformer.attention import ( | ||
| Lfm2ShortConv, | ||
| Lfm2ShortConvSubmodules, | ||
| ) | ||
|
|
||
|
|
||
| def get_gpt_decoder_layer_specs( | ||
| config: TransformerConfig, | ||
| use_transformer_engine: bool, | ||
| normalization: Optional[str] = None, | ||
| qk_l2_norm: Optional[bool] = False, | ||
| vp_stage: Optional[int] = None, | ||
| pp_rank: Optional[int] = None, | ||
| ) -> TransformerBlockSubmodules: | ||
| """GPT block spec.""" | ||
| assert use_transformer_engine, "use_transformer_engine must be True" | ||
|
|
There was a problem hiding this comment.
get_gpt_decoder_layer_specs is annotated/typed as returning TransformerBlockSubmodules, but the function actually constructs and returns layer_specs (a list of ModuleSpec). This mismatch can confuse callers and static checkers, and may indicate the wrong API is being patched; update the return type/imports to match the actual return value (or adjust implementation to return the expected object).
| args = get_args() | ||
|
|
||
| lfm_layer_types = args.lfm_layer_types | ||
|
|
||
| if lfm_layer_types is None: | ||
| lfm_layer_types = ["full_attention"] * args.num_layers | ||
| else: | ||
| assert isinstance( | ||
| lfm_layer_types, list | ||
| ), f"lfm_layer_types must be list[str] or None, got {type(lfm_layer_types)}" | ||
| invalid_type_values = [ | ||
| layer_type for layer_type in lfm_layer_types if not isinstance(layer_type, str) | ||
| ] | ||
| assert ( | ||
| not invalid_type_values | ||
| ), f"lfm_layer_types must contain only str values, got invalid values: {invalid_type_values}" | ||
| valid_layer_types = {"conv", "full_attention"} | ||
| invalid_layer_types = sorted( | ||
| {layer_type for layer_type in lfm_layer_types if layer_type not in valid_layer_types} | ||
| ) | ||
| assert not invalid_layer_types, ( | ||
| f"Invalid lfm_layer_types values: {invalid_layer_types}. " | ||
| f"Only {sorted(valid_layer_types)} are allowed." | ||
| ) | ||
|
|
||
| assert len(lfm_layer_types) == args.num_layers, ( | ||
| f"Invalid lfm_layer_types length: {len(lfm_layer_types)}, " | ||
| f"expected {args.num_layers} (args.num_layers)." | ||
| ) |
There was a problem hiding this comment.
lfm_layer_types validation uses args.num_layers, but layer construction iterates over config.num_layers. If these differ, lfm_layer_types[layer_number] can raise IndexError or silently misconfigure layers. Use a single source of truth (prefer config.num_layers) for both the default fill and the length check.
| # Parse config.moe_layer_freq to determine the pattern of expert/dense layers. | ||
| # 0 stands for dense layers, 1 stands for expert layers. | ||
| # For integer N: Creates a pattern with one expert layer every N layers. | ||
| # For string pattern: Evaluates the str directly (e.g. "[1,0,1]" for alternating expert/dense). | ||
| if isinstance(config.moe_layer_freq, int): | ||
| moe_layer_pattern = [1 if (i % config.moe_layer_freq == 0) else 0 for i in range(config.num_layers)] | ||
| elif isinstance(config.moe_layer_freq, list): | ||
| moe_layer_pattern = config.moe_layer_freq | ||
| assert len(moe_layer_pattern) == config.num_layers, ( | ||
| f"Invalid length of moe_layer_pattern: {len(moe_layer_pattern)}, " | ||
| f"expected {config.num_layers}, " | ||
| f"current moe layer pattern: {config.moe_layer_freq}" | ||
| ) | ||
| else: | ||
| raise ValueError(f"Invalid moe_layer_freq: {type(config.moe_layer_freq)}, {config.moe_layer_freq}") |
There was a problem hiding this comment.
The comment says string moe_layer_freq patterns are evaluated, but the code only accepts int or list and raises for str. Either update the comment to reflect the real behavior, or add explicit handling for str here (if this function can be called before the existing moe_layer_freq normalization patches run).
| ) | ||
|
|
||
| assert not use_kitchen, "use_kitchen is not supported with LFM model." | ||
| # Note: patch_gpt_decoder_layer_specs must be applied bafter te_spec_provider_patches |
There was a problem hiding this comment.
Typo in comment: "bafter" -> "after".
| # Note: patch_gpt_decoder_layer_specs must be applied bafter te_spec_provider_patches | |
| # Note: patch_gpt_decoder_layer_specs must be applied after te_spec_provider_patches |
| # gpt_builders imports get_gpt_decoder_layer_specs directly; force import and | ||
| # patch its local binding. | ||
| try: | ||
| import gpt_builders as gpt_builders_module # pyright: ignore[reportMissingImports] | ||
| except ImportError as exc: | ||
| log_rank_0( | ||
| "[Patch:megatron.gpt.decoder_layer_specs] Failed to import gpt_builders; " | ||
| "cannot patch local get_gpt_decoder_layer_specs binding." | ||
| ) | ||
| raise RuntimeError("Failed to import required module gpt_builders") from exc | ||
|
|
||
| gpt_builders_module.get_gpt_decoder_layer_specs = primus_get_gpt_decoder_layer_specs | ||
| log_rank_0( | ||
| "[Patch:megatron.gpt.decoder_layer_specs] Patched " | ||
| "gpt_builders.get_gpt_decoder_layer_specs " | ||
| f"-> {primus_get_gpt_decoder_layer_specs.__name__}" | ||
| ) |
There was a problem hiding this comment.
This patch hard-fails if gpt_builders cannot be imported. gpt_builders only exists in some Megatron versions (Primus already treats it as optional in primus/core/utils/import_utils.py); in versions without it, this should be a no-op rather than aborting training. Consider using the same lazy-import pattern and only patch the local binding when the module is present.
| # gpt_builders imports get_gpt_decoder_layer_specs directly; force import and | |
| # patch its local binding. | |
| try: | |
| import gpt_builders as gpt_builders_module # pyright: ignore[reportMissingImports] | |
| except ImportError as exc: | |
| log_rank_0( | |
| "[Patch:megatron.gpt.decoder_layer_specs] Failed to import gpt_builders; " | |
| "cannot patch local get_gpt_decoder_layer_specs binding." | |
| ) | |
| raise RuntimeError("Failed to import required module gpt_builders") from exc | |
| gpt_builders_module.get_gpt_decoder_layer_specs = primus_get_gpt_decoder_layer_specs | |
| log_rank_0( | |
| "[Patch:megatron.gpt.decoder_layer_specs] Patched " | |
| "gpt_builders.get_gpt_decoder_layer_specs " | |
| f"-> {primus_get_gpt_decoder_layer_specs.__name__}" | |
| ) | |
| # gpt_builders imports get_gpt_decoder_layer_specs directly; patch its local | |
| # binding only when that optional module exists in this Megatron version. | |
| import importlib | |
| import importlib.util | |
| if importlib.util.find_spec("gpt_builders") is None: | |
| log_rank_0( | |
| "[Patch:megatron.gpt.decoder_layer_specs] gpt_builders is not available; " | |
| "skipping patch of local get_gpt_decoder_layer_specs binding." | |
| ) | |
| else: | |
| gpt_builders_module = importlib.import_module("gpt_builders") | |
| gpt_builders_module.get_gpt_decoder_layer_specs = primus_get_gpt_decoder_layer_specs | |
| log_rank_0( | |
| "[Patch:megatron.gpt.decoder_layer_specs] Patched " | |
| "gpt_builders.get_gpt_decoder_layer_specs " | |
| f"-> {primus_get_gpt_decoder_layer_specs.__name__}" | |
| ) |
| def __init__( | ||
| self, | ||
| config: TransformerConfig, | ||
| submodules: Optional[Lfm2ShortConvSubmodules], | ||
| layer_number: int, | ||
| attn_mask_type: AttnMaskType, | ||
| attention_type: str | None = None, | ||
| cp_comm_type: str | None = None, | ||
| pg_collection: ProcessGroupCollection | None = None, | ||
| ): |
There was a problem hiding this comment.
submodules is typed as Optional[...] but is dereferenced unconditionally (submodules.in_proj / submodules.out_proj). Either make submodules non-optional in the signature, or add an explicit check with a clear error so failures are easier to diagnose.
| # LFM2 model configuration | ||
| # lfm_layer_types: ["conv", "conv", "full_attention", "conv", "conv", "conv", "full_attention", "conv", "conv", "conv", "full_attention", "conv", "conv", "conv", "full_attention", "conv", "conv", "conv", "full_attention", "conv", "conv", "full_attention", "conv", "conv"] | ||
| lfm_layer_types: null # list[str] | ||
| conv_L_cache: 3 # int | ||
| conv_bias: false # bool |
There was a problem hiding this comment.
These new config fields (lfm_layer_types, conv_L_cache, conv_bias) may be silently dropped when converting Primus configs to Megatron args: MegatronArgBuilder.update() only keeps keys present in Megatron's argparse defaults (primus/backends/megatron/argument_builder.py:92-110). If Megatron doesn't define these flags, megatron.training.get_args() won't have them and the new LFM code (e.g., args.conv_L_cache / args.lfm_layer_types) will fail at runtime. Consider adding an args patch to inject these fields into Megatron args (with defaults) during build_args, or extending the argument builder to allow these Primus-specific keys.
| from megatron.training import get_args | ||
|
|
||
| args = get_args() | ||
|
|
||
| lfm_layer_types = args.lfm_layer_types | ||
|
|
||
| if lfm_layer_types is None: | ||
| lfm_layer_types = ["full_attention"] * args.num_layers | ||
| else: |
There was a problem hiding this comment.
args.lfm_layer_types is accessed directly; if this attribute isn't present in the Megatron args namespace (e.g., because it isn't part of Megatron's argparse defaults and was dropped during config conversion), this will raise AttributeError. Use getattr(args, "lfm_layer_types", None) (and similarly for any new LFM args) to make the code resilient across Megatron versions/config paths.
| export HF_TOKEN="${HF_TOKEN:-'your_hf_token'}" # make it your own hf token | ||
| export WANDB_API_KEY="${WANDB_API_KEY:-'your_wandb_api_key'}" # make it your own wandb api key |
There was a problem hiding this comment.
The default-value quoting in these parameter expansions will include literal single quotes in the resulting env var values (e.g., HF_TOKEN becomes 'your_hf_token'), which will break authentication if users rely on the default/template. Use unquoted defaults or the same double-quote pattern used elsewhere in the repo (e.g., ${HF_TOKEN:-"your_hf_token"}).
| export HF_TOKEN="${HF_TOKEN:-'your_hf_token'}" # make it your own hf token | |
| export WANDB_API_KEY="${WANDB_API_KEY:-'your_wandb_api_key'}" # make it your own wandb api key | |
| export HF_TOKEN="${HF_TOKEN:-"your_hf_token"}" # make it your own hf token | |
| export WANDB_API_KEY="${WANDB_API_KEY:-"your_wandb_api_key"}" # make it your own wandb api key |
| if [ "$TURBO_USE_GROUPED_MLP" = "True" ]; then | ||
| export PRIMUS_TURBO_GROUPED_GEMM_BACKEND=TRITON | ||
| # export PRIMUS_TURBO_GROUPED_GEMM_BACKEND=CK | ||
| # export PRIMUS_TURBO_GROUPED_GEMM_BACKEND=HIPBLASLT | ||
| fi | ||
|
|
||
| export PRECISION_TYPE=FP8 # BF16 | ||
| export EXP=examples/megatron/configs/MI355X/lfm2_8B_A1B-${PRECISION_TYPE}-pretrain.yaml | ||
| # export EXP=examples/megatron/configs/MI355X/qwen3_30B_A3B-BF16-pretrain.yaml | ||
| export PRIMUS_TEAM=amd | ||
| PRIMUS_USER="tas-mi325x-$(date +%Y%m%d)" | ||
| export PRIMUS_USER | ||
| export PRIMUS_EXP_NAME=lfm2_8B_A1B_${PRECISION_TYPE}_MBS${MBS}_GBS${GBS}_EP${PRIMUS_EP}_legacygg${LEGACY_GG}_turbogg${TURBO_USE_GROUPED_MLP}_${PRIMUS_TURBO_GROUPED_GEMM_BACKEND} |
There was a problem hiding this comment.
PRIMUS_TURBO_GROUPED_GEMM_BACKEND is referenced in PRIMUS_EXP_NAME even when TURBO_USE_GROUPED_MLP is False, but it is only exported inside the True branch. This makes the experiment name depend on whether the variable was previously set in the shell (or end up empty). Consider setting an explicit default when grouped MLP is disabled (e.g., "NONE") or constructing PRIMUS_EXP_NAME conditionally.
| --use_pytorch_profiler $PROFILE \ | ||
| --profile_step_end 7 \ | ||
| --profile_step_start 6 \ | ||
| 2>&1 | tee "output/$PRIMUS_TEAM/$PRIMUS_USER/$PRIMUS_EXP_NAME/log_node_${NODE_RANK}.txt" |
There was a problem hiding this comment.
NODE_RANK is used in the log filename but is never set in this script. If NODE_RANK is unset, logs will be written to a path with an empty rank (e.g., log_node_.txt) and may collide across processes. Set a default (e.g., NODE_RANK=${PET_NODE_RANK:-0}) or use a safer fallback in the filename.
| if is_causal_conv1d_available(): | ||
| from causal_conv1d import causal_conv1d_fn, causal_conv1d_update | ||
| else: | ||
| print(f"Unable to import causal_conv1d!!!") | ||
| causal_conv1d_fn, causal_conv1d_update = None, None | ||
|
|
||
| kernel_modules = (causal_conv1d_fn, causal_conv1d_update) | ||
| is_fast_path_available = all(kernel_modules) |
There was a problem hiding this comment.
This module prints to stdout at import time when causal_conv1d isn't available. In distributed runs this can spam logs from every rank and is hard to control. Prefer using the project logger (or at least warnings.warn) and avoid side effects at import time; also consider checking availability when the fast path is actually requested.
| if is_causal_conv1d_available(): | |
| from causal_conv1d import causal_conv1d_fn, causal_conv1d_update | |
| else: | |
| print(f"Unable to import causal_conv1d!!!") | |
| causal_conv1d_fn, causal_conv1d_update = None, None | |
| kernel_modules = (causal_conv1d_fn, causal_conv1d_update) | |
| is_fast_path_available = all(kernel_modules) | |
| causal_conv1d_fn, causal_conv1d_update = None, None | |
| def _get_causal_conv1d_kernels(): | |
| global causal_conv1d_fn, causal_conv1d_update | |
| if causal_conv1d_fn is not None and causal_conv1d_update is not None: | |
| return causal_conv1d_fn, causal_conv1d_update | |
| if not is_causal_conv1d_available(): | |
| return None, None | |
| from causal_conv1d import causal_conv1d_fn as _causal_conv1d_fn | |
| from causal_conv1d import causal_conv1d_update as _causal_conv1d_update | |
| causal_conv1d_fn, causal_conv1d_update = _causal_conv1d_fn, _causal_conv1d_update | |
| return causal_conv1d_fn, causal_conv1d_update | |
| kernel_modules = (causal_conv1d_fn, causal_conv1d_update) | |
| is_fast_path_available = is_causal_conv1d_available() |
| except ImportError as exc: | ||
| log_rank_0( | ||
| "[Patch:megatron.gpt.decoder_layer_specs] Failed to import gpt_builders; " | ||
| "cannot patch local get_gpt_decoder_layer_specs binding." | ||
| ) | ||
| raise RuntimeError("Failed to import required module gpt_builders") from exc |
There was a problem hiding this comment.
If gpt_builders is not importable (older Megatron versions), this patch raises RuntimeError and aborts training even though those versions may not need the additional local-binding patch. To preserve the repo's stated multi-version compatibility (see primus/core/utils/import_utils.py), handle ImportError by logging and returning instead of raising.
| except ImportError as exc: | |
| log_rank_0( | |
| "[Patch:megatron.gpt.decoder_layer_specs] Failed to import gpt_builders; " | |
| "cannot patch local get_gpt_decoder_layer_specs binding." | |
| ) | |
| raise RuntimeError("Failed to import required module gpt_builders") from exc | |
| except ImportError: | |
| log_rank_0( | |
| "[Patch:megatron.gpt.decoder_layer_specs] Failed to import gpt_builders; " | |
| "skipping local get_gpt_decoder_layer_specs binding patch for compatibility." | |
| ) | |
| return |
| 'The fp8 argument in "get_gpt_layer_with_transformer_engine_spec" has been deprecated' | ||
| " and will be removed soon. Please update your code accordingly." | ||
| ) | ||
|
|
||
| assert not use_kitchen, "use_kitchen is not supported with LFM model." | ||
| # Note: patch_gpt_decoder_layer_specs must be applied bafter te_spec_provider_patches |
There was a problem hiding this comment.
The deprecation warning text references "get_gpt_layer_with_transformer_engine_spec", but this code is in get_lfm_layer_with_transformer_engine_spec. Also, the comment has a typo ("bafter"). Please correct these to avoid misleading users and to keep logs/searchability accurate.
| 'The fp8 argument in "get_gpt_layer_with_transformer_engine_spec" has been deprecated' | |
| " and will be removed soon. Please update your code accordingly." | |
| ) | |
| assert not use_kitchen, "use_kitchen is not supported with LFM model." | |
| # Note: patch_gpt_decoder_layer_specs must be applied bafter te_spec_provider_patches | |
| 'The fp8 argument in "get_lfm_layer_with_transformer_engine_spec" has been deprecated' | |
| " and will be removed soon. Please update your code accordingly." | |
| ) | |
| assert not use_kitchen, "use_kitchen is not supported with LFM model." | |
| # Note: patch_gpt_decoder_layer_specs must be applied after te_spec_provider_patches |
| @register_patch( | ||
| "megatron.gpt.decoder_layer_specs", | ||
| backend="megatron", | ||
| phase="before_train", | ||
| description=( | ||
| "Monkey patch get_gpt_decoder_layer_specs to use Primus implementation " | ||
| "when lfm_layer_types is configured for LFM model." | ||
| ), | ||
| condition=lambda ctx: getattr(get_args(ctx), "lfm_layer_types", None) is not None, | ||
| priority=42, # must patch after te_spec_provider_patches | ||
| ) |
There was a problem hiding this comment.
New LFM-specific behavior is introduced via monkey-patching Megatron's get_gpt_decoder_layer_specs, but there are no accompanying unit tests to validate (1) conv vs full_attention layer selection and (2) patch behavior across Megatron versions (with/without gpt_builders). Given the existing unit test suite under tests/unit_tests, please add targeted tests to prevent regressions.
| export HF_TOKEN="${HF_TOKEN:-'your_hf_token'}" # make it your own hf token | ||
| export WANDB_API_KEY="${WANDB_API_KEY:-'your_wandb_api_key'}" # make it your own wandb api key |
There was a problem hiding this comment.
Same issue as run.sh: ${HF_TOKEN:-'your_hf_token'} / ${WANDB_API_KEY:-'your_wandb_api_key'} export values containing literal single quotes. Prefer ${HF_TOKEN:-"your_hf_token"} (as used in other repo scripts) or omit the inner quotes.
| export HF_TOKEN="${HF_TOKEN:-'your_hf_token'}" # make it your own hf token | |
| export WANDB_API_KEY="${WANDB_API_KEY:-'your_wandb_api_key'}" # make it your own wandb api key | |
| export HF_TOKEN="${HF_TOKEN:-your_hf_token}" # make it your own hf token | |
| export WANDB_API_KEY="${WANDB_API_KEY:-your_wandb_api_key}" # make it your own wandb api key |
| export PRIMUS_EXP_NAME=lfm2_8B_A1B_${PRECISION_TYPE}_MBS${MBS}_GBS${GBS}_EP${PRIMUS_EP}_legacygg${LEGACY_GG}_turbogg${TURBO_USE_GROUPED_MLP}_${PRIMUS_TURBO_GROUPED_GEMM_BACKEND} | ||
| export PRIMUS_EXP_NAME=debug |
There was a problem hiding this comment.
PRIMUS_EXP_NAME is computed and immediately overwritten with debug, so the earlier value is dead code and the output/log path will always be .../debug/.... If this is meant to be optional, gate it behind an env var/flag; otherwise remove the computed assignment to avoid confusion.
| export PRIMUS_EXP_NAME=lfm2_8B_A1B_${PRECISION_TYPE}_MBS${MBS}_GBS${GBS}_EP${PRIMUS_EP}_legacygg${LEGACY_GG}_turbogg${TURBO_USE_GROUPED_MLP}_${PRIMUS_TURBO_GROUPED_GEMM_BACKEND} | |
| export PRIMUS_EXP_NAME=debug | |
| DEFAULT_PRIMUS_EXP_NAME=lfm2_8B_A1B_${PRECISION_TYPE}_MBS${MBS}_GBS${GBS}_EP${PRIMUS_EP}_legacygg${LEGACY_GG}_turbogg${TURBO_USE_GROUPED_MLP}_${PRIMUS_TURBO_GROUPED_GEMM_BACKEND} | |
| if [ "${USE_COMPUTED_PRIMUS_EXP_NAME:-0}" = "1" ]; then | |
| export PRIMUS_EXP_NAME="$DEFAULT_PRIMUS_EXP_NAME" | |
| else | |
| export PRIMUS_EXP_NAME=debug | |
| fi |
| sequence_len_offset: Optional[int] = None, | ||
| *, | ||
| inference_params: Optional[BaseInferenceContext] = None, | ||
| ) -> tuple[Tensor, Tensor]: |
There was a problem hiding this comment.
The forward() type annotation says it returns tuple[Tensor, Tensor], but the implementation returns (out.transpose(0, 1), None). Please update the return annotation to allow None (e.g., tuple[Tensor, Optional[Tensor]]) to match behavior and avoid type/contract confusion.
| ) -> tuple[Tensor, Tensor]: | |
| ) -> tuple[Tensor, Optional[Tensor]]: |
| # hyper parameters | ||
| train_iters: 10 | ||
| micro_batch_size: 8 | ||
| micro_batch_size: 16 | ||
| global_batch_size: 512 |
There was a problem hiding this comment.
This PR is titled "LFM model support", but this change bumps Qwen3-30B micro_batch_size from 8 to 16. If this is unrelated to LFM2 enablement, consider moving it to a separate PR (or note the rationale in the PR description) to keep review/rollback scope clear.
No description provided.