[Issue]: Using basic instructions to run simple example results in crash on gfx950

### Problem Description

I did the following, per the instructions:

```
docker pull rocm/atom-dev:latest

docker run -it --network=host \
  --device=/dev/kfd \
  --device=/dev/dri \
  --group-add video \
  --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  -v $HOME:/home/$USER \
  -v /mnt:/mnt \
  -v /data:/data \
  --shm-size=16G \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  rocm/atom-dev:latest
```

And then once inside the container, I ran the command:

```
# python -m atom.examples.simple_inference --model meta-llama/Meta-Llama-3-8B --kv_cache_dtype fp8
root@smci355-ccs-aus-m11-05:/# python -m atom.examples.simple_inference --model meta-llama/Meta-Llama-3-8B --kv_cache_dtype fp8
[aiter] import [module_aiter_core] under /app/aiter-test/aiter/jit/module_aiter_core.so
[atom 14:10:03] Engine kwargs: {'trust_remote_code': False, 'tensor_parallel_size': 1, 'data_parallel_size': 1, 'enforce_eager': False, 'enable_prefix_caching': True, 'port': 8006, 'kv_cache_dtype': 'fp8', 'max_model_len': None, 'max_num_batched_tokens': 16384, 'attn_prefill_chunk_size': 16384, 'enable_chunked_prefill': True, 'scheduler_delay_factor': 0.0, 'max_num_seqs': 512, 'gpu_memory_utilization': 0.9, 'load_dummy': False, 'enable_expert_parallel': False, 'torch_profiler_dir': None, 'enable_dp_attention': False, 'kv_transfer_config': '{}', 'mark_trace': False, 'online_quant_config': None, 'hf_overrides': None, 'kv_cache_block_size': 16, 'compilation_config': CompilationConfig(level=3, use_cudagraph=True, local_cache_dir=None, cudagraph_capture_sizes=[1, 2, 4], cuda_graph_sizes=[512], debug_dump_path='', traced_files=set(), cache_dir='', use_inductor=True, cudagraph_mode=None, compilation_time=0.0, splitting_ops=None, cudagraph_copy_inputs=False, inductor_compile_config={}, compile_sizes=None, static_forward_context={}), 'speculative_config': None, 'enable_tbo': False, 'enable_tbo_decode': False, 'enable_low_latency': False}
config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 654/654 [00:00<00:00, 8.04MB/s]
generation_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 177/177 [00:00<00:00, 1.28MB/s]
`torch_dtype` is deprecated! Use `dtype` instead!
tokenizer_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50.6k/50.6k [00:00<00:00, 100MB/s]
tokenizer.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.09M/9.09M [00:00<00:00, 17.5MB/s]
special_tokens_map.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 73.0/73.0 [00:00<00:00, 479kB/s]
[atom 14:10:05] Engine Core Mgr: Creating EngineCore for DP rank 0/1
[atom 14:10:05] Creating EngineCore process: DP rank 0, will use GPUs 0 to 0
[atom 14:10:05] Engine Core Mgr: Starting EngineCore for DP rank 0/1
[aiter] import [module_aiter_core] under /app/aiter-test/aiter/jit/module_aiter_core.so
[aiter] message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 16777216, 10, 'psm_d34088bc'), local_subscribe_addr='ipc:///tmp/fcb8f94e-1ad1-4c38-a69c-8ef6559882c6', remote_subscribe_addr=None, remote_addr_ipv6=False)
[aiter] import [module_aiter_core] under /app/aiter-test/aiter/jit/module_aiter_core.so
[aiter] merge tuned file under model_configs/ and configs/ /app/aiter-test/aiter/configs/bf16_tuned_gemm.csv:/app/aiter-test/aiter/configs/model_configs/dsv4_bf16_tuned_gemm.csv:/app/aiter-test/aiter/configs/model_configs/qwen32B_bf16_tuned_gemm.csv:/app/aiter-test/aiter/configs/model_configs/dsv3_bf16_tuned_gemm.csv:/app/aiter-test/aiter/configs/model_configs/gptoss_bf16_tuned_gemm.csv:/app/aiter-test/aiter/configs/model_configs/llama70B_bf16_tuned_gemm.csv:/app/aiter-test/aiter/configs/model_configs/kimi_bf16_tuned_gemm.csv:/app/aiter-test/aiter/configs/model_configs/kimik2_bf16_tuned_gemm.csv:/app/aiter-test/aiter/configs/model_configs/llama405B_bf16_tuned_gemm.csv:/app/aiter-test/aiter/configs/model_configs/glm5_bf16_tuned_gemm.csv
[atom 14:10:16] Create lazy wrapper for FusedMoE to change the naming
[atom 14:10:16] ModelRunner rank=0, dp_rank_local=0, local_device_rank=0, device=cuda:0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[aiter] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
[atom 14:10:17] disable_mmap: False
[atom.model_loader.weight_utils 14:10:18] Using model weights format ['*.safetensors']
[atom.model_loader.weight_utils 14:10:37] Time spent downloading weights for meta-llama/Meta-Llama-3-8B: 19.594633 seconds
Loading safetensors shards[meta-llama/Meta-Llama-3-8B]: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 11.24it/s]
[atom 14:10:39] Model load done: meta-llama/Meta-Llama-3-8B
[atom 14:10:42] Using cache directory: /root/.cache/atom/torch_compile_cache/297e7c3ac8/rank_0/backbone for vLLM's torch.compile
[atom 14:10:42] Dynamo bytecode transform time: 2.24 s
[atom 14:10:42] Cache the graph for dynamic shape for later use
[atom 14:10:43] Compiling a graph for dynamic shape takes 1.58 s
[atom 14:10:43] Computation graph saved to /root/.cache/atom/torch_compile_cache/297e7c3ac8/rank_0/backbone/computation_graph.py
[aiter] import [module_rmsnorm_quant] under /app/aiter-test/aiter/jit/module_rmsnorm_quant.so
[aiter] shape is M:16384, N:6144, K:4096 dtype='torch.bfloat16' otype='torch.bfloat16' bias=False, scaleAB=False, bpreshuffle=False, not found tuned config in /tmp/aiter_configs/bf16_tuned_gemm.csv, will use default config! using torch solution:0
[aiter] shape is M:16384, N:4096, K:4096 dtype='torch.bfloat16' otype='torch.bfloat16' bias=False, scaleAB=False, bpreshuffle=False, not found tuned config in /tmp/aiter_configs/bf16_tuned_gemm.csv, will use default config! using torch solution:0
[aiter] shape is M:16384, N:28672, K:4096 dtype='torch.bfloat16' otype='torch.bfloat16' bias=False, scaleAB=False, bpreshuffle=False, not found tuned config in /tmp/aiter_configs/bf16_tuned_gemm.csv, will use default config! using torch solution:0
[aiter] import [module_activation] under /app/aiter-test/aiter/jit/module_activation.so
[aiter] shape is M:16384, N:4096, K:14336 dtype='torch.bfloat16' otype='torch.bfloat16' bias=False, scaleAB=False, bpreshuffle=False, not found tuned config in /tmp/aiter_configs/bf16_tuned_gemm.csv, will use default config! using torch solution:0
[aiter] shape is M:2, N:128256, K:4096 dtype='torch.bfloat16' otype='torch.bfloat16' bias=False, scaleAB=False, bpreshuffle=False, not found tuned config in /tmp/aiter_configs/bf16_tuned_gemm.csv, will use default config! using torch solution:0
[aiter] import [module_sample] under /app/aiter-test/aiter/jit/module_sample.so
[atom 14:10:46] Model Runner0/1: warmup_model 6.06 seconds with 2 reqs 16384 tokens
[atom 14:10:46] Model warmup done: meta-llama/Meta-Llama-3-8B
[atom 14:10:46] Memory budget: total_gpu=287.98GB, free=270.40GB, utilization=0.9, budget=259.19GB, peak_torch=16.85GB, cudagraph_est=0.34GB, safety=5.76GB, available_for_kv=236.24GB, block_bytes=1081344, num_kvcache_blocks=234579
[atom 14:10:46] Concurrent capacity vs context length (max_model_len=8192, block_size=16, max_slots=512, pool_blocks=234579):
   10% (    819 tok):     52 blk/req → max_concurrent=512   (bound by slots)
   30% (   2457 tok):    154 blk/req → max_concurrent=512   (bound by slots)
   50% (   4096 tok):    256 blk/req → max_concurrent=512   (bound by slots)
   70% (   5734 tok):    359 blk/req → max_concurrent=512   (bound by slots)
   90% (   7372 tok):    461 blk/req → max_concurrent=508   (bound by blocks)
  100% (   8192 tok):    512 blk/req → max_concurrent=458   (bound by blocks)
[atom 14:10:46] Binding KV cache for target model starting at layer_id=0
/opt/venv/lib/python3.12/site-packages/torch/distributed/c10d_logger.py:83: UserWarning: barrier(): using the device under current context. You can specify `device_id` in `init_process_group` to mute this warning.
  return func(*args, **kwargs)
[rank0]:[W613 14:10:46.310280669 ProcessGroupNCCL.cpp:5140] Guessing device ID based on global rank. This can cause a hang if rank to GPU mapping is heterogeneous. You can specify device_id in init_process_group()
Capturing bs=4, max_q_len=1:   0%|                                                                                                                                        | 0/3 [00:00<?, ?it/s][aiter] import [module_rope_2c_cached_positions_fwd] under /app/aiter-test/aiter/jit/module_rope_2c_cached_positions_fwd.so
[aiter] import [module_cache] under /app/aiter-test/aiter/jit/module_cache.so
[aiter] LoadKernel: _ZN5aiter31pa_bf16_pertokenFp8_gqa8_2tg_4wE hsaco: /app/aiter-test/hsa//gfx950/pa/pa_bf16_pertokenFp8_gqa8_2tg_4w.co
[aiter] shape is M:4, N:4096, K:4096 dtype='torch.bfloat16' otype='torch.bfloat16' bias=False, scaleAB=False, bpreshuffle=False, not found tuned config in /tmp/aiter_configs/bf16_tuned_gemm.csv, will use default config! using torch solution:0
[aiter] shape is M:4, N:28672, K:4096 dtype='torch.bfloat16' otype='torch.bfloat16' bias=False, scaleAB=False, bpreshuffle=False, not found tuned config in /tmp/aiter_configs/bf16_tuned_gemm.csv, will use default config! using torch solution:0
[aiter] shape is M:4, N:4096, K:14336 dtype='torch.bfloat16' otype='torch.bfloat16' bias=False, scaleAB=False, bpreshuffle=False, not found tuned config in /tmp/aiter_configs/bf16_tuned_gemm.csv, will use default config! using torch solution:0
[aiter] shape is M:4, N:128256, K:4096 dtype='torch.bfloat16' otype='torch.bfloat16' bias=False, scaleAB=False, bpreshuffle=False, not found tuned config in /tmp/aiter_configs/bf16_tuned_gemm.csv, will use default config! using torch solution:0
Capturing bs=2, max_q_len=1:  33%|██████████████████████████████████████████▋                                                                                     | 1/3 [00:00<00:00,  7.60it/s][aiter] shape is M:2, N:4096, K:4096 dtype='torch.bfloat16' otype='torch.bfloat16' bias=False, scaleAB=False, bpreshuffle=False, not found tuned config in /tmp/aiter_configs/bf16_tuned_gemm.csv, will use default config! using torch solution:0
[aiter] shape is M:2, N:28672, K:4096 dtype='torch.bfloat16' otype='torch.bfloat16' bias=False, scaleAB=False, bpreshuffle=False, not found tuned config in /tmp/aiter_configs/bf16_tuned_gemm.csv, will use default config! using torch solution:0
[aiter] shape is M:2, N:4096, K:14336 dtype='torch.bfloat16' otype='torch.bfloat16' bias=False, scaleAB=False, bpreshuffle=False, not found tuned config in /tmp/aiter_configs/bf16_tuned_gemm.csv, will use default config! using torch solution:0
Capturing bs=1, max_q_len=1:  33%|██████████████████████████████████████████▋                                                                                     | 1/3 [00:00<00:00,  7.60it/s][aiter] shape is M:1, N:4096, K:4096 dtype='torch.bfloat16' otype='torch.bfloat16' bias=False, scaleAB=False, bpreshuffle=False, not found tuned config in /tmp/aiter_configs/bf16_tuned_gemm.csv, will use default config! using torch solution:0
[aiter] shape is M:1, N:28672, K:4096 dtype='torch.bfloat16' otype='torch.bfloat16' bias=False, scaleAB=False, bpreshuffle=False, not found tuned config in /tmp/aiter_configs/bf16_tuned_gemm.csv, will use default config! using torch solution:0
[aiter] shape is M:1, N:4096, K:14336 dtype='torch.bfloat16' otype='torch.bfloat16' bias=False, scaleAB=False, bpreshuffle=False, not found tuned config in /tmp/aiter_configs/bf16_tuned_gemm.csv, will use default config! using torch solution:0
[aiter] shape is M:1, N:128256, K:4096 dtype='torch.bfloat16' otype='torch.bfloat16' bias=False, scaleAB=False, bpreshuffle=False, not found tuned config in /tmp/aiter_configs/bf16_tuned_gemm.csv, will use default config! using torch solution:0
Capturing bs=1, max_q_len=1: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 10.02it/s]
[atom 14:10:46] Post-init memory: actual=254.65GB (88.4%), target=259.19GB (90%)
[atom 14:10:46] Engine Core: cudagraph capture[1, 2, 4] cost: 0.31 seconds
[atom 14:10:46] Engine Core: load model runner success
[atom 14:10:47] Engine Core: EngineCore fully initialized and ready
[atom 14:10:47] Engine Core Mgr: DP rank 0 is fully initialized and ready
[atom 14:10:47] Engine Core Mgr: All EngineCores are fully initialized and ready
[atom 14:10:47] Engine Core Mgr: All 1 EngineCores initialized and ready
[atom 14:10:47] KV transfer config loaded: {}
[atom 14:10:47] LLMEngine init with 1 data parallel ranks
[atom 14:10:47] LLMEngine init with 1 data parallel ranks
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/app/ATOM/atom/examples/simple_inference.py", line 94, in <module>
    main()
  File "/app/ATOM/atom/examples/simple_inference.py", line 69, in main
    apply_chat_template(tokenizer, custom_encoder, [{"role": "user", "content": p}])
  File "/app/ATOM/atom/entrypoints/openai/chat_encoders.py", line 118, in apply_chat_template
    return tokenizer.apply_chat_template(messages, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/transformers/tokenization_utils_base.py", line 3009, in apply_chat_template
    chat_template = self.get_chat_template(chat_template, tools)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/transformers/tokenization_utils_base.py", line 3191, in get_chat_template
    raise ValueError(
ValueError: Cannot use chat template functions because tokenizer.chat_template is not set and no template argument was passed! For information about writing templates and setting the tokenizer.chat_template attribute, please see the documentation at https://huggingface.co/docs/transformers/main/en/chat_templating
[atom 14:10:48] Engine Core Mgr: Shutting down all 1 EngineCores

```

### Operating System

Ubuntu 24.04.4 LTS

### CPU

AMD EPYC 9575F 64-Core Processor

### GPU

AMD Instinct MI355 OAM

### ROCm Version

7.2.4

### ROCm Component

_No response_

### Steps to Reproduce

```
docker pull rocm/atom-dev:latest

docker run -it --network=host \
  --device=/dev/kfd \
  --device=/dev/dri \
  --group-add video \
  --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  -v $HOME:/home/$USER \
  -v /mnt:/mnt \
  -v /data:/data \
  --shm-size=16G \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  rocm/atom-dev:latest

# export HF_TOKEN=<hf token>
# python -m atom.examples.simple_inference --model meta-llama/Meta-Llama-3-8B --kv_cache_dtype fp8
```

### (Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

<details>
<summary>rocminfo --support output</summary>

```
Paste output here
```

</details>


### Additional Information

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Issue]: Using basic instructions to run simple example results in crash on gfx950 #1207

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Issue]: Using basic instructions to run simple example results in crash on gfx950 #1207

Description

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions