Skip to content

Segmentation fault when using XPU GPU acceleration #1480

@pohyongrui

Description

@pohyongrui

I'm trying to train a MACE model using Intel PVC GPUs, but I consistently receive a segmentation fault like this:

Segmentation fault from GPU at 0xff00000021a2a000, ctx_id: 1 (CCS) type: 0 (NotPresent), level: 1 (PDE), access: 1 (Write), banned: 1, aborting.
Abort was called at 288 line in file:
/home/ubit/rpmbuild/BUILD/intel-compute-runtime-25.18.33578.42/shared/source/os_interface/linux/drm_neo.cpp

This is what my environment file looks like:

name: "test"
seed: 123
foundation_model: "test.model"
multiheads_finetuning: False
default_dtype: "float64"
compute_avg_num_neighbors: True
E0s: "{6: -.46589652E-01, 8: -.52704460E-01, 46: -.14756416E+01}"
pair_repulsion: True
train_file: "training_set.xyz"
valid_fraction: 0.1
test_file: "test_set.xyz"
energy_weight: 1.0
forces_weight: 10.0
energy_key: "DFT_energy"
forces_key: "DFT_forces"
stress_key: "DFT_stress"
scaling: "rms_forces_scaling"
lr: 0.001
max_num_epochs: 200
swa: True
ema: True
ema_decay: 0.995
amsgrad: True
keep_checkpoints: True
num_worker: 0
batch_size: 4
valid_batch_size: 4
device: xpu

The segmentation fault usually appears after a certain number of epochs have passed. I've found that reducing the training set size and batch size helps delay the onset of this problem, but this will prolong the training time significantly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions