Segmentation fault when using XPU GPU acceleration

I'm trying to train a MACE model using Intel PVC GPUs, but I consistently receive a segmentation fault like this:

```
Segmentation fault from GPU at 0xff00000021a2a000, ctx_id: 1 (CCS) type: 0 (NotPresent), level: 1 (PDE), access: 1 (Write), banned: 1, aborting.
Abort was called at 288 line in file:
/home/ubit/rpmbuild/BUILD/intel-compute-runtime-25.18.33578.42/shared/source/os_interface/linux/drm_neo.cpp

```

This is what my environment file looks like:

```
name: "test"
seed: 123
foundation_model: "test.model"
multiheads_finetuning: False
default_dtype: "float64"
compute_avg_num_neighbors: True
E0s: "{6: -.46589652E-01, 8: -.52704460E-01, 46: -.14756416E+01}"
pair_repulsion: True
train_file: "training_set.xyz"
valid_fraction: 0.1
test_file: "test_set.xyz"
energy_weight: 1.0
forces_weight: 10.0
energy_key: "DFT_energy"
forces_key: "DFT_forces"
stress_key: "DFT_stress"
scaling: "rms_forces_scaling"
lr: 0.001
max_num_epochs: 200
swa: True
ema: True
ema_decay: 0.995
amsgrad: True
keep_checkpoints: True
num_worker: 0
batch_size: 4
valid_batch_size: 4
device: xpu
```

The segmentation fault usually appears after a certain number of epochs have passed. I've found that reducing the training set size and batch size helps delay the onset of this problem, but this will prolong the training time significantly.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation fault when using XPU GPU acceleration #1480

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Segmentation fault when using XPU GPU acceleration #1480

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions