I'm trying to train a MACE model using Intel PVC GPUs, but I consistently receive a segmentation fault like this:
Segmentation fault from GPU at 0xff00000021a2a000, ctx_id: 1 (CCS) type: 0 (NotPresent), level: 1 (PDE), access: 1 (Write), banned: 1, aborting.
Abort was called at 288 line in file:
/home/ubit/rpmbuild/BUILD/intel-compute-runtime-25.18.33578.42/shared/source/os_interface/linux/drm_neo.cpp
This is what my environment file looks like:
name: "test"
seed: 123
foundation_model: "test.model"
multiheads_finetuning: False
default_dtype: "float64"
compute_avg_num_neighbors: True
E0s: "{6: -.46589652E-01, 8: -.52704460E-01, 46: -.14756416E+01}"
pair_repulsion: True
train_file: "training_set.xyz"
valid_fraction: 0.1
test_file: "test_set.xyz"
energy_weight: 1.0
forces_weight: 10.0
energy_key: "DFT_energy"
forces_key: "DFT_forces"
stress_key: "DFT_stress"
scaling: "rms_forces_scaling"
lr: 0.001
max_num_epochs: 200
swa: True
ema: True
ema_decay: 0.995
amsgrad: True
keep_checkpoints: True
num_worker: 0
batch_size: 4
valid_batch_size: 4
device: xpu
The segmentation fault usually appears after a certain number of epochs have passed. I've found that reducing the training set size and batch size helps delay the onset of this problem, but this will prolong the training time significantly.
I'm trying to train a MACE model using Intel PVC GPUs, but I consistently receive a segmentation fault like this:
This is what my environment file looks like:
The segmentation fault usually appears after a certain number of epochs have passed. I've found that reducing the training set size and batch size helps delay the onset of this problem, but this will prolong the training time significantly.