Skip to content

[BUG] _DeePMD-kit C API Error: DeePMD-kit Error: DeePMD-kit PyTorch backend error: The following operation failed in the TorchScript interpreter. #5087

@myyelishu

Description

@myyelishu

Bug summary

我首先在集群A上使用GPU训练出模型graph.pth,然后使用该模型在集群B的CPU节点运行LAMMPS,此时两个集群都是通过离线方式安装了GPU版本3.1.0的DeePMD-kit,在LAMMPS跑MD的过程中出错了。我的MD过程有两段:第一段是找到平衡态,第二段是施加剪切。我后续又进行了多次测试,总结错误一共两种:第一种是在第一段MD运行中途出错了,报错内容是“ 477 14.671651 -1840.8705 -1840.4779 951095.46 1007.5899 957687.5 947820.98 947777.91 31.101126 2.3915695 13.546425 0 0 0
478 14.848564 -1840.4616 -1840.0643 956489.71 1006.907 963265.36 953626.93 952576.83 31.093733 2.3909153 13.544167 0 0 0
ERROR on proc 34: DeePMD-kit C API Error: DeePMD-kit Error: DeePMD-kit PyTorch backend error: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
File "code/torch/deepmd/pt/model/model/ener_model.py", line 66, in forward_lower
comm_dict: Optional[Dict[str, Tensor]]=None) -> Dict[str, Tensor]:
_6 = (self).need_sorted_nlist_for_lower()
model_ret = (self).forward_common_lower(extended_coord, extended_atype, nlist, mapping, fparam, aparam, do_atomic_virial, comm_dict, _6, )
~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
_7 = (self).get_fitting_net()
model_predict = annotate(Dict[str, Tensor], {})
File "code/torch/deepmd/pt/model/model/ener_model.py", line 232, in forward_common_lower
cc_ext, _40, fp, ap, input_prec, = _39
atomic_model = self.atomic_model
atomic_ret = (atomic_model).forward_common_atomic(cc_ext, extended_atype, nlist0, mapping, fp, ap, comm_dict, )
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
_41 = (self).atomic_output_def()
training = self.training
File "code/torch/deepmd/pt/model/atomic_model/energy_atomic_model.py", line 53, in forward_common_atomic
ext_atom_mask = (self).make_atom_mask(extended_atype, )
_3 = torch.where(ext_atom_mask, extended_atype, 0)
ret_dict = (self).forward_atomic(extended_coord, _3, nlist, mapping, fparam, aparam, comm_dict, )
~~~~~~~~~~~~~~~~~~~~ <--- HERE
ret_dict0 = (self).apply_out_stat(ret_dict, atype, )
_4 = torch.slice(torch.slice(ext_atom_mask), 1, None, nloc)
File "code/torch/deepmd/pt/model/atomic_model/energy_atomic_model.py", line 96, in forward_atomic
pass
descriptor = self.descriptor
_16 = (descriptor).forward(extended_coord, extended_atype, nlist, mapping, comm_dict, )
~~~~~~~~~~~~~~~~~~~ <--- HERE
descriptor0, rot_mat, g2, h2, sw, = _16
enable_eval_descriptor_hook = self.enable_eval_descriptor_hook
File "code/torch/deepmd/pt/model/descriptor/dpa3.py", line 53, in forward
node_ebd_inp = torch.slice(_2, 2)
repflows = self.repflows
_3 = (repflows).forward(nlist, extended_coord0, extended_atype, node_ebd_ext, mapping, comm_dict, )
~~~~~~~~~~~~~~~~~ <--- HERE
node_ebd, edge_ebd, h2, rot_mat, sw, = _3
concat_output_tebd = self.concat_output_tebd
File "code/torch/deepmd/pt/model/descriptor/repflows.py", line 326, in forward
_72 = torch.tensor(real_nloc, dtype=3, device=torch.device("cpu"))
_73 = torch.tensor(torch.sub(real_nall, real_nloc), dtype=3, device=torch.device("cpu"))
ret = ops.deepmd.border_op(_66, _67, _68, _69, _70, node_ebd0, _71, _72, _73)
~~~~~~~~~~~~~~~~~~~~ <--- HERE
node_ebd_ext1 = torch.unsqueeze(ret[0], 0)
if has_spin:

Traceback of TorchScript, original code (most recent call last):
File "/public/home/xuyouwei/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/model/model/ener_model.py", line 119, in forward_lower
comm_dict: Optional[dict[str, torch.Tensor]] = None,
):
model_ret = self.forward_common_lower(
~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
extended_coord,
extended_atype,
File "/public/home/xuyouwei/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/model/model/make_model.py", line 287, in forward_common_lower
)
del extended_coord, fparam, aparam
atomic_ret = self.atomic_model.forward_common_atomic(
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
cc_ext,
extended_atype,
File "/public/home/xuyouwei/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/model/atomic_model/base_atomic_model.py", line 249, in forward_common_atomic

    ext_atom_mask = self.make_atom_mask(extended_atype)
    ret_dict = self.forward_atomic(
               ~~~~~~~~~~~~~~~~~~~ <--- HERE
        extended_coord,
        torch.where(ext_atom_mask, extended_atype, 0),

File "/public/home/xuyouwei/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/model/atomic_model/dp_atomic_model.py", line 238, in forward_atomic
if self.do_grad_r() or self.do_grad_c():
extended_coord.requires_grad_(True)
descriptor, rot_mat, g2, h2, sw = self.descriptor(
~~~~~~~~~~~~~~~ <--- HERE
extended_coord,
extended_atype,
File "/public/home/xuyouwei/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/model/descriptor/dpa3.py", line 498, in forward
node_ebd_inp = node_ebd_ext[:, :nloc, :]
# repflows
node_ebd, edge_ebd, h2, rot_mat, sw = self.repflows(
~~~~~~~~~~~~~ <--- HERE
nlist,
extended_coord,
File "/public/home/xuyouwei/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/model/descriptor/repflows.py", line 599, in forward
assert "recv_num" in comm_dict
assert "communicator" in comm_dict
ret = torch.ops.deepmd.border_op(
~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
comm_dict["send_list"],
comm_dict["send_proc"],
RuntimeError: index out of range in self (/home/conda/feedstock_root/build_artifacts/deepmd-kit_1749640039377/work/source/lmp/pair_deepmd.cpp:253)
Last command: run 500”(这段报错的第一行477和第二行478以及最后的500说明任务是中途终止的);第二个错误是在前面一个的基础上增加neighbor 参数的skin的值,这使得第一段MD可以运行完成,但在即将开始第二段MD的时候出错了,报错内容是“Setting up Verlet run ...
Unit style : metal
Current step : 0
Time step : 0.001
ERROR on proc 141: DeePMD-kit C API Error: DeePMD-kit Error: DeePMD-kit PyTorch backend error: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
File "code/torch/deepmd/pt/model/model/ener_model.py", line 66, in forward_lower
comm_dict: Optional[Dict[str, Tensor]]=None) -> Dict[str, Tensor]:
_6 = (self).need_sorted_nlist_for_lower()
model_ret = (self).forward_common_lower(extended_coord, extended_atype, nlist, mapping, fparam, aparam, do_atomic_virial, comm_dict, _6, )
~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
_7 = (self).get_fitting_net()
model_predict = annotate(Dict[str, Tensor], {})
File "code/torch/deepmd/pt/model/model/ener_model.py", line 232, in forward_common_lower
cc_ext, _40, fp, ap, input_prec, = _39
atomic_model = self.atomic_model
atomic_ret = (atomic_model).forward_common_atomic(cc_ext, extended_atype, nlist0, mapping, fp, ap, comm_dict, )
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
_41 = (self).atomic_output_def()
training = self.training
File "code/torch/deepmd/pt/model/atomic_model/energy_atomic_model.py", line 53, in forward_common_atomic
ext_atom_mask = (self).make_atom_mask(extended_atype, )
_3 = torch.where(ext_atom_mask, extended_atype, 0)
ret_dict = (self).forward_atomic(extended_coord, _3, nlist, mapping, fparam, aparam, comm_dict, )
~~~~~~~~~~~~~~~~~~~~ <--- HERE
ret_dict0 = (self).apply_out_stat(ret_dict, atype, )
_4 = torch.slice(torch.slice(ext_atom_mask), 1, None, nloc)
File "code/torch/deepmd/pt/model/atomic_model/energy_atomic_model.py", line 96, in forward_atomic
pass
descriptor = self.descriptor
_16 = (descriptor).forward(extended_coord, extended_atype, nlist, mapping, comm_dict, )
~~~~~~~~~~~~~~~~~~~ <--- HERE
descriptor0, rot_mat, g2, h2, sw, = _16
enable_eval_descriptor_hook = self.enable_eval_descriptor_hook
File "code/torch/deepmd/pt/model/descriptor/dpa3.py", line 53, in forward
node_ebd_inp = torch.slice(_2, 2)
repflows = self.repflows
_3 = (repflows).forward(nlist, extended_coord0, extended_atype, node_ebd_ext, mapping, comm_dict, )
~~~~~~~~~~~~~~~~~ <--- HERE
node_ebd, edge_ebd, h2, rot_mat, sw, = _3
concat_output_tebd = self.concat_output_tebd
File "code/torch/deepmd/pt/model/descriptor/repflows.py", line 326, in forward
_72 = torch.tensor(real_nloc, dtype=3, device=torch.device("cpu"))
_73 = torch.tensor(torch.sub(real_nall, real_nloc), dtype=3, device=torch.device("cpu"))
ret = ops.deepmd.border_op(_66, _67, _68, _69, _70, node_ebd0, _71, _72, _73)
~~~~~~~~~~~~~~~~~~~~ <--- HERE
node_ebd_ext1 = torch.unsqueeze(ret[0], 0)
if has_spin:

Traceback of TorchScript, original code (most recent call last):
File "/public/home/xuyouwei/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/model/model/ener_model.py", line 119, in forward_lower
comm_dict: Optional[dict[str, torch.Tensor]] = None,
):
model_ret = self.forward_common_lower(
~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
extended_coord,
extended_atype,
File "/public/home/xuyouwei/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/model/model/make_model.py", line 287, in forward_common_lower
)
del extended_coord, fparam, aparam
atomic_ret = self.atomic_model.forward_common_atomic(
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
cc_ext,
extended_atype,
File "/public/home/xuyouwei/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/model/atomic_model/base_atomic_model.py", line 249, in forward_common_atomic

    ext_atom_mask = self.make_atom_mask(extended_atype)
    ret_dict = self.forward_atomic(
               ~~~~~~~~~~~~~~~~~~~ <--- HERE
        extended_coord,
        torch.where(ext_atom_mask, extended_atype, 0),

File "/public/ERROR on proc 162: DeePMD-kit C API Error: DeePMD-kit Error: DeePMD-kit PyTorch backend error: The following operation failed in the TorchScript interpreter.”。前面提到我经过了多次测试,均失败了:1.描述符:在pytorch后端3.1.0的GPU版本下分别使用dpa3和se_e2_a训练的模型都在跑MD是失败了;2:在集群A使用GPU训练模型,然后在同个GPU上运行LAMMPS,任务失败了;3:使用同样的模型,使用DeePMD-kit 3.1.0的CPU版本或者GPU的LAMMPS来跑MD,都失败了。多次尝试后,有一次LAMMPS运行成功了,我使用pytorch后端和描述符se_e2_a训练出模型graph.pth,graph.pth在用到LAMMPS跑MD的时候失败了,然后我使用dp convert-backend graph.pth graph.pb命令把该模型又pytorch后端转换成了tensorflow后端,然后给到lammps跑md,这次就成功运行了。这似乎说明.pth模型在LAMMPS跑MD模拟过程中有些问题。而目前dpa3描述符确实精度比较好,在我的数据集下,dpa3精度是se_e2_a的一倍,但是dpa3只支持pytorch后端,我无法把它转换到tensorflow后端然后给到lammps跑MD。

DeePMD-kit Version

3.1.0

Backend and its version

PyTorch v2.6.0-gUnknown

How did you download the software?

Offline packages

Input Files, Running Commands, Error Log, etc.

LAMMPS运行命令:mpirun -n 192 /data/home/liuchang/deepmd-kit/bin/lmp -i in.tin -l lmp.log
LAMMPS的输入文件in.tin:#######################initialization
units metal

dimension 3

boundary p p p

atom_style atomic

#######################system definition

read_data structure.lmp

#######################simulation settings

set potential

pair_style deepmd graph.pb

pair_coeff * * C B

neighbor 3.5 nsq

neigh_modify every 1 delay 0 check yes

thermo output

thermo 1

thermo_style custom step temp pe etotal press vol pxx pyy pzz lx ly lz xy xz yz

thermo_modify flush yes lost error line one

timestep 0.001

#intial relax

dump RE all custom 1 int.traj id type x y z

velocity all create 10.0 32546 dist gaussian mom yes rot yes

fix MD all npt temp 10 10 0.1 x 0 1000000 1 y 0 1000000 1 z 0 1000000 1 #xy 0.0 0.0 1.0 xz 0.0 0.0 1.0 yz 0.0 0.0 1.0 #iso 0 0 1.0

run 500

unfix MD

undump RE

#load
change_box all triclinic

reset_timestep 0

fix MD1 all npt temp 10 10 0.1 x 1000000 1000000 1 y 1000000 1000000 1 z 1000000 1000000 1

fix MD2 all deform 1 xz erate 0.002 units lattice remap x

dump dump all custom 50 load.traj id type x y z

#strain的0.000005与erate 0.005的0.005的关系是:后者是前者的1000倍,对应皮秒是飞秒的1000倍
variable strain equal step*0.000002
variable p1 equal "v_strain"

variable px equal "-pxx/10000"
variable py equal "-pyy/10000"
variable pz equal "-pzz/10000"
variable psyz equal "-pyz/10000"
variable psxz equal "-pxz/10000"
variable psxy equal "-pxy/10000"

fix out1 all print 50 "${p1} ${px} ${py} ${pz} ${psyz} ${psxz} ${psxy}" file Stress.dat screen no

run 2000

Steps to Reproduce

DeePMD-kit C API Error: DeePMD-kit Error: DeePMD-kit PyTorch backend error: The following operation failed in the TorchScript interpreter.

Further Information, Files, and Links

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions