-
Notifications
You must be signed in to change notification settings - Fork 581
Description
Bug summary
我首先在集群A上使用GPU训练出模型graph.pth,然后使用该模型在集群B的CPU节点运行LAMMPS,此时两个集群都是通过离线方式安装了GPU版本3.1.0的DeePMD-kit,在LAMMPS跑MD的过程中出错了。我的MD过程有两段:第一段是找到平衡态,第二段是施加剪切。我后续又进行了多次测试,总结错误一共两种:第一种是在第一段MD运行中途出错了,报错内容是“ 477 14.671651 -1840.8705 -1840.4779 951095.46 1007.5899 957687.5 947820.98 947777.91 31.101126 2.3915695 13.546425 0 0 0
478 14.848564 -1840.4616 -1840.0643 956489.71 1006.907 963265.36 953626.93 952576.83 31.093733 2.3909153 13.544167 0 0 0
ERROR on proc 34: DeePMD-kit C API Error: DeePMD-kit Error: DeePMD-kit PyTorch backend error: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
File "code/torch/deepmd/pt/model/model/ener_model.py", line 66, in forward_lower
comm_dict: Optional[Dict[str, Tensor]]=None) -> Dict[str, Tensor]:
_6 = (self).need_sorted_nlist_for_lower()
model_ret = (self).forward_common_lower(extended_coord, extended_atype, nlist, mapping, fparam, aparam, do_atomic_virial, comm_dict, _6, )
~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
_7 = (self).get_fitting_net()
model_predict = annotate(Dict[str, Tensor], {})
File "code/torch/deepmd/pt/model/model/ener_model.py", line 232, in forward_common_lower
cc_ext, _40, fp, ap, input_prec, = _39
atomic_model = self.atomic_model
atomic_ret = (atomic_model).forward_common_atomic(cc_ext, extended_atype, nlist0, mapping, fp, ap, comm_dict, )
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
_41 = (self).atomic_output_def()
training = self.training
File "code/torch/deepmd/pt/model/atomic_model/energy_atomic_model.py", line 53, in forward_common_atomic
ext_atom_mask = (self).make_atom_mask(extended_atype, )
_3 = torch.where(ext_atom_mask, extended_atype, 0)
ret_dict = (self).forward_atomic(extended_coord, _3, nlist, mapping, fparam, aparam, comm_dict, )
~~~~~~~~~~~~~~~~~~~~ <--- HERE
ret_dict0 = (self).apply_out_stat(ret_dict, atype, )
_4 = torch.slice(torch.slice(ext_atom_mask), 1, None, nloc)
File "code/torch/deepmd/pt/model/atomic_model/energy_atomic_model.py", line 96, in forward_atomic
pass
descriptor = self.descriptor
_16 = (descriptor).forward(extended_coord, extended_atype, nlist, mapping, comm_dict, )
~~~~~~~~~~~~~~~~~~~ <--- HERE
descriptor0, rot_mat, g2, h2, sw, = _16
enable_eval_descriptor_hook = self.enable_eval_descriptor_hook
File "code/torch/deepmd/pt/model/descriptor/dpa3.py", line 53, in forward
node_ebd_inp = torch.slice(_2, 2)
repflows = self.repflows
_3 = (repflows).forward(nlist, extended_coord0, extended_atype, node_ebd_ext, mapping, comm_dict, )
~~~~~~~~~~~~~~~~~ <--- HERE
node_ebd, edge_ebd, h2, rot_mat, sw, = _3
concat_output_tebd = self.concat_output_tebd
File "code/torch/deepmd/pt/model/descriptor/repflows.py", line 326, in forward
_72 = torch.tensor(real_nloc, dtype=3, device=torch.device("cpu"))
_73 = torch.tensor(torch.sub(real_nall, real_nloc), dtype=3, device=torch.device("cpu"))
ret = ops.deepmd.border_op(_66, _67, _68, _69, _70, node_ebd0, _71, _72, _73)
~~~~~~~~~~~~~~~~~~~~ <--- HERE
node_ebd_ext1 = torch.unsqueeze(ret[0], 0)
if has_spin:
Traceback of TorchScript, original code (most recent call last):
File "/public/home/xuyouwei/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/model/model/ener_model.py", line 119, in forward_lower
comm_dict: Optional[dict[str, torch.Tensor]] = None,
):
model_ret = self.forward_common_lower(
~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
extended_coord,
extended_atype,
File "/public/home/xuyouwei/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/model/model/make_model.py", line 287, in forward_common_lower
)
del extended_coord, fparam, aparam
atomic_ret = self.atomic_model.forward_common_atomic(
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
cc_ext,
extended_atype,
File "/public/home/xuyouwei/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/model/atomic_model/base_atomic_model.py", line 249, in forward_common_atomic
ext_atom_mask = self.make_atom_mask(extended_atype)
ret_dict = self.forward_atomic(
~~~~~~~~~~~~~~~~~~~ <--- HERE
extended_coord,
torch.where(ext_atom_mask, extended_atype, 0),
File "/public/home/xuyouwei/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/model/atomic_model/dp_atomic_model.py", line 238, in forward_atomic
if self.do_grad_r() or self.do_grad_c():
extended_coord.requires_grad_(True)
descriptor, rot_mat, g2, h2, sw = self.descriptor(
~~~~~~~~~~~~~~~ <--- HERE
extended_coord,
extended_atype,
File "/public/home/xuyouwei/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/model/descriptor/dpa3.py", line 498, in forward
node_ebd_inp = node_ebd_ext[:, :nloc, :]
# repflows
node_ebd, edge_ebd, h2, rot_mat, sw = self.repflows(
~~~~~~~~~~~~~ <--- HERE
nlist,
extended_coord,
File "/public/home/xuyouwei/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/model/descriptor/repflows.py", line 599, in forward
assert "recv_num" in comm_dict
assert "communicator" in comm_dict
ret = torch.ops.deepmd.border_op(
~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
comm_dict["send_list"],
comm_dict["send_proc"],
RuntimeError: index out of range in self (/home/conda/feedstock_root/build_artifacts/deepmd-kit_1749640039377/work/source/lmp/pair_deepmd.cpp:253)
Last command: run 500”(这段报错的第一行477和第二行478以及最后的500说明任务是中途终止的);第二个错误是在前面一个的基础上增加neighbor 参数的skin的值,这使得第一段MD可以运行完成,但在即将开始第二段MD的时候出错了,报错内容是“Setting up Verlet run ...
Unit style : metal
Current step : 0
Time step : 0.001
ERROR on proc 141: DeePMD-kit C API Error: DeePMD-kit Error: DeePMD-kit PyTorch backend error: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
File "code/torch/deepmd/pt/model/model/ener_model.py", line 66, in forward_lower
comm_dict: Optional[Dict[str, Tensor]]=None) -> Dict[str, Tensor]:
_6 = (self).need_sorted_nlist_for_lower()
model_ret = (self).forward_common_lower(extended_coord, extended_atype, nlist, mapping, fparam, aparam, do_atomic_virial, comm_dict, _6, )
~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
_7 = (self).get_fitting_net()
model_predict = annotate(Dict[str, Tensor], {})
File "code/torch/deepmd/pt/model/model/ener_model.py", line 232, in forward_common_lower
cc_ext, _40, fp, ap, input_prec, = _39
atomic_model = self.atomic_model
atomic_ret = (atomic_model).forward_common_atomic(cc_ext, extended_atype, nlist0, mapping, fp, ap, comm_dict, )
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
_41 = (self).atomic_output_def()
training = self.training
File "code/torch/deepmd/pt/model/atomic_model/energy_atomic_model.py", line 53, in forward_common_atomic
ext_atom_mask = (self).make_atom_mask(extended_atype, )
_3 = torch.where(ext_atom_mask, extended_atype, 0)
ret_dict = (self).forward_atomic(extended_coord, _3, nlist, mapping, fparam, aparam, comm_dict, )
~~~~~~~~~~~~~~~~~~~~ <--- HERE
ret_dict0 = (self).apply_out_stat(ret_dict, atype, )
_4 = torch.slice(torch.slice(ext_atom_mask), 1, None, nloc)
File "code/torch/deepmd/pt/model/atomic_model/energy_atomic_model.py", line 96, in forward_atomic
pass
descriptor = self.descriptor
_16 = (descriptor).forward(extended_coord, extended_atype, nlist, mapping, comm_dict, )
~~~~~~~~~~~~~~~~~~~ <--- HERE
descriptor0, rot_mat, g2, h2, sw, = _16
enable_eval_descriptor_hook = self.enable_eval_descriptor_hook
File "code/torch/deepmd/pt/model/descriptor/dpa3.py", line 53, in forward
node_ebd_inp = torch.slice(_2, 2)
repflows = self.repflows
_3 = (repflows).forward(nlist, extended_coord0, extended_atype, node_ebd_ext, mapping, comm_dict, )
~~~~~~~~~~~~~~~~~ <--- HERE
node_ebd, edge_ebd, h2, rot_mat, sw, = _3
concat_output_tebd = self.concat_output_tebd
File "code/torch/deepmd/pt/model/descriptor/repflows.py", line 326, in forward
_72 = torch.tensor(real_nloc, dtype=3, device=torch.device("cpu"))
_73 = torch.tensor(torch.sub(real_nall, real_nloc), dtype=3, device=torch.device("cpu"))
ret = ops.deepmd.border_op(_66, _67, _68, _69, _70, node_ebd0, _71, _72, _73)
~~~~~~~~~~~~~~~~~~~~ <--- HERE
node_ebd_ext1 = torch.unsqueeze(ret[0], 0)
if has_spin:
Traceback of TorchScript, original code (most recent call last):
File "/public/home/xuyouwei/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/model/model/ener_model.py", line 119, in forward_lower
comm_dict: Optional[dict[str, torch.Tensor]] = None,
):
model_ret = self.forward_common_lower(
~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
extended_coord,
extended_atype,
File "/public/home/xuyouwei/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/model/model/make_model.py", line 287, in forward_common_lower
)
del extended_coord, fparam, aparam
atomic_ret = self.atomic_model.forward_common_atomic(
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
cc_ext,
extended_atype,
File "/public/home/xuyouwei/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/model/atomic_model/base_atomic_model.py", line 249, in forward_common_atomic
ext_atom_mask = self.make_atom_mask(extended_atype)
ret_dict = self.forward_atomic(
~~~~~~~~~~~~~~~~~~~ <--- HERE
extended_coord,
torch.where(ext_atom_mask, extended_atype, 0),
File "/public/ERROR on proc 162: DeePMD-kit C API Error: DeePMD-kit Error: DeePMD-kit PyTorch backend error: The following operation failed in the TorchScript interpreter.”。前面提到我经过了多次测试,均失败了:1.描述符:在pytorch后端3.1.0的GPU版本下分别使用dpa3和se_e2_a训练的模型都在跑MD是失败了;2:在集群A使用GPU训练模型,然后在同个GPU上运行LAMMPS,任务失败了;3:使用同样的模型,使用DeePMD-kit 3.1.0的CPU版本或者GPU的LAMMPS来跑MD,都失败了。多次尝试后,有一次LAMMPS运行成功了,我使用pytorch后端和描述符se_e2_a训练出模型graph.pth,graph.pth在用到LAMMPS跑MD的时候失败了,然后我使用dp convert-backend graph.pth graph.pb命令把该模型又pytorch后端转换成了tensorflow后端,然后给到lammps跑md,这次就成功运行了。这似乎说明.pth模型在LAMMPS跑MD模拟过程中有些问题。而目前dpa3描述符确实精度比较好,在我的数据集下,dpa3精度是se_e2_a的一倍,但是dpa3只支持pytorch后端,我无法把它转换到tensorflow后端然后给到lammps跑MD。
DeePMD-kit Version
3.1.0
Backend and its version
PyTorch v2.6.0-gUnknown
How did you download the software?
Offline packages
Input Files, Running Commands, Error Log, etc.
LAMMPS运行命令:mpirun -n 192 /data/home/liuchang/deepmd-kit/bin/lmp -i in.tin -l lmp.log
LAMMPS的输入文件in.tin:#######################initialization
units metal
dimension 3
boundary p p p
atom_style atomic
#######################system definition
read_data structure.lmp
#######################simulation settings
set potential
pair_style deepmd graph.pb
pair_coeff * * C B
neighbor 3.5 nsq
neigh_modify every 1 delay 0 check yes
thermo output
thermo 1
thermo_style custom step temp pe etotal press vol pxx pyy pzz lx ly lz xy xz yz
thermo_modify flush yes lost error line one
timestep 0.001
#intial relax
dump RE all custom 1 int.traj id type x y z
velocity all create 10.0 32546 dist gaussian mom yes rot yes
fix MD all npt temp 10 10 0.1 x 0 1000000 1 y 0 1000000 1 z 0 1000000 1 #xy 0.0 0.0 1.0 xz 0.0 0.0 1.0 yz 0.0 0.0 1.0 #iso 0 0 1.0
run 500
unfix MD
undump RE
#load
change_box all triclinic
reset_timestep 0
fix MD1 all npt temp 10 10 0.1 x 1000000 1000000 1 y 1000000 1000000 1 z 1000000 1000000 1
fix MD2 all deform 1 xz erate 0.002 units lattice remap x
dump dump all custom 50 load.traj id type x y z
#strain的0.000005与erate 0.005的0.005的关系是:后者是前者的1000倍,对应皮秒是飞秒的1000倍
variable strain equal step*0.000002
variable p1 equal "v_strain"
variable px equal "-pxx/10000"
variable py equal "-pyy/10000"
variable pz equal "-pzz/10000"
variable psyz equal "-pyz/10000"
variable psxz equal "-pxz/10000"
variable psxy equal "-pxy/10000"
fix out1 all print 50 "${p1} ${px} ${py} ${pz} ${psyz} ${psxz} ${psxy}" file Stress.dat screen no
run 2000
Steps to Reproduce
DeePMD-kit C API Error: DeePMD-kit Error: DeePMD-kit PyTorch backend error: The following operation failed in the TorchScript interpreter.
Further Information, Files, and Links
No response