[BUG] _DeePMD-kit C API Error: DeePMD-kit Error: DeePMD-kit PyTorch backend error: The following operation failed in the TorchScript interpreter.

### Bug summary

我首先在集群A上使用GPU训练出模型graph.pth，然后使用该模型在集群B的CPU节点运行LAMMPS，此时两个集群都是通过离线方式安装了GPU版本3.1.0的DeePMD-kit，在LAMMPS跑MD的过程中出错了。我的MD过程有两段：第一段是找到平衡态，第二段是施加剪切。我后续又进行了多次测试，总结错误一共两种：第一种是在第一段MD运行中途出错了，报错内容是“       477   14.671651     -1840.8705     -1840.4779      951095.46      1007.5899      957687.5       947820.98      947777.91      31.101126      2.3915695      13.546425      0              0              0            
       478   14.848564     -1840.4616     -1840.0643      956489.71      1006.907       963265.36      953626.93      952576.83      31.093733      2.3909153      13.544167      0              0              0            
ERROR on proc 34: DeePMD-kit C API Error: DeePMD-kit Error: DeePMD-kit PyTorch backend error: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
  File "code/__torch__/deepmd/pt/model/model/ener_model.py", line 66, in forward_lower
    comm_dict: Optional[Dict[str, Tensor]]=None) -> Dict[str, Tensor]:
    _6 = (self).need_sorted_nlist_for_lower()
    model_ret = (self).forward_common_lower(extended_coord, extended_atype, nlist, mapping, fparam, aparam, do_atomic_virial, comm_dict, _6, )
                 ~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    _7 = (self).get_fitting_net()
    model_predict = annotate(Dict[str, Tensor], {})
  File "code/__torch__/deepmd/pt/model/model/ener_model.py", line 232, in forward_common_lower
    cc_ext, _40, fp, ap, input_prec, = _39
    atomic_model = self.atomic_model
    atomic_ret = (atomic_model).forward_common_atomic(cc_ext, extended_atype, nlist0, mapping, fp, ap, comm_dict, )
                  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    _41 = (self).atomic_output_def()
    training = self.training
  File "code/__torch__/deepmd/pt/model/atomic_model/energy_atomic_model.py", line 53, in forward_common_atomic
    ext_atom_mask = (self).make_atom_mask(extended_atype, )
    _3 = torch.where(ext_atom_mask, extended_atype, 0)
    ret_dict = (self).forward_atomic(extended_coord, _3, nlist, mapping, fparam, aparam, comm_dict, )
                ~~~~~~~~~~~~~~~~~~~~ <--- HERE
    ret_dict0 = (self).apply_out_stat(ret_dict, atype, )
    _4 = torch.slice(torch.slice(ext_atom_mask), 1, None, nloc)
  File "code/__torch__/deepmd/pt/model/atomic_model/energy_atomic_model.py", line 96, in forward_atomic
      pass
    descriptor = self.descriptor
    _16 = (descriptor).forward(extended_coord, extended_atype, nlist, mapping, comm_dict, )
           ~~~~~~~~~~~~~~~~~~~ <--- HERE
    descriptor0, rot_mat, g2, h2, sw, = _16
    enable_eval_descriptor_hook = self.enable_eval_descriptor_hook
  File "code/__torch__/deepmd/pt/model/descriptor/dpa3.py", line 53, in forward
    node_ebd_inp = torch.slice(_2, 2)
    repflows = self.repflows
    _3 = (repflows).forward(nlist, extended_coord0, extended_atype, node_ebd_ext, mapping, comm_dict, )
          ~~~~~~~~~~~~~~~~~ <--- HERE
    node_ebd, edge_ebd, h2, rot_mat, sw, = _3
    concat_output_tebd = self.concat_output_tebd
  File "code/__torch__/deepmd/pt/model/descriptor/repflows.py", line 326, in forward
      _72 = torch.tensor(real_nloc, dtype=3, device=torch.device("cpu"))
      _73 = torch.tensor(torch.sub(real_nall, real_nloc), dtype=3, device=torch.device("cpu"))
      ret = ops.deepmd.border_op(_66, _67, _68, _69, _70, node_ebd0, _71, _72, _73)
            ~~~~~~~~~~~~~~~~~~~~ <--- HERE
      node_ebd_ext1 = torch.unsqueeze(ret[0], 0)
      if has_spin:

Traceback of TorchScript, original code (most recent call last):
  File "/public/home/xuyouwei/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/model/model/ener_model.py", line 119, in forward_lower
        comm_dict: Optional[dict[str, torch.Tensor]] = None,
    ):
        model_ret = self.forward_common_lower(
                    ~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
            extended_coord,
            extended_atype,
  File "/public/home/xuyouwei/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/model/model/make_model.py", line 287, in forward_common_lower
            )
            del extended_coord, fparam, aparam
            atomic_ret = self.atomic_model.forward_common_atomic(
                         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
                cc_ext,
                extended_atype,
  File "/public/home/xuyouwei/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/model/atomic_model/base_atomic_model.py", line 249, in forward_common_atomic
    
        ext_atom_mask = self.make_atom_mask(extended_atype)
        ret_dict = self.forward_atomic(
                   ~~~~~~~~~~~~~~~~~~~ <--- HERE
            extended_coord,
            torch.where(ext_atom_mask, extended_atype, 0),
  File "/public/home/xuyouwei/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/model/atomic_model/dp_atomic_model.py", line 238, in forward_atomic
        if self.do_grad_r() or self.do_grad_c():
            extended_coord.requires_grad_(True)
        descriptor, rot_mat, g2, h2, sw = self.descriptor(
                                          ~~~~~~~~~~~~~~~ <--- HERE
            extended_coord,
            extended_atype,
  File "/public/home/xuyouwei/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/model/descriptor/dpa3.py", line 498, in forward
        node_ebd_inp = node_ebd_ext[:, :nloc, :]
        # repflows
        node_ebd, edge_ebd, h2, rot_mat, sw = self.repflows(
                                              ~~~~~~~~~~~~~ <--- HERE
            nlist,
            extended_coord,
  File "/public/home/xuyouwei/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/model/descriptor/repflows.py", line 599, in forward
                assert "recv_num" in comm_dict
                assert "communicator" in comm_dict
                ret = torch.ops.deepmd.border_op(
                      ~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
                    comm_dict["send_list"],
                    comm_dict["send_proc"],
RuntimeError: index out of range in self (/home/conda/feedstock_root/build_artifacts/deepmd-kit_1749640039377/work/source/lmp/pair_deepmd.cpp:253)
Last command: run             500”（这段报错的第一行477和第二行478以及最后的500说明任务是中途终止的）；第二个错误是在前面一个的基础上增加neighbor 参数的skin的值，这使得第一段MD可以运行完成，但在即将开始第二段MD的时候出错了，报错内容是“Setting up Verlet run ...
  Unit style    : metal
  Current step  : 0
  Time step     : 0.001
ERROR on proc 141: DeePMD-kit C API Error: DeePMD-kit Error: DeePMD-kit PyTorch backend error: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
  File "code/__torch__/deepmd/pt/model/model/ener_model.py", line 66, in forward_lower
    comm_dict: Optional[Dict[str, Tensor]]=None) -> Dict[str, Tensor]:
    _6 = (self).need_sorted_nlist_for_lower()
    model_ret = (self).forward_common_lower(extended_coord, extended_atype, nlist, mapping, fparam, aparam, do_atomic_virial, comm_dict, _6, )
                 ~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    _7 = (self).get_fitting_net()
    model_predict = annotate(Dict[str, Tensor], {})
  File "code/__torch__/deepmd/pt/model/model/ener_model.py", line 232, in forward_common_lower
    cc_ext, _40, fp, ap, input_prec, = _39
    atomic_model = self.atomic_model
    atomic_ret = (atomic_model).forward_common_atomic(cc_ext, extended_atype, nlist0, mapping, fp, ap, comm_dict, )
                  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    _41 = (self).atomic_output_def()
    training = self.training
  File "code/__torch__/deepmd/pt/model/atomic_model/energy_atomic_model.py", line 53, in forward_common_atomic
    ext_atom_mask = (self).make_atom_mask(extended_atype, )
    _3 = torch.where(ext_atom_mask, extended_atype, 0)
    ret_dict = (self).forward_atomic(extended_coord, _3, nlist, mapping, fparam, aparam, comm_dict, )
                ~~~~~~~~~~~~~~~~~~~~ <--- HERE
    ret_dict0 = (self).apply_out_stat(ret_dict, atype, )
    _4 = torch.slice(torch.slice(ext_atom_mask), 1, None, nloc)
  File "code/__torch__/deepmd/pt/model/atomic_model/energy_atomic_model.py", line 96, in forward_atomic
      pass
    descriptor = self.descriptor
    _16 = (descriptor).forward(extended_coord, extended_atype, nlist, mapping, comm_dict, )
           ~~~~~~~~~~~~~~~~~~~ <--- HERE
    descriptor0, rot_mat, g2, h2, sw, = _16
    enable_eval_descriptor_hook = self.enable_eval_descriptor_hook
  File "code/__torch__/deepmd/pt/model/descriptor/dpa3.py", line 53, in forward
    node_ebd_inp = torch.slice(_2, 2)
    repflows = self.repflows
    _3 = (repflows).forward(nlist, extended_coord0, extended_atype, node_ebd_ext, mapping, comm_dict, )
          ~~~~~~~~~~~~~~~~~ <--- HERE
    node_ebd, edge_ebd, h2, rot_mat, sw, = _3
    concat_output_tebd = self.concat_output_tebd
  File "code/__torch__/deepmd/pt/model/descriptor/repflows.py", line 326, in forward
      _72 = torch.tensor(real_nloc, dtype=3, device=torch.device("cpu"))
      _73 = torch.tensor(torch.sub(real_nall, real_nloc), dtype=3, device=torch.device("cpu"))
      ret = ops.deepmd.border_op(_66, _67, _68, _69, _70, node_ebd0, _71, _72, _73)
            ~~~~~~~~~~~~~~~~~~~~ <--- HERE
      node_ebd_ext1 = torch.unsqueeze(ret[0], 0)
      if has_spin:

Traceback of TorchScript, original code (most recent call last):
  File "/public/home/xuyouwei/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/model/model/ener_model.py", line 119, in forward_lower
        comm_dict: Optional[dict[str, torch.Tensor]] = None,
    ):
        model_ret = self.forward_common_lower(
                    ~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
            extended_coord,
            extended_atype,
  File "/public/home/xuyouwei/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/model/model/make_model.py", line 287, in forward_common_lower
            )
            del extended_coord, fparam, aparam
            atomic_ret = self.atomic_model.forward_common_atomic(
                         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
                cc_ext,
                extended_atype,
  File "/public/home/xuyouwei/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/model/atomic_model/base_atomic_model.py", line 249, in forward_common_atomic
    
        ext_atom_mask = self.make_atom_mask(extended_atype)
        ret_dict = self.forward_atomic(
                   ~~~~~~~~~~~~~~~~~~~ <--- HERE
            extended_coord,
            torch.where(ext_atom_mask, extended_atype, 0),
  File "/public/ERROR on proc 162: DeePMD-kit C API Error: DeePMD-kit Error: DeePMD-kit PyTorch backend error: The following operation failed in the TorchScript interpreter.”。前面提到我经过了多次测试，均失败了：1.描述符：在pytorch后端3.1.0的GPU版本下分别使用dpa3和se_e2_a训练的模型都在跑MD是失败了；2：在集群A使用GPU训练模型，然后在同个GPU上运行LAMMPS，任务失败了；3：使用同样的模型，使用DeePMD-kit 3.1.0的CPU版本或者GPU的LAMMPS来跑MD，都失败了。多次尝试后，有一次LAMMPS运行成功了，我使用pytorch后端和描述符se_e2_a训练出模型graph.pth,graph.pth在用到LAMMPS跑MD的时候失败了，然后我使用dp convert-backend graph.pth graph.pb命令把该模型又pytorch后端转换成了tensorflow后端，然后给到lammps跑md，这次就成功运行了。这似乎说明.pth模型在LAMMPS跑MD模拟过程中有些问题。而目前dpa3描述符确实精度比较好，在我的数据集下，dpa3精度是se_e2_a的一倍，但是dpa3只支持pytorch后端，我无法把它转换到tensorflow后端然后给到lammps跑MD。

### DeePMD-kit Version

3.1.0

### Backend and its version

PyTorch  v2.6.0-gUnknown

### How did you download the software?

Offline packages

### Input Files, Running Commands, Error Log, etc.

LAMMPS运行命令：mpirun -n 192 /data/home/liuchang/deepmd-kit/bin/lmp -i in.tin -l lmp.log
LAMMPS的输入文件in.tin:#######################initialization
units           metal

dimension       3

boundary        p p p

atom_style      atomic

#######################system definition

read_data        structure.lmp   

#######################simulation settings
# set potential

pair_style      deepmd graph.pb

pair_coeff      * * C B

neighbor        3.5 nsq

neigh_modify    every 1 delay 0 check yes

# thermo output

thermo          1

thermo_style    custom step temp pe etotal press vol pxx pyy pzz lx ly lz xy xz yz

thermo_modify   flush yes lost error line one

timestep        0.001 

#intial relax

dump            RE all custom 1 int.traj id type x y z

velocity        all create 10.0 32546 dist gaussian mom yes rot yes

fix             MD all npt temp 10 10 0.1 x 0 1000000 1 y 0 1000000 1 z 0 1000000 1 #xy 0.0 0.0 1.0 xz 0.0 0.0 1.0 yz 0.0 0.0 1.0 #iso 0 0 1.0

run             500

unfix           MD

undump		    RE

#load
change_box       all triclinic

reset_timestep  0

fix             MD1 all npt temp 10 10 0.1 x 1000000 1000000 1 y 1000000 1000000 1 z 1000000 1000000 1

fix             MD2 all deform 1 xz erate 0.002 units lattice remap x

dump            dump all custom 50 load.traj id type x y z

#strain的0.000005与erate 0.005的0.005的关系是:后者是前者的1000倍,对应皮秒是飞秒的1000倍
variable	strain equal step*0.000002
variable        p1 equal "v_strain"

variable        px equal "-pxx/10000"
variable        py equal "-pyy/10000"
variable        pz equal "-pzz/10000"
variable        psyz equal "-pyz/10000"
variable        psxz equal "-pxz/10000"
variable        psxy equal "-pxy/10000"

fix             out1 all print 50 "${p1} ${px} ${py} ${pz} ${psyz} ${psxz} ${psxy}" file Stress.dat screen no

run             2000

### Steps to Reproduce

DeePMD-kit C API Error: DeePMD-kit Error: DeePMD-kit PyTorch backend error: The following operation failed in the TorchScript interpreter.

### Further Information, Files, and Links

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] _DeePMD-kit C API Error: DeePMD-kit Error: DeePMD-kit PyTorch backend error: The following operation failed in the TorchScript interpreter. #5087

Bug summary

DeePMD-kit Version

Backend and its version

How did you download the software?

Input Files, Running Commands, Error Log, etc.

set potential

thermo output

Steps to Reproduce

Further Information, Files, and Links

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] _DeePMD-kit C API Error: DeePMD-kit Error: DeePMD-kit PyTorch backend error: The following operation failed in the TorchScript interpreter. #5087

Description

Bug summary

DeePMD-kit Version

Backend and its version

How did you download the software?

Input Files, Running Commands, Error Log, etc.

set potential

thermo output

Steps to Reproduce

Further Information, Files, and Links

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions