============================= test session starts ============================== platform linux -- Python 3.9.21, pytest-6.2.5, py-1.11.0, pluggy-0.13.1 rootdir: /home/jenkins/mindspore/testcases/testcases/tests/st/graph_kernel/comm, configfile: ../../../../../../../sault/virtual_test/virtualenv_002/sault/config/pytest.ini plugins: forked-1.6.0, hydra-core-1.3.2, xdist-1.32.0, anyio-4.9.0 collected 1 item test_all.py /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) Start worker process with rank id:0, log file:./dvm_matmul_allreduce_log/worker_0.log. Environment variable [RANK_ID=0] is exported. Start worker process with rank id:1, log file:./dvm_matmul_allreduce_log/worker_1.log. Environment variable [RANK_ID=1] is exported. Start worker process with rank id:2, log file:./dvm_matmul_allreduce_log/worker_2.log. Environment variable [RANK_ID=2] is exported. Start worker process with rank id:3, log file:./dvm_matmul_allreduce_log/worker_3.log. Environment variable [RANK_ID=3] is exported. [WARNING] ME(1227474:281473664413376,MainProcess):2025-07-15-11:21:07.575.951 [mindspore/parallel/cluster/process_entity/_api.py:267] Distributed job is spawned. Waiting all processes to exit... /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) [WARNING] ME(1231866:281473342435008,MainProcess):2025-07-15-11:21:12.363.648 [mindspore/context.py:1412] For 'context.set_context', the parameter 'device_target' will be deprecated and removed in a future version. Please use the api mindspore.set_device() instead. [WARNING] DISTRIBUTED(1231866,ffff9e96eec0,python):2025-07-15-11:21:12.366.186 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 19 source: 127.0.0.1:33714, destination: 127.0.0.1:8118 [WARNING] DISTRIBUTED(1231866,ffff9e96eec0,python):2025-07-15-11:21:12.366.264 [mindspore/ccsrc/distributed/rpc/tcp/tcp_client.cc:76] Connect] Failed to connect to the tcp server : 127.0.0.1:8118, retry to reconnect(1/1)... [WARNING] ME(1231966:281473205661376,MainProcess):2025-07-15-11:21:12.487.058 [mindspore/context.py:1412] For 'context.set_context', the parameter 'device_target' will be deprecated and removed in a future version. Please use the api mindspore.set_device() instead. [WARNING] DISTRIBUTED(1231966,ffff966feec0,python):2025-07-15-11:21:12.489.857 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 19 source: 127.0.0.1:33730, destination: 127.0.0.1:8118 [WARNING] DISTRIBUTED(1231966,ffff2f68efa0,python):2025-07-15-11:21:12.489.857 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:33730 to 127.0.0.1:8118 is successfully created. System errno: Success [WARNING] DISTRIBUTED(1231966,ffff966feec0,python):2025-07-15-11:21:12.489.940 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:8118 to be connected...Retry number: 1 [WARNING] ME(1232017:281472828894912,MainProcess):2025-07-15-11:21:12.531.734 [mindspore/context.py:1412] For 'context.set_context', the parameter 'device_target' will be deprecated and removed in a future version. Please use the api mindspore.set_device() instead. [WARNING] DISTRIBUTED(1232017,ffff18f3efa0,python):2025-07-15-11:21:12.534.348 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:33742 to 127.0.0.1:8118 is successfully created. System errno: Success [WARNING] DISTRIBUTED(1232017,ffff7ffaeec0,python):2025-07-15-11:21:12.534.347 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 19 source: 127.0.0.1:33742, destination: 127.0.0.1:8118 [WARNING] DISTRIBUTED(1232017,ffff7ffaeec0,python):2025-07-15-11:21:12.534.574 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 20 source: 127.0.0.1:33748, destination: 127.0.0.1:8118 [WARNING] DISTRIBUTED(1232017,ffff19f5efa0,python):2025-07-15-11:21:12.534.607 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:33748 to 127.0.0.1:8118 is successfully created. System errno: Success [WARNING] DISTRIBUTED(1232017,ffff7ffaeec0,python):2025-07-15-11:21:12.534.619 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:8118 to be connected...Retry number: 1 [WARNING] ME(1231915:281473691545280,MainProcess):2025-07-15-11:21:12.635.040 [mindspore/context.py:1412] For 'context.set_context', the parameter 'device_target' will be deprecated and removed in a future version. Please use the api mindspore.set_device() instead. [WARNING] DISTRIBUTED(1231915,ffffb365eec0,python):2025-07-15-11:21:12.637.900 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 19 source: 127.0.0.1:33754, destination: 127.0.0.1:8118 [WARNING] DISTRIBUTED(1231915,ffff47ffefa0,python):2025-07-15-11:21:12.637.901 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:33754 to 127.0.0.1:8118 is successfully created. System errno: Success [WARNING] DISTRIBUTED(1231915,ffffb365eec0,python):2025-07-15-11:21:12.637.982 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:8118 to be connected...Retry number: 1 [WARNING] DISTRIBUTED(1231866,ffff9e96eec0,python):2025-07-15-11:21:12.866.493 [mindspore/ccsrc/distributed/cluster/topology/compute_graph_node.cc:173] Register] Failed to connect to the meta server node url: 127.0.0.1:8118 [WARNING] DISTRIBUTED(1231866,ffff9e96eec0,python):2025-07-15-11:21:12.866.572 [mindspore/ccsrc/distributed/cluster/topology/compute_graph_node.cc:363] ReconnectWithTimeoutWindow] Failed to register and try to reconnect to the meta server. [WARNING] DISTRIBUTED(1231966,ffff966feec0,python):2025-07-15-11:21:12.990.178 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 20 source: 127.0.0.1:33764, destination: 127.0.0.1:8118 [WARNING] DISTRIBUTED(1231966,ffff966feec0,python):2025-07-15-11:21:12.990.222 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:8118 to be connected...Retry number: 2 [WARNING] DISTRIBUTED(1231966,ffff306aefa0,python):2025-07-15-11:21:12.990.220 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:33764 to 127.0.0.1:8118 is successfully created. System errno: Success [WARNING] DISTRIBUTED(1232017,ffff7ffaeec0,python):2025-07-15-11:21:13.035.309 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(1/1200). [WARNING] DISTRIBUTED(1231915,ffffb365eec0,python):2025-07-15-11:21:13.138.234 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 20 source: 127.0.0.1:33780, destination: 127.0.0.1:8118 [WARNING] DISTRIBUTED(1231915,ffff4d62efa0,python):2025-07-15-11:21:13.138.266 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:33780 to 127.0.0.1:8118 is successfully created. System errno: Success [WARNING] DISTRIBUTED(1231915,ffffb365eec0,python):2025-07-15-11:21:13.138.285 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:8118 to be connected...Retry number: 2 [WARNING] DISTRIBUTED(1231866,ffff9e96eec0,python):2025-07-15-11:21:13.367.012 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 20 source: 127.0.0.1:33796, destination: 127.0.0.1:8118 [WARNING] DISTRIBUTED(1231866,ffff3891efa0,python):2025-07-15-11:21:13.367.012 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:33796 to 127.0.0.1:8118 is successfully created. System errno: Success [WARNING] DISTRIBUTED(1231866,ffff9e96eec0,python):2025-07-15-11:21:13.367.086 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:8118 to be connected...Retry number: 1 [WARNING] DISTRIBUTED(1231966,ffff966feec0,python):2025-07-15-11:21:13.490.780 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(1/1200). [WARNING] DISTRIBUTED(1232017,ffff7ffaeec0,python):2025-07-15-11:21:13.535.420 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(2/1200). [WARNING] DISTRIBUTED(1231915,ffffb365eec0,python):2025-07-15-11:21:13.638.866 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(1/1200). [WARNING] DISTRIBUTED(1231866,ffff9e96eec0,python):2025-07-15-11:21:13.867.524 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 21 source: 127.0.0.1:33804, destination: 127.0.0.1:8118 [WARNING] DISTRIBUTED(1231866,ffff337eefa0,python):2025-07-15-11:21:13.867.527 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:33804 to 127.0.0.1:8118 is successfully created. System errno: Success [WARNING] DISTRIBUTED(1231866,ffff9e96eec0,python):2025-07-15-11:21:13.867.598 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:8118 to be connected...Retry number: 2 [WARNING] DISTRIBUTED(1231966,ffff966feec0,python):2025-07-15-11:21:13.990.893 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(2/1200). [WARNING] DISTRIBUTED(1232017,ffff7ffaeec0,python):2025-07-15-11:21:14.035.524 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(3/1200). [WARNING] DISTRIBUTED(1231915,ffffb365eec0,python):2025-07-15-11:21:14.139.004 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(2/1200). [WARNING] DISTRIBUTED(1231866,ffff9e96eec0,python):2025-07-15-11:21:14.368.541 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(1/1200). [WARNING] DISTRIBUTED(1231966,ffff966feec0,python):2025-07-15-11:21:14.491.000 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(3/1200). [WARNING] DISTRIBUTED(1232017,ffff7ffaeec0,python):2025-07-15-11:21:14.535.632 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(4/1200). [WARNING] DISTRIBUTED(1231915,ffffb365eec0,python):2025-07-15-11:21:14.639.198 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(3/1200). [WARNING] DISTRIBUTED(1231866,ffff9e96eec0,python):2025-07-15-11:21:14.868.820 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:249] BuildCluster] Cluster is successfully initialized. [WARNING] DISTRIBUTED(1231866,ffff9e96eec0,python):2025-07-15-11:21:14.868.900 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:355] PostProcess] This node 0 rank id: 0 [WARNING] DISTRIBUTED(1231966,ffff966feec0,python):2025-07-15-11:21:14.991.145 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:249] BuildCluster] Cluster is successfully initialized. [WARNING] DISTRIBUTED(1231966,ffff966feec0,python):2025-07-15-11:21:14.991.187 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:355] PostProcess] This node 2 rank id: 2 [WARNING] DISTRIBUTED(1232017,ffff7ffaeec0,python):2025-07-15-11:21:15.035.758 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:249] BuildCluster] Cluster is successfully initialized. [WARNING] DISTRIBUTED(1232017,ffff7ffaeec0,python):2025-07-15-11:21:15.035.801 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:355] PostProcess] This node 3 rank id: 3 [WARNING] DISTRIBUTED(1231915,ffffb365eec0,python):2025-07-15-11:21:15.139.422 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:249] BuildCluster] Cluster is successfully initialized. [WARNING] DISTRIBUTED(1231915,ffffb365eec0,python):2025-07-15-11:21:15.139.507 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:355] PostProcess] This node 1 rank id: 1 [WARNING] DISTRIBUTED(1231866,ffff9e96eec0,python):2025-07-15-11:21:16.647.253 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: hccl_world_group [const vector]{0, 1, 2, 3}, async: 1, submit_now: 1 [WARNING] DISTRIBUTED(1231866,ffff9e96eec0,python):2025-07-15-11:21:16.647.641 [mindspore/ccsrc/distributed/collective/collective_manager.cc:393] CreateCommunicationGroup] This group's communicator is async created hccl_world_group [WARNING] DEVICE(1231866,fffee0baefa0,python):2025-07-15-11:21:16.647.916 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:254] SetGlobalCommInfo] Start to SetGlobalCommInfo for hccl_world_group, master_ip:2130706433, master_port:8118, node_rank:2130706433, total_rank_size:4, local_rank_size4 [WARNING] HCCL_ADPT(1231866,fffee0baefa0,python):2025-07-15-11:21:16.648.012 [mindspore/ccsrc/utils/dlopen_macro.h:165] DlsymAscend] Dynamically load symbol HcclSetGlobalCommInfo failed, result = /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/../lib/plugin/ascend/libhccl_plugin.so: undefined symbol: HcclSetGlobalCommInfo [WARNING] HCCL_ADPT(1231866,fffee0baefa0,python):2025-07-15-11:21:16.648.049 [mindspore/ccsrc/plugin/res_manager/ascend/hccl_adapter/hccl_adapter.cc:635] HcclSetGlobalCommInfo] Func HcclSetGlobalCommInfo is not supported in CANN package. [WARNING] DEVICE(1231866,fffee0baefa0,python):2025-07-15-11:21:16.648.078 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:265] SetGlobalCommInfo] End to SetGlobalCommInfo for hccl_world_group [WARNING] DISTRIBUTED(1231866,fffee0baefa0,python):2025-07-15-11:21:16.655.934 [mindspore/ccsrc/distributed/collective/collective_manager.cc:1021] CreateDeviceCommunicator] Begin initialize communication group on the device side: hccl_world_group [0] physical device id: 0 [WARNING] DISTRIBUTED(1231966,ffff966feec0,python):2025-07-15-11:21:16.735.986 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: hccl_world_group [const vector]{0, 1, 2, 3}, async: 1, submit_now: 1 [WARNING] DISTRIBUTED(1231966,ffff966feec0,python):2025-07-15-11:21:16.736.299 [mindspore/ccsrc/distributed/collective/collective_manager.cc:393] CreateCommunicationGroup] This group's communicator is async created hccl_world_group [WARNING] DEVICE(1231966,fffe93ffefa0,python):2025-07-15-11:21:16.736.629 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:254] SetGlobalCommInfo] Start to SetGlobalCommInfo for hccl_world_group, master_ip:2130706433, master_port:8118, node_rank:2130706433, total_rank_size:4, local_rank_size4 [WARNING] HCCL_ADPT(1231966,fffe93ffefa0,python):2025-07-15-11:21:16.736.748 [mindspore/ccsrc/utils/dlopen_macro.h:165] DlsymAscend] Dynamically load symbol HcclSetGlobalCommInfo failed, result = /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/../lib/plugin/ascend/libhccl_plugin.so: undefined symbol: HcclSetGlobalCommInfo [WARNING] HCCL_ADPT(1231966,fffe93ffefa0,python):2025-07-15-11:21:16.736.784 [mindspore/ccsrc/plugin/res_manager/ascend/hccl_adapter/hccl_adapter.cc:635] HcclSetGlobalCommInfo] Func HcclSetGlobalCommInfo is not supported in CANN package. [WARNING] DEVICE(1231966,fffe93ffefa0,python):2025-07-15-11:21:16.736.812 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:265] SetGlobalCommInfo] End to SetGlobalCommInfo for hccl_world_group [WARNING] DISTRIBUTED(1231966,fffe93ffefa0,python):2025-07-15-11:21:16.737.425 [mindspore/ccsrc/distributed/collective/collective_manager.cc:1021] CreateDeviceCommunicator] Begin initialize communication group on the device side: hccl_world_group [2] physical device id: 2 [WARNING] DISTRIBUTED(1232017,ffff7ffaeec0,python):2025-07-15-11:21:16.797.432 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: hccl_world_group [const vector]{0, 1, 2, 3}, async: 1, submit_now: 1 [WARNING] DISTRIBUTED(1232017,ffff7ffaeec0,python):2025-07-15-11:21:16.797.743 [mindspore/ccsrc/distributed/collective/collective_manager.cc:393] CreateCommunicationGroup] This group's communicator is async created hccl_world_group [WARNING] DEVICE(1232017,fffec187efa0,python):2025-07-15-11:21:16.797.986 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:254] SetGlobalCommInfo] Start to SetGlobalCommInfo for hccl_world_group, master_ip:2130706433, master_port:8118, node_rank:2130706433, total_rank_size:4, local_rank_size4 [WARNING] HCCL_ADPT(1232017,fffec187efa0,python):2025-07-15-11:21:16.798.093 [mindspore/ccsrc/utils/dlopen_macro.h:165] DlsymAscend] Dynamically load symbol HcclSetGlobalCommInfo failed, result = /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/../lib/plugin/ascend/libhccl_plugin.so: undefined symbol: HcclSetGlobalCommInfo [WARNING] HCCL_ADPT(1232017,fffec187efa0,python):2025-07-15-11:21:16.798.131 [mindspore/ccsrc/plugin/res_manager/ascend/hccl_adapter/hccl_adapter.cc:635] HcclSetGlobalCommInfo] Func HcclSetGlobalCommInfo is not supported in CANN package. [WARNING] DEVICE(1232017,fffec187efa0,python):2025-07-15-11:21:16.798.160 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:265] SetGlobalCommInfo] End to SetGlobalCommInfo for hccl_world_group [WARNING] DISTRIBUTED(1232017,fffec187efa0,python):2025-07-15-11:21:16.798.729 [mindspore/ccsrc/distributed/collective/collective_manager.cc:1021] CreateDeviceCommunicator] Begin initialize communication group on the device side: hccl_world_group [3] physical device id: 3 Traceback (most recent call last): File "/home/jenkins/mindspore/testcases/testcases/tests/st/graph_kernel/comm/test_dvm_matmul_allreduce.py", line 27, in init() File "/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/management.py", line 203, in init init_hccl() RuntimeError: Call aclrtSetDevice failed, ret[507033]. Got device count[8] and device id[1], please check if device id is valid. ---------------------------------------------------- - C++ Call Stack: (For framework developers) ---------------------------------------------------- mindspore/ccsrc/plugin/res_manager/ascend/hal_manager/ascend_hal_manager.cc:67 InitDevice [WARNING] DEVICE(1231915,ffffb365eec0,python):2025-07-15-11:23:58.712.212 [mindspore/ccsrc/plugin/device/ascend/hal/hardware/ascend_device_res_manager.cc:350] SyncAllStreams] The ascend_res_manager_ is nullptr in scenarios where it is not actually executed [ERROR] ME(1227474:281473664413376,MainProcess):2025-07-15-11:24:00.407.492 [mindspore/parallel/cluster/process_entity/_api.py:363] Worker process 1231915 exit with exception. Error code: 1. [WARNING] ME(1227474:281473664413376,MainProcess):2025-07-15-11:24:00.407.817 [mindspore/parallel/cluster/process_entity/_api.py:369] There's worker exits with exception, kill all other workers. [ERROR] ME(1227474:281473664413376,MainProcess):2025-07-15-11:24:36.512.26 [mindspore/parallel/cluster/process_entity/_api.py:382] Scheduler process 1231845 exit with exception. [ERROR] ME(1227474:281473664413376,MainProcess):2025-07-15-11:24:36.522.30 [mindspore/parallel/cluster/process_entity/_api.py:603] Time out nodes are ['1'] ./dvm_matmul_allreduce_log/worker_1.log-16-[WARNING] DISTRIBUTED(1231915,ffffb365eec0,python):2025-07-15-11:21:13.638.866 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(1/1200). ./dvm_matmul_allreduce_log/worker_1.log-17-[WARNING] DISTRIBUTED(1231915,ffffb365eec0,python):2025-07-15-11:21:14.139.004 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(2/1200). ./dvm_matmul_allreduce_log/worker_1.log-18-[WARNING] DISTRIBUTED(1231915,ffffb365eec0,python):2025-07-15-11:21:14.639.198 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(3/1200). ./dvm_matmul_allreduce_log/worker_1.log-19-[WARNING] DISTRIBUTED(1231915,ffffb365eec0,python):2025-07-15-11:21:15.139.422 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:249] BuildCluster] Cluster is successfully initialized. ./dvm_matmul_allreduce_log/worker_1.log-20-[WARNING] DISTRIBUTED(1231915,ffffb365eec0,python):2025-07-15-11:21:15.139.507 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:355] PostProcess] This node 1 rank id: 1 ./dvm_matmul_allreduce_log/worker_1.log:21:Traceback (most recent call last): ./dvm_matmul_allreduce_log/worker_1.log-22- File "/home/jenkins/mindspore/testcases/testcases/tests/st/graph_kernel/comm/test_dvm_matmul_allreduce.py", line 27, in ./dvm_matmul_allreduce_log/worker_1.log-23- init() ./dvm_matmul_allreduce_log/worker_1.log-24- File "/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/management.py", line 203, in init ./dvm_matmul_allreduce_log/worker_1.log-25- init_hccl() ./dvm_matmul_allreduce_log/worker_1.log:26:RuntimeError: Call aclrtSetDevice failed, ret[507033]. Got device count[8] and device id[1], please check if device id is valid. ./dvm_matmul_allreduce_log/worker_1.log-27- ./dvm_matmul_allreduce_log/worker_1.log-28----------------------------------------------------- ./dvm_matmul_allreduce_log/worker_1.log-29-- C++ Call Stack: (For framework developers) ./dvm_matmul_allreduce_log/worker_1.log-30----------------------------------------------------- ./dvm_matmul_allreduce_log/worker_1.log-31-mindspore/ccsrc/plugin/res_manager/ascend/hal_manager/ascend_hal_manager.cc:67 InitDevice -- ./dvm_matmul_allreduce_log/scheduler.log-96-[WARNING] DISTRIBUTED(1231845,ffffa6f9eec0,python):2025-07-15-11:24:19.391.888 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:154] Finalize] This log means the cluster is successfully created. Retry to finalize the node and exit cluster... ./dvm_matmul_allreduce_log/scheduler.log-97-[WARNING] DISTRIBUTED(1231845,ffffa6f9eec0,python):2025-07-15-11:24:24.392.077 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:98] Finalize] The meta server node can not be finalized because there are still 4 alive nodes. ./dvm_matmul_allreduce_log/scheduler.log-98-[WARNING] DISTRIBUTED(1231845,ffffa6f9eec0,python):2025-07-15-11:24:24.392.163 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:154] Finalize] This log means the cluster is successfully created. Retry to finalize the node and exit cluster... ./dvm_matmul_allreduce_log/scheduler.log-99-[WARNING] DISTRIBUTED(1231845,ffffa6f9eec0,python):2025-07-15-11:24:29.392.365 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:98] Finalize] The meta server node can not be finalized because there are still 4 alive nodes. ./dvm_matmul_allreduce_log/scheduler.log-100-[WARNING] DISTRIBUTED(1231845,ffffa6f9eec0,python):2025-07-15-11:24:29.392.443 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:154] Finalize] This log means the cluster is successfully created. Retry to finalize the node and exit cluster... ./dvm_matmul_allreduce_log/scheduler.log:101:[ERROR] DISTRIBUTED(1231845,ffff3b7eefa0,python):2025-07-15-11:24:29.408.830 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:511] UpdateTopoState] The node: 1 is timed out. It may exit with exception, please check this node's log. ./dvm_matmul_allreduce_log/scheduler.log:102:[ERROR] DISTRIBUTED(1231845,ffffa6f9eec0,python):2025-07-15-11:24:34.392.675 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:103] Finalize] There are 1 abnormal compute graph nodes. ./dvm_matmul_allreduce_log/scheduler.log:103:Traceback (most recent call last): ./dvm_matmul_allreduce_log/scheduler.log-104- File "/home/jenkins/mindspore/testcases/testcases/tests/st/graph_kernel/comm/test_dvm_matmul_allreduce.py", line 27, in ./dvm_matmul_allreduce_log/scheduler.log-105- init() ./dvm_matmul_allreduce_log/scheduler.log-106- File "/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/management.py", line 213, in init ./dvm_matmul_allreduce_log/scheduler.log-107- init_cluster() ./dvm_matmul_allreduce_log/scheduler.log:108:RuntimeError: The total number of timed out node is 1. Timed out node list is: [const vector]{1}, worker 1 is the first one timed out, please check its log. ./dvm_matmul_allreduce_log/scheduler.log-109- ./dvm_matmul_allreduce_log/scheduler.log-110----------------------------------------------------- ./dvm_matmul_allreduce_log/scheduler.log-111-- C++ Call Stack: (For framework developers) ./dvm_matmul_allreduce_log/scheduler.log-112----------------------------------------------------- ./dvm_matmul_allreduce_log/scheduler.log-113-mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:517 UpdateTopoState Traceback (most recent call last): File "/home/jenkins/anaconda3/envs/ci39/bin/msrun", line 8, in sys.exit(main()) File "/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/parallel/cluster/run.py", line 191, in main run(args) File "/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/parallel/cluster/run.py", line 185, in run process_manager.run() File "/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/parallel/cluster/process_entity/_api.py", line 268, in run self.join_processes() File "/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/parallel/cluster/process_entity/_api.py", line 387, in join_processes raise RuntimeError("Distributed job exited with exception. Please check logs in " RuntimeError: Distributed job exited with exception. Please check logs in directory: ./dvm_matmul_allreduce_log. F =================================== FAILURES =================================== __________________________ test_dvm_matmul_allreduce ___________________________ @arg_mark(plat_marks=['platform_ascend910b'], level_mark='level1', card_mark='allcards', essential_mark='essential') def test_dvm_matmul_allreduce(): """ Feature: DVM operator test. Description: msrun dvm matmul + allreduce 4P case. Expectation: success """ return_code = os.system( "MS_DEV_GRAPH_KERNEL_FLAGS='--enable_cluster_ops=MatMul,AllReduce' "\ "msrun --worker_num=4 --local_worker_num=4 --join=True --log_dir=./dvm_matmul_allreduce_log "\ "python test_dvm_matmul_allreduce.py" ) > assert return_code == 0 E assert 256 == 0 test_all.py:58: AssertionError =========================== short test summary info ============================ FAILED test_all.py::test_dvm_matmul_allreduce - assert 256 == 0 ======================== 1 failed in 215.62s (0:03:35) =========================