============================= test session starts ============================== platform linux -- Python 3.9.21, pytest-6.2.5, py-1.11.0, pluggy-0.13.1 rootdir: /home/jenkins/mindspore/testcases/testcases/tests/st/graph_kernel/comm, configfile: ../../../../../../../sault/virtual_test/virtualenv_002/sault/config/pytest.ini plugins: forked-1.6.0, hydra-core-1.3.2, xdist-1.32.0, anyio-4.9.0 collected 1 item test_all.py /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) Start worker process with rank id:0, log file:./dvm_allreduce_log/worker_0.log. Environment variable [RANK_ID=0] is exported. Start worker process with rank id:1, log file:./dvm_allreduce_log/worker_1.log. Environment variable [RANK_ID=1] is exported. Start worker process with rank id:2, log file:./dvm_allreduce_log/worker_2.log. Environment variable [RANK_ID=2] is exported. Start worker process with rank id:3, log file:./dvm_allreduce_log/worker_3.log. Environment variable [RANK_ID=3] is exported. [WARNING] ME(1178852:281473819209408,MainProcess):2025-07-15-11:17:00.203.916 [mindspore/parallel/cluster/process_entity/_api.py:267] Distributed job is spawned. Waiting all processes to exit... /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) [WARNING] ME(1179105:281473774382784,MainProcess):2025-07-15-11:17:04.818.070 [mindspore/context.py:1412] For 'context.set_context', the parameter 'device_target' will be deprecated and removed in a future version. Please use the api mindspore.set_device() instead. [WARNING] DISTRIBUTED(1179105,ffffb855eec0,python):2025-07-15-11:17:04.820.689 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 19 source: 127.0.0.1:42646, destination: 127.0.0.1:8118 [WARNING] DISTRIBUTED(1179105,ffffb855eec0,python):2025-07-15-11:17:04.820.777 [mindspore/ccsrc/distributed/rpc/tcp/tcp_client.cc:76] Connect] Failed to connect to the tcp server : 127.0.0.1:8118, retry to reconnect(1/1)... [WARNING] ME(1179129:281473788538560,MainProcess):2025-07-15-11:17:04.923.866 [mindspore/context.py:1412] For 'context.set_context', the parameter 'device_target' will be deprecated and removed in a future version. Please use the api mindspore.set_device() instead. [WARNING] DISTRIBUTED(1179129,ffffb92deec0,python):2025-07-15-11:17:04.926.405 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 19 source: 127.0.0.1:42658, destination: 127.0.0.1:8118 [WARNING] DISTRIBUTED(1179129,ffffb92deec0,python):2025-07-15-11:17:04.926.510 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:8118 to be connected...Retry number: 1 [WARNING] DISTRIBUTED(1179129,ffff5226efa0,python):2025-07-15-11:17:04.926.450 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:42658 to 127.0.0.1:8118 is successfully created. System errno: Success [WARNING] ME(1179143:281473558376128,MainProcess):2025-07-15-11:17:04.990.960 [mindspore/context.py:1412] For 'context.set_context', the parameter 'device_target' will be deprecated and removed in a future version. Please use the api mindspore.set_device() instead. [WARNING] DISTRIBUTED(1179143,ffff3fffefa0,python):2025-07-15-11:17:04.993.675 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:42674 to 127.0.0.1:8118 is successfully created. System errno: Success [WARNING] DISTRIBUTED(1179143,ffffab75eec0,python):2025-07-15-11:17:04.993.672 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 19 source: 127.0.0.1:42674, destination: 127.0.0.1:8118 [WARNING] DISTRIBUTED(1179143,ffffab75eec0,python):2025-07-15-11:17:04.993.880 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 20 source: 127.0.0.1:42684, destination: 127.0.0.1:8118 [WARNING] DISTRIBUTED(1179143,ffff4572efa0,python):2025-07-15-11:17:04.993.914 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:42684 to 127.0.0.1:8118 is successfully created. System errno: Success [WARNING] DISTRIBUTED(1179143,ffffab75eec0,python):2025-07-15-11:17:04.993.926 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:8118 to be connected...Retry number: 1 [WARNING] ME(1179150:281473865019072,MainProcess):2025-07-15-11:17:05.106.347 [mindspore/context.py:1412] For 'context.set_context', the parameter 'device_target' will be deprecated and removed in a future version. Please use the api mindspore.set_device() instead. [WARNING] DISTRIBUTED(1179150,ffff56b6efa0,python):2025-07-15-11:17:05.108.935 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:42694 to 127.0.0.1:8118 is successfully created. System errno: Success [WARNING] DISTRIBUTED(1179150,ffffbdbceec0,python):2025-07-15-11:17:05.108.960 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 19 source: 127.0.0.1:42694, destination: 127.0.0.1:8118 [WARNING] DISTRIBUTED(1179150,ffffbdbceec0,python):2025-07-15-11:17:05.109.163 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 20 source: 127.0.0.1:42698, destination: 127.0.0.1:8118 [WARNING] DISTRIBUTED(1179150,ffffbdbceec0,python):2025-07-15-11:17:05.109.218 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:8118 to be connected...Retry number: 1 [WARNING] DISTRIBUTED(1179150,ffff57b8efa0,python):2025-07-15-11:17:05.109.219 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:42698 to 127.0.0.1:8118 is successfully created. System errno: Success [WARNING] DISTRIBUTED(1179105,ffffb855eec0,python):2025-07-15-11:17:05.320.892 [mindspore/ccsrc/distributed/cluster/topology/compute_graph_node.cc:173] Register] Failed to connect to the meta server node url: 127.0.0.1:8118 [WARNING] DISTRIBUTED(1179105,ffffb855eec0,python):2025-07-15-11:17:05.320.935 [mindspore/ccsrc/distributed/cluster/topology/compute_graph_node.cc:363] ReconnectWithTimeoutWindow] Failed to register and try to reconnect to the meta server. [WARNING] DISTRIBUTED(1179129,ffffb92deec0,python):2025-07-15-11:17:05.426.743 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 20 source: 127.0.0.1:42712, destination: 127.0.0.1:8118 [WARNING] DISTRIBUTED(1179129,ffff5328efa0,python):2025-07-15-11:17:05.426.775 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:42712 to 127.0.0.1:8118 is successfully created. System errno: Success [WARNING] DISTRIBUTED(1179129,ffffb92deec0,python):2025-07-15-11:17:05.426.791 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:8118 to be connected...Retry number: 2 [WARNING] DISTRIBUTED(1179143,ffffab75eec0,python):2025-07-15-11:17:05.494.721 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(1/1200). [WARNING] DISTRIBUTED(1179150,ffffbdbceec0,python):2025-07-15-11:17:05.609.776 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(1/1200). [WARNING] DISTRIBUTED(1179105,ffffb855eec0,python):2025-07-15-11:17:05.821.220 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 20 source: 127.0.0.1:42726, destination: 127.0.0.1:8118 [WARNING] DISTRIBUTED(1179105,ffff5251efa0,python):2025-07-15-11:17:05.821.254 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:42726 to 127.0.0.1:8118 is successfully created. System errno: Success [WARNING] DISTRIBUTED(1179105,ffffb855eec0,python):2025-07-15-11:17:05.821.265 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:8118 to be connected...Retry number: 1 [WARNING] DISTRIBUTED(1179129,ffffb92deec0,python):2025-07-15-11:17:05.927.437 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(1/1200). [WARNING] DISTRIBUTED(1179143,ffffab75eec0,python):2025-07-15-11:17:05.994.836 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(2/1200). [WARNING] DISTRIBUTED(1179150,ffffbdbceec0,python):2025-07-15-11:17:06.109.985 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(2/1200). [WARNING] DISTRIBUTED(1179105,ffffb855eec0,python):2025-07-15-11:17:06.321.506 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 21 source: 127.0.0.1:42740, destination: 127.0.0.1:8118 [WARNING] DISTRIBUTED(1179105,ffff514fefa0,python):2025-07-15-11:17:06.321.541 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:42740 to 127.0.0.1:8118 is successfully created. System errno: Success [WARNING] DISTRIBUTED(1179105,ffffb855eec0,python):2025-07-15-11:17:06.321.548 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:8118 to be connected...Retry number: 2 [WARNING] DISTRIBUTED(1179129,ffffb92deec0,python):2025-07-15-11:17:06.427.647 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(2/1200). [WARNING] DISTRIBUTED(1179143,ffffab75eec0,python):2025-07-15-11:17:06.495.021 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(3/1200). [WARNING] DISTRIBUTED(1179150,ffffbdbceec0,python):2025-07-15-11:17:06.610.160 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(3/1200). [WARNING] DISTRIBUTED(1179105,ffffb855eec0,python):2025-07-15-11:17:06.822.296 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(1/1200). [WARNING] DISTRIBUTED(1179129,ffffb92deec0,python):2025-07-15-11:17:06.927.781 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(3/1200). [WARNING] DISTRIBUTED(1179143,ffffab75eec0,python):2025-07-15-11:17:06.995.244 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(4/1200). [WARNING] DISTRIBUTED(1179150,ffffbdbceec0,python):2025-07-15-11:17:07.110.276 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(4/1200). [WARNING] DISTRIBUTED(1179105,ffffb855eec0,python):2025-07-15-11:17:07.322.448 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:249] BuildCluster] Cluster is successfully initialized. [WARNING] DISTRIBUTED(1179105,ffffb855eec0,python):2025-07-15-11:17:07.322.496 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:355] PostProcess] This node 0 rank id: 0 [WARNING] DISTRIBUTED(1179129,ffffb92deec0,python):2025-07-15-11:17:07.427.931 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:249] BuildCluster] Cluster is successfully initialized. [WARNING] DISTRIBUTED(1179129,ffffb92deec0,python):2025-07-15-11:17:07.427.978 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:355] PostProcess] This node 1 rank id: 1 [WARNING] DISTRIBUTED(1179143,ffffab75eec0,python):2025-07-15-11:17:07.495.512 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:249] BuildCluster] Cluster is successfully initialized. [WARNING] DISTRIBUTED(1179143,ffffab75eec0,python):2025-07-15-11:17:07.495.600 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:355] PostProcess] This node 2 rank id: 2 [WARNING] DISTRIBUTED(1179150,ffffbdbceec0,python):2025-07-15-11:17:07.610.451 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:249] BuildCluster] Cluster is successfully initialized. [WARNING] DISTRIBUTED(1179150,ffffbdbceec0,python):2025-07-15-11:17:07.610.514 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:355] PostProcess] This node 3 rank id: 3 [WARNING] DISTRIBUTED(1179105,ffffb855eec0,python):2025-07-15-11:17:09.111.624 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: hccl_world_group [const vector]{0, 1, 2, 3}, async: 1, submit_now: 1 [WARNING] DISTRIBUTED(1179105,ffffb855eec0,python):2025-07-15-11:17:09.111.977 [mindspore/ccsrc/distributed/collective/collective_manager.cc:393] CreateCommunicationGroup] This group's communicator is async created hccl_world_group [WARNING] DEVICE(1179105,ffff0480efa0,python):2025-07-15-11:17:09.112.231 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:254] SetGlobalCommInfo] Start to SetGlobalCommInfo for hccl_world_group, master_ip:2130706433, master_port:8118, node_rank:2130706433, total_rank_size:4, local_rank_size4 [WARNING] HCCL_ADPT(1179105,ffff0480efa0,python):2025-07-15-11:17:09.112.330 [mindspore/ccsrc/utils/dlopen_macro.h:165] DlsymAscend] Dynamically load symbol HcclSetGlobalCommInfo failed, result = /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/../lib/plugin/ascend/libhccl_plugin.so: undefined symbol: HcclSetGlobalCommInfo [WARNING] HCCL_ADPT(1179105,ffff0480efa0,python):2025-07-15-11:17:09.112.365 [mindspore/ccsrc/plugin/res_manager/ascend/hccl_adapter/hccl_adapter.cc:635] HcclSetGlobalCommInfo] Func HcclSetGlobalCommInfo is not supported in CANN package. [WARNING] DEVICE(1179105,ffff0480efa0,python):2025-07-15-11:17:09.112.394 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:265] SetGlobalCommInfo] End to SetGlobalCommInfo for hccl_world_group [WARNING] DISTRIBUTED(1179105,ffff0480efa0,python):2025-07-15-11:17:09.119.481 [mindspore/ccsrc/distributed/collective/collective_manager.cc:1021] CreateDeviceCommunicator] Begin initialize communication group on the device side: hccl_world_group [0] physical device id: 0 [WARNING] DISTRIBUTED(1179143,ffffab75eec0,python):2025-07-15-11:17:09.233.697 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: hccl_world_group [const vector]{0, 1, 2, 3}, async: 1, submit_now: 1 [WARNING] DISTRIBUTED(1179143,ffffab75eec0,python):2025-07-15-11:17:09.233.960 [mindspore/ccsrc/distributed/collective/collective_manager.cc:393] CreateCommunicationGroup] This group's communicator is async created hccl_world_group [WARNING] DEVICE(1179143,fffeed06efa0,python):2025-07-15-11:17:09.234.222 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:254] SetGlobalCommInfo] Start to SetGlobalCommInfo for hccl_world_group, master_ip:2130706433, master_port:8118, node_rank:2130706433, total_rank_size:4, local_rank_size4 [WARNING] HCCL_ADPT(1179143,fffeed06efa0,python):2025-07-15-11:17:09.234.343 [mindspore/ccsrc/utils/dlopen_macro.h:165] DlsymAscend] Dynamically load symbol HcclSetGlobalCommInfo failed, result = /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/../lib/plugin/ascend/libhccl_plugin.so: undefined symbol: HcclSetGlobalCommInfo [WARNING] HCCL_ADPT(1179143,fffeed06efa0,python):2025-07-15-11:17:09.234.379 [mindspore/ccsrc/plugin/res_manager/ascend/hccl_adapter/hccl_adapter.cc:635] HcclSetGlobalCommInfo] Func HcclSetGlobalCommInfo is not supported in CANN package. [WARNING] DEVICE(1179143,fffeed06efa0,python):2025-07-15-11:17:09.234.408 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:265] SetGlobalCommInfo] End to SetGlobalCommInfo for hccl_world_group [WARNING] DISTRIBUTED(1179143,fffeed06efa0,python):2025-07-15-11:17:09.234.964 [mindspore/ccsrc/distributed/collective/collective_manager.cc:1021] CreateDeviceCommunicator] Begin initialize communication group on the device side: hccl_world_group [2] physical device id: 2 [WARNING] DISTRIBUTED(1179150,ffffbdbceec0,python):2025-07-15-11:17:09.406.717 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: hccl_world_group [const vector]{0, 1, 2, 3}, async: 1, submit_now: 1 [WARNING] DISTRIBUTED(1179150,ffffbdbceec0,python):2025-07-15-11:17:09.406.989 [mindspore/ccsrc/distributed/collective/collective_manager.cc:393] CreateCommunicationGroup] This group's communicator is async created hccl_world_group [WARNING] DEVICE(1179150,fffec37eefa0,python):2025-07-15-11:17:09.407.248 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:254] SetGlobalCommInfo] Start to SetGlobalCommInfo for hccl_world_group, master_ip:2130706433, master_port:8118, node_rank:2130706433, total_rank_size:4, local_rank_size4 [WARNING] HCCL_ADPT(1179150,fffec37eefa0,python):2025-07-15-11:17:09.407.368 [mindspore/ccsrc/utils/dlopen_macro.h:165] DlsymAscend] Dynamically load symbol HcclSetGlobalCommInfo failed, result = /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/../lib/plugin/ascend/libhccl_plugin.so: undefined symbol: HcclSetGlobalCommInfo [WARNING] HCCL_ADPT(1179150,fffec37eefa0,python):2025-07-15-11:17:09.407.404 [mindspore/ccsrc/plugin/res_manager/ascend/hccl_adapter/hccl_adapter.cc:635] HcclSetGlobalCommInfo] Func HcclSetGlobalCommInfo is not supported in CANN package. [WARNING] DEVICE(1179150,fffec37eefa0,python):2025-07-15-11:17:09.407.432 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:265] SetGlobalCommInfo] End to SetGlobalCommInfo for hccl_world_group [WARNING] DISTRIBUTED(1179150,fffec37eefa0,python):2025-07-15-11:17:09.407.842 [mindspore/ccsrc/distributed/collective/collective_manager.cc:1021] CreateDeviceCommunicator] Begin initialize communication group on the device side: hccl_world_group [3] physical device id: 3 Traceback (most recent call last): File "/home/jenkins/mindspore/testcases/testcases/tests/st/graph_kernel/comm/test_dvm_allreduce.py", line 27, in init() File "/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/management.py", line 203, in init init_hccl() RuntimeError: Call aclrtSetDevice failed, ret[507033]. Got device count[8] and device id[1], please check if device id is valid. ---------------------------------------------------- - C++ Call Stack: (For framework developers) ---------------------------------------------------- mindspore/ccsrc/plugin/res_manager/ascend/hal_manager/ascend_hal_manager.cc:67 InitDevice [WARNING] DEVICE(1179129,ffffb92deec0,python):2025-07-15-11:19:52.994.271 [mindspore/ccsrc/plugin/device/ascend/hal/hardware/ascend_device_res_manager.cc:350] SyncAllStreams] The ascend_res_manager_ is nullptr in scenarios where it is not actually executed [ERROR] ME(1178852:281473819209408,MainProcess):2025-07-15-11:19:54.373.388 [mindspore/parallel/cluster/process_entity/_api.py:363] Worker process 1179129 exit with exception. Error code: 1. [WARNING] ME(1178852:281473819209408,MainProcess):2025-07-15-11:19:54.373.863 [mindspore/parallel/cluster/process_entity/_api.py:369] There's worker exits with exception, kill all other workers. [ERROR] ME(1178852:281473819209408,MainProcess):2025-07-15-11:20:28.558.396 [mindspore/parallel/cluster/process_entity/_api.py:382] Scheduler process 1179098 exit with exception. [ERROR] ME(1178852:281473819209408,MainProcess):2025-07-15-11:20:28.559.707 [mindspore/parallel/cluster/process_entity/_api.py:603] Time out nodes are ['1'] ./dvm_allreduce_log/worker_1.log-16-[WARNING] DISTRIBUTED(1179129,ffffb92deec0,python):2025-07-15-11:17:05.927.437 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(1/1200). ./dvm_allreduce_log/worker_1.log-17-[WARNING] DISTRIBUTED(1179129,ffffb92deec0,python):2025-07-15-11:17:06.427.647 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(2/1200). ./dvm_allreduce_log/worker_1.log-18-[WARNING] DISTRIBUTED(1179129,ffffb92deec0,python):2025-07-15-11:17:06.927.781 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(3/1200). ./dvm_allreduce_log/worker_1.log-19-[WARNING] DISTRIBUTED(1179129,ffffb92deec0,python):2025-07-15-11:17:07.427.931 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:249] BuildCluster] Cluster is successfully initialized. ./dvm_allreduce_log/worker_1.log-20-[WARNING] DISTRIBUTED(1179129,ffffb92deec0,python):2025-07-15-11:17:07.427.978 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:355] PostProcess] This node 1 rank id: 1 ./dvm_allreduce_log/worker_1.log:21:Traceback (most recent call last): ./dvm_allreduce_log/worker_1.log-22- File "/home/jenkins/mindspore/testcases/testcases/tests/st/graph_kernel/comm/test_dvm_allreduce.py", line 27, in ./dvm_allreduce_log/worker_1.log-23- init() ./dvm_allreduce_log/worker_1.log-24- File "/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/management.py", line 203, in init ./dvm_allreduce_log/worker_1.log-25- init_hccl() ./dvm_allreduce_log/worker_1.log:26:RuntimeError: Call aclrtSetDevice failed, ret[507033]. Got device count[8] and device id[1], please check if device id is valid. ./dvm_allreduce_log/worker_1.log-27- ./dvm_allreduce_log/worker_1.log-28----------------------------------------------------- ./dvm_allreduce_log/worker_1.log-29-- C++ Call Stack: (For framework developers) ./dvm_allreduce_log/worker_1.log-30----------------------------------------------------- ./dvm_allreduce_log/worker_1.log-31-mindspore/ccsrc/plugin/res_manager/ascend/hal_manager/ascend_hal_manager.cc:67 InitDevice -- ./dvm_allreduce_log/scheduler.log-96-[WARNING] DISTRIBUTED(1179098,ffffb404eec0,python):2025-07-15-11:20:11.901.253 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:154] Finalize] This log means the cluster is successfully created. Retry to finalize the node and exit cluster... ./dvm_allreduce_log/scheduler.log-97-[WARNING] DISTRIBUTED(1179098,ffffb404eec0,python):2025-07-15-11:20:16.901.419 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:98] Finalize] The meta server node can not be finalized because there are still 4 alive nodes. ./dvm_allreduce_log/scheduler.log-98-[WARNING] DISTRIBUTED(1179098,ffffb404eec0,python):2025-07-15-11:20:16.901.485 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:154] Finalize] This log means the cluster is successfully created. Retry to finalize the node and exit cluster... ./dvm_allreduce_log/scheduler.log-99-[WARNING] DISTRIBUTED(1179098,ffffb404eec0,python):2025-07-15-11:20:21.901.611 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:98] Finalize] The meta server node can not be finalized because there are still 4 alive nodes. ./dvm_allreduce_log/scheduler.log-100-[WARNING] DISTRIBUTED(1179098,ffffb404eec0,python):2025-07-15-11:20:21.901.669 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:154] Finalize] This log means the cluster is successfully created. Retry to finalize the node and exit cluster... ./dvm_allreduce_log/scheduler.log:101:[ERROR] DISTRIBUTED(1179098,ffff4cffefa0,python):2025-07-15-11:20:23.416.296 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:511] UpdateTopoState] The node: 1 is timed out. It may exit with exception, please check this node's log. ./dvm_allreduce_log/scheduler.log:102:[ERROR] DISTRIBUTED(1179098,ffffb404eec0,python):2025-07-15-11:20:26.901.824 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:103] Finalize] There are 1 abnormal compute graph nodes. ./dvm_allreduce_log/scheduler.log:103:Traceback (most recent call last): ./dvm_allreduce_log/scheduler.log-104- File "/home/jenkins/mindspore/testcases/testcases/tests/st/graph_kernel/comm/test_dvm_allreduce.py", line 27, in ./dvm_allreduce_log/scheduler.log-105- init() ./dvm_allreduce_log/scheduler.log-106- File "/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/management.py", line 213, in init ./dvm_allreduce_log/scheduler.log-107- init_cluster() ./dvm_allreduce_log/scheduler.log:108:RuntimeError: The total number of timed out node is 1. Timed out node list is: [const vector]{1}, worker 1 is the first one timed out, please check its log. ./dvm_allreduce_log/scheduler.log-109- ./dvm_allreduce_log/scheduler.log-110----------------------------------------------------- ./dvm_allreduce_log/scheduler.log-111-- C++ Call Stack: (For framework developers) ./dvm_allreduce_log/scheduler.log-112----------------------------------------------------- ./dvm_allreduce_log/scheduler.log-113-mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:517 UpdateTopoState Traceback (most recent call last): File "/home/jenkins/anaconda3/envs/ci39/bin/msrun", line 8, in sys.exit(main()) File "/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/parallel/cluster/run.py", line 191, in main run(args) File "/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/parallel/cluster/run.py", line 185, in run process_manager.run() File "/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/parallel/cluster/process_entity/_api.py", line 268, in run self.join_processes() File "/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/parallel/cluster/process_entity/_api.py", line 387, in join_processes raise RuntimeError("Distributed job exited with exception. Please check logs in " RuntimeError: Distributed job exited with exception. Please check logs in directory: ./dvm_allreduce_log. F =================================== FAILURES =================================== ______________________________ test_dvm_allreduce ______________________________ @arg_mark(plat_marks=['platform_ascend910b'], level_mark='level1', card_mark='allcards', essential_mark='essential') def test_dvm_allreduce(): """ Feature: DVM operator test. Description: msrun dvm allreduce 4P case. Expectation: success """ return_code = os.system( "MS_DEV_GRAPH_KERNEL_FLAGS='--enable_cluster_ops=AllReduce' "\ "msrun --worker_num=4 --local_worker_num=4 --join=True --log_dir=./dvm_allreduce_log "\ "python test_dvm_allreduce.py" ) > assert return_code == 0 E assert 256 == 0 test_all.py:32: AssertionError =========================== short test summary info ============================ FAILED test_all.py::test_dvm_allreduce - assert 256 == 0 ======================== 1 failed in 215.41s (0:03:35) =========================