============================= test session starts ============================== platform linux -- Python 3.9.21, pytest-6.2.5, py-1.11.0, pluggy-0.13.1 rootdir: /home/jenkins/mindspore/testcases/testcases/tests/st/pynative/data_parallel, configfile: ../../../../../../../sault/virtual_test/virtualenv_002/sault/config/pytest.ini plugins: forked-1.6.0, hydra-core-1.3.2, xdist-1.32.0, anyio-4.9.0 collected 1 item test_entry_msrun_pynative_hccl.py /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) Start worker process with rank id:0, log file:worker_0.log. Environment variable [RANK_ID=0] is exported. Start worker process with rank id:1, log file:worker_1.log. Environment variable [RANK_ID=1] is exported. Start worker process with rank id:2, log file:worker_2.log. Environment variable [RANK_ID=2] is exported. Start worker process with rank id:3, log file:worker_3.log. Environment variable [RANK_ID=3] is exported. Start worker process with rank id:4, log file:worker_4.log. Environment variable [RANK_ID=4] is exported. Start worker process with rank id:5, log file:worker_5.log. Environment variable [RANK_ID=5] is exported. Start worker process with rank id:6, log file:worker_6.log. Environment variable [RANK_ID=6] is exported. Start worker process with rank id:7, log file:worker_7.log. Environment variable [RANK_ID=7] is exported. [WARNING] ME(1156162:281473887891136,MainProcess):2025-07-15-11:08:49.810.760 [mindspore/parallel/cluster/process_entity/_api.py:267] Distributed job is spawned. Waiting all processes to exit... ============================= test session starts ============================== platform linux -- Python 3.9.21, pytest-6.2.5, py-1.11.0, pluggy-0.13.1 rootdir: /home/jenkins/mindspore/testcases/testcases/tests/st/pynative/data_parallel plugins: forked-1.6.0, hydra-core-1.3.2, xdist-1.32.0, anyio-4.9.0 collected 1 item test_pynative_hccl_allreduce.py [WARNING] DISTRIBUTED(1156247,ffff237eefa0,python):2025-07-15-11:08:55.330.253 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:37988 to 127.0.0.1:10969 is successfully created. System errno: Success [WARNING] DISTRIBUTED(1156247,ffff8f9ceec0,python):2025-07-15-11:08:55.330.253 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 21 source: 127.0.0.1:37988, destination: 127.0.0.1:10969 [WARNING] DISTRIBUTED(1156247,ffff8f9ceec0,python):2025-07-15-11:08:55.330.492 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 22 source: 127.0.0.1:37998, destination: 127.0.0.1:10969 [WARNING] DISTRIBUTED(1156247,ffff28ebefa0,python):2025-07-15-11:08:55.330.525 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:37998 to 127.0.0.1:10969 is successfully created. System errno: Success [WARNING] DISTRIBUTED(1156247,ffff8f9ceec0,python):2025-07-15-11:08:55.330.537 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:10969 to be connected...Retry number: 1 ============================= test session starts ============================== platform linux -- Python 3.9.21, pytest-6.2.5, py-1.11.0, pluggy-0.13.1 rootdir: /home/jenkins/mindspore/testcases/testcases/tests/st/pynative/data_parallel plugins: forked-1.6.0, hydra-core-1.3.2, xdist-1.32.0, anyio-4.9.0 collected 1 item test_pynative_hccl_allreduce.py [WARNING] DISTRIBUTED(1156243,ffff277eefa0,python):2025-07-15-11:08:55.377.131 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:38012 to 127.0.0.1:10969 is successfully created. System errno: Success [WARNING] DISTRIBUTED(1156243,ffff9365eec0,python):2025-07-15-11:08:55.377.136 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 21 source: 127.0.0.1:38012, destination: 127.0.0.1:10969 [WARNING] DISTRIBUTED(1156243,ffff9365eec0,python):2025-07-15-11:08:55.377.349 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 22 source: 127.0.0.1:38018, destination: 127.0.0.1:10969 [WARNING] DISTRIBUTED(1156243,ffff2cb5efa0,python):2025-07-15-11:08:55.377.383 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:38018 to 127.0.0.1:10969 is successfully created. System errno: Success [WARNING] DISTRIBUTED(1156243,ffff9365eec0,python):2025-07-15-11:08:55.377.394 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:10969 to be connected...Retry number: 1 ============================= test session starts ============================== platform linux -- Python 3.9.21, pytest-6.2.5, py-1.11.0, pluggy-0.13.1 rootdir: /home/jenkins/mindspore/testcases/testcases/tests/st/pynative/data_parallel plugins: forked-1.6.0, hydra-core-1.3.2, xdist-1.32.0, anyio-4.9.0 collected 1 item test_pynative_hccl_allreduce.py [WARNING] DISTRIBUTED(1156239,ffffb692eec0,python):2025-07-15-11:08:55.466.658 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 21 source: 127.0.0.1:38026, destination: 127.0.0.1:10969 [WARNING] DISTRIBUTED(1156239,ffff4eddefa0,python):2025-07-15-11:08:55.466.726 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:38026 to 127.0.0.1:10969 is successfully created. System errno: Success [WARNING] DISTRIBUTED(1156239,ffffb692eec0,python):2025-07-15-11:08:55.466.758 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:10969 to be connected...Retry number: 1 ============================= test session starts ============================== platform linux -- Python 3.9.21, pytest-6.2.5, py-1.11.0, pluggy-0.13.1 rootdir: /home/jenkins/mindspore/testcases/testcases/tests/st/pynative/data_parallel plugins: forked-1.6.0, hydra-core-1.3.2, xdist-1.32.0, anyio-4.9.0 collected 1 item test_pynative_hccl_allreduce.py ============================= test session starts ============================== platform linux -- Python 3.9.21, pytest-6.2.5, py-1.11.0, pluggy-0.13.1 rootdir: /home/jenkins/mindspore/testcases/testcases/tests/st/pynative/data_parallel plugins: forked-1.6.0, hydra-core-1.3.2, xdist-1.32.0, anyio-4.9.0 collected 1 item test_pynative_hccl_allreduce.py [WARNING] DISTRIBUTED(1156255,ffff2fffefa0,python):2025-07-15-11:08:55.505.974 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:38040 to 127.0.0.1:10969 is successfully created. System errno: Success [WARNING] DISTRIBUTED(1156255,ffff9c09eec0,python):2025-07-15-11:08:55.505.977 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 21 source: 127.0.0.1:38040, destination: 127.0.0.1:10969 [WARNING] DISTRIBUTED(1156251,ffff9460eec0,python):2025-07-15-11:08:55.506.036 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 21 source: 127.0.0.1:38056, destination: 127.0.0.1:10969 [WARNING] DISTRIBUTED(1156251,ffff2cabefa0,python):2025-07-15-11:08:55.506.038 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:38056 to 127.0.0.1:10969 is successfully created. System errno: Success [WARNING] DISTRIBUTED(1156251,ffff9460eec0,python):2025-07-15-11:08:55.506.104 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:10969 to be connected...Retry number: 1 [WARNING] DISTRIBUTED(1156255,ffff9c09eec0,python):2025-07-15-11:08:55.506.192 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 22 source: 127.0.0.1:38070, destination: 127.0.0.1:10969 [WARNING] DISTRIBUTED(1156255,ffff9c09eec0,python):2025-07-15-11:08:55.506.232 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:10969 to be connected...Retry number: 1 [WARNING] DISTRIBUTED(1156255,ffff3559efa0,python):2025-07-15-11:08:55.506.228 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:38070 to 127.0.0.1:10969 is successfully created. System errno: Success ============================= test session starts ============================== platform linux -- Python 3.9.21, pytest-6.2.5, py-1.11.0, pluggy-0.13.1 rootdir: /home/jenkins/mindspore/testcases/testcases/tests/st/pynative/data_parallel plugins: forked-1.6.0, hydra-core-1.3.2, xdist-1.32.0, anyio-4.9.0 collected 1 item test_pynative_hccl_allreduce.py [WARNING] DISTRIBUTED(1156259,ffff3734efa0,python):2025-07-15-11:08:55.712.633 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:38076 to 127.0.0.1:10969 is successfully created. System errno: Success [WARNING] DISTRIBUTED(1156259,ffff9ee6eec0,python):2025-07-15-11:08:55.712.633 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 21 source: 127.0.0.1:38076, destination: 127.0.0.1:10969 [WARNING] DISTRIBUTED(1156259,ffff9ee6eec0,python):2025-07-15-11:08:55.712.854 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 22 source: 127.0.0.1:38082, destination: 127.0.0.1:10969 [WARNING] DISTRIBUTED(1156259,ffff3836efa0,python):2025-07-15-11:08:55.712.883 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:38082 to 127.0.0.1:10969 is successfully created. System errno: Success [WARNING] DISTRIBUTED(1156259,ffff9ee6eec0,python):2025-07-15-11:08:55.712.901 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:10969 to be connected...Retry number: 1 [WARNING] DISTRIBUTED(1156247,ffff8f9ceec0,python):2025-07-15-11:08:55.831.539 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(1/1200). ============================= test session starts ============================== platform linux -- Python 3.9.21, pytest-6.2.5, py-1.11.0, pluggy-0.13.1 rootdir: /home/jenkins/mindspore/testcases/testcases/tests/st/pynative/data_parallel plugins: forked-1.6.0, hydra-core-1.3.2, xdist-1.32.0, anyio-4.9.0 collected 1 item test_pynative_hccl_allreduce.py [WARNING] DISTRIBUTED(1156265,ffff5167efa0,python):2025-07-15-11:08:55.869.360 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:38096 to 127.0.0.1:10969 is successfully created. System errno: Success [WARNING] DISTRIBUTED(1156265,ffffb91aeec0,python):2025-07-15-11:08:55.869.364 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 21 source: 127.0.0.1:38096, destination: 127.0.0.1:10969 [WARNING] DISTRIBUTED(1156265,ffffb91aeec0,python):2025-07-15-11:08:55.869.578 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 22 source: 127.0.0.1:38098, destination: 127.0.0.1:10969 [WARNING] DISTRIBUTED(1156265,ffff5269efa0,python):2025-07-15-11:08:55.869.610 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:38098 to 127.0.0.1:10969 is successfully created. System errno: Success [WARNING] DISTRIBUTED(1156265,ffffb91aeec0,python):2025-07-15-11:08:55.869.623 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:10969 to be connected...Retry number: 1 [WARNING] DISTRIBUTED(1156243,ffff9365eec0,python):2025-07-15-11:08:55.877.960 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(1/1200). ============================= test session starts ============================== platform linux -- Python 3.9.21, pytest-6.2.5, py-1.11.0, pluggy-0.13.1 rootdir: /home/jenkins/mindspore/testcases/testcases/tests/st/pynative/data_parallel plugins: forked-1.6.0, hydra-core-1.3.2, xdist-1.32.0, anyio-4.9.0 collected 1 item test_pynative_hccl_allreduce.py [WARNING] DISTRIBUTED(1156269,ffff1cc3efa0,python):2025-07-15-11:08:55.936.557 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:38102 to 127.0.0.1:10969 is successfully created. System errno: Success [WARNING] DISTRIBUTED(1156269,ffff8476eec0,python):2025-07-15-11:08:55.936.557 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 21 source: 127.0.0.1:38102, destination: 127.0.0.1:10969 [WARNING] DISTRIBUTED(1156269,ffff8476eec0,python):2025-07-15-11:08:55.936.740 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 22 source: 127.0.0.1:38106, destination: 127.0.0.1:10969 [WARNING] DISTRIBUTED(1156269,ffff1dc5efa0,python):2025-07-15-11:08:55.936.772 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:38106 to 127.0.0.1:10969 is successfully created. System errno: Success [WARNING] DISTRIBUTED(1156269,ffff8476eec0,python):2025-07-15-11:08:55.936.779 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:10969 to be connected...Retry number: 1 [WARNING] DISTRIBUTED(1156239,ffffb692eec0,python):2025-07-15-11:08:55.967.078 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 22 source: 127.0.0.1:38108, destination: 127.0.0.1:10969 [WARNING] DISTRIBUTED(1156239,ffff4fdfefa0,python):2025-07-15-11:08:55.967.089 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:38108 to 127.0.0.1:10969 is successfully created. System errno: Success [WARNING] DISTRIBUTED(1156239,ffffb692eec0,python):2025-07-15-11:08:55.967.134 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:10969 to be connected...Retry number: 2 [WARNING] DISTRIBUTED(1156251,ffff9460eec0,python):2025-07-15-11:08:56.006.335 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 22 source: 127.0.0.1:38116, destination: 127.0.0.1:10969 [WARNING] DISTRIBUTED(1156251,ffff9460eec0,python):2025-07-15-11:08:56.006.380 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:10969 to be connected...Retry number: 2 [WARNING] DISTRIBUTED(1156251,ffff2dadefa0,python):2025-07-15-11:08:56.006.378 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:38116 to 127.0.0.1:10969 is successfully created. System errno: Success [WARNING] DISTRIBUTED(1156255,ffff9c09eec0,python):2025-07-15-11:08:56.006.802 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(1/1200). [WARNING] DISTRIBUTED(1156259,ffff9ee6eec0,python):2025-07-15-11:08:56.213.450 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(1/1200). [WARNING] DISTRIBUTED(1156247,ffff8f9ceec0,python):2025-07-15-11:08:56.331.670 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(2/1200). [WARNING] DISTRIBUTED(1156265,ffffb91aeec0,python):2025-07-15-11:08:56.370.234 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(1/1200). [WARNING] DISTRIBUTED(1156243,ffff9365eec0,python):2025-07-15-11:08:56.378.076 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(2/1200). [WARNING] DISTRIBUTED(1156269,ffff8476eec0,python):2025-07-15-11:08:56.437.292 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(1/1200). [WARNING] DISTRIBUTED(1156239,ffffb692eec0,python):2025-07-15-11:08:56.467.625 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(1/1200). [WARNING] DISTRIBUTED(1156255,ffff9c09eec0,python):2025-07-15-11:08:56.506.911 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(2/1200). [WARNING] DISTRIBUTED(1156251,ffff9460eec0,python):2025-07-15-11:08:56.506.942 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(1/1200). [WARNING] DISTRIBUTED(1156259,ffff9ee6eec0,python):2025-07-15-11:08:56.713.566 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(2/1200). [WARNING] DISTRIBUTED(1156247,ffff8f9ceec0,python):2025-07-15-11:08:56.831.788 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(3/1200). [WARNING] DISTRIBUTED(1156265,ffffb91aeec0,python):2025-07-15-11:08:56.870.355 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(2/1200). [WARNING] DISTRIBUTED(1156243,ffff9365eec0,python):2025-07-15-11:08:56.878.184 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(3/1200). [WARNING] DISTRIBUTED(1156269,ffff8476eec0,python):2025-07-15-11:08:56.937.401 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(2/1200). [WARNING] DISTRIBUTED(1156239,ffffb692eec0,python):2025-07-15-11:08:56.967.743 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(2/1200). [WARNING] DISTRIBUTED(1156255,ffff9c09eec0,python):2025-07-15-11:08:57.007.043 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:249] BuildCluster] Cluster is successfully initialized. [WARNING] DISTRIBUTED(1156251,ffff9460eec0,python):2025-07-15-11:08:57.007.061 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:249] BuildCluster] Cluster is successfully initialized. [WARNING] DISTRIBUTED(1156255,ffff9c09eec0,python):2025-07-15-11:08:57.007.089 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:355] PostProcess] This node 4 rank id: 4 [WARNING] DISTRIBUTED(1156251,ffff9460eec0,python):2025-07-15-11:08:57.007.103 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:355] PostProcess] This node 3 rank id: 3 [WARNING] PS(1156251,ffff9460eec0,python):2025-07-15-11:08:57.007.703 [mindspore/ccsrc/ps/core/file_configuration.cc:24] Initialize] The file: is not exist. [WARNING] PS(1156255,ffff9c09eec0,python):2025-07-15-11:08:57.007.703 [mindspore/ccsrc/ps/core/file_configuration.cc:24] Initialize] The file: is not exist. [WARNING] DEVICE(1156255,ffff9c09eec0,python):2025-07-15-11:08:57.007.760 [mindspore/ccsrc/plugin/device/cpu/hal/hardware/ms_collective_node.cc:33] Start] Failed to initialize the configuration for this mccl collective node. [WARNING] DEVICE(1156251,ffff9460eec0,python):2025-07-15-11:08:57.007.759 [mindspore/ccsrc/plugin/device/cpu/hal/hardware/ms_collective_node.cc:33] Start] Failed to initialize the configuration for this mccl collective node. [WARNING] DISTRIBUTED(1156259,ffff9ee6eec0,python):2025-07-15-11:08:57.213.699 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:249] BuildCluster] Cluster is successfully initialized. [WARNING] DISTRIBUTED(1156259,ffff9ee6eec0,python):2025-07-15-11:08:57.213.744 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:355] PostProcess] This node 5 rank id: 5 [WARNING] PS(1156259,ffff9ee6eec0,python):2025-07-15-11:08:57.214.411 [mindspore/ccsrc/ps/core/file_configuration.cc:24] Initialize] The file: is not exist. [WARNING] DEVICE(1156259,ffff9ee6eec0,python):2025-07-15-11:08:57.214.485 [mindspore/ccsrc/plugin/device/cpu/hal/hardware/ms_collective_node.cc:33] Start] Failed to initialize the configuration for this mccl collective node. [WARNING] DISTRIBUTED(1156247,ffff8f9ceec0,python):2025-07-15-11:08:57.331.914 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:249] BuildCluster] Cluster is successfully initialized. [WARNING] DISTRIBUTED(1156247,ffff8f9ceec0,python):2025-07-15-11:08:57.331.962 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:355] PostProcess] This node 2 rank id: 2 [WARNING] PS(1156247,ffff8f9ceec0,python):2025-07-15-11:08:57.332.751 [mindspore/ccsrc/ps/core/file_configuration.cc:24] Initialize] The file: is not exist. [WARNING] DEVICE(1156247,ffff8f9ceec0,python):2025-07-15-11:08:57.332.827 [mindspore/ccsrc/plugin/device/cpu/hal/hardware/ms_collective_node.cc:33] Start] Failed to initialize the configuration for this mccl collective node. [WARNING] DISTRIBUTED(1156265,ffffb91aeec0,python):2025-07-15-11:08:57.370.481 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:249] BuildCluster] Cluster is successfully initialized. [WARNING] DISTRIBUTED(1156265,ffffb91aeec0,python):2025-07-15-11:08:57.370.526 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:355] PostProcess] This node 6 rank id: 6 [WARNING] PS(1156265,ffffb91aeec0,python):2025-07-15-11:08:57.371.065 [mindspore/ccsrc/ps/core/file_configuration.cc:24] Initialize] The file: is not exist. [WARNING] DEVICE(1156265,ffffb91aeec0,python):2025-07-15-11:08:57.371.119 [mindspore/ccsrc/plugin/device/cpu/hal/hardware/ms_collective_node.cc:33] Start] Failed to initialize the configuration for this mccl collective node. [WARNING] DISTRIBUTED(1156243,ffff9365eec0,python):2025-07-15-11:08:57.378.301 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:249] BuildCluster] Cluster is successfully initialized. [WARNING] DISTRIBUTED(1156243,ffff9365eec0,python):2025-07-15-11:08:57.378.345 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:355] PostProcess] This node 1 rank id: 1 [WARNING] PS(1156243,ffff9365eec0,python):2025-07-15-11:08:57.378.958 [mindspore/ccsrc/ps/core/file_configuration.cc:24] Initialize] The file: is not exist. [WARNING] DEVICE(1156243,ffff9365eec0,python):2025-07-15-11:08:57.379.017 [mindspore/ccsrc/plugin/device/cpu/hal/hardware/ms_collective_node.cc:33] Start] Failed to initialize the configuration for this mccl collective node. [WARNING] DISTRIBUTED(1156269,ffff8476eec0,python):2025-07-15-11:08:57.437.528 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:249] BuildCluster] Cluster is successfully initialized. [WARNING] DISTRIBUTED(1156269,ffff8476eec0,python):2025-07-15-11:08:57.437.568 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:355] PostProcess] This node 7 rank id: 7 [WARNING] PS(1156269,ffff8476eec0,python):2025-07-15-11:08:57.438.103 [mindspore/ccsrc/ps/core/file_configuration.cc:24] Initialize] The file: is not exist. [WARNING] DEVICE(1156269,ffff8476eec0,python):2025-07-15-11:08:57.438.154 [mindspore/ccsrc/plugin/device/cpu/hal/hardware/ms_collective_node.cc:33] Start] Failed to initialize the configuration for this mccl collective node. [WARNING] DISTRIBUTED(1156239,ffffb692eec0,python):2025-07-15-11:08:57.467.875 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:249] BuildCluster] Cluster is successfully initialized. [WARNING] DISTRIBUTED(1156239,ffffb692eec0,python):2025-07-15-11:08:57.467.931 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:355] PostProcess] This node 0 rank id: 0 [WARNING] PS(1156239,ffffb692eec0,python):2025-07-15-11:08:57.468.480 [mindspore/ccsrc/ps/core/file_configuration.cc:24] Initialize] The file: is not exist. [WARNING] DEVICE(1156239,ffffb692eec0,python):2025-07-15-11:08:57.468.535 [mindspore/ccsrc/plugin/device/cpu/hal/hardware/ms_collective_node.cc:33] Start] Failed to initialize the configuration for this mccl collective node. [WARNING] DISTRIBUTED(1156239,ffffb692eec0,python):2025-07-15-11:08:59.250.170 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: hccl_world_group [const vector]{0, 1, 2, 3, 4, 5, 6, 7}, async: 0, submit_now: 1 [WARNING] DEVICE(1156239,fffef695efa0,python):2025-07-15-11:08:59.250.758 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:254] SetGlobalCommInfo] Start to SetGlobalCommInfo for hccl_world_group, master_ip:2130706433, master_port:10969, node_rank:2130706433, total_rank_size:8, local_rank_size8 [WARNING] HCCL_ADPT(1156239,fffef695efa0,python):2025-07-15-11:08:59.250.916 [mindspore/ccsrc/utils/dlopen_macro.h:165] DlsymAscend] Dynamically load symbol HcclSetGlobalCommInfo failed, result = /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/../lib/plugin/ascend/libhccl_plugin.so: undefined symbol: HcclSetGlobalCommInfo [WARNING] HCCL_ADPT(1156239,fffef695efa0,python):2025-07-15-11:08:59.250.954 [mindspore/ccsrc/plugin/res_manager/ascend/hccl_adapter/hccl_adapter.cc:635] HcclSetGlobalCommInfo] Func HcclSetGlobalCommInfo is not supported in CANN package. [WARNING] DEVICE(1156239,fffef695efa0,python):2025-07-15-11:08:59.251.012 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:265] SetGlobalCommInfo] End to SetGlobalCommInfo for hccl_world_group [WARNING] DISTRIBUTED(1156239,fffef695efa0,python):2025-07-15-11:08:59.258.868 [mindspore/ccsrc/distributed/collective/collective_manager.cc:1021] CreateDeviceCommunicator] Begin initialize communication group on the device side: hccl_world_group [WARNING] DEVICE(1156239,fffef491efa0,python):2025-07-15-11:08:59.259.269 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:169] InitByRootInfoConfig] Start to initialize communicator by HcclCommInitRootInfoConfig for hccl_world_group, hcclBufferSize is 200 MB, hcclDeterministic is 0 [WARNING] DISTRIBUTED(1156251,ffff9460eec0,python):2025-07-15-11:09:01.812.688 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: hccl_world_group [const vector]{0, 1, 2, 3, 4, 5, 6, 7}, async: 0, submit_now: 1 [WARNING] DEVICE(1156251,fffed4baefa0,python):2025-07-15-11:09:01.813.202 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:254] SetGlobalCommInfo] Start to SetGlobalCommInfo for hccl_world_group, master_ip:2130706433, master_port:10969, node_rank:2130706433, total_rank_size:8, local_rank_size8 [WARNING] HCCL_ADPT(1156251,fffed4baefa0,python):2025-07-15-11:09:01.813.304 [mindspore/ccsrc/utils/dlopen_macro.h:165] DlsymAscend] Dynamically load symbol HcclSetGlobalCommInfo failed, result = /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/../lib/plugin/ascend/libhccl_plugin.so: undefined symbol: HcclSetGlobalCommInfo [WARNING] HCCL_ADPT(1156251,fffed4baefa0,python):2025-07-15-11:09:01.813.341 [mindspore/ccsrc/plugin/res_manager/ascend/hccl_adapter/hccl_adapter.cc:635] HcclSetGlobalCommInfo] Func HcclSetGlobalCommInfo is not supported in CANN package. [WARNING] DEVICE(1156251,fffed4baefa0,python):2025-07-15-11:09:01.813.373 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:265] SetGlobalCommInfo] End to SetGlobalCommInfo for hccl_world_group [WARNING] DISTRIBUTED(1156251,fffed4baefa0,python):2025-07-15-11:09:01.813.883 [mindspore/ccsrc/distributed/collective/collective_manager.cc:1021] CreateDeviceCommunicator] Begin initialize communication group on the device side: hccl_world_group [WARNING] DISTRIBUTED(1156255,ffff9c09eec0,python):2025-07-15-11:09:01.814.203 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: hccl_world_group [const vector]{0, 1, 2, 3, 4, 5, 6, 7}, async: 0, submit_now: 1 [WARNING] DEVICE(1156251,fffe87ffefa0,python):2025-07-15-11:09:01.814.260 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:169] InitByRootInfoConfig] Start to initialize communicator by HcclCommInitRootInfoConfig for hccl_world_group, hcclBufferSize is 200 MB, hcclDeterministic is 0 [WARNING] DEVICE(1156255,fffe97ffefa0,python):2025-07-15-11:09:01.814.733 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:254] SetGlobalCommInfo] Start to SetGlobalCommInfo for hccl_world_group, master_ip:2130706433, master_port:10969, node_rank:2130706433, total_rank_size:8, local_rank_size8 [WARNING] HCCL_ADPT(1156255,fffe97ffefa0,python):2025-07-15-11:09:01.814.826 [mindspore/ccsrc/utils/dlopen_macro.h:165] DlsymAscend] Dynamically load symbol HcclSetGlobalCommInfo failed, result = /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/../lib/plugin/ascend/libhccl_plugin.so: undefined symbol: HcclSetGlobalCommInfo [WARNING] HCCL_ADPT(1156255,fffe97ffefa0,python):2025-07-15-11:09:01.814.863 [mindspore/ccsrc/plugin/res_manager/ascend/hccl_adapter/hccl_adapter.cc:635] HcclSetGlobalCommInfo] Func HcclSetGlobalCommInfo is not supported in CANN package. [WARNING] DEVICE(1156255,fffe97ffefa0,python):2025-07-15-11:09:01.814.894 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:265] SetGlobalCommInfo] End to SetGlobalCommInfo for hccl_world_group [WARNING] DISTRIBUTED(1156255,fffe97ffefa0,python):2025-07-15-11:09:01.815.239 [mindspore/ccsrc/distributed/collective/collective_manager.cc:1021] CreateDeviceCommunicator] Begin initialize communication group on the device side: hccl_world_group [WARNING] DEVICE(1156255,fffe977eefa0,python):2025-07-15-11:09:01.815.525 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:169] InitByRootInfoConfig] Start to initialize communicator by HcclCommInitRootInfoConfig for hccl_world_group, hcclBufferSize is 200 MB, hcclDeterministic is 0 [WARNING] DISTRIBUTED(1156259,ffff9ee6eec0,python):2025-07-15-11:09:02.020.442 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: hccl_world_group [const vector]{0, 1, 2, 3, 4, 5, 6, 7}, async: 0, submit_now: 1 [WARNING] DEVICE(1156259,fffedeefefa0,python):2025-07-15-11:09:02.021.126 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:254] SetGlobalCommInfo] Start to SetGlobalCommInfo for hccl_world_group, master_ip:2130706433, master_port:10969, node_rank:2130706433, total_rank_size:8, local_rank_size8 [WARNING] HCCL_ADPT(1156259,fffedeefefa0,python):2025-07-15-11:09:02.021.253 [mindspore/ccsrc/utils/dlopen_macro.h:165] DlsymAscend] Dynamically load symbol HcclSetGlobalCommInfo failed, result = /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/../lib/plugin/ascend/libhccl_plugin.so: undefined symbol: HcclSetGlobalCommInfo [WARNING] HCCL_ADPT(1156259,fffedeefefa0,python):2025-07-15-11:09:02.021.290 [mindspore/ccsrc/plugin/res_manager/ascend/hccl_adapter/hccl_adapter.cc:635] HcclSetGlobalCommInfo] Func HcclSetGlobalCommInfo is not supported in CANN package. [WARNING] DEVICE(1156259,fffedeefefa0,python):2025-07-15-11:09:02.021.321 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:265] SetGlobalCommInfo] End to SetGlobalCommInfo for hccl_world_group [WARNING] DISTRIBUTED(1156259,fffedeefefa0,python):2025-07-15-11:09:02.021.806 [mindspore/ccsrc/distributed/collective/collective_manager.cc:1021] CreateDeviceCommunicator] Begin initialize communication group on the device side: hccl_world_group [WARNING] DEVICE(1156259,fffede6eefa0,python):2025-07-15-11:09:02.022.344 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:169] InitByRootInfoConfig] Start to initialize communicator by HcclCommInitRootInfoConfig for hccl_world_group, hcclBufferSize is 200 MB, hcclDeterministic is 0 [WARNING] DISTRIBUTED(1156247,ffff8f9ceec0,python):2025-07-15-11:09:02.143.847 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: hccl_world_group [const vector]{0, 1, 2, 3, 4, 5, 6, 7}, async: 0, submit_now: 1 [WARNING] DEVICE(1156247,fffed9beefa0,python):2025-07-15-11:09:02.144.386 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:254] SetGlobalCommInfo] Start to SetGlobalCommInfo for hccl_world_group, master_ip:2130706433, master_port:10969, node_rank:2130706433, total_rank_size:8, local_rank_size8 [WARNING] HCCL_ADPT(1156247,fffed9beefa0,python):2025-07-15-11:09:02.144.488 [mindspore/ccsrc/utils/dlopen_macro.h:165] DlsymAscend] Dynamically load symbol HcclSetGlobalCommInfo failed, result = /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/../lib/plugin/ascend/libhccl_plugin.so: undefined symbol: HcclSetGlobalCommInfo [WARNING] HCCL_ADPT(1156247,fffed9beefa0,python):2025-07-15-11:09:02.144.526 [mindspore/ccsrc/plugin/res_manager/ascend/hccl_adapter/hccl_adapter.cc:635] HcclSetGlobalCommInfo] Func HcclSetGlobalCommInfo is not supported in CANN package. [WARNING] DEVICE(1156247,fffed9beefa0,python):2025-07-15-11:09:02.144.558 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:265] SetGlobalCommInfo] End to SetGlobalCommInfo for hccl_world_group [WARNING] DISTRIBUTED(1156247,fffed9beefa0,python):2025-07-15-11:09:02.145.102 [mindspore/ccsrc/distributed/collective/collective_manager.cc:1021] CreateDeviceCommunicator] Begin initialize communication group on the device side: hccl_world_group [WARNING] DEVICE(1156247,fffeceefefa0,python):2025-07-15-11:09:02.145.477 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:169] InitByRootInfoConfig] Start to initialize communicator by HcclCommInitRootInfoConfig for hccl_world_group, hcclBufferSize is 200 MB, hcclDeterministic is 0 [WARNING] DISTRIBUTED(1156265,ffffb91aeec0,python):2025-07-15-11:09:02.192.738 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: hccl_world_group [const vector]{0, 1, 2, 3, 4, 5, 6, 7}, async: 0, submit_now: 1 [WARNING] DEVICE(1156265,fffef90aefa0,python):2025-07-15-11:09:02.193.257 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:254] SetGlobalCommInfo] Start to SetGlobalCommInfo for hccl_world_group, master_ip:2130706433, master_port:10969, node_rank:2130706433, total_rank_size:8, local_rank_size8 [WARNING] HCCL_ADPT(1156265,fffef90aefa0,python):2025-07-15-11:09:02.193.354 [mindspore/ccsrc/utils/dlopen_macro.h:165] DlsymAscend] Dynamically load symbol HcclSetGlobalCommInfo failed, result = /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/../lib/plugin/ascend/libhccl_plugin.so: undefined symbol: HcclSetGlobalCommInfo [WARNING] HCCL_ADPT(1156265,fffef90aefa0,python):2025-07-15-11:09:02.193.392 [mindspore/ccsrc/plugin/res_manager/ascend/hccl_adapter/hccl_adapter.cc:635] HcclSetGlobalCommInfo] Func HcclSetGlobalCommInfo is not supported in CANN package. [WARNING] DEVICE(1156265,fffef90aefa0,python):2025-07-15-11:09:02.193.424 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:265] SetGlobalCommInfo] End to SetGlobalCommInfo for hccl_world_group [WARNING] DISTRIBUTED(1156265,fffef90aefa0,python):2025-07-15-11:09:02.193.960 [mindspore/ccsrc/distributed/collective/collective_manager.cc:1021] CreateDeviceCommunicator] Begin initialize communication group on the device side: hccl_world_group [WARNING] DEVICE(1156265,fffef889efa0,python):2025-07-15-11:09:02.194.359 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:169] InitByRootInfoConfig] Start to initialize communicator by HcclCommInitRootInfoConfig for hccl_world_group, hcclBufferSize is 200 MB, hcclDeterministic is 0 [WARNING] DISTRIBUTED(1156269,ffff8476eec0,python):2025-07-15-11:09:02.225.731 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: hccl_world_group [const vector]{0, 1, 2, 3, 4, 5, 6, 7}, async: 0, submit_now: 1 [WARNING] DEVICE(1156269,fffec4baefa0,python):2025-07-15-11:09:02.226.219 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:254] SetGlobalCommInfo] Start to SetGlobalCommInfo for hccl_world_group, master_ip:2130706433, master_port:10969, node_rank:2130706433, total_rank_size:8, local_rank_size8 [WARNING] HCCL_ADPT(1156269,fffec4baefa0,python):2025-07-15-11:09:02.226.309 [mindspore/ccsrc/utils/dlopen_macro.h:165] DlsymAscend] Dynamically load symbol HcclSetGlobalCommInfo failed, result = /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/../lib/plugin/ascend/libhccl_plugin.so: undefined symbol: HcclSetGlobalCommInfo [WARNING] HCCL_ADPT(1156269,fffec4baefa0,python):2025-07-15-11:09:02.226.342 [mindspore/ccsrc/plugin/res_manager/ascend/hccl_adapter/hccl_adapter.cc:635] HcclSetGlobalCommInfo] Func HcclSetGlobalCommInfo is not supported in CANN package. [WARNING] DEVICE(1156269,fffec4baefa0,python):2025-07-15-11:09:02.226.371 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:265] SetGlobalCommInfo] End to SetGlobalCommInfo for hccl_world_group [WARNING] DISTRIBUTED(1156269,fffec4baefa0,python):2025-07-15-11:09:02.226.833 [mindspore/ccsrc/distributed/collective/collective_manager.cc:1021] CreateDeviceCommunicator] Begin initialize communication group on the device side: hccl_world_group [WARNING] DEVICE(1156269,fffe77ffefa0,python):2025-07-15-11:09:02.227.254 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:169] InitByRootInfoConfig] Start to initialize communicator by HcclCommInitRootInfoConfig for hccl_world_group, hcclBufferSize is 200 MB, hcclDeterministic is 0 F =================================== FAILURES =================================== ____________________ test_msrun_pynative_hccl_allreduce_8p _____________________ def test_msrun_pynative_hccl_allreduce_8p(): ''' Feature: allreduce op in pynative mode. Description: Test allreduce op in pynative mode. Expectation: Run success. ''' > D.init() test_pynative_hccl_allreduce.py:76: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ backend_name = 'hccl' def init(backend_name=None): """ Initialize distributed backends required by communication services, e.g. ``"hccl"`` / ``"nccl"`` / ``"mccl"``. It is usually used in distributed parallel scenarios and set before using communication services. Note: - The full name of ``"hccl"`` is Huawei Collective Communication Library(HCCL). - The full name of ``"nccl"`` is NVIDIA Collective Communication Library(NCCL). - The full name of ``"mccl"`` is MindSpore Collective Communication Library(MCCL). - In Ascend hardware platforms, ``init()`` should be set before the definition of any Tensor and Parameter, and the instantiation and execution of any operation and net. Args: backend_name (str): Backend, using ``"hccl"`` / ``"nccl"`` / ``"mccl"``. ``"hccl"`` should be used for Ascend hardware platforms, ``"nccl"`` for GPU hardware platforms and ``"mccl"`` for CPU hardware platforms. If not set, inference is automatically made based on the hardware platform type (device_target). Default: ``None`` . Raises: TypeError: If `backend_name` is not a string. RuntimeError: If device target is invalid, or backend is invalid, or distributed initialization fails, or the environment variables RANK_ID/MINDSPORE_HCCL_CONFIG_PATH have not been exported when backend is HCCL. Supported Platforms: ``Ascend`` ``GPU`` ``CPU`` Examples: .. note:: Before running the following examples, you need to configure the communication environment variables. For Ascend/GPU/CPU devices, it is recommended to use the msrun startup method without any third-party or configuration file dependencies. Please see the `msrun start up `_ for more details. >>> from mindspore.communication import init >>> init() """ host_init = _host_distribute() device_target = context.get_context("device_target") if backend_name is None: if device_target == "Ascend": backend_name = "hccl" elif device_target == "GPU": backend_name = "nccl" elif device_target == "CPU": backend_name = "mccl" else: raise RuntimeError("For 'set_context', the argument 'device_target' {} is not supported in " "parallel initialization, please use Ascend, GPU or CPU.".format(device_target)) if not isinstance(backend_name, str): raise TypeError("For 'init', the argument 'backend_name' must be a string, " "but got the type : {}".format(type(backend_name))) if os.getenv("MS_ROLE") == "MS_SCHED": backend_name = "mccl" _set_elegant_exit_handle() if backend_name == "hccl": if _is_ps_mode(): # Use MindSpore cluster to build network for Parameter Server training. init_cluster() if _is_role_sched() or _is_role_pserver(): raise RuntimeError("Parameter server and scheduler should use 'CPU' as backend instead of 'Ascend'") if _get_ps_context("worker_num") == 1: GlobalComm.INITED = True return if device_target != "Ascend": raise RuntimeError("For 'init', the argument 'backend_name' should be '{}' to init '{}', " "but got 'hccl'.".format(DEVICE_TO_BACKEND[device_target], device_target)) if is_initialized(device_target): logger.warning(f"For 'init' in Ascend backend, the backend is already initialized, please set it before " "the definition of any Tensor and Parameter, and the instantiation and execution of any " "operation and net, otherwise the 'init' may not take effect.") if not host_init: _check_parallel_envs() GlobalComm.BACKEND = Backend("hccl") _check_hccl() > init_hccl() E RuntimeError: Call aclrtSetDevice failed, ret[507033]. Got device count[8] and device id[1], please check if device id is valid. E E ---------------------------------------------------- E - C++ Call Stack: (For framework developers) E ---------------------------------------------------- E mindspore/ccsrc/plugin/res_manager/ascend/hal_manager/ascend_hal_manager.cc:67 InitDevice /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/management.py:203: RuntimeError =============================== warnings summary =============================== ../../../../../../../.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549 /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) ../../../../../../../.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89 /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) ../../../../../../../.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549 /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) ../../../../../../../.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89 /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) ../../../../../../../anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/batchnorm_fold2.py:57 /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/batchnorm_fold2.py:57: DeprecationWarning: te_fusion.fusion_manager.fusion_manager.register is deprecated,please replace it with tbe.common.register.register_op_compute @fusion_manager.register("batchnorm_fold2") ../../../../../../../anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/batchnorm_fold2_grad.py:56 /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/batchnorm_fold2_grad.py:56: DeprecationWarning: te_fusion.fusion_manager.fusion_manager.register is deprecated,please replace it with tbe.common.register.register_op_compute @fusion_manager.register("batchnorm_fold2_grad") ../../../../../../../anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/batchnorm_fold2_grad_reduce.py:48 /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/batchnorm_fold2_grad_reduce.py:48: DeprecationWarning: te_fusion.fusion_manager.fusion_manager.register is deprecated,please replace it with tbe.common.register.register_op_compute @fusion_manager.register("batchnorm_fold2_grad_reduce") ../../../../../../../anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/correction_mul.py:51 /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/correction_mul.py:51: DeprecationWarning: te_fusion.fusion_manager.fusion_manager.register is deprecated,please replace it with tbe.common.register.register_op_compute @fusion_manager.register("correction_mul") ../../../../../../../anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/correction_mul_grad.py:51 /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/correction_mul_grad.py:51: DeprecationWarning: te_fusion.fusion_manager.fusion_manager.register is deprecated,please replace it with tbe.common.register.register_op_compute @fusion_manager.register("correction_mul_grad") ../../../../../../../anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/correction_mul_grad.py:143 /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/correction_mul_grad.py:143: DeprecationWarning: te_fusion.fusion_manager.fusion_manager.register is deprecated,please replace it with tbe.common.register.register_op_compute @fusion_manager.register("correction_mul_grad_reduce") ../../../../../../../anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/fake_learned_scale_quant_perlayer.py:50 /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/fake_learned_scale_quant_perlayer.py:50: DeprecationWarning: te_fusion.fusion_manager.fusion_manager.register is deprecated,please replace it with tbe.common.register.register_op_compute @fusion_manager.register("fake_learned_scale_quant_perlayer") ../../../../../../../anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/fake_learned_scale_quant_perlayer_grad.py:92 /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/fake_learned_scale_quant_perlayer_grad.py:92: DeprecationWarning: te_fusion.fusion_manager.fusion_manager.register is deprecated,please replace it with tbe.common.register.register_op_compute @fusion_manager.register("fake_learned_scale_quant_perlayer_grad_d") ../../../../../../../anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/fake_learned_scale_quant_perlayer_grad_reduce.py:49 /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/fake_learned_scale_quant_perlayer_grad_reduce.py:49: DeprecationWarning: te_fusion.fusion_manager.fusion_manager.register is deprecated,please replace it with tbe.common.register.register_op_compute @fusion_manager.register("fake_learned_scale_quant_perlayer_grad_d_reduce") ../../../../../../../anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/fake_learned_scale_quant_perchannel.py:50 /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/fake_learned_scale_quant_perchannel.py:50: DeprecationWarning: te_fusion.fusion_manager.fusion_manager.register is deprecated,please replace it with tbe.common.register.register_op_compute @fusion_manager.register("fake_learned_scale_quant_perchannel") ../../../../../../../anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/fake_learned_scale_quant_perchannel_grad.py:91 /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/fake_learned_scale_quant_perchannel_grad.py:91: DeprecationWarning: te_fusion.fusion_manager.fusion_manager.register is deprecated,please replace it with tbe.common.register.register_op_compute @fusion_manager.register("fake_learned_scale_quant_perchannel_grad_d") ../../../../../../../anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/fake_learned_scale_quant_perchannel_grad_reduce.py:48 /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/fake_learned_scale_quant_perchannel_grad_reduce.py:48: DeprecationWarning: te_fusion.fusion_manager.fusion_manager.register is deprecated,please replace it with tbe.common.register.register_op_compute @fusion_manager.register("fake_learned_scale_quant_perchannel_grad_d_reduce") ../../../../../../../anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/fake_quant_perchannel.py:52 /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/fake_quant_perchannel.py:52: DeprecationWarning: te_fusion.fusion_manager.fusion_manager.register is deprecated,please replace it with tbe.common.register.register_op_compute @fusion_manager.register("fake_quant_perchannel") ../../../../../../../anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/fake_quant_perchannel_grad.py:81 /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/fake_quant_perchannel_grad.py:81: DeprecationWarning: te_fusion.fusion_manager.fusion_manager.register is deprecated,please replace it with tbe.common.register.register_op_compute @fusion_manager.register("fake_quant_perchannel_grad") ../../../../../../../anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/fake_quant_perlayer.py:54 /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/fake_quant_perlayer.py:54: DeprecationWarning: te_fusion.fusion_manager.fusion_manager.register is deprecated,please replace it with tbe.common.register.register_op_compute @fusion_manager.register("fake_quant_per_layer") ../../../../../../../anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/fake_quant_perlayer_grad.py:81 /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/fake_quant_perlayer_grad.py:81: DeprecationWarning: te_fusion.fusion_manager.fusion_manager.register is deprecated,please replace it with tbe.common.register.register_op_compute @fusion_manager.register("fake_quant_per_layer_grad") ../../../../../../../anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/minmax_update_perchannel.py:50 /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/minmax_update_perchannel.py:50: DeprecationWarning: te_fusion.fusion_manager.fusion_manager.register is deprecated,please replace it with tbe.common.register.register_op_compute @fusion_manager.register("minmax_update_perchannel") ../../../../../../../anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/minmax_update_perlayer.py:50 /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/minmax_update_perlayer.py:50: DeprecationWarning: te_fusion.fusion_manager.fusion_manager.register is deprecated,please replace it with tbe.common.register.register_op_compute @fusion_manager.register("minmax_update_perlayer") -- Docs: https://docs.pytest.org/en/stable/warnings.html =========================== short test summary info ============================ FAILED test_pynative_hccl_allreduce.py::test_msrun_pynative_hccl_allreduce_8p ================== 1 failed, 22 warnings in 171.38s (0:02:51) ================== [WARNING] DEVICE(1156243,ffff9365eec0,python):2025-07-15-11:11:41.432.505 [mindspore/ccsrc/plugin/device/ascend/hal/hardware/ascend_device_res_manager.cc:350] SyncAllStreams] The ascend_res_manager_ is nullptr in scenarios where it is not actually executed [INFO] PS(1156243,ffff257aefa0,python):2025-07-15-11:11:42.008.853 [mindspore/ccsrc/ps/core/communicator/tcp_client.cc:318] Start] Event base dispatch success! [INFO] PS(1156243,ffff25fbefa0,python):2025-07-15-11:11:42.009.240 [mindspore/ccsrc/ps/core/communicator/tcp_server.cc:220] Start] Event base dispatch success! [ERROR] ME(1156162:281473887891136,MainProcess):2025-07-15-11:11:42.902.219 [mindspore/parallel/cluster/process_entity/_api.py:363] Worker process 1156243 exit with exception. Error code: 1. [WARNING] ME(1156162:281473887891136,MainProcess):2025-07-15-11:11:42.902.543 [mindspore/parallel/cluster/process_entity/_api.py:369] There's worker exits with exception, kill all other workers. [ERROR] ME(1156162:281473887891136,MainProcess):2025-07-15-11:12:18.950.219 [mindspore/parallel/cluster/process_entity/_api.py:382] Scheduler process 1156237 exit with exception. [ERROR] ME(1156162:281473887891136,MainProcess):2025-07-15-11:12:18.951.353 [mindspore/parallel/cluster/process_entity/_api.py:603] Time out nodes are ['0', '2', '3', '4', '5', '6', '7'] worker_1.log-52- ``"nccl"`` for GPU hardware platforms and ``"mccl"`` for CPU hardware platforms. worker_1.log-53- If not set, inference is automatically made based on the hardware worker_1.log-54- platform type (device_target). Default: ``None`` . worker_1.log-55- worker_1.log-56- Raises: worker_1.log:57: TypeError: If `backend_name` is not a string. worker_1.log:58: RuntimeError: If device target is invalid, or backend is invalid, or distributed initialization fails, worker_1.log-59- or the environment variables RANK_ID/MINDSPORE_HCCL_CONFIG_PATH worker_1.log-60- have not been exported when backend is HCCL. worker_1.log-61- worker_1.log-62- Supported Platforms: worker_1.log-63- ``Ascend`` ``GPU`` ``CPU`` -- worker_1.log-84- elif device_target == "GPU": worker_1.log-85- backend_name = "nccl" worker_1.log-86- elif device_target == "CPU": worker_1.log-87- backend_name = "mccl" worker_1.log-88- else: worker_1.log:89: raise RuntimeError("For 'set_context', the argument 'device_target' {} is not supported in " worker_1.log-90- "parallel initialization, please use Ascend, GPU or CPU.".format(device_target)) worker_1.log-91- if not isinstance(backend_name, str): worker_1.log:92: raise TypeError("For 'init', the argument 'backend_name' must be a string, " worker_1.log-93- "but got the type : {}".format(type(backend_name))) worker_1.log-94- if os.getenv("MS_ROLE") == "MS_SCHED": worker_1.log-95- backend_name = "mccl" worker_1.log-96- worker_1.log-97- _set_elegant_exit_handle() worker_1.log-98- if backend_name == "hccl": worker_1.log-99- if _is_ps_mode(): worker_1.log-100- # Use MindSpore cluster to build network for Parameter Server training. worker_1.log-101- init_cluster() worker_1.log-102- if _is_role_sched() or _is_role_pserver(): worker_1.log:103: raise RuntimeError("Parameter server and scheduler should use 'CPU' as backend instead of 'Ascend'") worker_1.log-104- if _get_ps_context("worker_num") == 1: worker_1.log-105- GlobalComm.INITED = True worker_1.log-106- return worker_1.log-107- if device_target != "Ascend": worker_1.log:108: raise RuntimeError("For 'init', the argument 'backend_name' should be '{}' to init '{}', " worker_1.log-109- "but got 'hccl'.".format(DEVICE_TO_BACKEND[device_target], device_target)) worker_1.log-110- if is_initialized(device_target): worker_1.log-111- logger.warning(f"For 'init' in Ascend backend, the backend is already initialized, please set it before " worker_1.log-112- "the definition of any Tensor and Parameter, and the instantiation and execution of any " worker_1.log-113- "operation and net, otherwise the 'init' may not take effect.") worker_1.log-114- if not host_init: worker_1.log-115- _check_parallel_envs() worker_1.log-116- GlobalComm.BACKEND = Backend("hccl") worker_1.log-117- _check_hccl() worker_1.log-118-> init_hccl() worker_1.log:119:E RuntimeError: Call aclrtSetDevice failed, ret[507033]. Got device count[8] and device id[1], please check if device id is valid. worker_1.log-120-E worker_1.log-121-E ---------------------------------------------------- worker_1.log-122-E - C++ Call Stack: (For framework developers) worker_1.log-123-E ---------------------------------------------------- worker_1.log-124-E mindspore/ccsrc/plugin/res_manager/ascend/hal_manager/ascend_hal_manager.cc:67 InitDevice worker_1.log-125- worker_1.log:126:/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/management.py:203: RuntimeError worker_1.log-127-=============================== warnings summary =============================== worker_1.log-128-../../../../../../../.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549 worker_1.log-129- /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. worker_1.log-130- setattr(self, word, getattr(machar, word).flat[0]) worker_1.log-131- grep: __pycache__/test_entry_msrun_pynative_hccl.cpython-39-pytest-6.2.5.pyc: binary file matches -- scheduler.log-97-[WARNING] DISTRIBUTED(1156237,ffffb1cfeec0,python):2025-07-15-11:12:01.790.951 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:154] Finalize] This log means the cluster is successfully created. Retry to finalize the node and exit cluster... scheduler.log-98-[WARNING] DISTRIBUTED(1156237,ffffb1cfeec0,python):2025-07-15-11:12:06.791.170 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:98] Finalize] The meta server node can not be finalized because there are still 7 alive nodes. scheduler.log-99-[WARNING] DISTRIBUTED(1156237,ffffb1cfeec0,python):2025-07-15-11:12:06.791.261 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:154] Finalize] This log means the cluster is successfully created. Retry to finalize the node and exit cluster... scheduler.log-100-[WARNING] DISTRIBUTED(1156237,ffffb1cfeec0,python):2025-07-15-11:12:11.791.468 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:98] Finalize] The meta server node can not be finalized because there are still 7 alive nodes. scheduler.log-101-[WARNING] DISTRIBUTED(1156237,ffffb1cfeec0,python):2025-07-15-11:12:11.791.536 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:154] Finalize] This log means the cluster is successfully created. Retry to finalize the node and exit cluster... scheduler.log:102:[ERROR] DISTRIBUTED(1156237,ffff4a1cefa0,python):2025-07-15-11:12:13.306.693 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:511] UpdateTopoState] The node: 0 is timed out. It may exit with exception, please check this node's log. scheduler.log:103:[ERROR] DISTRIBUTED(1156237,ffff4a1cefa0,python):2025-07-15-11:12:13.306.768 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:511] UpdateTopoState] The node: 2 is timed out. It may exit with exception, please check this node's log. scheduler.log:104:[ERROR] DISTRIBUTED(1156237,ffff4a1cefa0,python):2025-07-15-11:12:13.306.798 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:511] UpdateTopoState] The node: 3 is timed out. It may exit with exception, please check this node's log. scheduler.log:105:[ERROR] DISTRIBUTED(1156237,ffff4a1cefa0,python):2025-07-15-11:12:13.306.823 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:511] UpdateTopoState] The node: 4 is timed out. It may exit with exception, please check this node's log. scheduler.log:106:[ERROR] DISTRIBUTED(1156237,ffff4a1cefa0,python):2025-07-15-11:12:13.306.848 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:511] UpdateTopoState] The node: 5 is timed out. It may exit with exception, please check this node's log. scheduler.log:107:[ERROR] DISTRIBUTED(1156237,ffff4a1cefa0,python):2025-07-15-11:12:13.306.872 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:511] UpdateTopoState] The node: 6 is timed out. It may exit with exception, please check this node's log. scheduler.log:108:[ERROR] DISTRIBUTED(1156237,ffff4a1cefa0,python):2025-07-15-11:12:13.306.896 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:511] UpdateTopoState] The node: 7 is timed out. It may exit with exception, please check this node's log. scheduler.log:109:[ERROR] DISTRIBUTED(1156237,ffffb1cfeec0,python):2025-07-15-11:12:16.791.691 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:103] Finalize] There are 7 abnormal compute graph nodes. scheduler.log-110-F scheduler.log-111- scheduler.log-112-=================================== FAILURES =================================== scheduler.log-113-____________________ test_msrun_pynative_hccl_allreduce_8p _____________________ scheduler.log-114- -- scheduler.log-143- ``"nccl"`` for GPU hardware platforms and ``"mccl"`` for CPU hardware platforms. scheduler.log-144- If not set, inference is automatically made based on the hardware scheduler.log-145- platform type (device_target). Default: ``None`` . scheduler.log-146- scheduler.log-147- Raises: scheduler.log:148: TypeError: If `backend_name` is not a string. scheduler.log:149: RuntimeError: If device target is invalid, or backend is invalid, or distributed initialization fails, scheduler.log-150- or the environment variables RANK_ID/MINDSPORE_HCCL_CONFIG_PATH scheduler.log-151- have not been exported when backend is HCCL. scheduler.log-152- scheduler.log-153- Supported Platforms: scheduler.log-154- ``Ascend`` ``GPU`` ``CPU`` -- scheduler.log-175- elif device_target == "GPU": scheduler.log-176- backend_name = "nccl" scheduler.log-177- elif device_target == "CPU": scheduler.log-178- backend_name = "mccl" scheduler.log-179- else: scheduler.log:180: raise RuntimeError("For 'set_context', the argument 'device_target' {} is not supported in " scheduler.log-181- "parallel initialization, please use Ascend, GPU or CPU.".format(device_target)) scheduler.log-182- if not isinstance(backend_name, str): scheduler.log:183: raise TypeError("For 'init', the argument 'backend_name' must be a string, " scheduler.log-184- "but got the type : {}".format(type(backend_name))) scheduler.log-185- if os.getenv("MS_ROLE") == "MS_SCHED": scheduler.log-186- backend_name = "mccl" scheduler.log-187- scheduler.log-188- _set_elegant_exit_handle() scheduler.log-189- if backend_name == "hccl": scheduler.log-190- if _is_ps_mode(): scheduler.log-191- # Use MindSpore cluster to build network for Parameter Server training. scheduler.log-192- init_cluster() scheduler.log-193- if _is_role_sched() or _is_role_pserver(): scheduler.log:194: raise RuntimeError("Parameter server and scheduler should use 'CPU' as backend instead of 'Ascend'") scheduler.log-195- if _get_ps_context("worker_num") == 1: scheduler.log-196- GlobalComm.INITED = True scheduler.log-197- return scheduler.log-198- if device_target != "Ascend": scheduler.log:199: raise RuntimeError("For 'init', the argument 'backend_name' should be '{}' to init '{}', " scheduler.log-200- "but got 'hccl'.".format(DEVICE_TO_BACKEND[device_target], device_target)) scheduler.log-201- if is_initialized(device_target): scheduler.log-202- logger.warning(f"For 'init' in Ascend backend, the backend is already initialized, please set it before " scheduler.log-203- "the definition of any Tensor and Parameter, and the instantiation and execution of any " scheduler.log-204- "operation and net, otherwise the 'init' may not take effect.") -- scheduler.log-208- _check_hccl() scheduler.log-209- init_hccl() scheduler.log-210- GlobalComm.WORLD_COMM_GROUP = HCCL_WORLD_COMM_GROUP scheduler.log-211- elif backend_name == "nccl": scheduler.log-212- if device_target != "GPU": scheduler.log:213: raise RuntimeError("For 'init', the argument 'backend_name' should be '{}' to init '{}', " scheduler.log-214- "but got 'nccl'.".format(DEVICE_TO_BACKEND[device_target], device_target)) scheduler.log-215- init_cluster() scheduler.log-216- GlobalComm.BACKEND = Backend("nccl") scheduler.log-217- GlobalComm.WORLD_COMM_GROUP = NCCL_WORLD_COMM_GROUP scheduler.log-218- elif backend_name == "mccl": scheduler.log-219-> init_cluster() scheduler.log:220:E RuntimeError: The total number of timed out node is 7. Timed out node list is: [const vector]{0, 2, 3, 4, 5, 6, 7}, worker 0 is the first one timed out, please check its log. scheduler.log-221-E scheduler.log-222-E ---------------------------------------------------- scheduler.log-223-E - C++ Call Stack: (For framework developers) scheduler.log-224-E ---------------------------------------------------- scheduler.log-225-E mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:517 UpdateTopoState scheduler.log-226- scheduler.log:227:/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/management.py:213: RuntimeError scheduler.log-228-=============================== warnings summary =============================== scheduler.log-229-../../../../../../../.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549 scheduler.log-230- /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. scheduler.log-231- setattr(self, word, getattr(machar, word).flat[0]) scheduler.log-232- Traceback (most recent call last): File "/home/jenkins/anaconda3/envs/ci39/bin/msrun", line 8, in sys.exit(main()) File "/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/parallel/cluster/run.py", line 191, in main run(args) File "/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/parallel/cluster/run.py", line 185, in run process_manager.run() File "/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/parallel/cluster/process_entity/_api.py", line 268, in run self.join_processes() File "/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/parallel/cluster/process_entity/_api.py", line 387, in join_processes raise RuntimeError("Distributed job exited with exception. Please check logs in " RuntimeError: Distributed job exited with exception. Please check logs in directory: . F =================================== FAILURES =================================== _______________________ test_pynative_hccl_allreduce_8p ________________________ @arg_mark(plat_marks=['platform_ascend910b'], level_mark='level1', card_mark='allcards', essential_mark='essential') def test_pynative_hccl_allreduce_8p(): ''' Feature: run allreduce op in pynative mode using msrun. Description: Test case entry allreduce op in pynative mode. Expectation: Run success. ''' return_code = os.system( "msrun --worker_num=8 --local_worker_num=8 --master_addr=127.0.0.1 --master_port=10969 --join=True "\ "pytest -s test_pynative_hccl_allreduce.py::test_msrun_pynative_hccl_allreduce_8p" ) > assert return_code == 0 E assert 256 == 0 test_entry_msrun_pynative_hccl.py:33: AssertionError =========================== short test summary info ============================ FAILED test_entry_msrun_pynative_hccl.py::test_pynative_hccl_allreduce_8p - a... ======================== 1 failed in 216.07s (0:03:36) =========================