============================= test session starts ============================== platform linux -- Python 3.9.21, pytest-6.2.5, py-1.11.0, pluggy-0.13.1 rootdir: /home/jenkins/mindspore/testcases/testcases/tests/st/mint, configfile: ../../../../../../sault/virtual_test/virtualenv_002/sault/config/pytest.ini plugins: forked-1.6.0, hydra-core-1.3.2, xdist-1.32.0, anyio-4.9.0 collected 1 item test_mint_comm_op.py /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) Start worker process with rank id:0, log file:worker_0.log. Environment variable [RANK_ID=0] is exported. Start worker process with rank id:1, log file:worker_1.log. Environment variable [RANK_ID=1] is exported. Start worker process with rank id:2, log file:worker_2.log. Environment variable [RANK_ID=2] is exported. Start worker process with rank id:3, log file:worker_3.log. Environment variable [RANK_ID=3] is exported. Start worker process with rank id:4, log file:worker_4.log. Environment variable [RANK_ID=4] is exported. Start worker process with rank id:5, log file:worker_5.log. Environment variable [RANK_ID=5] is exported. Start worker process with rank id:6, log file:worker_6.log. Environment variable [RANK_ID=6] is exported. Start worker process with rank id:7, log file:worker_7.log. Environment variable [RANK_ID=7] is exported. [WARNING] ME(1412768:281473674899136,MainProcess):2025-07-15-12:13:37.686.542 [mindspore/parallel/cluster/process_entity/_api.py:267] Distributed job is spawned. Waiting all processes to exit... [WARNING] DISTRIBUTED(1412857,ffffb707eec0,python):2025-07-15-12:13:43.453.159 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 21 source: 127.0.0.1:45028, destination: 127.0.0.1:10666 [WARNING] DISTRIBUTED(1412857,ffffb707eec0,python):2025-07-15-12:13:43.453.294 [mindspore/ccsrc/distributed/rpc/tcp/tcp_client.cc:76] Connect] Failed to connect to the tcp server : 127.0.0.1:10666, retry to reconnect(1/1)... [WARNING] DISTRIBUTED(1412845,ffffa0aaeec0,python):2025-07-15-12:13:43.560.142 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 21 source: 127.0.0.1:45042, destination: 127.0.0.1:10666 [WARNING] DISTRIBUTED(1412845,ffff38f7efa0,python):2025-07-15-12:13:43.560.149 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:45042 to 127.0.0.1:10666 is successfully created. System errno: Success [WARNING] DISTRIBUTED(1412845,ffffa0aaeec0,python):2025-07-15-12:13:43.560.221 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:10666 to be connected...Retry number: 1 [WARNING] DISTRIBUTED(1412853,ffffad19eec0,python):2025-07-15-12:13:43.664.998 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 21 source: 127.0.0.1:45058, destination: 127.0.0.1:10666 [WARNING] DISTRIBUTED(1412853,ffff4563efa0,python):2025-07-15-12:13:43.665.057 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:45058 to 127.0.0.1:10666 is successfully created. System errno: Success [WARNING] DISTRIBUTED(1412853,ffffad19eec0,python):2025-07-15-12:13:43.665.145 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:10666 to be connected...Retry number: 1 [WARNING] DISTRIBUTED(1412861,ffff809ceec0,python):2025-07-15-12:13:43.748.559 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 21 source: 127.0.0.1:45074, destination: 127.0.0.1:10666 [WARNING] DISTRIBUTED(1412861,ffff18ebefa0,python):2025-07-15-12:13:43.748.559 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:45074 to 127.0.0.1:10666 is successfully created. System errno: Success [WARNING] DISTRIBUTED(1412861,ffff809ceec0,python):2025-07-15-12:13:43.748.700 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:10666 to be connected...Retry number: 1 [WARNING] DISTRIBUTED(1412869,ffff8873eec0,python):2025-07-15-12:13:43.748.741 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 21 source: 127.0.0.1:45078, destination: 127.0.0.1:10666 [WARNING] DISTRIBUTED(1412869,ffff20c3efa0,python):2025-07-15-12:13:43.748.750 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:45078 to 127.0.0.1:10666 is successfully created. System errno: Success [WARNING] DISTRIBUTED(1412869,ffff8873eec0,python):2025-07-15-12:13:43.748.863 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:10666 to be connected...Retry number: 1 [WARNING] DISTRIBUTED(1412849,ffff8ce9eec0,python):2025-07-15-12:13:43.823.431 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 21 source: 127.0.0.1:45082, destination: 127.0.0.1:10666 [WARNING] DISTRIBUTED(1412849,ffff2533efa0,python):2025-07-15-12:13:43.823.431 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:45082 to 127.0.0.1:10666 is successfully created. System errno: Success [WARNING] DISTRIBUTED(1412849,ffff8ce9eec0,python):2025-07-15-12:13:43.823.562 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:10666 to be connected...Retry number: 1 [WARNING] DISTRIBUTED(1412873,ffff9adceec0,python):2025-07-15-12:13:43.862.308 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 21 source: 127.0.0.1:45088, destination: 127.0.0.1:10666 [WARNING] DISTRIBUTED(1412873,ffff332defa0,python):2025-07-15-12:13:43.862.328 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:45088 to 127.0.0.1:10666 is successfully created. System errno: Success [WARNING] DISTRIBUTED(1412873,ffff9adceec0,python):2025-07-15-12:13:43.862.381 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:10666 to be connected...Retry number: 1 [WARNING] DISTRIBUTED(1412857,ffffb707eec0,python):2025-07-15-12:13:43.953.468 [mindspore/ccsrc/distributed/cluster/topology/compute_graph_node.cc:173] Register] Failed to connect to the meta server node url: 127.0.0.1:10666 [WARNING] DISTRIBUTED(1412857,ffffb707eec0,python):2025-07-15-12:13:43.953.535 [mindspore/ccsrc/distributed/cluster/topology/compute_graph_node.cc:363] ReconnectWithTimeoutWindow] Failed to register and try to reconnect to the meta server. [WARNING] DISTRIBUTED(1412845,ffffa0aaeec0,python):2025-07-15-12:13:44.060.675 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 22 source: 127.0.0.1:45102, destination: 127.0.0.1:10666 [WARNING] DISTRIBUTED(1412845,ffffa0aaeec0,python):2025-07-15-12:13:44.060.728 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:10666 to be connected...Retry number: 2 [WARNING] DISTRIBUTED(1412845,ffff39f9efa0,python):2025-07-15-12:13:44.060.732 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:45102 to 127.0.0.1:10666 is successfully created. System errno: Success [WARNING] DISTRIBUTED(1412865,ffff2d17efa0,python):2025-07-15-12:13:44.152.633 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:45118 to 127.0.0.1:10666 is successfully created. System errno: Success [WARNING] DISTRIBUTED(1412865,ffff94c6eec0,python):2025-07-15-12:13:44.152.632 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 21 source: 127.0.0.1:45118, destination: 127.0.0.1:10666 [WARNING] DISTRIBUTED(1412865,ffff94c6eec0,python):2025-07-15-12:13:44.152.871 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 22 source: 127.0.0.1:45134, destination: 127.0.0.1:10666 [WARNING] DISTRIBUTED(1412865,ffff2e19efa0,python):2025-07-15-12:13:44.152.902 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:45134 to 127.0.0.1:10666 is successfully created. System errno: Success [WARNING] DISTRIBUTED(1412865,ffff94c6eec0,python):2025-07-15-12:13:44.152.917 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:10666 to be connected...Retry number: 1 [WARNING] DISTRIBUTED(1412853,ffffad19eec0,python):2025-07-15-12:13:44.165.600 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 22 source: 127.0.0.1:45142, destination: 127.0.0.1:10666 [WARNING] DISTRIBUTED(1412853,ffff4665efa0,python):2025-07-15-12:13:44.165.606 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:45142 to 127.0.0.1:10666 is successfully created. System errno: Success [WARNING] DISTRIBUTED(1412853,ffffad19eec0,python):2025-07-15-12:13:44.165.662 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:10666 to be connected...Retry number: 2 [WARNING] DISTRIBUTED(1412861,ffff809ceec0,python):2025-07-15-12:13:44.248.976 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 22 source: 127.0.0.1:45148, destination: 127.0.0.1:10666 [WARNING] DISTRIBUTED(1412861,ffff809ceec0,python):2025-07-15-12:13:44.249.025 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:10666 to be connected...Retry number: 2 [WARNING] DISTRIBUTED(1412861,ffff19edefa0,python):2025-07-15-12:13:44.249.025 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:45148 to 127.0.0.1:10666 is successfully created. System errno: Success [WARNING] DISTRIBUTED(1412869,ffff8873eec0,python):2025-07-15-12:13:44.249.099 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 22 source: 127.0.0.1:45150, destination: 127.0.0.1:10666 [WARNING] DISTRIBUTED(1412869,ffff8873eec0,python):2025-07-15-12:13:44.249.156 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:10666 to be connected...Retry number: 2 [WARNING] DISTRIBUTED(1412869,ffff21c5efa0,python):2025-07-15-12:13:44.249.158 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:45150 to 127.0.0.1:10666 is successfully created. System errno: Success [WARNING] DISTRIBUTED(1412849,ffff8ce9eec0,python):2025-07-15-12:13:44.323.788 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 22 source: 127.0.0.1:45162, destination: 127.0.0.1:10666 [WARNING] DISTRIBUTED(1412849,ffff8ce9eec0,python):2025-07-15-12:13:44.323.839 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:10666 to be connected...Retry number: 2 [WARNING] DISTRIBUTED(1412849,ffff2635efa0,python):2025-07-15-12:13:44.323.840 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:45162 to 127.0.0.1:10666 is successfully created. System errno: Success [WARNING] DISTRIBUTED(1412873,ffff9adceec0,python):2025-07-15-12:13:44.362.597 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 22 source: 127.0.0.1:45168, destination: 127.0.0.1:10666 [WARNING] DISTRIBUTED(1412873,ffff9adceec0,python):2025-07-15-12:13:44.362.638 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:10666 to be connected...Retry number: 2 [WARNING] DISTRIBUTED(1412873,ffff342fefa0,python):2025-07-15-12:13:44.362.640 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:45168 to 127.0.0.1:10666 is successfully created. System errno: Success [WARNING] DISTRIBUTED(1412857,ffffb707eec0,python):2025-07-15-12:13:44.453.922 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 22 source: 127.0.0.1:45184, destination: 127.0.0.1:10666 [WARNING] DISTRIBUTED(1412857,ffffb707eec0,python):2025-07-15-12:13:44.453.966 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:10666 to be connected...Retry number: 1 [WARNING] DISTRIBUTED(1412857,ffff505befa0,python):2025-07-15-12:13:44.453.997 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:45184 to 127.0.0.1:10666 is successfully created. System errno: Success [WARNING] DISTRIBUTED(1412845,ffffa0aaeec0,python):2025-07-15-12:13:44.561.404 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(1/1200). [WARNING] DISTRIBUTED(1412865,ffff94c6eec0,python):2025-07-15-12:13:44.653.468 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(1/1200). [WARNING] DISTRIBUTED(1412853,ffffad19eec0,python):2025-07-15-12:13:44.666.381 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(1/1200). [WARNING] DISTRIBUTED(1412861,ffff809ceec0,python):2025-07-15-12:13:44.749.598 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(1/1200). [WARNING] DISTRIBUTED(1412869,ffff8873eec0,python):2025-07-15-12:13:44.749.693 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(1/1200). [WARNING] DISTRIBUTED(1412849,ffff8ce9eec0,python):2025-07-15-12:13:44.824.403 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(1/1200). [WARNING] DISTRIBUTED(1412873,ffff9adceec0,python):2025-07-15-12:13:44.863.040 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(1/1200). [WARNING] DISTRIBUTED(1412857,ffffb707eec0,python):2025-07-15-12:13:44.954.229 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 23 source: 127.0.0.1:45198, destination: 127.0.0.1:10666 [WARNING] DISTRIBUTED(1412857,ffffb707eec0,python):2025-07-15-12:13:44.954.267 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:10666 to be connected...Retry number: 2 [WARNING] DISTRIBUTED(1412857,ffff4f59efa0,python):2025-07-15-12:13:44.954.277 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:45198 to 127.0.0.1:10666 is successfully created. System errno: Success [WARNING] DISTRIBUTED(1412845,ffffa0aaeec0,python):2025-07-15-12:13:45.061.510 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(2/1200). [WARNING] DISTRIBUTED(1412865,ffff94c6eec0,python):2025-07-15-12:13:45.153.572 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(2/1200). [WARNING] DISTRIBUTED(1412853,ffffad19eec0,python):2025-07-15-12:13:45.166.461 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(2/1200). [WARNING] DISTRIBUTED(1412861,ffff809ceec0,python):2025-07-15-12:13:45.249.707 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(2/1200). [WARNING] DISTRIBUTED(1412869,ffff8873eec0,python):2025-07-15-12:13:45.249.799 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(2/1200). [WARNING] DISTRIBUTED(1412849,ffff8ce9eec0,python):2025-07-15-12:13:45.324.517 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(2/1200). [WARNING] DISTRIBUTED(1412873,ffff9adceec0,python):2025-07-15-12:13:45.363.151 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(2/1200). [WARNING] DISTRIBUTED(1412857,ffffb707eec0,python):2025-07-15-12:13:45.454.998 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(1/1200). [WARNING] DISTRIBUTED(1412845,ffffa0aaeec0,python):2025-07-15-12:13:45.561.612 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(3/1200). [WARNING] DISTRIBUTED(1412865,ffff94c6eec0,python):2025-07-15-12:13:45.653.669 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(3/1200). [WARNING] DISTRIBUTED(1412853,ffffad19eec0,python):2025-07-15-12:13:45.666.564 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(3/1200). [WARNING] DISTRIBUTED(1412861,ffff809ceec0,python):2025-07-15-12:13:45.749.842 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(3/1200). [WARNING] DISTRIBUTED(1412869,ffff8873eec0,python):2025-07-15-12:13:45.749.899 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(3/1200). [WARNING] DISTRIBUTED(1412849,ffff8ce9eec0,python):2025-07-15-12:13:45.824.648 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(3/1200). [WARNING] DISTRIBUTED(1412873,ffff9adceec0,python):2025-07-15-12:13:45.863.265 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(3/1200). [WARNING] DISTRIBUTED(1412857,ffffb707eec0,python):2025-07-15-12:13:45.955.136 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:249] BuildCluster] Cluster is successfully initialized. [WARNING] DISTRIBUTED(1412857,ffffb707eec0,python):2025-07-15-12:13:45.955.189 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:355] PostProcess] This node 3 rank id: 3 [WARNING] PS(1412857,ffffb707eec0,python):2025-07-15-12:13:45.955.785 [mindspore/ccsrc/ps/core/file_configuration.cc:24] Initialize] The file: is not exist. [WARNING] DEVICE(1412857,ffffb707eec0,python):2025-07-15-12:13:45.955.838 [mindspore/ccsrc/plugin/device/cpu/hal/hardware/ms_collective_node.cc:33] Start] Failed to initialize the configuration for this mccl collective node. [WARNING] DISTRIBUTED(1412845,ffffa0aaeec0,python):2025-07-15-12:13:46.061.727 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:249] BuildCluster] Cluster is successfully initialized. [WARNING] DISTRIBUTED(1412845,ffffa0aaeec0,python):2025-07-15-12:13:46.061.767 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:355] PostProcess] This node 0 rank id: 0 [WARNING] PS(1412845,ffffa0aaeec0,python):2025-07-15-12:13:46.062.276 [mindspore/ccsrc/ps/core/file_configuration.cc:24] Initialize] The file: is not exist. [WARNING] DEVICE(1412845,ffffa0aaeec0,python):2025-07-15-12:13:46.062.327 [mindspore/ccsrc/plugin/device/cpu/hal/hardware/ms_collective_node.cc:33] Start] Failed to initialize the configuration for this mccl collective node. [WARNING] DISTRIBUTED(1412865,ffff94c6eec0,python):2025-07-15-12:13:46.153.778 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:249] BuildCluster] Cluster is successfully initialized. [WARNING] DISTRIBUTED(1412865,ffff94c6eec0,python):2025-07-15-12:13:46.153.815 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:355] PostProcess] This node 5 rank id: 5 [WARNING] PS(1412865,ffff94c6eec0,python):2025-07-15-12:13:46.154.312 [mindspore/ccsrc/ps/core/file_configuration.cc:24] Initialize] The file: is not exist. [WARNING] DEVICE(1412865,ffff94c6eec0,python):2025-07-15-12:13:46.154.360 [mindspore/ccsrc/plugin/device/cpu/hal/hardware/ms_collective_node.cc:33] Start] Failed to initialize the configuration for this mccl collective node. [WARNING] DISTRIBUTED(1412853,ffffad19eec0,python):2025-07-15-12:13:46.166.694 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:249] BuildCluster] Cluster is successfully initialized. [WARNING] DISTRIBUTED(1412853,ffffad19eec0,python):2025-07-15-12:13:46.166.735 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:355] PostProcess] This node 2 rank id: 2 [WARNING] PS(1412853,ffffad19eec0,python):2025-07-15-12:13:46.167.373 [mindspore/ccsrc/ps/core/file_configuration.cc:24] Initialize] The file: is not exist. [WARNING] DEVICE(1412853,ffffad19eec0,python):2025-07-15-12:13:46.167.429 [mindspore/ccsrc/plugin/device/cpu/hal/hardware/ms_collective_node.cc:33] Start] Failed to initialize the configuration for this mccl collective node. [WARNING] DISTRIBUTED(1412869,ffff8873eec0,python):2025-07-15-12:13:46.250.012 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:249] BuildCluster] Cluster is successfully initialized. [WARNING] DISTRIBUTED(1412861,ffff809ceec0,python):2025-07-15-12:13:46.250.003 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:249] BuildCluster] Cluster is successfully initialized. [WARNING] DISTRIBUTED(1412861,ffff809ceec0,python):2025-07-15-12:13:46.250.049 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:355] PostProcess] This node 4 rank id: 4 [WARNING] DISTRIBUTED(1412869,ffff8873eec0,python):2025-07-15-12:13:46.250.051 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:355] PostProcess] This node 6 rank id: 6 [WARNING] PS(1412869,ffff8873eec0,python):2025-07-15-12:13:46.250.462 [mindspore/ccsrc/ps/core/file_configuration.cc:24] Initialize] The file: is not exist. [WARNING] DEVICE(1412869,ffff8873eec0,python):2025-07-15-12:13:46.250.513 [mindspore/ccsrc/plugin/device/cpu/hal/hardware/ms_collective_node.cc:33] Start] Failed to initialize the configuration for this mccl collective node. [WARNING] PS(1412861,ffff809ceec0,python):2025-07-15-12:13:46.250.871 [mindspore/ccsrc/ps/core/file_configuration.cc:24] Initialize] The file: is not exist. [WARNING] DEVICE(1412861,ffff809ceec0,python):2025-07-15-12:13:46.250.935 [mindspore/ccsrc/plugin/device/cpu/hal/hardware/ms_collective_node.cc:33] Start] Failed to initialize the configuration for this mccl collective node. [WARNING] DISTRIBUTED(1412849,ffff8ce9eec0,python):2025-07-15-12:13:46.324.791 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:249] BuildCluster] Cluster is successfully initialized. [WARNING] DISTRIBUTED(1412849,ffff8ce9eec0,python):2025-07-15-12:13:46.324.845 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:355] PostProcess] This node 1 rank id: 1 [WARNING] PS(1412849,ffff8ce9eec0,python):2025-07-15-12:13:46.325.458 [mindspore/ccsrc/ps/core/file_configuration.cc:24] Initialize] The file: is not exist. [WARNING] DEVICE(1412849,ffff8ce9eec0,python):2025-07-15-12:13:46.325.509 [mindspore/ccsrc/plugin/device/cpu/hal/hardware/ms_collective_node.cc:33] Start] Failed to initialize the configuration for this mccl collective node. [WARNING] DISTRIBUTED(1412873,ffff9adceec0,python):2025-07-15-12:13:46.363.398 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:249] BuildCluster] Cluster is successfully initialized. [WARNING] DISTRIBUTED(1412873,ffff9adceec0,python):2025-07-15-12:13:46.363.453 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:355] PostProcess] This node 7 rank id: 7 [WARNING] PS(1412873,ffff9adceec0,python):2025-07-15-12:13:46.363.828 [mindspore/ccsrc/ps/core/file_configuration.cc:24] Initialize] The file: is not exist. [WARNING] DEVICE(1412873,ffff9adceec0,python):2025-07-15-12:13:46.363.877 [mindspore/ccsrc/plugin/device/cpu/hal/hardware/ms_collective_node.cc:33] Start] Failed to initialize the configuration for this mccl collective node. [WARNING] DISTRIBUTED(1412873,ffff9adceec0,python):2025-07-15-12:13:48.094.941 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: hccl_world_group [const vector]{0, 1, 2, 3, 4, 5, 6, 7}, async: 0, submit_now: 1 [WARNING] DEVICE(1412873,fffedb04efa0,python):2025-07-15-12:13:48.095.419 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:254] SetGlobalCommInfo] Start to SetGlobalCommInfo for hccl_world_group, master_ip:2130706433, master_port:10666, node_rank:2130706433, total_rank_size:8, local_rank_size8 [WARNING] HCCL_ADPT(1412873,fffedb04efa0,python):2025-07-15-12:13:48.095.517 [mindspore/ccsrc/utils/dlopen_macro.h:165] DlsymAscend] Dynamically load symbol HcclSetGlobalCommInfo failed, result = /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/../lib/plugin/ascend/libhccl_plugin.so: undefined symbol: HcclSetGlobalCommInfo [WARNING] HCCL_ADPT(1412873,fffedb04efa0,python):2025-07-15-12:13:48.095.556 [mindspore/ccsrc/plugin/res_manager/ascend/hccl_adapter/hccl_adapter.cc:635] HcclSetGlobalCommInfo] Func HcclSetGlobalCommInfo is not supported in CANN package. [WARNING] DEVICE(1412873,fffedb04efa0,python):2025-07-15-12:13:48.095.588 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:265] SetGlobalCommInfo] End to SetGlobalCommInfo for hccl_world_group [WARNING] DEVICE(1412873,fffedb04efa0,python):2025-07-15-12:13:48.096.025 [mindspore/ccsrc/plugin/device/cpu/hal/hardware/ms_collective_comm_lib.cc:251] QueryUniqueID] Retry to lookup the unique id for group hccl_world_group from the meta server node...Retry time: 399/400, sleep 1 [WARNING] DEVICE(1412873,fffedb04efa0,python):2025-07-15-12:13:48.596.275 [mindspore/ccsrc/plugin/device/cpu/hal/hardware/ms_collective_comm_lib.cc:251] QueryUniqueID] Retry to lookup the unique id for group hccl_world_group from the meta server node...Retry time: 398/400, sleep 1 [WARNING] DEVICE(1412873,fffedb04efa0,python):2025-07-15-12:13:49.096.515 [mindspore/ccsrc/plugin/device/cpu/hal/hardware/ms_collective_comm_lib.cc:251] QueryUniqueID] Retry to lookup the unique id for group hccl_world_group from the meta server node...Retry time: 397/400, sleep 1 [WARNING] DEVICE(1412873,fffedb04efa0,python):2025-07-15-12:13:49.597.012 [mindspore/ccsrc/plugin/device/cpu/hal/hardware/ms_collective_comm_lib.cc:251] QueryUniqueID] Retry to lookup the unique id for group hccl_world_group from the meta server node...Retry time: 396/400, sleep 1 [WARNING] DEVICE(1412873,fffedb04efa0,python):2025-07-15-12:13:50.097.613 [mindspore/ccsrc/plugin/device/cpu/hal/hardware/ms_collective_comm_lib.cc:251] QueryUniqueID] Retry to lookup the unique id for group hccl_world_group from the meta server node...Retry time: 395/400, sleep 2 [WARNING] DEVICE(1412873,fffedb04efa0,python):2025-07-15-12:13:50.598.189 [mindspore/ccsrc/plugin/device/cpu/hal/hardware/ms_collective_comm_lib.cc:251] QueryUniqueID] Retry to lookup the unique id for group hccl_world_group from the meta server node...Retry time: 394/400, sleep 2 [WARNING] DISTRIBUTED(1412857,ffffb707eec0,python):2025-07-15-12:13:50.769.501 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: hccl_world_group [const vector]{0, 1, 2, 3, 4, 5, 6, 7}, async: 0, submit_now: 1 [WARNING] DEVICE(1412857,fffef6efefa0,python):2025-07-15-12:13:50.770.087 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:254] SetGlobalCommInfo] Start to SetGlobalCommInfo for hccl_world_group, master_ip:2130706433, master_port:10666, node_rank:2130706433, total_rank_size:8, local_rank_size8 [WARNING] HCCL_ADPT(1412857,fffef6efefa0,python):2025-07-15-12:13:50.770.200 [mindspore/ccsrc/utils/dlopen_macro.h:165] DlsymAscend] Dynamically load symbol HcclSetGlobalCommInfo failed, result = /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/../lib/plugin/ascend/libhccl_plugin.so: undefined symbol: HcclSetGlobalCommInfo [WARNING] HCCL_ADPT(1412857,fffef6efefa0,python):2025-07-15-12:13:50.770.250 [mindspore/ccsrc/plugin/res_manager/ascend/hccl_adapter/hccl_adapter.cc:635] HcclSetGlobalCommInfo] Func HcclSetGlobalCommInfo is not supported in CANN package. [WARNING] DEVICE(1412857,fffef6efefa0,python):2025-07-15-12:13:50.770.279 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:265] SetGlobalCommInfo] End to SetGlobalCommInfo for hccl_world_group [WARNING] DEVICE(1412857,fffef6efefa0,python):2025-07-15-12:13:50.770.778 [mindspore/ccsrc/plugin/device/cpu/hal/hardware/ms_collective_comm_lib.cc:251] QueryUniqueID] Retry to lookup the unique id for group hccl_world_group from the meta server node...Retry time: 399/400, sleep 2 [WARNING] DISTRIBUTED(1412845,ffffa0aaeec0,python):2025-07-15-12:13:50.876.099 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: hccl_world_group [const vector]{0, 1, 2, 3, 4, 5, 6, 7}, async: 0, submit_now: 1 [WARNING] DEVICE(1412845,fffee0baefa0,python):2025-07-15-12:13:50.876.619 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:254] SetGlobalCommInfo] Start to SetGlobalCommInfo for hccl_world_group, master_ip:2130706433, master_port:10666, node_rank:2130706433, total_rank_size:8, local_rank_size8 [WARNING] HCCL_ADPT(1412845,fffee0baefa0,python):2025-07-15-12:13:50.876.714 [mindspore/ccsrc/utils/dlopen_macro.h:165] DlsymAscend] Dynamically load symbol HcclSetGlobalCommInfo failed, result = /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/../lib/plugin/ascend/libhccl_plugin.so: undefined symbol: HcclSetGlobalCommInfo [WARNING] HCCL_ADPT(1412845,fffee0baefa0,python):2025-07-15-12:13:50.876.749 [mindspore/ccsrc/plugin/res_manager/ascend/hccl_adapter/hccl_adapter.cc:635] HcclSetGlobalCommInfo] Func HcclSetGlobalCommInfo is not supported in CANN package. [WARNING] DEVICE(1412845,fffee0baefa0,python):2025-07-15-12:13:50.876.779 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:265] SetGlobalCommInfo] End to SetGlobalCommInfo for hccl_world_group [WARNING] DISTRIBUTED(1412845,fffee0baefa0,python):2025-07-15-12:13:50.884.373 [mindspore/ccsrc/distributed/collective/collective_manager.cc:1021] CreateDeviceCommunicator] Begin initialize communication group on the device side: hccl_world_group [WARNING] DEVICE(1412845,fffe927cefa0,python):2025-07-15-12:13:50.884.712 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:169] InitByRootInfoConfig] Start to initialize communicator by HcclCommInitRootInfoConfig for hccl_world_group, hcclBufferSize is 200 MB, hcclDeterministic is 0 [WARNING] DISTRIBUTED(1412865,ffff94c6eec0,python):2025-07-15-12:13:50.948.499 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: hccl_world_group [const vector]{0, 1, 2, 3, 4, 5, 6, 7}, async: 0, submit_now: 1 [WARNING] DEVICE(1412865,fffe8fffefa0,python):2025-07-15-12:13:50.949.063 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:254] SetGlobalCommInfo] Start to SetGlobalCommInfo for hccl_world_group, master_ip:2130706433, master_port:10666, node_rank:2130706433, total_rank_size:8, local_rank_size8 [WARNING] HCCL_ADPT(1412865,fffe8fffefa0,python):2025-07-15-12:13:50.949.156 [mindspore/ccsrc/utils/dlopen_macro.h:165] DlsymAscend] Dynamically load symbol HcclSetGlobalCommInfo failed, result = /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/../lib/plugin/ascend/libhccl_plugin.so: undefined symbol: HcclSetGlobalCommInfo [WARNING] HCCL_ADPT(1412865,fffe8fffefa0,python):2025-07-15-12:13:50.949.192 [mindspore/ccsrc/plugin/res_manager/ascend/hccl_adapter/hccl_adapter.cc:635] HcclSetGlobalCommInfo] Func HcclSetGlobalCommInfo is not supported in CANN package. [WARNING] DEVICE(1412865,fffe8fffefa0,python):2025-07-15-12:13:50.949.222 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:265] SetGlobalCommInfo] End to SetGlobalCommInfo for hccl_world_group [WARNING] DISTRIBUTED(1412865,fffe8fffefa0,python):2025-07-15-12:13:50.949.666 [mindspore/ccsrc/distributed/collective/collective_manager.cc:1021] CreateDeviceCommunicator] Begin initialize communication group on the device side: hccl_world_group [WARNING] DEVICE(1412865,fffe8f7eefa0,python):2025-07-15-12:13:50.949.996 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:169] InitByRootInfoConfig] Start to initialize communicator by HcclCommInitRootInfoConfig for hccl_world_group, hcclBufferSize is 200 MB, hcclDeterministic is 0 [WARNING] DISTRIBUTED(1412853,ffffad19eec0,python):2025-07-15-12:13:50.955.074 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: hccl_world_group [const vector]{0, 1, 2, 3, 4, 5, 6, 7}, async: 0, submit_now: 1 [WARNING] DEVICE(1412853,fffeed0aefa0,python):2025-07-15-12:13:50.955.630 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:254] SetGlobalCommInfo] Start to SetGlobalCommInfo for hccl_world_group, master_ip:2130706433, master_port:10666, node_rank:2130706433, total_rank_size:8, local_rank_size8 [WARNING] HCCL_ADPT(1412853,fffeed0aefa0,python):2025-07-15-12:13:50.955.731 [mindspore/ccsrc/utils/dlopen_macro.h:165] DlsymAscend] Dynamically load symbol HcclSetGlobalCommInfo failed, result = /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/../lib/plugin/ascend/libhccl_plugin.so: undefined symbol: HcclSetGlobalCommInfo [WARNING] HCCL_ADPT(1412853,fffeed0aefa0,python):2025-07-15-12:13:50.955.771 [mindspore/ccsrc/plugin/res_manager/ascend/hccl_adapter/hccl_adapter.cc:635] HcclSetGlobalCommInfo] Func HcclSetGlobalCommInfo is not supported in CANN package. [WARNING] DEVICE(1412853,fffeed0aefa0,python):2025-07-15-12:13:50.955.804 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:265] SetGlobalCommInfo] End to SetGlobalCommInfo for hccl_world_group [WARNING] DISTRIBUTED(1412853,fffeed0aefa0,python):2025-07-15-12:13:50.956.590 [mindspore/ccsrc/distributed/collective/collective_manager.cc:1021] CreateDeviceCommunicator] Begin initialize communication group on the device side: hccl_world_group [WARNING] DEVICE(1412853,fffeec89efa0,python):2025-07-15-12:13:50.957.088 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:169] InitByRootInfoConfig] Start to initialize communicator by HcclCommInitRootInfoConfig for hccl_world_group, hcclBufferSize is 200 MB, hcclDeterministic is 0 [WARNING] DISTRIBUTED(1412869,ffff8873eec0,python):2025-07-15-12:13:50.991.392 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: hccl_world_group [const vector]{0, 1, 2, 3, 4, 5, 6, 7}, async: 0, submit_now: 1 [WARNING] DEVICE(1412869,fffec8baefa0,python):2025-07-15-12:13:50.991.882 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:254] SetGlobalCommInfo] Start to SetGlobalCommInfo for hccl_world_group, master_ip:2130706433, master_port:10666, node_rank:2130706433, total_rank_size:8, local_rank_size8 [WARNING] HCCL_ADPT(1412869,fffec8baefa0,python):2025-07-15-12:13:50.991.980 [mindspore/ccsrc/utils/dlopen_macro.h:165] DlsymAscend] Dynamically load symbol HcclSetGlobalCommInfo failed, result = /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/../lib/plugin/ascend/libhccl_plugin.so: undefined symbol: HcclSetGlobalCommInfo [WARNING] HCCL_ADPT(1412869,fffec8baefa0,python):2025-07-15-12:13:50.992.018 [mindspore/ccsrc/plugin/res_manager/ascend/hccl_adapter/hccl_adapter.cc:635] HcclSetGlobalCommInfo] Func HcclSetGlobalCommInfo is not supported in CANN package. [WARNING] DEVICE(1412869,fffec8baefa0,python):2025-07-15-12:13:50.992.051 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:265] SetGlobalCommInfo] End to SetGlobalCommInfo for hccl_world_group [WARNING] DISTRIBUTED(1412869,fffec8baefa0,python):2025-07-15-12:13:50.992.502 [mindspore/ccsrc/distributed/collective/collective_manager.cc:1021] CreateDeviceCommunicator] Begin initialize communication group on the device side: hccl_world_group [WARNING] DEVICE(1412869,fffe7bffefa0,python):2025-07-15-12:13:50.992.886 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:169] InitByRootInfoConfig] Start to initialize communicator by HcclCommInitRootInfoConfig for hccl_world_group, hcclBufferSize is 200 MB, hcclDeterministic is 0 [WARNING] DISTRIBUTED(1412861,ffff809ceec0,python):2025-07-15-12:13:51.030.134 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: hccl_world_group [const vector]{0, 1, 2, 3, 4, 5, 6, 7}, async: 0, submit_now: 1 [WARNING] DEVICE(1412861,fffec0baefa0,python):2025-07-15-12:13:51.030.671 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:254] SetGlobalCommInfo] Start to SetGlobalCommInfo for hccl_world_group, master_ip:2130706433, master_port:10666, node_rank:2130706433, total_rank_size:8, local_rank_size8 [WARNING] HCCL_ADPT(1412861,fffec0baefa0,python):2025-07-15-12:13:51.030.770 [mindspore/ccsrc/utils/dlopen_macro.h:165] DlsymAscend] Dynamically load symbol HcclSetGlobalCommInfo failed, result = /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/../lib/plugin/ascend/libhccl_plugin.so: undefined symbol: HcclSetGlobalCommInfo [WARNING] HCCL_ADPT(1412861,fffec0baefa0,python):2025-07-15-12:13:51.030.808 [mindspore/ccsrc/plugin/res_manager/ascend/hccl_adapter/hccl_adapter.cc:635] HcclSetGlobalCommInfo] Func HcclSetGlobalCommInfo is not supported in CANN package. [WARNING] DEVICE(1412861,fffec0baefa0,python):2025-07-15-12:13:51.030.840 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:265] SetGlobalCommInfo] End to SetGlobalCommInfo for hccl_world_group [WARNING] DISTRIBUTED(1412861,fffec0baefa0,python):2025-07-15-12:13:51.031.379 [mindspore/ccsrc/distributed/collective/collective_manager.cc:1021] CreateDeviceCommunicator] Begin initialize communication group on the device side: hccl_world_group [WARNING] DEVICE(1412861,fffe73ffefa0,python):2025-07-15-12:13:51.031.783 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:169] InitByRootInfoConfig] Start to initialize communicator by HcclCommInitRootInfoConfig for hccl_world_group, hcclBufferSize is 200 MB, hcclDeterministic is 0 [WARNING] DISTRIBUTED(1412873,fffedb04efa0,python):2025-07-15-12:13:51.098.701 [mindspore/ccsrc/distributed/collective/collective_manager.cc:1021] CreateDeviceCommunicator] Begin initialize communication group on the device side: hccl_world_group [WARNING] DEVICE(1412873,fffeda83efa0,python):2025-07-15-12:13:51.099.073 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:169] InitByRootInfoConfig] Start to initialize communicator by HcclCommInitRootInfoConfig for hccl_world_group, hcclBufferSize is 200 MB, hcclDeterministic is 0 [WARNING] DISTRIBUTED(1412857,fffef6efefa0,python):2025-07-15-12:13:51.271.132 [mindspore/ccsrc/distributed/collective/collective_manager.cc:1021] CreateDeviceCommunicator] Begin initialize communication group on the device side: hccl_world_group [WARNING] DEVICE(1412857,fffef66eefa0,python):2025-07-15-12:13:51.271.593 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:169] InitByRootInfoConfig] Start to initialize communicator by HcclCommInitRootInfoConfig for hccl_world_group, hcclBufferSize is 200 MB, hcclDeterministic is 0 ============================= test session starts ============================== platform linux -- Python 3.9.21, pytest-6.2.5, py-1.11.0, pluggy-0.13.1 rootdir: /home/jenkins/mindspore/testcases/testcases/tests/st/mint plugins: forked-1.6.0, hydra-core-1.3.2, xdist-1.32.0, anyio-4.9.0 collected 0 items / 1 error ==================================== ERRORS ==================================== _____________________ ERROR collecting test_distributed.py _____________________ test_distributed.py:55: in init_process_group() /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/mint/distributed/distributed.py:470: in init_process_group init(backend) /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/management.py:203: in init init_hccl() E RuntimeError: Call aclrtSetDevice failed, ret[507033]. Got device count[8] and device id[1], please check if device id is valid. E E ---------------------------------------------------- E - C++ Call Stack: (For framework developers) E ---------------------------------------------------- E mindspore/ccsrc/plugin/res_manager/ascend/hal_manager/ascend_hal_manager.cc:67 InitDevice =============================== warnings summary =============================== ../../../../../../.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549 /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) ../../../../../../.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89 /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) ../../../../../../.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549 /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) ../../../../../../.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89 /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) ../../../../../../anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/batchnorm_fold2.py:57 /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/batchnorm_fold2.py:57: DeprecationWarning: te_fusion.fusion_manager.fusion_manager.register is deprecated,please replace it with tbe.common.register.register_op_compute @fusion_manager.register("batchnorm_fold2") ../../../../../../anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/batchnorm_fold2_grad.py:56 /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/batchnorm_fold2_grad.py:56: DeprecationWarning: te_fusion.fusion_manager.fusion_manager.register is deprecated,please replace it with tbe.common.register.register_op_compute @fusion_manager.register("batchnorm_fold2_grad") ../../../../../../anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/batchnorm_fold2_grad_reduce.py:48 /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/batchnorm_fold2_grad_reduce.py:48: DeprecationWarning: te_fusion.fusion_manager.fusion_manager.register is deprecated,please replace it with tbe.common.register.register_op_compute @fusion_manager.register("batchnorm_fold2_grad_reduce") ../../../../../../anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/correction_mul.py:51 /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/correction_mul.py:51: DeprecationWarning: te_fusion.fusion_manager.fusion_manager.register is deprecated,please replace it with tbe.common.register.register_op_compute @fusion_manager.register("correction_mul") ../../../../../../anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/correction_mul_grad.py:51 /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/correction_mul_grad.py:51: DeprecationWarning: te_fusion.fusion_manager.fusion_manager.register is deprecated,please replace it with tbe.common.register.register_op_compute @fusion_manager.register("correction_mul_grad") ../../../../../../anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/correction_mul_grad.py:143 /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/correction_mul_grad.py:143: DeprecationWarning: te_fusion.fusion_manager.fusion_manager.register is deprecated,please replace it with tbe.common.register.register_op_compute @fusion_manager.register("correction_mul_grad_reduce") ../../../../../../anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/fake_learned_scale_quant_perlayer.py:50 /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/fake_learned_scale_quant_perlayer.py:50: DeprecationWarning: te_fusion.fusion_manager.fusion_manager.register is deprecated,please replace it with tbe.common.register.register_op_compute @fusion_manager.register("fake_learned_scale_quant_perlayer") ../../../../../../anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/fake_learned_scale_quant_perlayer_grad.py:92 /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/fake_learned_scale_quant_perlayer_grad.py:92: DeprecationWarning: te_fusion.fusion_manager.fusion_manager.register is deprecated,please replace it with tbe.common.register.register_op_compute @fusion_manager.register("fake_learned_scale_quant_perlayer_grad_d") ../../../../../../anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/fake_learned_scale_quant_perlayer_grad_reduce.py:49 /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/fake_learned_scale_quant_perlayer_grad_reduce.py:49: DeprecationWarning: te_fusion.fusion_manager.fusion_manager.register is deprecated,please replace it with tbe.common.register.register_op_compute @fusion_manager.register("fake_learned_scale_quant_perlayer_grad_d_reduce") ../../../../../../anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/fake_learned_scale_quant_perchannel.py:50 /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/fake_learned_scale_quant_perchannel.py:50: DeprecationWarning: te_fusion.fusion_manager.fusion_manager.register is deprecated,please replace it with tbe.common.register.register_op_compute @fusion_manager.register("fake_learned_scale_quant_perchannel") ../../../../../../anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/fake_learned_scale_quant_perchannel_grad.py:91 /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/fake_learned_scale_quant_perchannel_grad.py:91: DeprecationWarning: te_fusion.fusion_manager.fusion_manager.register is deprecated,please replace it with tbe.common.register.register_op_compute @fusion_manager.register("fake_learned_scale_quant_perchannel_grad_d") ../../../../../../anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/fake_learned_scale_quant_perchannel_grad_reduce.py:48 /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/fake_learned_scale_quant_perchannel_grad_reduce.py:48: DeprecationWarning: te_fusion.fusion_manager.fusion_manager.register is deprecated,please replace it with tbe.common.register.register_op_compute @fusion_manager.register("fake_learned_scale_quant_perchannel_grad_d_reduce") ../../../../../../anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/fake_quant_perchannel.py:52 /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/fake_quant_perchannel.py:52: DeprecationWarning: te_fusion.fusion_manager.fusion_manager.register is deprecated,please replace it with tbe.common.register.register_op_compute @fusion_manager.register("fake_quant_perchannel") ../../../../../../anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/fake_quant_perchannel_grad.py:81 /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/fake_quant_perchannel_grad.py:81: DeprecationWarning: te_fusion.fusion_manager.fusion_manager.register is deprecated,please replace it with tbe.common.register.register_op_compute @fusion_manager.register("fake_quant_perchannel_grad") ../../../../../../anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/fake_quant_perlayer.py:54 /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/fake_quant_perlayer.py:54: DeprecationWarning: te_fusion.fusion_manager.fusion_manager.register is deprecated,please replace it with tbe.common.register.register_op_compute @fusion_manager.register("fake_quant_per_layer") ../../../../../../anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/fake_quant_perlayer_grad.py:81 /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/fake_quant_perlayer_grad.py:81: DeprecationWarning: te_fusion.fusion_manager.fusion_manager.register is deprecated,please replace it with tbe.common.register.register_op_compute @fusion_manager.register("fake_quant_per_layer_grad") ../../../../../../anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/minmax_update_perchannel.py:50 /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/minmax_update_perchannel.py:50: DeprecationWarning: te_fusion.fusion_manager.fusion_manager.register is deprecated,please replace it with tbe.common.register.register_op_compute @fusion_manager.register("minmax_update_perchannel") ../../../../../../anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/minmax_update_perlayer.py:50 /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/minmax_update_perlayer.py:50: DeprecationWarning: te_fusion.fusion_manager.fusion_manager.register is deprecated,please replace it with tbe.common.register.register_op_compute @fusion_manager.register("minmax_update_perlayer") -- Docs: https://docs.pytest.org/en/stable/warnings.html =========================== short test summary info ============================ ERROR test_distributed.py - RuntimeError: Call aclrtSetDevice failed, ret[507... !!!!!!!!!!!!!!!!!!!! Interrupted: 1 error during collection !!!!!!!!!!!!!!!!!!!! ================== 22 warnings, 1 error in 166.67s (0:02:46) =================== [WARNING] DEVICE(1412849,ffff8ce9eec0,python):2025-07-15-12:16:24.881.522 [mindspore/ccsrc/plugin/device/ascend/hal/hardware/ascend_device_res_manager.cc:350] SyncAllStreams] The ascend_res_manager_ is nullptr in scenarios where it is not actually executed [INFO] PS(1412849,ffff16fdefa0,python):2025-07-15-12:16:25.459.310 [mindspore/ccsrc/ps/core/communicator/tcp_client.cc:318] Start] Event base dispatch success! [INFO] PS(1412849,ffff177eefa0,python):2025-07-15-12:16:25.459.642 [mindspore/ccsrc/ps/core/communicator/tcp_server.cc:220] Start] Event base dispatch success! [ERROR] ME(1412768:281473674899136,MainProcess):2025-07-15-12:16:26.571.376 [mindspore/parallel/cluster/process_entity/_api.py:363] Worker process 1412849 exit with exception. Error code: 2. [WARNING] ME(1412768:281473674899136,MainProcess):2025-07-15-12:16:26.571.702 [mindspore/parallel/cluster/process_entity/_api.py:369] There's worker exits with exception, kill all other workers. [ERROR] ME(1412768:281473674899136,MainProcess):2025-07-15-12:17:02.791.397 [mindspore/parallel/cluster/process_entity/_api.py:382] Scheduler process 1412843 exit with exception. [ERROR] ME(1412768:281473674899136,MainProcess):2025-07-15-12:17:02.792.504 [mindspore/parallel/cluster/process_entity/_api.py:603] Time out nodes are ['0', '2', '3', '4', '5', '6', '7'] test_distributed.py-58-) test_distributed.py-59-context.set_context(mode=context.PYNATIVE_MODE, device_target="Ascend") test_distributed.py-60-rank = get_rank() test_distributed.py-61-size = get_world_size() test_distributed.py-62-if size % 2 != 0: test_distributed.py:63: raise RuntimeError("Group size should be divided by 2 exactly.") test_distributed.py-64- test_distributed.py-65- test_distributed.py-66-def test_hccl_new_group(): test_distributed.py-67- """ test_distributed.py-68- Feature: test distributed op -- test_distributed.py-88- name = "hccl_" + str(2) + "_" + hashlib.sha1(bytes("_".join(map(str, range(2))), "utf-8")).hexdigest() test_distributed.py-89- assert group == name test_distributed.py-90- #超时用例 test_distributed.py-91- #group = new_group(list(range(9))) test_distributed.py-92- #assert group == "" test_distributed.py:93: with pytest.raises(TypeError): test_distributed.py-94- new_group(1) test_distributed.py:95: with pytest.raises(TypeError): test_distributed.py-96- new_group(True) test_distributed.py-97- if rank == 0 or rank == 1: test_distributed.py:98: with pytest.raises(ValueError): test_distributed.py-99- new_group([0, 0, 1, 1]) test_distributed.py-100- test_distributed.py-101- test_distributed.py-102-def test_hccl_get_backend(): test_distributed.py-103- """ -- test_distributed.py-112- name = "hccl_" + str(size) + "_" + hashlib.sha1(bytes("_".join(map(str, range(size))), "utf-8")).hexdigest() test_distributed.py-113- group = new_group(list(range(size)), 1) test_distributed.py-114- assert group == name test_distributed.py-115- backend = get_backend(group) test_distributed.py-116- assert backend == "hccl" test_distributed.py:117: with pytest.raises(TypeError): test_distributed.py-118- backend = get_backend(1) test_distributed.py-119- test_distributed.py-120- test_distributed.py-121-def test_hccl_get_global_rank(): test_distributed.py-122- """ -- test_distributed.py-124- Description: test comm op in python native test_distributed.py-125- Expectation: success test_distributed.py-126- """ test_distributed.py-127- global_rank = get_global_rank(None, rank) test_distributed.py-128- assert global_rank == rank test_distributed.py:129: with pytest.raises(TypeError): test_distributed.py-130- get_global_rank(0, rank) test_distributed.py:131: with pytest.raises(TypeError): test_distributed.py-132- get_global_rank(None, "rank") test_distributed.py-133- if rank == 0: test_distributed.py-134- group = new_group(list(range(2))) test_distributed.py-135- global_rank = get_global_rank(group, 1) test_distributed.py-136- assert global_rank == 1 -- test_distributed.py-154- Description: test comm op in python native test_distributed.py-155- Expectation: success test_distributed.py-156- """ test_distributed.py-157- group_rank = get_group_rank(None, rank) test_distributed.py-158- assert group_rank == rank test_distributed.py:159: with pytest.raises(TypeError): test_distributed.py-160- get_group_rank(0, rank) test_distributed.py:161: with pytest.raises(TypeError): test_distributed.py-162- get_group_rank(None, "rank") test_distributed.py-163- if rank == 0: test_distributed.py-164- group = new_group(list(range(2))) test_distributed.py-165- group_rank = get_group_rank(group, 1) test_distributed.py-166- assert group_rank == 1 -- test_distributed.py-185- Expectation: success test_distributed.py-186- """ test_distributed.py-187- ranks = get_process_group_ranks() test_distributed.py-188- print("ranks is:", ranks) test_distributed.py-189- assert ranks == list(range(size)) test_distributed.py:190: with pytest.raises(TypeError): test_distributed.py-191- get_process_group_ranks(0) test_distributed.py:192: with pytest.raises(TypeError): test_distributed.py-193- get_process_group_ranks(True) test_distributed.py-194- if rank == 0: test_distributed.py-195- group = new_group(list(range(2))) test_distributed.py-196- ranks = get_process_group_ranks(group) test_distributed.py-197- assert ranks == list(range(2)) -- test_distributed.py-270- sum_input_tensor1 = input_tensor * (rank + 1) test_distributed.py-271- sum_output_handle = all_reduce(sum_input_tensor1, group=name) test_distributed.py-272- except_sum_output = input_tensor * (sum(list(range(1, 3)))) test_distributed.py-273- assert np.allclose(sum_input_tensor1.asnumpy(), except_sum_output.asnumpy()) test_distributed.py-274- # 异常场景 test_distributed.py:275: with pytest.raises(TypeError): test_distributed.py-276- all_reduce(1) test_distributed.py:277: with pytest.raises(TypeError): test_distributed.py-278- all_reduce(sum_input_tensor, op=1) test_distributed.py:279: with pytest.raises(TypeError): test_distributed.py-280- all_reduce(sum_input_tensor, op="test") test_distributed.py:281: with pytest.raises(TypeError): test_distributed.py-282- all_reduce(sum_input_tensor, group=1) test_distributed.py:283: with pytest.raises(TypeError): test_distributed.py-284- all_reduce(sum_input_tensor, async_op="test") test_distributed.py-285- test_distributed.py-286- test_distributed.py-287-def test_hccl_all_gather_into_tensor(): test_distributed.py-288- """ -- test_distributed.py-315- ) test_distributed.py-316- except_output_tensor = cat([input_tensor1, input_tensor1]) test_distributed.py-317- assert output_handle is None test_distributed.py-318- assert np.allclose(output_tensor1.asnumpy(), except_output_tensor.asnumpy()) test_distributed.py-319- # 异常场景 test_distributed.py:320: with pytest.raises(TypeError): test_distributed.py-321- all_gather_into_tensor(1) test_distributed.py:322: with pytest.raises(TypeError): test_distributed.py-323- all_gather_into_tensor(output_tensor, input_tensor, group=1) test_distributed.py:324: with pytest.raises(TypeError): test_distributed.py-325- all_gather_into_tensor(output_tensor, input_tensor, async_op="test") test_distributed.py:326: with pytest.raises(TypeError): test_distributed.py-327- all_gather_into_tensor([1], input_tensor) test_distributed.py:328: with pytest.raises(TypeError): test_distributed.py-329- all_gather_into_tensor(output_tensor, [1]) test_distributed.py-330- output_tensor = ms.Tensor(np.zeros([3, 3]).astype(np.float32)) test_distributed.py:331: with pytest.raises(ValueError): test_distributed.py-332- all_gather_into_tensor(output_tensor, input_tensor) test_distributed.py-333- _pynative_executor.sync() test_distributed.py-334- output_tensor = ms.Tensor(np.zeros([3, 3 * size]).astype(np.float32)) test_distributed.py:335: with pytest.raises(ValueError): test_distributed.py-336- all_gather_into_tensor(output_tensor, input_tensor) test_distributed.py-337- _pynative_executor.sync() test_distributed.py-338- output_tensor = ms.Tensor(np.zeros([3 * size, 3]).astype(np.int32)) test_distributed.py:339: with pytest.raises(ValueError): test_distributed.py-340- all_gather_into_tensor(output_tensor, input_tensor) test_distributed.py-341- _pynative_executor.sync() test_distributed.py-342- test_distributed.py-343-def test_hccl_all_gather_into_tensor_uneven(): test_distributed.py-344- """ -- test_distributed.py-403- group=name, test_distributed.py-404- ) test_distributed.py-405- assert output_handle is None test_distributed.py-406- assert np.allclose(output_tensor1.asnumpy(), except_output) test_distributed.py-407- # # 异常场景 test_distributed.py:408: with pytest.raises(TypeError): test_distributed.py-409- all_gather_into_tensor_uneven(1) test_distributed.py:410: with pytest.raises(TypeError): test_distributed.py-411- all_gather_into_tensor_uneven(output_tensor, input_tensor, group=1) test_distributed.py:412: with pytest.raises(TypeError): test_distributed.py-413- all_gather_into_tensor_uneven(output_tensor, input_tensor, async_op="test") test_distributed.py:414: with pytest.raises(TypeError): test_distributed.py-415- all_gather_into_tensor_uneven([1], input_tensor) test_distributed.py:416: with pytest.raises(TypeError): test_distributed.py-417- all_gather_into_tensor_uneven(output_tensor, [1]) test_distributed.py:418: with pytest.raises(ValueError): test_distributed.py-419- output_split_sizes1 = [r + 3 for r in range(size)] test_distributed.py-420- all_gather_into_tensor_uneven( test_distributed.py-421- output_tensor, input_tensor, output_split_sizes=output_split_sizes1 test_distributed.py-422- ) test_distributed.py-423- _pynative_executor.sync() test_distributed.py:424: with pytest.raises(ValueError): test_distributed.py-425- output_split_sizes1 = [r + 1 for r in range(size + 3)] test_distributed.py-426- all_gather_into_tensor_uneven( test_distributed.py-427- output_tensor, input_tensor, output_split_sizes=output_split_sizes1 test_distributed.py-428- ) test_distributed.py-429- _pynative_executor.sync() test_distributed.py-430- output_tensor = ms.Tensor(np.zeros([5 * size]).astype(np.float32)) test_distributed.py:431: with pytest.raises(ValueError): test_distributed.py-432- all_gather_into_tensor_uneven( test_distributed.py-433- output_tensor, input_tensor, output_split_sizes=output_split_sizes test_distributed.py-434- ) test_distributed.py-435- _pynative_executor.sync() test_distributed.py-436- output_tensor = ms.Tensor(np.zeros([total_size]).astype(np.int32)) test_distributed.py:437: with pytest.raises(ValueError): test_distributed.py-438- all_gather_into_tensor_uneven(output_tensor, input_tensor) test_distributed.py-439- _pynative_executor.sync() test_distributed.py-440- test_distributed.py-441- test_distributed.py-442-def test_hccl_reduce_scatter_tensor_type(): -- test_distributed.py-500- output_handle = reduce_scatter_tensor(output_tensor1, input_tensor1, group=name) test_distributed.py-501- except_output_tensor = ms.Tensor(np.ones([3, 3]).astype(np.float32)) * 2 test_distributed.py-502- assert output_handle is None test_distributed.py-503- assert np.allclose(output_tensor1.asnumpy(), except_output_tensor.asnumpy()) test_distributed.py-504- # 异常场景 test_distributed.py:505: with pytest.raises(TypeError): test_distributed.py-506- reduce_scatter_tensor(1) test_distributed.py:507: with pytest.raises(TypeError): test_distributed.py-508- reduce_scatter_tensor(output_tensor, input_tensor, op=1) test_distributed.py:509: with pytest.raises(TypeError): test_distributed.py-510- reduce_scatter_tensor(output_tensor, input_tensor, op="test") test_distributed.py:511: with pytest.raises(TypeError): test_distributed.py-512- reduce_scatter_tensor(output_tensor, input_tensor, group=1) test_distributed.py:513: with pytest.raises(TypeError): test_distributed.py-514- reduce_scatter_tensor(output_tensor, input_tensor, async_op="test") test_distributed.py:515: with pytest.raises(TypeError): test_distributed.py-516- reduce_scatter_tensor([1], input_tensor) test_distributed.py:517: with pytest.raises(TypeError): test_distributed.py-518- reduce_scatter_tensor(output_tensor, [1]) test_distributed.py-519- output_tensor = ms.Tensor(np.zeros([1, 3]).astype(np.float32)) test_distributed.py:520: with pytest.raises(ValueError): test_distributed.py-521- reduce_scatter_tensor(output_tensor, input_tensor) test_distributed.py-522- _pynative_executor.sync() test_distributed.py-523- output_tensor = ms.Tensor(np.zeros([3, 1]).astype(np.float32)) test_distributed.py:524: with pytest.raises(ValueError): test_distributed.py-525- reduce_scatter_tensor(output_tensor, input_tensor) test_distributed.py-526- _pynative_executor.sync() test_distributed.py-527- output_tensor = ms.Tensor(np.zeros([3, 3]).astype(np.int32)) test_distributed.py:528: with pytest.raises(ValueError): test_distributed.py-529- reduce_scatter_tensor(output_tensor, input_tensor) test_distributed.py-530- _pynative_executor.sync() test_distributed.py-531- test_distributed.py-532- test_distributed.py-533- -- test_distributed.py-592- group=name, test_distributed.py-593- ) test_distributed.py-594- assert output_handle is None test_distributed.py-595- assert np.allclose(output_tensor1.asnumpy(), expected_output) test_distributed.py-596- # 异常场景 test_distributed.py:597: with pytest.raises(TypeError): test_distributed.py-598- reduce_scatter_tensor_uneven(1) test_distributed.py:599: with pytest.raises(TypeError): test_distributed.py-600- reduce_scatter_tensor_uneven(output_tensor, input_tensor, group=1) test_distributed.py:601: with pytest.raises(TypeError): test_distributed.py-602- reduce_scatter_tensor_uneven(output_tensor, input_tensor, async_op="test") test_distributed.py:603: with pytest.raises(TypeError): test_distributed.py-604- reduce_scatter_tensor_uneven([1], input_tensor) test_distributed.py:605: with pytest.raises(TypeError): test_distributed.py-606- reduce_scatter_tensor_uneven(output_tensor, [1]) test_distributed.py:607: with pytest.raises(ValueError): test_distributed.py-608- input_split_sizes1 = [r + 3 for r in range(size)] test_distributed.py-609- reduce_scatter_tensor_uneven( test_distributed.py-610- output_tensor, input_tensor, input_split_sizes=input_split_sizes1 test_distributed.py-611- ) test_distributed.py-612- _pynative_executor.sync() test_distributed.py:613: with pytest.raises(ValueError): test_distributed.py-614- input_split_sizes1 = [r + 1 for r in range(size + 3)] test_distributed.py-615- reduce_scatter_tensor_uneven( test_distributed.py-616- output_tensor, input_tensor, input_split_sizes=input_split_sizes1 test_distributed.py-617- ) test_distributed.py-618- _pynative_executor.sync() test_distributed.py-619- output_tensor = ms.Tensor(np.zeros([rank + 1]).astype(np.int32)) test_distributed.py:620: with pytest.raises(ValueError): test_distributed.py-621- reduce_scatter_tensor_uneven(output_tensor, input_tensor) test_distributed.py-622- _pynative_executor.sync() test_distributed.py-623- test_distributed.py-624- test_distributed.py-625-def test_hccl_reduce_type(): -- test_distributed.py-696- assert np.allclose(sum_input_tensor1.asnumpy(), except_sum_output.asnumpy()) test_distributed.py-697- else: test_distributed.py-698- except_sum_output = input_tensor * (rank + 1) test_distributed.py-699- assert np.allclose(sum_input_tensor1.asnumpy(), except_sum_output.asnumpy()) test_distributed.py-700- # 异常场景 test_distributed.py:701: with pytest.raises(TypeError): test_distributed.py-702- reduce(1) test_distributed.py:703: with pytest.raises(TypeError): test_distributed.py-704- reduce(sum_input_tensor, dst="test") test_distributed.py:705: with pytest.raises(TypeError): test_distributed.py-706- reduce(sum_input_tensor, dst=0, op=1) test_distributed.py:707: with pytest.raises(TypeError): test_distributed.py-708- reduce(sum_input_tensor, dst=0, op="test") test_distributed.py:709: with pytest.raises(TypeError): test_distributed.py-710- reduce(sum_input_tensor, dst=0, group=1) test_distributed.py:711: with pytest.raises(TypeError): test_distributed.py-712- reduce(sum_input_tensor, dst=0, async_op="test") test_distributed.py-713- test_distributed.py-714- test_distributed.py-715-def test_hccl_batch_isend_irecv(): test_distributed.py-716- """ -- test_distributed.py-742- except_output = ms.Tensor(2, dtype=ms.float32) test_distributed.py-743- assert np.allclose(recv_tensor.asnumpy(), except_output.asnumpy()) test_distributed.py-744- # 异常场景 test_distributed.py-745- send_op = P2POp("isend", send_tensor, next_rank) test_distributed.py-746- recv_op = P2POp("irecv", recv_tensor, prev_rank, group="11") test_distributed.py:747: with pytest.raises(TypeError): test_distributed.py-748- batch_isend_irecv() test_distributed.py-749- test_distributed.py-750- if rank == 0 or rank == 1: test_distributed.py-751- next_rank = (rank + 1) % 2 test_distributed.py-752- prev_rank = (rank + size - 1) % 2 -- test_distributed.py-769- except_output = ms.Tensor(2, dtype=ms.float32) test_distributed.py-770- assert np.allclose(recv_tensor.asnumpy(), except_output.asnumpy()) test_distributed.py-771- # 异常场景 test_distributed.py-772- send_op = P2POp(isend, send_tensor, next_rank) test_distributed.py-773- recv_op = P2POp(irecv, recv_tensor, prev_rank, group="11") test_distributed.py:774: with pytest.raises(TypeError): test_distributed.py-775- batch_isend_irecv() test_distributed.py-776- test_distributed.py-777- # 异常场景 test_distributed.py:778: with pytest.raises(TypeError): test_distributed.py-779- batch_isend_irecv() test_distributed.py:780: with pytest.raises(TypeError): test_distributed.py-781- batch_isend_irecv(1) test_distributed.py:782: with pytest.raises(TypeError): test_distributed.py-783- batch_isend_irecv([]) test_distributed.py:784: with pytest.raises(TypeError): test_distributed.py-785- batch_isend_irecv([1]) test_distributed.py-786- barrier() test_distributed.py-787- test_distributed.py-788- test_distributed.py-789-def test_hccl_broadcast(): -- test_distributed.py-823- except_output_tensor = ms.Tensor( test_distributed.py-824- np.arange(8).reshape([2, 4]).astype(np.float32) test_distributed.py-825- ) test_distributed.py-826- assert np.allclose(tensor.asnumpy(), except_output_tensor.asnumpy()) test_distributed.py-827- # 异常场景 test_distributed.py:828: with pytest.raises(TypeError): test_distributed.py-829- broadcast(1, src=0) test_distributed.py:830: with pytest.raises(TypeError): test_distributed.py-831- broadcast(tensor, src="test") test_distributed.py:832: with pytest.raises(TypeError): test_distributed.py-833- broadcast(tensor, src=0, group=1) test_distributed.py:834: with pytest.raises(TypeError): test_distributed.py-835- broadcast(tensor, src=0, async_op="test") test_distributed.py-836- test_distributed.py-837- test_distributed.py-838-def test_hccl_barrier(): test_distributed.py-839- """ -- test_distributed.py-854- name = "hccl_" + str(2) + "_" + hashlib.sha1(bytes("_".join(map(str, range(2))), "utf-8")).hexdigest() test_distributed.py-855- assert group == name test_distributed.py-856- output_handle = barrier(group=name) test_distributed.py-857- assert output_handle is None test_distributed.py-858- # 异常场景 test_distributed.py:859: with pytest.raises(TypeError): test_distributed.py-860- barrier(group=1) test_distributed.py:861: with pytest.raises(TypeError): test_distributed.py-862- barrier(async_op="test") test_distributed.py-863- test_distributed.py-864- test_distributed.py-865-def test_hccl_send(): test_distributed.py-866- """ -- test_distributed.py-888- out = recv(output, src=1, group=group) test_distributed.py-889- assert out == 0 test_distributed.py-890- assert np.allclose(output.asnumpy(), input_tensor.asnumpy()) test_distributed.py-891- test_distributed.py-892- # 异常场景 test_distributed.py:893: with pytest.raises(TypeError): test_distributed.py-894- send(1) test_distributed.py:895: with pytest.raises(TypeError): test_distributed.py-896- send(input_tensor, dst="test") test_distributed.py:897: with pytest.raises(TypeError): test_distributed.py-898- send(input_tensor, group=1) test_distributed.py:899: with pytest.raises(ValueError): test_distributed.py-900- send(input_tensor, dst=rank) test_distributed.py-901- test_distributed.py-902- test_distributed.py-903-def test_hccl_recv(): test_distributed.py-904- """ -- test_distributed.py-925- else: test_distributed.py-926- out = recv(output, src=1, group=group) test_distributed.py-927- assert out == 0 test_distributed.py-928- assert np.allclose(output.asnumpy(), input_tensor.asnumpy()) test_distributed.py-929- # 异常场景 test_distributed.py:930: with pytest.raises(TypeError): test_distributed.py-931- recv(1) test_distributed.py:932: with pytest.raises(TypeError): test_distributed.py-933- recv(output, src="test") test_distributed.py:934: with pytest.raises(TypeError): test_distributed.py-935- recv(output, group=1) test_distributed.py-936- test_distributed.py-937- test_distributed.py-938-def test_hccl_isend(): test_distributed.py-939- """ -- test_distributed.py-965- out = recv(output, src=1, group=group) test_distributed.py-966- assert out == 0 test_distributed.py-967- assert np.allclose(output.asnumpy(), input_tensor.asnumpy()) test_distributed.py-968- test_distributed.py-969- # 异常场景 test_distributed.py:970: with pytest.raises(TypeError): test_distributed.py-971- isend(1) test_distributed.py:972: with pytest.raises(TypeError): test_distributed.py-973- isend(input_tensor, dst="test") test_distributed.py:974: with pytest.raises(TypeError): test_distributed.py-975- isend(input_tensor, group=1) test_distributed.py:976: with pytest.raises(ValueError): test_distributed.py-977- isend(input_tensor, dst=rank) test_distributed.py-978- test_distributed.py-979- test_distributed.py-980-def test_hccl_irecv(): test_distributed.py-981- """ -- test_distributed.py-1004- handle = irecv(output, src=1, group=group) test_distributed.py-1005- assert handle is not None test_distributed.py-1006- handle.wait() test_distributed.py-1007- assert np.allclose(output.asnumpy(), input_tensor.asnumpy()) test_distributed.py-1008- # 异常场景 test_distributed.py:1009: with pytest.raises(TypeError): test_distributed.py-1010- irecv(1) test_distributed.py:1011: with pytest.raises(TypeError): test_distributed.py-1012- irecv(output, src="test") test_distributed.py:1013: with pytest.raises(TypeError): test_distributed.py-1014- irecv(output, group=1) test_distributed.py-1015- test_distributed.py-1016- test_distributed.py-1017-def test_hccl_all_to_all(): test_distributed.py-1018- """ -- test_distributed.py-1083- ) test_distributed.py-1084- assert np.allclose( test_distributed.py-1085- recv_tensor_list[1].asnumpy(), except_output_tensor[1].asnumpy() test_distributed.py-1086- ) test_distributed.py-1087- # 异常场景 test_distributed.py:1088: with pytest.raises(TypeError): test_distributed.py-1089- all_to_all(1) test_distributed.py:1090: with pytest.raises(TypeError): test_distributed.py-1091- all_to_all(output_tensors, 1) test_distributed.py:1092: with pytest.raises(TypeError): test_distributed.py-1093- all_to_all(output_tensors, input_tensors, group=1) test_distributed.py:1094: with pytest.raises(TypeError): test_distributed.py-1095- all_to_all(output_tensors, input_tensors, async_op="1") test_distributed.py:1096: with pytest.raises(ValueError): test_distributed.py-1097- output_tensors = [] test_distributed.py-1098- for _ in range(size): test_distributed.py-1099- output_tensors.append(ms.Tensor(np.ones([1, 1]).astype(np.int32))) test_distributed.py-1100- all_to_all(output_tensors, input_tensors) test_distributed.py-1101- _pynative_executor.sync() -- test_distributed.py-1142- handle = all_to_all_single(output, tensor, [1, 2], [1, 2], group=group) test_distributed.py-1143- assert handle is None test_distributed.py-1144- except_output_tensor = ms.Tensor([[0, 0, 0.0], [12, 13, 14], [1, 1, 1]]) test_distributed.py-1145- assert np.allclose(output.asnumpy(), except_output_tensor.asnumpy()) test_distributed.py-1146- # 异常场景 test_distributed.py:1147: with pytest.raises(TypeError): test_distributed.py-1148- all_to_all_single(1, input_tensor) test_distributed.py:1149: with pytest.raises(TypeError): test_distributed.py-1150- all_to_all_single(output_tensor, 1) test_distributed.py:1151: with pytest.raises(TypeError): test_distributed.py-1152- all_to_all_single(output_tensor, input_tensor, group=1) test_distributed.py:1153: with pytest.raises(TypeError): test_distributed.py-1154- all_to_all_single(output_tensor, input_tensor, async_op="1") test_distributed.py:1155: with pytest.raises(ValueError): test_distributed.py-1156- input_tensor = ms.Tensor(np.ones([size - 1, 1]).astype(np.float32)) test_distributed.py-1157- all_to_all_single(output_tensor, input_tensor) test_distributed.py:1158: with pytest.raises(ValueError): test_distributed.py-1159- input_tensor = ms.Tensor(np.ones([size, 1]).astype(np.float32)) * rank test_distributed.py-1160- output_tensor = ms.Tensor(np.zeros([size, 1]).astype(np.int32)) test_distributed.py-1161- all_to_all_single(output_tensor, input_tensor) test_distributed.py-1162- _pynative_executor.sync() test_distributed.py-1163- -- test_distributed.py-1207- ) test_distributed.py-1208- assert np.allclose( test_distributed.py-1209- output_tensor1[1].asnumpy(), except_output_tensor[1].asnumpy() test_distributed.py-1210- ) test_distributed.py-1211- # 异常场景 test_distributed.py:1212: with pytest.raises(TypeError): test_distributed.py-1213- all_gather(1) test_distributed.py:1214: with pytest.raises(TypeError): test_distributed.py-1215- all_gather(output_tensor, input_tensor, group=1) test_distributed.py:1216: with pytest.raises(TypeError): test_distributed.py-1217- all_gather(output_tensor, input_tensor, async_op="test") test_distributed.py:1218: with pytest.raises(TypeError): test_distributed.py-1219- all_gather([1], input_tensor) test_distributed.py:1220: with pytest.raises(TypeError): test_distributed.py-1221- all_gather(output_tensor, [1]) test_distributed.py:1222: with pytest.raises(TypeError): test_distributed.py-1223- output_tensor = [ test_distributed.py-1224- ms.Tensor(np.zeros([3, 3]).astype(np.float32)), test_distributed.py-1225- ms.Tensor(np.zeros([3, 3]).astype(np.int32)), test_distributed.py-1226- ] test_distributed.py-1227- all_gather(output_tensor, input_tensor) test_distributed.py:1228: with pytest.raises(TypeError): test_distributed.py-1229- output_tensor = [ test_distributed.py-1230- ms.Tensor(np.zeros([3, 3]).astype(np.float32)), test_distributed.py-1231- ms.Tensor(np.zeros([1, 3]).astype(np.float32)), test_distributed.py-1232- ] test_distributed.py-1233- all_gather(output_tensor, input_tensor) test_distributed.py:1234: with pytest.raises(TypeError): test_distributed.py-1235- output_tensor = [] test_distributed.py-1236- for _ in range(size): test_distributed.py-1237- output_tensor.append(ms.Tensor(np.zeros([3, 3]).astype(np.int32))) test_distributed.py-1238- all_gather(output_tensor, input_tensor) test_distributed.py-1239- _pynative_executor.sync() -- test_distributed.py-1308- output_handle = reduce_scatter(output_tensor1, input_tensor1, group=name) test_distributed.py-1309- except_output_tensor = ms.Tensor(np.ones([3, 3]).astype(np.float32)) * 2 test_distributed.py-1310- assert output_handle is None test_distributed.py-1311- assert np.allclose(output_tensor1.asnumpy(), except_output_tensor.asnumpy()) test_distributed.py-1312- # 异常场景 test_distributed.py:1313: with pytest.raises(TypeError): test_distributed.py-1314- reduce_scatter(1) test_distributed.py:1315: with pytest.raises(TypeError): test_distributed.py-1316- reduce_scatter(output_tensor, input_tensor, op=1) test_distributed.py:1317: with pytest.raises(TypeError): test_distributed.py-1318- reduce_scatter(output_tensor, input_tensor, op="test") test_distributed.py:1319: with pytest.raises(TypeError): test_distributed.py-1320- reduce_scatter(output_tensor, input_tensor, group=1) test_distributed.py:1321: with pytest.raises(TypeError): test_distributed.py-1322- reduce_scatter(output_tensor, input_tensor, async_op="test") test_distributed.py:1323: with pytest.raises(TypeError): test_distributed.py-1324- reduce_scatter([1], input_tensor) test_distributed.py:1325: with pytest.raises(TypeError): test_distributed.py-1326- reduce_scatter(output_tensor, [1]) test_distributed.py:1327: with pytest.raises(TypeError): test_distributed.py-1328- input_tensor1 = [ test_distributed.py-1329- ms.Tensor(np.zeros([3, 3]).astype(np.float32)), test_distributed.py-1330- ms.Tensor(np.zeros([3, 3]).astype(np.int32)), test_distributed.py-1331- ] test_distributed.py-1332- reduce_scatter(output_tensor, input_tensor1) test_distributed.py-1333- output_tensor = ms.Tensor(np.zeros([1, 3]).astype(np.float32)) test_distributed.py:1334: with pytest.raises(TypeError): test_distributed.py-1335- reduce_scatter(output_tensor, input_tensor) test_distributed.py-1336- _pynative_executor.sync() test_distributed.py-1337- output_tensor = ms.Tensor(np.zeros([3, 1]).astype(np.float32)) test_distributed.py:1338: with pytest.raises(TypeError): test_distributed.py-1339- reduce_scatter(output_tensor, input_tensor) test_distributed.py-1340- _pynative_executor.sync() test_distributed.py-1341- output_tensor = ms.Tensor(np.zeros([3, 3]).astype(np.int32)) test_distributed.py:1342: with pytest.raises(TypeError): test_distributed.py-1343- reduce_scatter(output_tensor, input_tensor) test_distributed.py-1344- _pynative_executor.sync() test_distributed.py-1345- test_distributed.py-1346- test_distributed.py-1347-def test_hccl_reduce_scatter_diff_shape(): -- test_distributed.py-1372- assert output_handle is not None test_distributed.py-1373- output_handle.wait() test_distributed.py-1374- assert np.allclose(output_tensor.asnumpy(), expect_output) test_distributed.py-1375- # output tensor shape not match real op output. test_distributed.py-1376- output_tensor = ms.Tensor(np.zeros([size+1]).astype(np.int32)) test_distributed.py:1377: with pytest.raises(TypeError): test_distributed.py-1378- reduce_scatter(output_tensor, input_list) test_distributed.py-1379- _pynative_executor.sync() test_distributed.py-1380- test_distributed.py-1381- test_distributed.py-1382-def test_hccl_gather(): -- test_distributed.py-1431- ) test_distributed.py-1432- assert np.allclose( test_distributed.py-1433- output_tensor1[1].asnumpy(), except_output_tensor[1].asnumpy() test_distributed.py-1434- ) test_distributed.py-1435- # 异常场景 test_distributed.py:1436: with pytest.raises(TypeError): test_distributed.py-1437- gather(1) test_distributed.py:1438: with pytest.raises(TypeError): test_distributed.py-1439- gather(input_tensor, output_tensor, group=1) test_distributed.py:1440: with pytest.raises(TypeError): test_distributed.py-1441- gather(input_tensor, output_tensor, dst="test") test_distributed.py:1442: with pytest.raises(TypeError): test_distributed.py-1443- gather(input_tensor, output_tensor, async_op="test") test_distributed.py:1444: with pytest.raises(TypeError): test_distributed.py-1445- gather([1], output_tensor) test_distributed.py:1446: with pytest.raises(TypeError): test_distributed.py-1447- gather(input_tensor, [1]) test_distributed.py:1448: with pytest.raises(TypeError): test_distributed.py-1449- output_tensor1 = [ test_distributed.py-1450- ms.Tensor(np.zeros([3, 3]).astype(np.float32)), test_distributed.py-1451- ms.Tensor(np.zeros([3, 3]).astype(np.int32)), test_distributed.py-1452- ] test_distributed.py-1453- gather(input_tensor, output_tensor1) test_distributed.py:1454: with pytest.raises(TypeError): test_distributed.py-1455- output_tensor1 = [ test_distributed.py-1456- ms.Tensor(np.zeros([3, 3]).astype(np.float32)), test_distributed.py-1457- ms.Tensor(np.zeros([1, 3]).astype(np.float32)), test_distributed.py-1458- ] test_distributed.py-1459- gather(input_tensor, output_tensor1) test_distributed.py:1460: with pytest.raises(TypeError): test_distributed.py-1461- output_tensor = [] test_distributed.py-1462- for _ in range(size): test_distributed.py-1463- output_tensor.append(ms.Tensor(np.zeros([3, 3]).astype(np.int32))) test_distributed.py-1464- gather(input_tensor, output_tensor, dst=rank) test_distributed.py-1465- _pynative_executor.sync() -- test_distributed.py-1502- output_handle = scatter(output_tensor1, input_tensor1, src=0, group=name) test_distributed.py-1503- except_output_tensor = ms.Tensor(np.ones([3, 3]).astype(np.float32)) test_distributed.py-1504- assert output_handle is None test_distributed.py-1505- assert np.allclose(output_tensor1.asnumpy(), except_output_tensor.asnumpy()) test_distributed.py-1506- # 异常场景 test_distributed.py:1507: with pytest.raises(TypeError): test_distributed.py-1508- scatter(1) test_distributed.py:1509: with pytest.raises(TypeError): test_distributed.py-1510- scatter(output_tensor, input_tensor, src="test") test_distributed.py:1511: with pytest.raises(TypeError): test_distributed.py-1512- scatter(output_tensor, input_tensor, group=1) test_distributed.py:1513: with pytest.raises(TypeError): test_distributed.py-1514- scatter(output_tensor, input_tensor, async_op="test") test_distributed.py:1515: with pytest.raises(TypeError): test_distributed.py-1516- scatter([1], input_tensor) test_distributed.py:1517: with pytest.raises(TypeError): test_distributed.py-1518- scatter(output_tensor, [1]) test_distributed.py:1519: with pytest.raises(TypeError): test_distributed.py-1520- input_tensor1 = [ test_distributed.py-1521- ms.Tensor(np.zeros([3, 3]).astype(np.float32)), test_distributed.py-1522- ms.Tensor(np.zeros([3, 3]).astype(np.int32)), test_distributed.py-1523- ] test_distributed.py-1524- scatter(output_tensor, input_tensor1) test_distributed.py-1525- _pynative_executor.sync() test_distributed.py:1526: with pytest.raises(TypeError): test_distributed.py-1527- input_tensor1 = [ test_distributed.py-1528- ms.Tensor(np.zeros([3, 3]).astype(np.float32)), test_distributed.py-1529- ms.Tensor(np.zeros([1, 3]).astype(np.float32)), test_distributed.py-1530- ] test_distributed.py-1531- scatter(output_tensor, input_tensor1) test_distributed.py-1532- _pynative_executor.sync() test_distributed.py-1533- output_tensor = ms.Tensor(np.zeros([1, 3]).astype(np.float32)) test_distributed.py:1534: with pytest.raises(TypeError): test_distributed.py-1535- scatter(output_tensor, input_tensor, src=rank) test_distributed.py-1536- _pynative_executor.sync() test_distributed.py-1537- output_tensor = ms.Tensor(np.zeros([3, 1]).astype(np.float32)) test_distributed.py:1538: with pytest.raises(TypeError): test_distributed.py-1539- scatter(output_tensor, input_tensor, src=rank) test_distributed.py-1540- _pynative_executor.sync() test_distributed.py-1541- output_tensor = ms.Tensor(np.zeros([3, 3]).astype(np.int32)) test_distributed.py:1542: with pytest.raises(TypeError): test_distributed.py-1543- scatter(output_tensor, input_tensor, src=rank) test_distributed.py-1544- _pynative_executor.sync() test_distributed.py-1545- test_distributed.py-1546- test_distributed.py-1547-def test_hccl_scalar(): -- test_mint_chunk.py-110- test_mint_chunk.py-111- if context_mode == ms.GRAPH_MODE: test_mint_chunk.py-112- dims = 2 test_mint_chunk.py-113- test_cell.set_inputs(input_dyn, chunks, dims) test_mint_chunk.py-114- input_tensor = Tensor(np.arange(24).reshape((4, 2, 3)).astype(np.int64)) test_mint_chunk.py:115: with pytest.raises(RuntimeError): test_mint_chunk.py-116- _ = test_cell(input_tensor, chunks, dims) test_mint_chunk.py-117- _pynative_executor.sync() test_mint_chunk.py-118- test_mint_chunk.py-119- test_mint_chunk.py-120-@arg_mark(plat_marks=['platform_ascend'], level_mark='level1', card_mark='onecard', essential_mark='unessential') -- test_mint_chunk.py-130- chunks = 3 test_mint_chunk.py-131- dims = 0 test_mint_chunk.py-132- test_cell = test_utils.to_cell_obj(mint.chunk) test_mint_chunk.py-133- test_cell.set_inputs(input_dyn, chunks, dims) test_mint_chunk.py-134- input_tensor = Tensor(np.arange(24).reshape((4, 2, 3)).astype(np.int64)) test_mint_chunk.py:135: with pytest.raises(RuntimeError): test_mint_chunk.py-136- _ = test_cell(input_tensor, chunks, dims) test_mint_chunk.py-137- _pynative_executor.sync() test_mint_chunk.py-138- test_mint_chunk.py-139- test_mint_chunk.py-140-@arg_mark(plat_marks=['platform_ascend'], level_mark='level1', card_mark='onecard', essential_mark='unessential') -- test_mint_chunk.py-176- chunks = 3 test_mint_chunk.py-177- dims = 1 test_mint_chunk.py-178- test_cell = test_utils.to_cell_obj(ops.grad(mint.chunk, (0,))) test_mint_chunk.py-179- test_cell.set_inputs(input_dyn, chunks, dims) test_mint_chunk.py-180- input_tensor = Tensor(np.arange(24).reshape((4, 2, 3)).astype(np.float64)) test_mint_chunk.py:181: with pytest.raises(RuntimeError): test_mint_chunk.py-182- _ = test_cell(input_tensor, chunks, dims) test_mint_chunk.py-183- _pynative_executor.sync() test_mint_chunk.py-184- test_mint_chunk.py-185- test_mint_chunk.py-186-@arg_mark(plat_marks=['platform_ascend'], level_mark='level1', card_mark='onecard', essential_mark='unessential') -- test_mint_chunk.py-196- chunks = 2 test_mint_chunk.py-197- dims = 0 test_mint_chunk.py-198- expect = [np.array(np.arange(10).reshape((5, 2)), dtype=np.float32), test_mint_chunk.py-199- np.array(np.arange(10, 20).reshape((5, 2)), dtype=np.float32)] test_mint_chunk.py-200- if context_mode == ms.GRAPH_MODE: test_mint_chunk.py:201: with pytest.raises(RuntimeError): test_mint_chunk.py-202- _ = chunk_forward_func(x, ms.mutable(chunks), dims) test_mint_chunk.py-203- _pynative_executor.sync() test_mint_chunk.py-204- test_mint_chunk.py:205: with pytest.raises(RuntimeError): test_mint_chunk.py-206- _ = chunk_forward_func(x, chunks, ms.mutable(dims)) test_mint_chunk.py-207- else: test_mint_chunk.py-208- out = chunk_forward_func(x, ms.mutable(chunks), ms.mutable(dims)) test_mint_chunk.py-209- for res, exp in zip(out, expect): test_mint_chunk.py-210- assert np.allclose(res.asnumpy(), exp) -- test_comm_cpu.py-43-) test_comm_cpu.py-44-context.set_context(mode=context.PYNATIVE_MODE, device_target="Ascend") test_comm_cpu.py-45-rank = get_rank() test_comm_cpu.py-46-size = get_world_size() test_comm_cpu.py-47-if size % 2 != 0: test_comm_cpu.py:48: raise RuntimeError("Group size should be divided by 2 exactly.") test_comm_cpu.py-49- test_comm_cpu.py-50- test_comm_cpu.py-51-def test_cpu_new_group(): test_comm_cpu.py-52- """ test_comm_cpu.py-53- Feature: test distributed op -- overload/test_mint_max.py-55- output = net(x) overload/test_mint_max.py-56- expect_output = 25.0 overload/test_mint_max.py-57- assert np.allclose(output.asnumpy(), expect_output) overload/test_mint_max.py-58- overload/test_mint_max.py-59- x_np = np.array([[1., 25., 5., 7.], [4., 11., 6., 21.]]).astype(np.float32) overload/test_mint_max.py:60: with pytest.raises(TypeError): overload/test_mint_max.py-61- net(x_np) overload/test_mint_max.py-62- _pynative_executor.sync() overload/test_mint_max.py-63- overload/test_mint_max.py-64- overload/test_mint_max.py-65-@arg_mark(plat_marks=['platform_ascend'], -- overload/test_mint_max.py-82- expect_output0 = np.array([[25.], [21.]], dtype=np.float32) overload/test_mint_max.py-83- expect_output1 = np.array([[1], [3]], dtype=np.float32) overload/test_mint_max.py-84- assert np.allclose(output[0].asnumpy(), expect_output0) overload/test_mint_max.py-85- assert np.allclose(output[1].asnumpy(), expect_output1) overload/test_mint_max.py-86- overload/test_mint_max.py:87: with pytest.raises(TypeError): overload/test_mint_max.py-88- net(x_np, dim=-1, keepdim=True) overload/test_mint_max.py-89- _pynative_executor.sync() overload/test_mint_max.py-90- overload/test_mint_max.py:91: with pytest.raises(TypeError): overload/test_mint_max.py-92- net(x, dim=None, keepdim=True) overload/test_mint_max.py-93- _pynative_executor.sync() overload/test_mint_max.py-94- overload/test_mint_max.py:95: with pytest.raises(TypeError): overload/test_mint_max.py-96- net(x_np, dim=None, keepdim=False) overload/test_mint_max.py-97- _pynative_executor.sync() overload/test_mint_max.py-98- overload/test_mint_max.py:99: with pytest.raises(TypeError): overload/test_mint_max.py-100- net(x_np, dim=-1, keepdim=-1) overload/test_mint_max.py-101- _pynative_executor.sync() overload/test_mint_max.py-102- overload/test_mint_max.py-103- overload/test_mint_max.py-104-@arg_mark(plat_marks=['platform_ascend'], -- overload/test_mint_min.py-55- output = net(x) overload/test_mint_min.py-56- expect_output = 1.0 overload/test_mint_min.py-57- assert np.allclose(output.asnumpy(), expect_output) overload/test_mint_min.py-58- overload/test_mint_min.py-59- x_np = np.array([[1., 25., 5., 7.], [4., 11., 6., 21.]]).astype(np.float32) overload/test_mint_min.py:60: with pytest.raises(TypeError): overload/test_mint_min.py-61- net(x_np) overload/test_mint_min.py-62- _pynative_executor.sync() overload/test_mint_min.py-63- overload/test_mint_min.py-64- overload/test_mint_min.py-65-@arg_mark(plat_marks=['platform_ascend'], -- overload/test_mint_min.py-82- expect_output0 = np.array([[1.], [6.]], dtype=np.float32) overload/test_mint_min.py-83- expect_output1 = np.array([[0], [2]], dtype=np.float32) overload/test_mint_min.py-84- assert np.allclose(output[0].asnumpy(), expect_output0) overload/test_mint_min.py-85- assert np.allclose(output[1].asnumpy(), expect_output1) overload/test_mint_min.py-86- overload/test_mint_min.py:87: with pytest.raises(TypeError): overload/test_mint_min.py-88- net(x_np, dim=-1, keepdim=True) overload/test_mint_min.py-89- _pynative_executor.sync() overload/test_mint_min.py-90- overload/test_mint_min.py:91: with pytest.raises(TypeError): overload/test_mint_min.py-92- net(x, dim=None, keepdim=True) overload/test_mint_min.py-93- _pynative_executor.sync() overload/test_mint_min.py-94- overload/test_mint_min.py:95: with pytest.raises(TypeError): overload/test_mint_min.py-96- net(x_np, dim=None, keepdim=False) overload/test_mint_min.py-97- _pynative_executor.sync() overload/test_mint_min.py-98- overload/test_mint_min.py:99: with pytest.raises(TypeError): overload/test_mint_min.py-100- net(x_np, dim=-1, keepdim=-1) overload/test_mint_min.py-101- _pynative_executor.sync() overload/test_mint_min.py-102- overload/test_mint_min.py-103- overload/test_mint_min.py-104-@arg_mark(plat_marks=['platform_ascend'], -- overload/test_mint_clamp.py-64- """ overload/test_mint_clamp.py-65- ms.set_context(mode=mode) overload/test_mint_clamp.py-66- x_np = np.array([[1., 25., 5., 7.], [4., 11., 6., 21.]]).astype(np.float32) overload/test_mint_clamp.py-67- x = Tensor(x_np, ms.float32) overload/test_mint_clamp.py-68- net = ClampNet() overload/test_mint_clamp.py:69: with pytest.raises(TypeError): overload/test_mint_clamp.py-70- net(x, Tensor(5, ms.float32), 20) overload/test_mint_clamp.py-71- _pynative_executor.sync() overload/test_mint_clamp.py-72- overload/test_mint_clamp.py-73- overload/test_mint_clamp.py-74-@arg_mark(plat_marks=['platform_ascend'], -- overload/test_mint_clamp.py-84- """ overload/test_mint_clamp.py-85- ms.set_context(mode=mode) overload/test_mint_clamp.py-86- x_np = np.array([[1., 25., 5., 7.], [4., 11., 6., 21.]]).astype(np.float32) overload/test_mint_clamp.py-87- x = Tensor(x_np, ms.float32) overload/test_mint_clamp.py-88- net = ClampNet() overload/test_mint_clamp.py:89: with pytest.raises(ValueError) as error_info: overload/test_mint_clamp.py-90- net(x, 5, True) overload/test_mint_clamp.py-91- _pynative_executor.sync() overload/test_mint_clamp.py-92- assert "For Clamp, the dtype of 'input', 'min' and 'max' must not be bool." in str(error_info.value) -- test_full.py-132- value = Tensor(np.array([3]).astype(np.float32)) test_full.py-133- out = test_cell(size, value, ms.int32) test_full.py-134- expect_output = np.full((2, 3), 3, np.int32) test_full.py-135- assert np.allclose(out.asnumpy(), expect_output) test_full.py-136- test_full.py:137: with pytest.raises(ValueError): test_full.py-138- value = Tensor(np.array([3, 4]).astype(np.float32)) test_full.py-139- _ = test_cell(size, value, ms.int32) test_full.py-140- _pynative_executor.sync() test_full.py-141- test_full.py-142- -- test_full.py-160- assert np.allclose(dvalue.asnumpy(), expect_dvalue) test_full.py-161- test_full.py-162- # The forward graph will be remove by optimization pass in GRAPH_MODE, since the input value is not used during test_full.py-163- # backward. This will likely cause input validation loss in dynamic scene. test_full.py-164- if context_mode == ms.PYNATIVE_MODE: test_full.py:165: with pytest.raises(ValueError): test_full.py-166- value = Tensor(np.array([2, 3]).astype(np.float32)) test_full.py-167- _ = test_cell(size, value, ms.int32) test_full.py-168- _pynative_executor.sync() -- msrun_log/worker_1.log-14-[WARNING] DISTRIBUTED(1411319,ffff8a0eeec0,python):2025-07-15-12:09:36.738.103 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:355] PostProcess] This node 1 rank id: 1 msrun_log/worker_1.log-15-[WARNING] PS(1411319,ffff8a0eeec0,python):2025-07-15-12:09:36.738.913 [mindspore/ccsrc/ps/core/file_configuration.cc:24] Initialize] The file: is not exist. msrun_log/worker_1.log-16-[WARNING] DEVICE(1411319,ffff8a0eeec0,python):2025-07-15-12:09:36.738.983 [mindspore/ccsrc/plugin/device/cpu/hal/hardware/ms_collective_node.cc:33] Start] Failed to initialize the configuration for this mccl collective node. msrun_log/worker_1.log-17-collected 0 items / 1 error msrun_log/worker_1.log-18- msrun_log/worker_1.log:19:==================================== ERRORS ==================================== msrun_log/worker_1.log:20:____________________ ERROR collecting test_syncbatchnorm.py ____________________ msrun_log/worker_1.log-21-test_syncbatchnorm.py:9: in msrun_log/worker_1.log-22- init() msrun_log/worker_1.log-23-/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/management.py:203: in init msrun_log/worker_1.log-24- init_hccl() msrun_log/worker_1.log:25:E RuntimeError: Call aclrtSetDevice failed, ret[507033]. Got device count[8] and device id[1], please check if device id is valid. msrun_log/worker_1.log-26-E msrun_log/worker_1.log-27-E ---------------------------------------------------- msrun_log/worker_1.log-28-E - C++ Call Stack: (For framework developers) msrun_log/worker_1.log-29-E ---------------------------------------------------- msrun_log/worker_1.log-30-E mindspore/ccsrc/plugin/res_manager/ascend/hal_manager/ascend_hal_manager.cc:67 InitDevice -- msrun_log/worker_1.log-92-../../../../../../anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/fake_learned_scale_quant_perchannel_grad_reduce.py:48 msrun_log/worker_1.log-93- /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/fake_learned_scale_quant_perchannel_grad_reduce.py:48: DeprecationWarning: te_fusion.fusion_manager.fusion_manager.register is deprecated,please replace it with tbe.common.register.register_op_compute msrun_log/worker_1.log-94- @fusion_manager.register("fake_learned_scale_quant_perchannel_grad_d_reduce") msrun_log/worker_1.log-95- msrun_log/worker_1.log-96-../../../../../../anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/fake_quant_perchannel.py:52 msrun_log/worker_1.log:97:ERROR: not found: /home/jenkins/mindspore/testcases/testcases/tests/st/mint/test_syncbatchnorm.py::test_sync_batch_norm_backward_world_size_2_channel_3_dim_4 msrun_log/worker_1.log-98-(no name '/home/jenkins/mindspore/testcases/testcases/tests/st/mint/test_syncbatchnorm.py::test_sync_batch_norm_backward_world_size_2_channel_3_dim_4' in any of []) msrun_log/worker_1.log-99- msrun_log/worker_1.log-100- /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/fake_quant_perchannel.py:52: DeprecationWarning: te_fusion.fusion_manager.fusion_manager.register is deprecated,please replace it with tbe.common.register.register_op_compute msrun_log/worker_1.log-101- @fusion_manager.register("fake_quant_perchannel") msrun_log/worker_1.log-102- -- msrun_log/worker_1.log-120- /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/minmax_update_perlayer.py:50: DeprecationWarning: te_fusion.fusion_manager.fusion_manager.register is deprecated,please replace it with tbe.common.register.register_op_compute msrun_log/worker_1.log-121- @fusion_manager.register("minmax_update_perlayer") msrun_log/worker_1.log-122- msrun_log/worker_1.log-123--- Docs: https://docs.pytest.org/en/stable/warnings.html msrun_log/worker_1.log-124-=========================== short test summary info ============================ msrun_log/worker_1.log:125:ERROR test_syncbatchnorm.py - RuntimeError: Call aclrtSetDevice failed, ret[5... msrun_log/worker_1.log-126-================== 22 warnings, 1 error in 168.84s (0:02:48) =================== msrun_log/worker_1.log-127-[WARNING] DEVICE(1411319,ffff8a0eeec0,python):2025-07-15-12:12:18.815.008 [mindspore/ccsrc/plugin/device/ascend/hal/hardware/ascend_device_res_manager.cc:350] SyncAllStreams] The ascend_res_manager_ is nullptr in scenarios where it is not actually executed msrun_log/worker_1.log-128-[INFO] PS(1411319,ffff0bffefa0,python):2025-07-15-12:12:19.383.838 [mindspore/ccsrc/ps/core/communicator/tcp_client.cc:318] Start] Event base dispatch success! msrun_log/worker_1.log-129-[INFO] PS(1411319,ffff20dcefa0,python):2025-07-15-12:12:19.384.322 [mindspore/ccsrc/ps/core/communicator/tcp_server.cc:220] Start] Event base dispatch success! -- msrun_log/scheduler.log-90-[WARNING] DISTRIBUTED(1411313,ffff88daeec0,python):2025-07-15-12:12:41.334.668 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:154] Finalize] This log means the cluster is successfully created. Retry to finalize the node and exit cluster... msrun_log/scheduler.log-91-[WARNING] DISTRIBUTED(1411313,ffff88daeec0,python):2025-07-15-12:12:46.334.883 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:98] Finalize] The meta server node can not be finalized because there are still 1 alive nodes. msrun_log/scheduler.log-92-[WARNING] DISTRIBUTED(1411313,ffff88daeec0,python):2025-07-15-12:12:46.334.928 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:154] Finalize] This log means the cluster is successfully created. Retry to finalize the node and exit cluster... msrun_log/scheduler.log-93-[WARNING] DISTRIBUTED(1411313,ffff88daeec0,python):2025-07-15-12:12:51.335.083 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:98] Finalize] The meta server node can not be finalized because there are still 1 alive nodes. msrun_log/scheduler.log-94-[WARNING] DISTRIBUTED(1411313,ffff88daeec0,python):2025-07-15-12:12:51.335.126 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:154] Finalize] This log means the cluster is successfully created. Retry to finalize the node and exit cluster... msrun_log/scheduler.log:95:[ERROR] DISTRIBUTED(1411313,ffff212befa0,python):2025-07-15-12:12:51.349.550 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:511] UpdateTopoState] The node: 0 is timed out. It may exit with exception, please check this node's log. msrun_log/scheduler.log:96:[ERROR] DISTRIBUTED(1411313,ffff88daeec0,python):2025-07-15-12:12:56.335.338 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:103] Finalize] There are 1 abnormal compute graph nodes. msrun_log/scheduler.log-97-collected 0 items / 1 error msrun_log/scheduler.log-98- msrun_log/scheduler.log:99:==================================== ERRORS ==================================== msrun_log/scheduler.log:100:____________________ ERROR collecting test_syncbatchnorm.py ____________________ msrun_log/scheduler.log-101-test_syncbatchnorm.py:9: in msrun_log/scheduler.log-102- init() msrun_log/scheduler.log-103-/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/management.py:213: in init msrun_log/scheduler.log-104- init_cluster() msrun_log/scheduler.log:105:E RuntimeError: The total number of timed out node is 1. Timed out node list is: [const vector]{0}, worker 0 is the first one timed out, please check its log. msrun_log/scheduler.log-106-E msrun_log/scheduler.log-107-E ---------------------------------------------------- msrun_log/scheduler.log-108-E - C++ Call Stack: (For framework developers) msrun_log/scheduler.log-109-E ---------------------------------------------------- msrun_log/scheduler.log-110-E mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:517 UpdateTopoState -- msrun_log/scheduler.log-172-../../../../../../anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/fake_learned_scale_quant_perchannel_grad_reduce.py:48 msrun_log/scheduler.log-173- /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/fake_learned_scale_quant_perchannel_grad_reduce.py:48: DeprecationWarning: te_fusion.fusion_manager.fusion_manager.register is deprecated,please replace it with tbe.common.register.register_op_compute msrun_log/scheduler.log-174- @fusion_manager.register("fake_learned_scale_quant_perchannel_grad_d_reduce") msrun_log/scheduler.log-175- msrun_log/scheduler.log-176-../../../../../../anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/fake_quant_perchannel.py:52 msrun_log/scheduler.log:177:ERROR: not found: /home/jenkins/mindspore/testcases/testcases/tests/st/mint/test_syncbatchnorm.py::test_sync_batch_norm_backward_world_size_2_channel_3_dim_4 msrun_log/scheduler.log-178-(no name '/home/jenkins/mindspore/testcases/testcases/tests/st/mint/test_syncbatchnorm.py::test_sync_batch_norm_backward_world_size_2_channel_3_dim_4' in any of []) msrun_log/scheduler.log-179- msrun_log/scheduler.log-180- /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/fake_quant_perchannel.py:52: DeprecationWarning: te_fusion.fusion_manager.fusion_manager.register is deprecated,please replace it with tbe.common.register.register_op_compute msrun_log/scheduler.log-181- @fusion_manager.register("fake_quant_perchannel") msrun_log/scheduler.log-182- -- msrun_log/scheduler.log-200- /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/minmax_update_perlayer.py:50: DeprecationWarning: te_fusion.fusion_manager.fusion_manager.register is deprecated,please replace it with tbe.common.register.register_op_compute msrun_log/scheduler.log-201- @fusion_manager.register("minmax_update_perlayer") msrun_log/scheduler.log-202- msrun_log/scheduler.log-203--- Docs: https://docs.pytest.org/en/stable/warnings.html msrun_log/scheduler.log-204-=========================== short test summary info ============================ msrun_log/scheduler.log:205:ERROR test_syncbatchnorm.py - RuntimeError: The total number of timed out nod... msrun_log/scheduler.log-206-================== 22 warnings, 1 error in 206.97s (0:03:26) =================== -- test_tcp_store.py-25-init_process_group() test_tcp_store.py-26-context.set_context(mode=context.PYNATIVE_MODE, device_target="Ascend") test_tcp_store.py-27-this_rank = get_rank() test_tcp_store.py-28-size = get_world_size() test_tcp_store.py-29-if size % 2 != 0: test_tcp_store.py:30: raise RuntimeError("Group size should be divided by 2 exactly.") test_tcp_store.py-31- test_tcp_store.py-32-context.set_context(mode=context.PYNATIVE_MODE, device_target="Ascend") test_tcp_store.py-33-def test_TCPStore(): test_tcp_store.py-34- """ test_tcp_store.py-35- Feature: test distributed op -- test_tcp_store.py-38- """ test_tcp_store.py-39- TCPStore() test_tcp_store.py-40- TCPStore("11") test_tcp_store.py-41- test_tcp_store.py-42- test_tcp_store.py:43:def test_TCPStore_TypeError(): test_tcp_store.py-44- """ test_tcp_store.py-45- Feature: test distributed op test_tcp_store.py-46- Description: test tcp store in python native test_tcp_store.py-47- Expectation: success test_tcp_store.py-48- """ test_tcp_store.py-49- store = TCPStore("") test_tcp_store.py:50: with pytest.raises(TypeError): test_tcp_store.py-51- store.set("key1", 1) test_tcp_store.py:52: with pytest.raises(TypeError): test_tcp_store.py-53- store.set(2, "{'a':1}") test_tcp_store.py:54: with pytest.raises(TypeError): test_tcp_store.py-55- store.set("key3", [1, 2, 3]) test_tcp_store.py:56: with pytest.raises(TypeError): test_tcp_store.py-57- store.set("key5") test_tcp_store.py:58: with pytest.raises(TypeError): test_tcp_store.py-59- store.delete_key(4) test_tcp_store.py:60: with pytest.raises(TypeError): test_tcp_store.py-61- store.get(2) test_tcp_store.py-62- test_tcp_store.py-63- test_tcp_store.py-64-def test_set(): test_tcp_store.py-65- """ -- worker_1.log-15-platform linux -- Python 3.9.21, pytest-6.2.5, py-1.11.0, pluggy-0.13.1 worker_1.log-16-rootdir: /home/jenkins/mindspore/testcases/testcases/tests/st/mint worker_1.log-17-plugins: forked-1.6.0, hydra-core-1.3.2, xdist-1.32.0, anyio-4.9.0 worker_1.log-18-collected 0 items / 1 error worker_1.log-19- worker_1.log:20:==================================== ERRORS ==================================== worker_1.log:21:_____________________ ERROR collecting test_distributed.py _____________________ worker_1.log-22-test_distributed.py:55: in worker_1.log-23- init_process_group() worker_1.log-24-/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/mint/distributed/distributed.py:470: in init_process_group worker_1.log-25- init(backend) worker_1.log-26-/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/management.py:203: in init worker_1.log-27- init_hccl() worker_1.log:28:E RuntimeError: Call aclrtSetDevice failed, ret[507033]. Got device count[8] and device id[1], please check if device id is valid. worker_1.log-29-E worker_1.log-30-E ---------------------------------------------------- worker_1.log-31-E - C++ Call Stack: (For framework developers) worker_1.log-32-E ---------------------------------------------------- worker_1.log-33-E mindspore/ccsrc/plugin/res_manager/ascend/hal_manager/ascend_hal_manager.cc:67 InitDevice -- worker_1.log-120- /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/minmax_update_perlayer.py:50: DeprecationWarning: te_fusion.fusion_manager.fusion_manager.register is deprecated,please replace it with tbe.common.register.register_op_compute worker_1.log-121- @fusion_manager.register("minmax_update_perlayer") worker_1.log-122- worker_1.log-123--- Docs: https://docs.pytest.org/en/stable/warnings.html worker_1.log-124-=========================== short test summary info ============================ worker_1.log:125:ERROR test_distributed.py - RuntimeError: Call aclrtSetDevice failed, ret[507... worker_1.log-126-!!!!!!!!!!!!!!!!!!!! Interrupted: 1 error during collection !!!!!!!!!!!!!!!!!!!! worker_1.log-127-================== 22 warnings, 1 error in 166.67s (0:02:46) =================== worker_1.log-128-[WARNING] DEVICE(1412849,ffff8ce9eec0,python):2025-07-15-12:16:24.881.522 [mindspore/ccsrc/plugin/device/ascend/hal/hardware/ascend_device_res_manager.cc:350] SyncAllStreams] The ascend_res_manager_ is nullptr in scenarios where it is not actually executed worker_1.log-129-[INFO] PS(1412849,ffff16fdefa0,python):2025-07-15-12:16:25.459.310 [mindspore/ccsrc/ps/core/communicator/tcp_client.cc:318] Start] Event base dispatch success! worker_1.log-130-[INFO] PS(1412849,ffff177eefa0,python):2025-07-15-12:16:25.459.642 [mindspore/ccsrc/ps/core/communicator/tcp_server.cc:220] Start] Event base dispatch success! grep: __pycache__/test_syncbatchnorm_msrun.cpython-39-pytest-6.2.5.pyc: binary file matches grep: __pycache__/test_mint_comm_op.cpython-39-pytest-6.2.5.pyc: binary file matches grep: __pycache__/test_inner_syncbatchnorm_msrun.cpython-39-pytest-6.2.5.pyc: binary file matches grep: __pycache__/test_syncbatchnorm.cpython-39-pytest-6.2.5.pyc: binary file matches grep: __pycache__/test_distributed.cpython-39-pytest-6.2.5.pyc: binary file matches grep: __pycache__/test_inner_syncbatchnorm.cpython-39-pytest-6.2.5.pyc: binary file matches -- scheduler.log-90-[WARNING] DISTRIBUTED(1412843,ffff8429eec0,python):2025-07-15-12:16:45.561.642 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:154] Finalize] This log means the cluster is successfully created. Retry to finalize the node and exit cluster... scheduler.log-91-[WARNING] DISTRIBUTED(1412843,ffff8429eec0,python):2025-07-15-12:16:50.561.739 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:98] Finalize] The meta server node can not be finalized because there are still 7 alive nodes. scheduler.log-92-[WARNING] DISTRIBUTED(1412843,ffff8429eec0,python):2025-07-15-12:16:50.561.777 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:154] Finalize] This log means the cluster is successfully created. Retry to finalize the node and exit cluster... scheduler.log-93-[WARNING] DISTRIBUTED(1412843,ffff8429eec0,python):2025-07-15-12:16:55.561.878 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:98] Finalize] The meta server node can not be finalized because there are still 7 alive nodes. scheduler.log-94-[WARNING] DISTRIBUTED(1412843,ffff8429eec0,python):2025-07-15-12:16:55.561.920 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:154] Finalize] This log means the cluster is successfully created. Retry to finalize the node and exit cluster... scheduler.log:95:[ERROR] DISTRIBUTED(1412843,ffff17ffefa0,python):2025-07-15-12:16:57.078.256 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:511] UpdateTopoState] The node: 0 is timed out. It may exit with exception, please check this node's log. scheduler.log:96:[ERROR] DISTRIBUTED(1412843,ffff17ffefa0,python):2025-07-15-12:16:57.078.322 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:511] UpdateTopoState] The node: 2 is timed out. It may exit with exception, please check this node's log. scheduler.log:97:[ERROR] DISTRIBUTED(1412843,ffff17ffefa0,python):2025-07-15-12:16:57.078.350 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:511] UpdateTopoState] The node: 3 is timed out. It may exit with exception, please check this node's log. scheduler.log:98:[ERROR] DISTRIBUTED(1412843,ffff17ffefa0,python):2025-07-15-12:16:57.078.375 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:511] UpdateTopoState] The node: 4 is timed out. It may exit with exception, please check this node's log. scheduler.log:99:[ERROR] DISTRIBUTED(1412843,ffff17ffefa0,python):2025-07-15-12:16:57.078.448 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:511] UpdateTopoState] The node: 5 is timed out. It may exit with exception, please check this node's log. scheduler.log:100:[ERROR] DISTRIBUTED(1412843,ffff17ffefa0,python):2025-07-15-12:16:57.078.477 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:511] UpdateTopoState] The node: 6 is timed out. It may exit with exception, please check this node's log. scheduler.log:101:[ERROR] DISTRIBUTED(1412843,ffff17ffefa0,python):2025-07-15-12:16:57.078.501 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:511] UpdateTopoState] The node: 7 is timed out. It may exit with exception, please check this node's log. scheduler.log:102:[ERROR] DISTRIBUTED(1412843,ffff8429eec0,python):2025-07-15-12:17:00.562.013 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:103] Finalize] There are 7 abnormal compute graph nodes. scheduler.log-103-============================= test session starts ============================== scheduler.log-104-platform linux -- Python 3.9.21, pytest-6.2.5, py-1.11.0, pluggy-0.13.1 scheduler.log-105-rootdir: /home/jenkins/mindspore/testcases/testcases/tests/st/mint scheduler.log-106-plugins: forked-1.6.0, hydra-core-1.3.2, xdist-1.32.0, anyio-4.9.0 scheduler.log-107-collected 0 items / 1 error scheduler.log-108- scheduler.log:109:==================================== ERRORS ==================================== scheduler.log:110:_____________________ ERROR collecting test_distributed.py _____________________ scheduler.log-111-test_distributed.py:55: in scheduler.log-112- init_process_group() scheduler.log-113-/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/mint/distributed/distributed.py:470: in init_process_group scheduler.log-114- init(backend) scheduler.log-115-/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/management.py:213: in init scheduler.log-116- init_cluster() scheduler.log:117:E RuntimeError: The total number of timed out node is 7. Timed out node list is: [const vector]{0, 2, 3, 4, 5, 6, 7}, worker 0 is the first one timed out, please check its log. scheduler.log-118-E scheduler.log-119-E ---------------------------------------------------- scheduler.log-120-E - C++ Call Stack: (For framework developers) scheduler.log-121-E ---------------------------------------------------- scheduler.log-122-E mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:517 UpdateTopoState -- scheduler.log-209- /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/minmax_update_perlayer.py:50: DeprecationWarning: te_fusion.fusion_manager.fusion_manager.register is deprecated,please replace it with tbe.common.register.register_op_compute scheduler.log-210- @fusion_manager.register("minmax_update_perlayer") scheduler.log-211- scheduler.log-212--- Docs: https://docs.pytest.org/en/stable/warnings.html scheduler.log-213-=========================== short test summary info ============================ scheduler.log:214:ERROR test_distributed.py - RuntimeError: The total number of timed out node ... scheduler.log-215-!!!!!!!!!!!!!!!!!!!! Interrupted: 1 error during collection !!!!!!!!!!!!!!!!!!!! scheduler.log-216-================== 22 warnings, 1 error in 203.28s (0:03:23) =================== -- test_comm_object.py-34-) test_comm_object.py-35-context.set_context(mode=context.PYNATIVE_MODE, device_target="Ascend") test_comm_object.py-36-rank = get_rank() test_comm_object.py-37-size = get_world_size() test_comm_object.py-38-if size % 2 != 0: test_comm_object.py:39: raise RuntimeError("Group size should be divided by 2 exactly.") test_comm_object.py-40- test_comm_object.py-41- test_comm_object.py-42-def test_all_gather_object(): test_comm_object.py-43- """ test_comm_object.py-44- Feature: test distributed op -- test_comm_object.py-53- all_gather_object(object_gather_list, obj) test_comm_object.py-54- assert len(object_gather_list) == size test_comm_object.py-55- for i in range(size): test_comm_object.py-56- assert object_gather_list[i] == str(i) test_comm_object.py-57- #异常用例 test_comm_object.py:58: with pytest.raises(TypeError): test_comm_object.py-59- all_gather_object(object_gather_list, obj, group=1) test_comm_object.py:60: with pytest.raises(TypeError): test_comm_object.py-61- all_gather_object(None, obj) test_comm_object.py:62: with pytest.raises(TypeError): test_comm_object.py-63- all_gather_object({None}, obj) test_comm_object.py:64: with pytest.raises(TypeError): test_comm_object.py-65- all_gather_object({}, obj) test_comm_object.py:66: with pytest.raises(TypeError): test_comm_object.py-67- all_gather_object(1, obj) test_comm_object.py-68- test_comm_object.py-69- test_comm_object.py-70-def test_broadcast_object_list(): test_comm_object.py-71- """ -- test_comm_object.py-82- broadcast_object_list(object_list) test_comm_object.py-83- assert len(object_list) == size test_comm_object.py-84- for i in range(size): test_comm_object.py-85- assert object_list[i] == str(i) test_comm_object.py-86- #异常用例 test_comm_object.py:87: with pytest.raises(TypeError): test_comm_object.py-88- broadcast_object_list(object_list, group=1) test_comm_object.py:89: with pytest.raises(TypeError): test_comm_object.py-90- broadcast_object_list(object_list, src="1") test_comm_object.py:91: with pytest.raises(TypeError): test_comm_object.py-92- broadcast_object_list(None) test_comm_object.py:93: with pytest.raises(TypeError): test_comm_object.py-94- broadcast_object_list({}) test_comm_object.py:95: with pytest.raises(TypeError): test_comm_object.py-96- broadcast_object_list(1) test_comm_object.py-97- test_comm_object.py-98- test_comm_object.py-99-def test_gather_object(): test_comm_object.py-100- """ -- test_comm_object.py-110- assert len(object_gather_list) == size test_comm_object.py-111- if rank == 0: test_comm_object.py-112- for i in range(size): test_comm_object.py-113- assert object_gather_list[i] == str(i) test_comm_object.py-114- #异常用例 test_comm_object.py:115: with pytest.raises(TypeError): test_comm_object.py-116- gather_object(obj, object_gather_list, group=1) test_comm_object.py:117: with pytest.raises(TypeError): test_comm_object.py-118- gather_object(obj, object_gather_list, dst="1") test_comm_object.py-119- if rank == 0: test_comm_object.py:120: with pytest.raises(TypeError): test_comm_object.py-121- gather_object(obj, None) test_comm_object.py:122: with pytest.raises(TypeError): test_comm_object.py-123- gather_object(obj, {}) test_comm_object.py:124: with pytest.raises(TypeError): test_comm_object.py-125- gather_object(obj, 1) test_comm_object.py-126- test_comm_object.py-127- test_comm_object.py-128-def test_hccl_scatter_object_list(): test_comm_object.py-129- """ -- test_comm_object.py-152- scatter_object_input_list.append(str(i)) test_comm_object.py-153- scatter_object_list(scatter_object_output_list, scatter_object_input_list, src=2, group=group) test_comm_object.py-154- assert len(scatter_object_input_list) == 2 test_comm_object.py-155- assert scatter_object_output_list[0] == str(rank-2) test_comm_object.py-156- #异常用例 test_comm_object.py:157: with pytest.raises(TypeError): test_comm_object.py-158- scatter_object_list(scatter_object_output_list, scatter_object_input_list, group=1) test_comm_object.py:159: with pytest.raises(TypeError): test_comm_object.py-160- scatter_object_list(scatter_object_output_list, scatter_object_input_list, src="1") test_comm_object.py-161- if rank == 0: test_comm_object.py:162: with pytest.raises(TypeError): test_comm_object.py-163- scatter_object_list(scatter_object_output_list, None) test_comm_object.py:164: with pytest.raises(TypeError): test_comm_object.py-165- scatter_object_list(scatter_object_output_list, {}) test_comm_object.py:166: with pytest.raises(TypeError): test_comm_object.py-167- scatter_object_list(scatter_object_output_list, {None}) test_comm_object.py:168: with pytest.raises(TypeError): test_comm_object.py-169- scatter_object_list(None, scatter_object_input_list) test_comm_object.py:170: with pytest.raises(TypeError): test_comm_object.py-171- scatter_object_list({}, scatter_object_input_list) Traceback (most recent call last): File "/home/jenkins/anaconda3/envs/ci39/bin/msrun", line 8, in sys.exit(main()) File "/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/parallel/cluster/run.py", line 191, in main run(args) File "/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/parallel/cluster/run.py", line 185, in run process_manager.run() File "/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/parallel/cluster/process_entity/_api.py", line 268, in run self.join_processes() File "/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/parallel/cluster/process_entity/_api.py", line 387, in join_processes raise RuntimeError("Distributed job exited with exception. Please check logs in " RuntimeError: Distributed job exited with exception. Please check logs in directory: . F =================================== FAILURES =================================== ______________________________ test_hccl_mint_ops ______________________________ @arg_mark(plat_marks=["platform_ascend910b"], level_mark="level1", card_mark="allcards", essential_mark="essential") def test_hccl_mint_ops(): """ Feature: mpi run 8P case Description: mpi run 8P case Expectation: success """ return_code = os.system( "msrun --worker_num=8 --local_worker_num=8 --master_addr=127.0.0.1 --master_port=10666 --join=True "\ "pytest -s test_distributed.py" ) > assert return_code == 0 E assert 256 == 0 test_mint_comm_op.py:29: AssertionError =========================== short test summary info ============================ FAILED test_mint_comm_op.py::test_hccl_mint_ops - assert 256 == 0 ======================== 1 failed in 212.44s (0:03:32) =========================