============================= test session starts ============================== platform linux -- Python 3.9.21, pytest-6.2.5, py-1.11.0, pluggy-0.13.1 rootdir: /home/jenkins/mindspore/testcases/testcases/tests/st/auto_parallel, configfile: ../../../../../../sault/virtual_test/virtualenv_002/sault/config/pytest.ini plugins: forked-1.6.0, hydra-core-1.3.2, xdist-1.32.0, anyio-4.9.0 collected 1 item test_checkpoints_convert.py /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) Start worker process with rank id:0, log file:./test_checkpoints_convert_by_layout/msrun_log/worker_0.log. Environment variable [RANK_ID=0] is exported. Start worker process with rank id:1, log file:./test_checkpoints_convert_by_layout/msrun_log/worker_1.log. Environment variable [RANK_ID=1] is exported. Start worker process with rank id:2, log file:./test_checkpoints_convert_by_layout/msrun_log/worker_2.log. Environment variable [RANK_ID=2] is exported. Start worker process with rank id:3, log file:./test_checkpoints_convert_by_layout/msrun_log/worker_3.log. Environment variable [RANK_ID=3] is exported. Start worker process with rank id:4, log file:./test_checkpoints_convert_by_layout/msrun_log/worker_4.log. Environment variable [RANK_ID=4] is exported. Start worker process with rank id:5, log file:./test_checkpoints_convert_by_layout/msrun_log/worker_5.log. Environment variable [RANK_ID=5] is exported. Start worker process with rank id:6, log file:./test_checkpoints_convert_by_layout/msrun_log/worker_6.log. Environment variable [RANK_ID=6] is exported. Start worker process with rank id:7, log file:./test_checkpoints_convert_by_layout/msrun_log/worker_7.log. Environment variable [RANK_ID=7] is exported. [WARNING] ME(1442327:281473739124416,MainProcess):2025-07-15-13:48:54.628.591 [mindspore/parallel/cluster/process_entity/_api.py:267] Distributed job is spawned. Waiting all processes to exit... ============================= test session starts ============================== platform linux -- Python 3.9.21, pytest-6.2.5, py-1.11.0, pluggy-0.13.1 rootdir: /home/jenkins/mindspore/testcases/testcases/tests/st/auto_parallel plugins: forked-1.6.0, hydra-core-1.3.2, xdist-1.32.0, anyio-4.9.0 collected 1 item checkpoints_convert.py [WARNING] ME(1442412:281473207496384,MainProcess):2025-07-15-13:48:59.977.811 [mindspore/context.py:1412] For 'context.set_context', the parameter 'device_target' will be deprecated and removed in a future version. Please use the api mindspore.set_device() instead. [WARNING] DISTRIBUTED(1442412,ffff968beec0,python):2025-07-15-13:48:59.980.998 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 21 source: 127.0.0.1:39510, destination: 127.0.0.1:10805 [WARNING] DISTRIBUTED(1442412,ffff968beec0,python):2025-07-15-13:48:59.981.096 [mindspore/ccsrc/distributed/rpc/tcp/tcp_client.cc:76] Connect] Failed to connect to the tcp server : 127.0.0.1:10805, retry to reconnect(1/1)... ============================= test session starts ============================== platform linux -- Python 3.9.21, pytest-6.2.5, py-1.11.0, pluggy-0.13.1 rootdir: /home/jenkins/mindspore/testcases/testcases/tests/st/auto_parallel plugins: forked-1.6.0, hydra-core-1.3.2, xdist-1.32.0, anyio-4.9.0 collected 1 item checkpoints_convert.py ============================= test session starts ============================== platform linux -- Python 3.9.21, pytest-6.2.5, py-1.11.0, pluggy-0.13.1 rootdir: /home/jenkins/mindspore/testcases/testcases/tests/st/auto_parallel plugins: forked-1.6.0, hydra-core-1.3.2, xdist-1.32.0, anyio-4.9.0 collected 1 item checkpoints_convert.py [WARNING] ME(1442420:281473383395008,MainProcess):2025-07-15-13:49:00.126.347 [mindspore/context.py:1412] For 'context.set_context', the parameter 'device_target' will be deprecated and removed in a future version. Please use the api mindspore.set_device() instead. [WARNING] ME(1442406:281472942534336,MainProcess):2025-07-15-13:49:00.126.384 [mindspore/context.py:1412] For 'context.set_context', the parameter 'device_target' will be deprecated and removed in a future version. Please use the api mindspore.set_device() instead. [WARNING] DISTRIBUTED(1442406,ffff1f09efa0,python):2025-07-15-13:49:00.129.171 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:39518 to 127.0.0.1:10805 is successfully created. System errno: Success [WARNING] DISTRIBUTED(1442406,ffff86c0eec0,python):2025-07-15-13:49:00.129.177 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 21 source: 127.0.0.1:39518, destination: 127.0.0.1:10805 [WARNING] DISTRIBUTED(1442420,ffffa107eec0,python):2025-07-15-13:49:00.129.176 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 21 source: 127.0.0.1:39520, destination: 127.0.0.1:10805 [WARNING] DISTRIBUTED(1442420,ffff394fefa0,python):2025-07-15-13:49:00.129.184 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:39520 to 127.0.0.1:10805 is successfully created. System errno: Success [WARNING] DISTRIBUTED(1442420,ffffa107eec0,python):2025-07-15-13:49:00.129.244 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:10805 to be connected...Retry number: 1 [WARNING] DISTRIBUTED(1442406,ffff86c0eec0,python):2025-07-15-13:49:00.129.365 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 22 source: 127.0.0.1:39536, destination: 127.0.0.1:10805 [WARNING] DISTRIBUTED(1442406,ffff86c0eec0,python):2025-07-15-13:49:00.129.413 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:10805 to be connected...Retry number: 1 [WARNING] DISTRIBUTED(1442406,ffff200befa0,python):2025-07-15-13:49:00.129.411 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:39536 to 127.0.0.1:10805 is successfully created. System errno: Success ============================= test session starts ============================== platform linux -- Python 3.9.21, pytest-6.2.5, py-1.11.0, pluggy-0.13.1 rootdir: /home/jenkins/mindspore/testcases/testcases/tests/st/auto_parallel plugins: forked-1.6.0, hydra-core-1.3.2, xdist-1.32.0, anyio-4.9.0 collected 1 item checkpoints_convert.py [WARNING] ME(1442416:281472905637568,MainProcess):2025-07-15-13:49:00.160.489 [mindspore/context.py:1412] For 'context.set_context', the parameter 'device_target' will be deprecated and removed in a future version. Please use the api mindspore.set_device() instead. [WARNING] DISTRIBUTED(1442416,ffff1cd7efa0,python):2025-07-15-13:49:00.163.327 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:39544 to 127.0.0.1:10805 is successfully created. System errno: Success [WARNING] DISTRIBUTED(1442416,ffff848deec0,python):2025-07-15-13:49:00.163.337 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 21 source: 127.0.0.1:39544, destination: 127.0.0.1:10805 [WARNING] DISTRIBUTED(1442416,ffff848deec0,python):2025-07-15-13:49:00.163.531 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 22 source: 127.0.0.1:39556, destination: 127.0.0.1:10805 [WARNING] DISTRIBUTED(1442416,ffff1dd9efa0,python):2025-07-15-13:49:00.163.559 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:39556 to 127.0.0.1:10805 is successfully created. System errno: Success [WARNING] DISTRIBUTED(1442416,ffff848deec0,python):2025-07-15-13:49:00.163.573 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:10805 to be connected...Retry number: 1 ============================= test session starts ============================== platform linux -- Python 3.9.21, pytest-6.2.5, py-1.11.0, pluggy-0.13.1 rootdir: /home/jenkins/mindspore/testcases/testcases/tests/st/auto_parallel plugins: forked-1.6.0, hydra-core-1.3.2, xdist-1.32.0, anyio-4.9.0 collected 1 item checkpoints_convert.py [WARNING] ME(1442424:281473345515200,MainProcess):2025-07-15-13:49:00.260.526 [mindspore/context.py:1412] For 'context.set_context', the parameter 'device_target' will be deprecated and removed in a future version. Please use the api mindspore.set_device() instead. [WARNING] DISTRIBUTED(1442424,ffff370defa0,python):2025-07-15-13:49:00.263.333 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:39564 to 127.0.0.1:10805 is successfully created. System errno: Success [WARNING] DISTRIBUTED(1442424,ffff9ec5eec0,python):2025-07-15-13:49:00.263.331 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 21 source: 127.0.0.1:39564, destination: 127.0.0.1:10805 [WARNING] DISTRIBUTED(1442424,ffff9ec5eec0,python):2025-07-15-13:49:00.263.545 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 22 source: 127.0.0.1:39580, destination: 127.0.0.1:10805 [WARNING] DISTRIBUTED(1442424,ffff380fefa0,python):2025-07-15-13:49:00.263.579 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:39580 to 127.0.0.1:10805 is successfully created. System errno: Success [WARNING] DISTRIBUTED(1442424,ffff9ec5eec0,python):2025-07-15-13:49:00.263.590 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:10805 to be connected...Retry number: 1 ============================= test session starts ============================== platform linux -- Python 3.9.21, pytest-6.2.5, py-1.11.0, pluggy-0.13.1 rootdir: /home/jenkins/mindspore/testcases/testcases/tests/st/auto_parallel plugins: forked-1.6.0, hydra-core-1.3.2, xdist-1.32.0, anyio-4.9.0 collected 1 item checkpoints_convert.py [WARNING] ME(1442429:281472956952256,MainProcess):2025-07-15-13:49:00.267.585 [mindspore/context.py:1412] For 'context.set_context', the parameter 'device_target' will be deprecated and removed in a future version. Please use the api mindspore.set_device() instead. [WARNING] DISTRIBUTED(1442429,ffff1b7eefa0,python):2025-07-15-13:49:00.270.365 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:39588 to 127.0.0.1:10805 is successfully created. System errno: Success [WARNING] DISTRIBUTED(1442429,ffff879ceec0,python):2025-07-15-13:49:00.270.365 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 21 source: 127.0.0.1:39588, destination: 127.0.0.1:10805 [WARNING] DISTRIBUTED(1442429,ffff879ceec0,python):2025-07-15-13:49:00.270.639 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 22 source: 127.0.0.1:39590, destination: 127.0.0.1:10805 [WARNING] DISTRIBUTED(1442429,ffff20e7efa0,python):2025-07-15-13:49:00.270.672 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:39590 to 127.0.0.1:10805 is successfully created. System errno: Success [WARNING] DISTRIBUTED(1442429,ffff879ceec0,python):2025-07-15-13:49:00.270.686 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:10805 to be connected...Retry number: 1 ============================= test session starts ============================== platform linux -- Python 3.9.21, pytest-6.2.5, py-1.11.0, pluggy-0.13.1 rootdir: /home/jenkins/mindspore/testcases/testcases/tests/st/auto_parallel plugins: forked-1.6.0, hydra-core-1.3.2, xdist-1.32.0, anyio-4.9.0 collected 1 item checkpoints_convert.py [WARNING] ME(1442435:281473626205888,MainProcess):2025-07-15-13:49:00.440.044 [mindspore/context.py:1412] For 'context.set_context', the parameter 'device_target' will be deprecated and removed in a future version. Please use the api mindspore.set_device() instead. [WARNING] DISTRIBUTED(1442435,ffff437eefa0,python):2025-07-15-13:49:00.442.881 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:39596 to 127.0.0.1:10805 is successfully created. System errno: Success [WARNING] DISTRIBUTED(1442435,ffffaf80eec0,python):2025-07-15-13:49:00.442.885 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 21 source: 127.0.0.1:39596, destination: 127.0.0.1:10805 [WARNING] DISTRIBUTED(1442435,ffffaf80eec0,python):2025-07-15-13:49:00.443.155 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 22 source: 127.0.0.1:39598, destination: 127.0.0.1:10805 [WARNING] DISTRIBUTED(1442435,ffff48caefa0,python):2025-07-15-13:49:00.443.181 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:39598 to 127.0.0.1:10805 is successfully created. System errno: Success [WARNING] DISTRIBUTED(1442435,ffffaf80eec0,python):2025-07-15-13:49:00.443.202 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:10805 to be connected...Retry number: 1 [WARNING] DISTRIBUTED(1442412,ffff968beec0,python):2025-07-15-13:49:00.481.234 [mindspore/ccsrc/distributed/cluster/topology/compute_graph_node.cc:173] Register] Failed to connect to the meta server node url: 127.0.0.1:10805 [WARNING] DISTRIBUTED(1442412,ffff968beec0,python):2025-07-15-13:49:00.481.288 [mindspore/ccsrc/distributed/cluster/topology/compute_graph_node.cc:363] ReconnectWithTimeoutWindow] Failed to register and try to reconnect to the meta server. ============================= test session starts ============================== platform linux -- Python 3.9.21, pytest-6.2.5, py-1.11.0, pluggy-0.13.1 rootdir: /home/jenkins/mindspore/testcases/testcases/tests/st/auto_parallel plugins: forked-1.6.0, hydra-core-1.3.2, xdist-1.32.0, anyio-4.9.0 collected 1 item checkpoints_convert.py [WARNING] ME(1442440:281473313992384,MainProcess):2025-07-15-13:49:00.499.047 [mindspore/context.py:1412] For 'context.set_context', the parameter 'device_target' will be deprecated and removed in a future version. Please use the api mindspore.set_device() instead. [WARNING] DISTRIBUTED(1442440,ffff352befa0,python):2025-07-15-13:49:00.501.568 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:39604 to 127.0.0.1:10805 is successfully created. System errno: Success [WARNING] DISTRIBUTED(1442440,ffff9ce4eec0,python):2025-07-15-13:49:00.501.571 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 21 source: 127.0.0.1:39604, destination: 127.0.0.1:10805 [WARNING] DISTRIBUTED(1442440,ffff9ce4eec0,python):2025-07-15-13:49:00.501.757 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 22 source: 127.0.0.1:39618, destination: 127.0.0.1:10805 [WARNING] DISTRIBUTED(1442440,ffff9ce4eec0,python):2025-07-15-13:49:00.501.800 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:10805 to be connected...Retry number: 1 [WARNING] DISTRIBUTED(1442440,ffff362defa0,python):2025-07-15-13:49:00.501.794 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:39618 to 127.0.0.1:10805 is successfully created. System errno: Success [WARNING] DISTRIBUTED(1442420,ffffa107eec0,python):2025-07-15-13:49:00.629.498 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 22 source: 127.0.0.1:39632, destination: 127.0.0.1:10805 [WARNING] DISTRIBUTED(1442420,ffff3a51efa0,python):2025-07-15-13:49:00.629.530 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:39632 to 127.0.0.1:10805 is successfully created. System errno: Success [WARNING] DISTRIBUTED(1442420,ffffa107eec0,python):2025-07-15-13:49:00.629.549 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:10805 to be connected...Retry number: 2 [WARNING] DISTRIBUTED(1442406,ffff86c0eec0,python):2025-07-15-13:49:00.630.114 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(1/1200). [WARNING] DISTRIBUTED(1442416,ffff848deec0,python):2025-07-15-13:49:00.664.121 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(1/1200). [WARNING] DISTRIBUTED(1442424,ffff9ec5eec0,python):2025-07-15-13:49:00.764.155 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(1/1200). [WARNING] DISTRIBUTED(1442429,ffff879ceec0,python):2025-07-15-13:49:00.771.250 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(1/1200). [WARNING] DISTRIBUTED(1442435,ffffaf80eec0,python):2025-07-15-13:49:00.943.822 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(1/1200). [WARNING] DISTRIBUTED(1442412,ffff968beec0,python):2025-07-15-13:49:00.981.525 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 22 source: 127.0.0.1:39634, destination: 127.0.0.1:10805 [WARNING] DISTRIBUTED(1442412,ffff968beec0,python):2025-07-15-13:49:00.981.568 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:10805 to be connected...Retry number: 1 [WARNING] DISTRIBUTED(1442412,ffff2fd7efa0,python):2025-07-15-13:49:00.981.569 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:39634 to 127.0.0.1:10805 is successfully created. System errno: Success [WARNING] DISTRIBUTED(1442440,ffff9ce4eec0,python):2025-07-15-13:49:01.002.384 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(1/1200). [WARNING] DISTRIBUTED(1442420,ffffa107eec0,python):2025-07-15-13:49:01.130.157 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(1/1200). [WARNING] DISTRIBUTED(1442406,ffff86c0eec0,python):2025-07-15-13:49:01.130.234 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(2/1200). [WARNING] DISTRIBUTED(1442416,ffff848deec0,python):2025-07-15-13:49:01.164.240 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(2/1200). [WARNING] DISTRIBUTED(1442424,ffff9ec5eec0,python):2025-07-15-13:49:01.264.275 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(2/1200). [WARNING] DISTRIBUTED(1442429,ffff879ceec0,python):2025-07-15-13:49:01.271.369 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(2/1200). [WARNING] DISTRIBUTED(1442435,ffffaf80eec0,python):2025-07-15-13:49:01.443.937 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(2/1200). [WARNING] DISTRIBUTED(1442412,ffff968beec0,python):2025-07-15-13:49:01.481.776 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 23 source: 127.0.0.1:39642, destination: 127.0.0.1:10805 [WARNING] DISTRIBUTED(1442412,ffff968beec0,python):2025-07-15-13:49:01.481.818 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:10805 to be connected...Retry number: 2 [WARNING] DISTRIBUTED(1442412,ffff2ed5efa0,python):2025-07-15-13:49:01.481.831 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:39642 to 127.0.0.1:10805 is successfully created. System errno: Success [WARNING] DISTRIBUTED(1442440,ffff9ce4eec0,python):2025-07-15-13:49:01.502.495 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(2/1200). [WARNING] DISTRIBUTED(1442420,ffffa107eec0,python):2025-07-15-13:49:01.630.274 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(2/1200). [WARNING] DISTRIBUTED(1442406,ffff86c0eec0,python):2025-07-15-13:49:01.630.341 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(3/1200). [WARNING] DISTRIBUTED(1442416,ffff848deec0,python):2025-07-15-13:49:01.664.349 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(3/1200). [WARNING] DISTRIBUTED(1442424,ffff9ec5eec0,python):2025-07-15-13:49:01.764.386 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(3/1200). [WARNING] DISTRIBUTED(1442429,ffff879ceec0,python):2025-07-15-13:49:01.771.478 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(3/1200). [WARNING] DISTRIBUTED(1442435,ffffaf80eec0,python):2025-07-15-13:49:01.944.043 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(3/1200). [WARNING] DISTRIBUTED(1442412,ffff968beec0,python):2025-07-15-13:49:01.982.509 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(1/1200). [WARNING] DISTRIBUTED(1442440,ffff9ce4eec0,python):2025-07-15-13:49:02.002.601 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(3/1200). [WARNING] DISTRIBUTED(1442420,ffffa107eec0,python):2025-07-15-13:49:02.130.380 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(3/1200). [WARNING] DISTRIBUTED(1442406,ffff86c0eec0,python):2025-07-15-13:49:02.130.451 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(4/1200). [WARNING] DISTRIBUTED(1442416,ffff848deec0,python):2025-07-15-13:49:02.164.450 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(4/1200). [WARNING] DISTRIBUTED(1442424,ffff9ec5eec0,python):2025-07-15-13:49:02.264.489 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(4/1200). [WARNING] DISTRIBUTED(1442429,ffff879ceec0,python):2025-07-15-13:49:02.271.583 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(4/1200). [WARNING] DISTRIBUTED(1442435,ffffaf80eec0,python):2025-07-15-13:49:02.444.159 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(4/1200). [WARNING] DISTRIBUTED(1442412,ffff968beec0,python):2025-07-15-13:49:02.482.654 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:249] BuildCluster] Cluster is successfully initialized. [WARNING] DISTRIBUTED(1442412,ffff968beec0,python):2025-07-15-13:49:02.482.702 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:355] PostProcess] This node 1 rank id: 1 [WARNING] DISTRIBUTED(1442440,ffff9ce4eec0,python):2025-07-15-13:49:02.502.724 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:249] BuildCluster] Cluster is successfully initialized. [WARNING] DISTRIBUTED(1442440,ffff9ce4eec0,python):2025-07-15-13:49:02.502.763 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:355] PostProcess] This node 7 rank id: 7 [WARNING] DISTRIBUTED(1442420,ffffa107eec0,python):2025-07-15-13:49:02.630.521 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:249] BuildCluster] Cluster is successfully initialized. [WARNING] DISTRIBUTED(1442420,ffffa107eec0,python):2025-07-15-13:49:02.630.576 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:355] PostProcess] This node 3 rank id: 3 [WARNING] DISTRIBUTED(1442406,ffff86c0eec0,python):2025-07-15-13:49:02.630.569 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:249] BuildCluster] Cluster is successfully initialized. [WARNING] DISTRIBUTED(1442406,ffff86c0eec0,python):2025-07-15-13:49:02.630.614 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:355] PostProcess] This node 0 rank id: 0 [WARNING] DISTRIBUTED(1442416,ffff848deec0,python):2025-07-15-13:49:02.664.573 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:249] BuildCluster] Cluster is successfully initialized. [WARNING] DISTRIBUTED(1442416,ffff848deec0,python):2025-07-15-13:49:02.664.617 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:355] PostProcess] This node 2 rank id: 2 [WARNING] DISTRIBUTED(1442424,ffff9ec5eec0,python):2025-07-15-13:49:02.764.623 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:249] BuildCluster] Cluster is successfully initialized. [WARNING] DISTRIBUTED(1442424,ffff9ec5eec0,python):2025-07-15-13:49:02.764.665 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:355] PostProcess] This node 4 rank id: 4 [WARNING] DISTRIBUTED(1442429,ffff879ceec0,python):2025-07-15-13:49:02.771.733 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:249] BuildCluster] Cluster is successfully initialized. [WARNING] DISTRIBUTED(1442429,ffff879ceec0,python):2025-07-15-13:49:02.771.792 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:355] PostProcess] This node 5 rank id: 5 [WARNING] DISTRIBUTED(1442435,ffffaf80eec0,python):2025-07-15-13:49:02.944.315 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:249] BuildCluster] Cluster is successfully initialized. [WARNING] DISTRIBUTED(1442435,ffffaf80eec0,python):2025-07-15-13:49:02.944.377 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:355] PostProcess] This node 6 rank id: 6 [WARNING] DISTRIBUTED(1442440,ffff9ce4eec0,python):2025-07-15-13:49:04.266.623 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: hccl_world_group [const vector]{0, 1, 2, 3, 4, 5, 6, 7}, async: 1, submit_now: 1 [WARNING] DISTRIBUTED(1442440,ffff9ce4eec0,python):2025-07-15-13:49:04.266.918 [mindspore/ccsrc/distributed/collective/collective_manager.cc:393] CreateCommunicationGroup] This group's communicator is async created hccl_world_group [WARNING] DEVICE(1442440,fffee880efa0,python):2025-07-15-13:49:04.267.169 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:254] SetGlobalCommInfo] Start to SetGlobalCommInfo for hccl_world_group, master_ip:2130706433, master_port:10805, node_rank:2130706433, total_rank_size:8, local_rank_size8 [WARNING] HCCL_ADPT(1442440,fffee880efa0,python):2025-07-15-13:49:04.267.274 [mindspore/ccsrc/utils/dlopen_macro.h:165] DlsymAscend] Dynamically load symbol HcclSetGlobalCommInfo failed, result = /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/../lib/plugin/ascend/libhccl_plugin.so: undefined symbol: HcclSetGlobalCommInfo [WARNING] HCCL_ADPT(1442440,fffee880efa0,python):2025-07-15-13:49:04.267.309 [mindspore/ccsrc/plugin/res_manager/ascend/hccl_adapter/hccl_adapter.cc:635] HcclSetGlobalCommInfo] Func HcclSetGlobalCommInfo is not supported in CANN package. [WARNING] DEVICE(1442440,fffee880efa0,python):2025-07-15-13:49:04.267.357 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:265] SetGlobalCommInfo] End to SetGlobalCommInfo for hccl_world_group [WARNING] DEVICE(1442440,fffee880efa0,python):2025-07-15-13:49:04.267.864 [mindspore/ccsrc/plugin/device/cpu/hal/hardware/ms_collective_comm_lib.cc:251] QueryUniqueID] Retry to lookup the unique id for group hccl_world_group from the meta server node...Retry time: 399/400, sleep 2 [WARNING] ME(1442440:281473313992384,MainProcess):2025-07-15-13:49:04.272.296 [mindspore/ops/primitive.py:220] The in_strategy/in_layout of the operator in your network will not take effect in stand_alone mode. This means the the shard function called in the network is ignored. If you want to enable it, please use semi auto or auto parallel mode by context.set_auto_parallel_context(parallel_mode=ParallelMode.SEMI_AUTO_PARALLEL or context.set_auto_parallel_context(parallel_mode=ParallelMode.AUTO_PARALLEL) [WARNING] DISTRIBUTED(1442406,ffff86c0eec0,python):2025-07-15-13:49:04.452.262 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: hccl_world_group [const vector]{0, 1, 2, 3, 4, 5, 6, 7}, async: 1, submit_now: 1 [WARNING] DISTRIBUTED(1442406,ffff86c0eec0,python):2025-07-15-13:49:04.452.595 [mindspore/ccsrc/distributed/collective/collective_manager.cc:393] CreateCommunicationGroup] This group's communicator is async created hccl_world_group [WARNING] DEVICE(1442406,fffe83ffefa0,python):2025-07-15-13:49:04.452.942 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:254] SetGlobalCommInfo] Start to SetGlobalCommInfo for hccl_world_group, master_ip:2130706433, master_port:10805, node_rank:2130706433, total_rank_size:8, local_rank_size8 [WARNING] HCCL_ADPT(1442406,fffe83ffefa0,python):2025-07-15-13:49:04.453.042 [mindspore/ccsrc/utils/dlopen_macro.h:165] DlsymAscend] Dynamically load symbol HcclSetGlobalCommInfo failed, result = /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/../lib/plugin/ascend/libhccl_plugin.so: undefined symbol: HcclSetGlobalCommInfo [WARNING] HCCL_ADPT(1442406,fffe83ffefa0,python):2025-07-15-13:49:04.453.102 [mindspore/ccsrc/plugin/res_manager/ascend/hccl_adapter/hccl_adapter.cc:635] HcclSetGlobalCommInfo] Func HcclSetGlobalCommInfo is not supported in CANN package. [WARNING] DEVICE(1442406,fffe83ffefa0,python):2025-07-15-13:49:04.453.135 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:265] SetGlobalCommInfo] End to SetGlobalCommInfo for hccl_world_group [WARNING] ME(1442406:281472942534336,MainProcess):2025-07-15-13:49:04.459.319 [mindspore/ops/primitive.py:220] The in_strategy/in_layout of the operator in your network will not take effect in stand_alone mode. This means the the shard function called in the network is ignored. If you want to enable it, please use semi auto or auto parallel mode by context.set_auto_parallel_context(parallel_mode=ParallelMode.SEMI_AUTO_PARALLEL or context.set_auto_parallel_context(parallel_mode=ParallelMode.AUTO_PARALLEL) [WARNING] DISTRIBUTED(1442406,fffe83ffefa0,python):2025-07-15-13:49:04.461.059 [mindspore/ccsrc/distributed/collective/collective_manager.cc:1021] CreateDeviceCommunicator] Begin initialize communication group on the device side: hccl_world_group [WARNING] DEVICE(1442406,fffe81fbefa0,python):2025-07-15-13:49:04.461.589 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:169] InitByRootInfoConfig] Start to initialize communicator by HcclCommInitRootInfoConfig for hccl_world_group, hcclBufferSize is 200 MB, hcclDeterministic is 0 [WARNING] PARALLEL(1442440,ffff9ce4eec0,python):2025-07-15-13:49:04.473.267 [mindspore/ccsrc/frontend/parallel/pipeline_transformer/pipeline_transformer.cc:258] MainGraph] Pipeline Parallel with no 'lazy_inline' is deprecated, '@lazy_inline' should be enabled [WARNING] DISTRIBUTED(1442440,ffff9ce4eec0,python):2025-07-15-13:49:04.477.379 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: 2-5488101015797526856 [const vector]{3, 7}, async: 0, submit_now: 0 [WARNING] DISTRIBUTED(1442440,ffff9ce4eec0,python):2025-07-15-13:49:04.477.872 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: 2-16057586909177180503 [const vector]{5, 7}, async: 0, submit_now: 0 [WARNING] DISTRIBUTED(1442440,ffff9ce4eec0,python):2025-07-15-13:49:04.478.062 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: 2-6853331267304275293 [const vector]{6, 7}, async: 0, submit_now: 0 [WARNING] DISTRIBUTED(1442440,ffff9ce4eec0,python):2025-07-15-13:49:04.478.437 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: 4-15700679239691767905 [const vector]{4, 5, 6, 7}, async: 0, submit_now: 0 [WARNING] DISTRIBUTED(1442440,ffff9ce4eec0,python):2025-07-15-13:49:04.478.832 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: 4-2688051859485673701 [const vector]{1, 3, 5, 7}, async: 0, submit_now: 0 [WARNING] DISTRIBUTED(1442420,ffffa107eec0,python):2025-07-15-13:49:04.489.402 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: hccl_world_group [const vector]{0, 1, 2, 3, 4, 5, 6, 7}, async: 1, submit_now: 1 [WARNING] DISTRIBUTED(1442420,ffffa107eec0,python):2025-07-15-13:49:04.489.781 [mindspore/ccsrc/distributed/collective/collective_manager.cc:393] CreateCommunicationGroup] This group's communicator is async created hccl_world_group [WARNING] DEVICE(1442420,fffeec80efa0,python):2025-07-15-13:49:04.490.165 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:254] SetGlobalCommInfo] Start to SetGlobalCommInfo for hccl_world_group, master_ip:2130706433, master_port:10805, node_rank:2130706433, total_rank_size:8, local_rank_size8 [WARNING] HCCL_ADPT(1442420,fffeec80efa0,python):2025-07-15-13:49:04.490.286 [mindspore/ccsrc/utils/dlopen_macro.h:165] DlsymAscend] Dynamically load symbol HcclSetGlobalCommInfo failed, result = /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/../lib/plugin/ascend/libhccl_plugin.so: undefined symbol: HcclSetGlobalCommInfo [WARNING] HCCL_ADPT(1442420,fffeec80efa0,python):2025-07-15-13:49:04.490.338 [mindspore/ccsrc/plugin/res_manager/ascend/hccl_adapter/hccl_adapter.cc:635] HcclSetGlobalCommInfo] Func HcclSetGlobalCommInfo is not supported in CANN package. [WARNING] DEVICE(1442420,fffeec80efa0,python):2025-07-15-13:49:04.490.368 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:265] SetGlobalCommInfo] End to SetGlobalCommInfo for hccl_world_group [WARNING] DISTRIBUTED(1442420,fffeec80efa0,python):2025-07-15-13:49:04.491.043 [mindspore/ccsrc/distributed/collective/collective_manager.cc:1021] CreateDeviceCommunicator] Begin initialize communication group on the device side: hccl_world_group [WARNING] DEVICE(1442420,fffee1b1efa0,python):2025-07-15-13:49:04.491.546 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:169] InitByRootInfoConfig] Start to initialize communicator by HcclCommInitRootInfoConfig for hccl_world_group, hcclBufferSize is 200 MB, hcclDeterministic is 0 [WARNING] DISTRIBUTED(1442416,ffff848deec0,python):2025-07-15-13:49:04.491.748 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: hccl_world_group [const vector]{0, 1, 2, 3, 4, 5, 6, 7}, async: 1, submit_now: 1 [WARNING] DISTRIBUTED(1442416,ffff848deec0,python):2025-07-15-13:49:04.492.073 [mindspore/ccsrc/distributed/collective/collective_manager.cc:393] CreateCommunicationGroup] This group's communicator is async created hccl_world_group [WARNING] DEVICE(1442416,fffec587efa0,python):2025-07-15-13:49:04.492.413 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:254] SetGlobalCommInfo] Start to SetGlobalCommInfo for hccl_world_group, master_ip:2130706433, master_port:10805, node_rank:2130706433, total_rank_size:8, local_rank_size8 [WARNING] HCCL_ADPT(1442416,fffec587efa0,python):2025-07-15-13:49:04.492.523 [mindspore/ccsrc/utils/dlopen_macro.h:165] DlsymAscend] Dynamically load symbol HcclSetGlobalCommInfo failed, result = /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/../lib/plugin/ascend/libhccl_plugin.so: undefined symbol: HcclSetGlobalCommInfo [WARNING] HCCL_ADPT(1442416,fffec587efa0,python):2025-07-15-13:49:04.492.581 [mindspore/ccsrc/plugin/res_manager/ascend/hccl_adapter/hccl_adapter.cc:635] HcclSetGlobalCommInfo] Func HcclSetGlobalCommInfo is not supported in CANN package. [WARNING] DEVICE(1442416,fffec587efa0,python):2025-07-15-13:49:04.492.613 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:265] SetGlobalCommInfo] End to SetGlobalCommInfo for hccl_world_group [WARNING] DISTRIBUTED(1442416,fffec587efa0,python):2025-07-15-13:49:04.493.132 [mindspore/ccsrc/distributed/collective/collective_manager.cc:1021] CreateDeviceCommunicator] Begin initialize communication group on the device side: hccl_world_group [WARNING] DEVICE(1442416,fffec506efa0,python):2025-07-15-13:49:04.493.451 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:169] InitByRootInfoConfig] Start to initialize communicator by HcclCommInitRootInfoConfig for hccl_world_group, hcclBufferSize is 200 MB, hcclDeterministic is 0 [WARNING] ME(1442416:281472905637568,MainProcess):2025-07-15-13:49:04.498.148 [mindspore/ops/primitive.py:220] The in_strategy/in_layout of the operator in your network will not take effect in stand_alone mode. This means the the shard function called in the network is ignored. If you want to enable it, please use semi auto or auto parallel mode by context.set_auto_parallel_context(parallel_mode=ParallelMode.SEMI_AUTO_PARALLEL or context.set_auto_parallel_context(parallel_mode=ParallelMode.AUTO_PARALLEL) [WARNING] ME(1442420:281473383395008,MainProcess):2025-07-15-13:49:04.499.815 [mindspore/ops/primitive.py:220] The in_strategy/in_layout of the operator in your network will not take effect in stand_alone mode. This means the the shard function called in the network is ignored. If you want to enable it, please use semi auto or auto parallel mode by context.set_auto_parallel_context(parallel_mode=ParallelMode.SEMI_AUTO_PARALLEL or context.set_auto_parallel_context(parallel_mode=ParallelMode.AUTO_PARALLEL) [WARNING] DISTRIBUTED(1442424,ffff9ec5eec0,python):2025-07-15-13:49:04.577.854 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: hccl_world_group [const vector]{0, 1, 2, 3, 4, 5, 6, 7}, async: 1, submit_now: 1 [WARNING] DISTRIBUTED(1442424,ffff9ec5eec0,python):2025-07-15-13:49:04.578.180 [mindspore/ccsrc/distributed/collective/collective_manager.cc:393] CreateCommunicationGroup] This group's communicator is async created hccl_world_group [WARNING] DEVICE(1442424,fffe9bffefa0,python):2025-07-15-13:49:04.578.574 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:254] SetGlobalCommInfo] Start to SetGlobalCommInfo for hccl_world_group, master_ip:2130706433, master_port:10805, node_rank:2130706433, total_rank_size:8, local_rank_size8 [WARNING] HCCL_ADPT(1442424,fffe9bffefa0,python):2025-07-15-13:49:04.578.673 [mindspore/ccsrc/utils/dlopen_macro.h:165] DlsymAscend] Dynamically load symbol HcclSetGlobalCommInfo failed, result = /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/../lib/plugin/ascend/libhccl_plugin.so: undefined symbol: HcclSetGlobalCommInfo [WARNING] HCCL_ADPT(1442424,fffe9bffefa0,python):2025-07-15-13:49:04.578.730 [mindspore/ccsrc/plugin/res_manager/ascend/hccl_adapter/hccl_adapter.cc:635] HcclSetGlobalCommInfo] Func HcclSetGlobalCommInfo is not supported in CANN package. [WARNING] DEVICE(1442424,fffe9bffefa0,python):2025-07-15-13:49:04.578.757 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:265] SetGlobalCommInfo] End to SetGlobalCommInfo for hccl_world_group [WARNING] DISTRIBUTED(1442424,fffe9bffefa0,python):2025-07-15-13:49:04.579.279 [mindspore/ccsrc/distributed/collective/collective_manager.cc:1021] CreateDeviceCommunicator] Begin initialize communication group on the device side: hccl_world_group [WARNING] DEVICE(1442424,fffe9b7eefa0,python):2025-07-15-13:49:04.579.607 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:169] InitByRootInfoConfig] Start to initialize communicator by HcclCommInitRootInfoConfig for hccl_world_group, hcclBufferSize is 200 MB, hcclDeterministic is 0 [WARNING] ME(1442424:281473345515200,MainProcess):2025-07-15-13:49:04.584.071 [mindspore/ops/primitive.py:220] The in_strategy/in_layout of the operator in your network will not take effect in stand_alone mode. This means the the shard function called in the network is ignored. If you want to enable it, please use semi auto or auto parallel mode by context.set_auto_parallel_context(parallel_mode=ParallelMode.SEMI_AUTO_PARALLEL or context.set_auto_parallel_context(parallel_mode=ParallelMode.AUTO_PARALLEL) [WARNING] DISTRIBUTED(1442429,ffff879ceec0,python):2025-07-15-13:49:04.598.803 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: hccl_world_group [const vector]{0, 1, 2, 3, 4, 5, 6, 7}, async: 1, submit_now: 1 [WARNING] DISTRIBUTED(1442429,ffff879ceec0,python):2025-07-15-13:49:04.599.199 [mindspore/ccsrc/distributed/collective/collective_manager.cc:393] CreateCommunicationGroup] This group's communicator is async created hccl_world_group [WARNING] DEVICE(1442429,fffec8baefa0,python):2025-07-15-13:49:04.599.542 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:254] SetGlobalCommInfo] Start to SetGlobalCommInfo for hccl_world_group, master_ip:2130706433, master_port:10805, node_rank:2130706433, total_rank_size:8, local_rank_size8 [WARNING] HCCL_ADPT(1442429,fffec8baefa0,python):2025-07-15-13:49:04.599.663 [mindspore/ccsrc/utils/dlopen_macro.h:165] DlsymAscend] Dynamically load symbol HcclSetGlobalCommInfo failed, result = /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/../lib/plugin/ascend/libhccl_plugin.so: undefined symbol: HcclSetGlobalCommInfo [WARNING] HCCL_ADPT(1442429,fffec8baefa0,python):2025-07-15-13:49:04.599.729 [mindspore/ccsrc/plugin/res_manager/ascend/hccl_adapter/hccl_adapter.cc:635] HcclSetGlobalCommInfo] Func HcclSetGlobalCommInfo is not supported in CANN package. [WARNING] DEVICE(1442429,fffec8baefa0,python):2025-07-15-13:49:04.599.763 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:265] SetGlobalCommInfo] End to SetGlobalCommInfo for hccl_world_group [WARNING] DISTRIBUTED(1442429,fffec8baefa0,python):2025-07-15-13:49:04.600.465 [mindspore/ccsrc/distributed/collective/collective_manager.cc:1021] CreateDeviceCommunicator] Begin initialize communication group on the device side: hccl_world_group [WARNING] DEVICE(1442429,fffe7bffefa0,python):2025-07-15-13:49:04.601.024 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:169] InitByRootInfoConfig] Start to initialize communicator by HcclCommInitRootInfoConfig for hccl_world_group, hcclBufferSize is 200 MB, hcclDeterministic is 0 [WARNING] ME(1442429:281472956952256,MainProcess):2025-07-15-13:49:04.611.488 [mindspore/ops/primitive.py:220] The in_strategy/in_layout of the operator in your network will not take effect in stand_alone mode. This means the the shard function called in the network is ignored. If you want to enable it, please use semi auto or auto parallel mode by context.set_auto_parallel_context(parallel_mode=ParallelMode.SEMI_AUTO_PARALLEL or context.set_auto_parallel_context(parallel_mode=ParallelMode.AUTO_PARALLEL) [WARNING] PARALLEL(1442406,ffff86c0eec0,python):2025-07-15-13:49:04.659.573 [mindspore/ccsrc/frontend/parallel/pipeline_transformer/pipeline_transformer.cc:258] MainGraph] Pipeline Parallel with no 'lazy_inline' is deprecated, '@lazy_inline' should be enabled [WARNING] DISTRIBUTED(1442406,ffff86c0eec0,python):2025-07-15-13:49:04.664.050 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: 2-16453000547691086251 [const vector]{0, 4}, async: 0, submit_now: 0 [WARNING] DISTRIBUTED(1442406,ffff86c0eec0,python):2025-07-15-13:49:04.664.626 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: 2-5208665662337742843 [const vector]{0, 2}, async: 0, submit_now: 0 [WARNING] DISTRIBUTED(1442406,ffff86c0eec0,python):2025-07-15-13:49:04.664.831 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: 2-5004544844489628105 [const vector]{0, 1}, async: 0, submit_now: 0 [WARNING] DISTRIBUTED(1442406,ffff86c0eec0,python):2025-07-15-13:49:04.665.235 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: 4-6301172352641561019 [const vector]{0, 1, 2, 3}, async: 0, submit_now: 0 [WARNING] DISTRIBUTED(1442406,ffff86c0eec0,python):2025-07-15-13:49:04.665.677 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: 4-5226697808808137312 [const vector]{0, 2, 4, 6}, async: 0, submit_now: 0 [WARNING] PARALLEL(1442416,ffff848deec0,python):2025-07-15-13:49:04.703.301 [mindspore/ccsrc/frontend/parallel/pipeline_transformer/pipeline_transformer.cc:258] MainGraph] Pipeline Parallel with no 'lazy_inline' is deprecated, '@lazy_inline' should be enabled [WARNING] DISTRIBUTED(1442435,ffffaf80eec0,python):2025-07-15-13:49:04.706.569 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: hccl_world_group [const vector]{0, 1, 2, 3, 4, 5, 6, 7}, async: 1, submit_now: 1 [WARNING] DISTRIBUTED(1442435,ffffaf80eec0,python):2025-07-15-13:49:04.706.917 [mindspore/ccsrc/distributed/collective/collective_manager.cc:393] CreateCommunicationGroup] This group's communicator is async created hccl_world_group [WARNING] DEVICE(1442435,fffef0baefa0,python):2025-07-15-13:49:04.707.184 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:254] SetGlobalCommInfo] Start to SetGlobalCommInfo for hccl_world_group, master_ip:2130706433, master_port:10805, node_rank:2130706433, total_rank_size:8, local_rank_size8 [WARNING] HCCL_ADPT(1442435,fffef0baefa0,python):2025-07-15-13:49:04.707.278 [mindspore/ccsrc/utils/dlopen_macro.h:165] DlsymAscend] Dynamically load symbol HcclSetGlobalCommInfo failed, result = /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/../lib/plugin/ascend/libhccl_plugin.so: undefined symbol: HcclSetGlobalCommInfo [WARNING] HCCL_ADPT(1442435,fffef0baefa0,python):2025-07-15-13:49:04.707.343 [mindspore/ccsrc/plugin/res_manager/ascend/hccl_adapter/hccl_adapter.cc:635] HcclSetGlobalCommInfo] Func HcclSetGlobalCommInfo is not supported in CANN package. [WARNING] DEVICE(1442435,fffef0baefa0,python):2025-07-15-13:49:04.707.373 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:265] SetGlobalCommInfo] End to SetGlobalCommInfo for hccl_world_group [WARNING] DISTRIBUTED(1442416,ffff848deec0,python):2025-07-15-13:49:04.707.646 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: 2-511848487187618470 [const vector]{2, 6}, async: 0, submit_now: 0 [WARNING] DISTRIBUTED(1442435,fffef0baefa0,python):2025-07-15-13:49:04.707.940 [mindspore/ccsrc/distributed/collective/collective_manager.cc:1021] CreateDeviceCommunicator] Begin initialize communication group on the device side: hccl_world_group [WARNING] DISTRIBUTED(1442416,ffff848deec0,python):2025-07-15-13:49:04.708.209 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: 2-5208665662337742843 [const vector]{0, 2}, async: 0, submit_now: 0 [WARNING] DEVICE(1442435,fffea3ffefa0,python):2025-07-15-13:49:04.708.330 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:169] InitByRootInfoConfig] Start to initialize communicator by HcclCommInitRootInfoConfig for hccl_world_group, hcclBufferSize is 200 MB, hcclDeterministic is 0 [WARNING] DISTRIBUTED(1442416,ffff848deec0,python):2025-07-15-13:49:04.708.412 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: 2-3358271254418797552 [const vector]{2, 3}, async: 0, submit_now: 0 [WARNING] DISTRIBUTED(1442416,ffff848deec0,python):2025-07-15-13:49:04.708.792 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: 4-6301172352641561019 [const vector]{0, 1, 2, 3}, async: 0, submit_now: 0 [WARNING] DISTRIBUTED(1442416,ffff848deec0,python):2025-07-15-13:49:04.709.212 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: 4-5226697808808137312 [const vector]{0, 2, 4, 6}, async: 0, submit_now: 0 [WARNING] ME(1442435:281473626205888,MainProcess):2025-07-15-13:49:04.712.919 [mindspore/ops/primitive.py:220] The in_strategy/in_layout of the operator in your network will not take effect in stand_alone mode. This means the the shard function called in the network is ignored. If you want to enable it, please use semi auto or auto parallel mode by context.set_auto_parallel_context(parallel_mode=ParallelMode.SEMI_AUTO_PARALLEL or context.set_auto_parallel_context(parallel_mode=ParallelMode.AUTO_PARALLEL) [WARNING] PARALLEL(1442420,ffffa107eec0,python):2025-07-15-13:49:04.721.342 [mindspore/ccsrc/frontend/parallel/pipeline_transformer/pipeline_transformer.cc:258] MainGraph] Pipeline Parallel with no 'lazy_inline' is deprecated, '@lazy_inline' should be enabled [WARNING] DISTRIBUTED(1442420,ffffa107eec0,python):2025-07-15-13:49:04.725.667 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: 2-5488101015797526856 [const vector]{3, 7}, async: 0, submit_now: 0 [WARNING] DISTRIBUTED(1442420,ffffa107eec0,python):2025-07-15-13:49:04.726.192 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: 2-4190060298023907007 [const vector]{1, 3}, async: 0, submit_now: 0 [WARNING] DISTRIBUTED(1442420,ffffa107eec0,python):2025-07-15-13:49:04.726.393 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: 2-3358271254418797552 [const vector]{2, 3}, async: 0, submit_now: 0 [WARNING] DISTRIBUTED(1442420,ffffa107eec0,python):2025-07-15-13:49:04.726.799 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: 4-6301172352641561019 [const vector]{0, 1, 2, 3}, async: 0, submit_now: 0 [WARNING] DISTRIBUTED(1442420,ffffa107eec0,python):2025-07-15-13:49:04.727.222 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: 4-2688051859485673701 [const vector]{1, 3, 5, 7}, async: 0, submit_now: 0 [WARNING] DISTRIBUTED(1442440,fffee880efa0,python):2025-07-15-13:49:04.768.326 [mindspore/ccsrc/distributed/collective/collective_manager.cc:1021] CreateDeviceCommunicator] Begin initialize communication group on the device side: hccl_world_group [WARNING] DEVICE(1442440,fffe29fbefa0,python):2025-07-15-13:49:04.768.885 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:169] InitByRootInfoConfig] Start to initialize communicator by HcclCommInitRootInfoConfig for hccl_world_group, hcclBufferSize is 200 MB, hcclDeterministic is 0 [WARNING] PARALLEL(1442424,ffff9ec5eec0,python):2025-07-15-13:49:04.789.973 [mindspore/ccsrc/frontend/parallel/pipeline_transformer/pipeline_transformer.cc:258] MainGraph] Pipeline Parallel with no 'lazy_inline' is deprecated, '@lazy_inline' should be enabled [WARNING] DISTRIBUTED(1442424,ffff9ec5eec0,python):2025-07-15-13:49:04.794.332 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: 2-16453000547691086251 [const vector]{0, 4}, async: 0, submit_now: 0 [WARNING] DISTRIBUTED(1442424,ffff9ec5eec0,python):2025-07-15-13:49:04.794.913 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: 2-5435772415009061329 [const vector]{4, 6}, async: 0, submit_now: 0 [WARNING] DISTRIBUTED(1442424,ffff9ec5eec0,python):2025-07-15-13:49:04.795.122 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: 2-6541264347459079684 [const vector]{4, 5}, async: 0, submit_now: 0 [WARNING] DISTRIBUTED(1442424,ffff9ec5eec0,python):2025-07-15-13:49:04.795.518 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: 4-15700679239691767905 [const vector]{4, 5, 6, 7}, async: 0, submit_now: 0 [WARNING] DISTRIBUTED(1442424,ffff9ec5eec0,python):2025-07-15-13:49:04.795.933 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: 4-5226697808808137312 [const vector]{0, 2, 4, 6}, async: 0, submit_now: 0 [WARNING] PARALLEL(1442429,ffff879ceec0,python):2025-07-15-13:49:04.822.099 [mindspore/ccsrc/frontend/parallel/pipeline_transformer/pipeline_transformer.cc:258] MainGraph] Pipeline Parallel with no 'lazy_inline' is deprecated, '@lazy_inline' should be enabled [WARNING] DISTRIBUTED(1442429,ffff879ceec0,python):2025-07-15-13:49:04.826.450 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: 2-12944936785892925600 [const vector]{1, 5}, async: 0, submit_now: 0 [WARNING] DISTRIBUTED(1442429,ffff879ceec0,python):2025-07-15-13:49:04.826.982 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: 2-16057586909177180503 [const vector]{5, 7}, async: 0, submit_now: 0 [WARNING] DISTRIBUTED(1442429,ffff879ceec0,python):2025-07-15-13:49:04.827.183 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: 2-6541264347459079684 [const vector]{4, 5}, async: 0, submit_now: 0 [WARNING] DISTRIBUTED(1442429,ffff879ceec0,python):2025-07-15-13:49:04.827.575 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: 4-15700679239691767905 [const vector]{4, 5, 6, 7}, async: 0, submit_now: 0 [WARNING] DISTRIBUTED(1442429,ffff879ceec0,python):2025-07-15-13:49:04.827.999 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: 4-2688051859485673701 [const vector]{1, 3, 5, 7}, async: 0, submit_now: 0 [WARNING] PARALLEL(1442435,ffffaf80eec0,python):2025-07-15-13:49:04.912.301 [mindspore/ccsrc/frontend/parallel/pipeline_transformer/pipeline_transformer.cc:258] MainGraph] Pipeline Parallel with no 'lazy_inline' is deprecated, '@lazy_inline' should be enabled [WARNING] DISTRIBUTED(1442435,ffffaf80eec0,python):2025-07-15-13:49:04.916.305 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: 2-511848487187618470 [const vector]{2, 6}, async: 0, submit_now: 0 [WARNING] DISTRIBUTED(1442435,ffffaf80eec0,python):2025-07-15-13:49:04.916.817 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: 2-5435772415009061329 [const vector]{4, 6}, async: 0, submit_now: 0 [WARNING] DISTRIBUTED(1442435,ffffaf80eec0,python):2025-07-15-13:49:04.917.018 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: 2-6853331267304275293 [const vector]{6, 7}, async: 0, submit_now: 0 [WARNING] DISTRIBUTED(1442435,ffffaf80eec0,python):2025-07-15-13:49:04.917.419 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: 4-15700679239691767905 [const vector]{4, 5, 6, 7}, async: 0, submit_now: 0 [WARNING] DISTRIBUTED(1442435,ffffaf80eec0,python):2025-07-15-13:49:04.917.816 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: 4-5226697808808137312 [const vector]{0, 2, 4, 6}, async: 0, submit_now: 0 F =================================== FAILURES =================================== ______________________ test_checkpoints_convert_by_layout ______________________ def test_checkpoints_convert_by_layout(): """ test checkpoints convert using layout. """ layout = Layout(device_matrix=(2, 2, 2), alias_name=('dp', 'sp', 'mp')) src_in_strategy = { 'mul_0.weight': (layout('dp', ('sp', 'mp')), layout(('sp', 'mp'))), 'matmul_0.weight': (layout('dp', 'sp'), layout('sp', 'mp')), 'matmul_1.weight': (layout('dp', 'None'), layout('None', ('sp', 'mp'))), 'add_0.weight': (layout('dp', 'mp'), layout('mp')), } dst_in_strategy = { 'mul_0.weight': (layout('dp', 'mp'), layout('mp')), 'matmul_0.weight': (layout(('dp', 'sp'), 'mp'), layout('mp', 'None')), 'matmul_1.weight': (layout('dp', ('sp', 'mp')), layout(('sp', 'mp'), 'None')), 'add_0.weight': (layout('dp', ('sp', 'mp')), layout(('sp', 'mp'))), } > run_convert_checkpoint_by_layout( src_in_strategy=src_in_strategy, dst_in_strategy=dst_in_strategy, enable_parallel_optimizer=False, ) checkpoints_convert.py:127: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ checkpoints_convert.py:60: in run_convert_checkpoint_by_layout init() _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ backend_name = 'hccl' def init(backend_name=None): """ Initialize distributed backends required by communication services, e.g. ``"hccl"`` / ``"nccl"`` / ``"mccl"``. It is usually used in distributed parallel scenarios and set before using communication services. Note: - The full name of ``"hccl"`` is Huawei Collective Communication Library(HCCL). - The full name of ``"nccl"`` is NVIDIA Collective Communication Library(NCCL). - The full name of ``"mccl"`` is MindSpore Collective Communication Library(MCCL). - In Ascend hardware platforms, ``init()`` should be set before the definition of any Tensor and Parameter, and the instantiation and execution of any operation and net. Args: backend_name (str): Backend, using ``"hccl"`` / ``"nccl"`` / ``"mccl"``. ``"hccl"`` should be used for Ascend hardware platforms, ``"nccl"`` for GPU hardware platforms and ``"mccl"`` for CPU hardware platforms. If not set, inference is automatically made based on the hardware platform type (device_target). Default: ``None`` . Raises: TypeError: If `backend_name` is not a string. RuntimeError: If device target is invalid, or backend is invalid, or distributed initialization fails, or the environment variables RANK_ID/MINDSPORE_HCCL_CONFIG_PATH have not been exported when backend is HCCL. Supported Platforms: ``Ascend`` ``GPU`` ``CPU`` Examples: .. note:: Before running the following examples, you need to configure the communication environment variables. For Ascend/GPU/CPU devices, it is recommended to use the msrun startup method without any third-party or configuration file dependencies. Please see the `msrun start up `_ for more details. >>> from mindspore.communication import init >>> init() """ host_init = _host_distribute() device_target = context.get_context("device_target") if backend_name is None: if device_target == "Ascend": backend_name = "hccl" elif device_target == "GPU": backend_name = "nccl" elif device_target == "CPU": backend_name = "mccl" else: raise RuntimeError("For 'set_context', the argument 'device_target' {} is not supported in " "parallel initialization, please use Ascend, GPU or CPU.".format(device_target)) if not isinstance(backend_name, str): raise TypeError("For 'init', the argument 'backend_name' must be a string, " "but got the type : {}".format(type(backend_name))) if os.getenv("MS_ROLE") == "MS_SCHED": backend_name = "mccl" _set_elegant_exit_handle() if backend_name == "hccl": if _is_ps_mode(): # Use MindSpore cluster to build network for Parameter Server training. init_cluster() if _is_role_sched() or _is_role_pserver(): raise RuntimeError("Parameter server and scheduler should use 'CPU' as backend instead of 'Ascend'") if _get_ps_context("worker_num") == 1: GlobalComm.INITED = True return if device_target != "Ascend": raise RuntimeError("For 'init', the argument 'backend_name' should be '{}' to init '{}', " "but got 'hccl'.".format(DEVICE_TO_BACKEND[device_target], device_target)) if is_initialized(device_target): logger.warning(f"For 'init' in Ascend backend, the backend is already initialized, please set it before " "the definition of any Tensor and Parameter, and the instantiation and execution of any " "operation and net, otherwise the 'init' may not take effect.") if not host_init: _check_parallel_envs() GlobalComm.BACKEND = Backend("hccl") _check_hccl() > init_hccl() E RuntimeError: Call aclrtSetDevice failed, ret[507033]. Got device count[8] and device id[1], please check if device id is valid. E E ---------------------------------------------------- E - C++ Call Stack: (For framework developers) E ---------------------------------------------------- E mindspore/ccsrc/plugin/res_manager/ascend/hal_manager/ascend_hal_manager.cc:67 InitDevice /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/management.py:203: RuntimeError =============================== warnings summary =============================== ../../../../../../.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549 /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) ../../../../../../.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89 /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) ../../../../../../.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549 /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) ../../../../../../.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89 /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) ../../../../../../anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/batchnorm_fold2.py:57 /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/batchnorm_fold2.py:57: DeprecationWarning: te_fusion.fusion_manager.fusion_manager.register is deprecated,please replace it with tbe.common.register.register_op_compute @fusion_manager.register("batchnorm_fold2") ../../../../../../anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/batchnorm_fold2_grad.py:56 /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/batchnorm_fold2_grad.py:56: DeprecationWarning: te_fusion.fusion_manager.fusion_manager.register is deprecated,please replace it with tbe.common.register.register_op_compute @fusion_manager.register("batchnorm_fold2_grad") ../../../../../../anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/batchnorm_fold2_grad_reduce.py:48 /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/batchnorm_fold2_grad_reduce.py:48: DeprecationWarning: te_fusion.fusion_manager.fusion_manager.register is deprecated,please replace it with tbe.common.register.register_op_compute @fusion_manager.register("batchnorm_fold2_grad_reduce") ../../../../../../anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/correction_mul.py:51 /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/correction_mul.py:51: DeprecationWarning: te_fusion.fusion_manager.fusion_manager.register is deprecated,please replace it with tbe.common.register.register_op_compute @fusion_manager.register("correction_mul") ../../../../../../anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/correction_mul_grad.py:51 /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/correction_mul_grad.py:51: DeprecationWarning: te_fusion.fusion_manager.fusion_manager.register is deprecated,please replace it with tbe.common.register.register_op_compute @fusion_manager.register("correction_mul_grad") ../../../../../../anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/correction_mul_grad.py:143 /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/correction_mul_grad.py:143: DeprecationWarning: te_fusion.fusion_manager.fusion_manager.register is deprecated,please replace it with tbe.common.register.register_op_compute @fusion_manager.register("correction_mul_grad_reduce") ../../../../../../anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/fake_learned_scale_quant_perlayer.py:50 /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/fake_learned_scale_quant_perlayer.py:50: DeprecationWarning: te_fusion.fusion_manager.fusion_manager.register is deprecated,please replace it with tbe.common.register.register_op_compute @fusion_manager.register("fake_learned_scale_quant_perlayer") ../../../../../../anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/fake_learned_scale_quant_perlayer_grad.py:92 /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/fake_learned_scale_quant_perlayer_grad.py:92: DeprecationWarning: te_fusion.fusion_manager.fusion_manager.register is deprecated,please replace it with tbe.common.register.register_op_compute @fusion_manager.register("fake_learned_scale_quant_perlayer_grad_d") ../../../../../../anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/fake_learned_scale_quant_perlayer_grad_reduce.py:49 /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/fake_learned_scale_quant_perlayer_grad_reduce.py:49: DeprecationWarning: te_fusion.fusion_manager.fusion_manager.register is deprecated,please replace it with tbe.common.register.register_op_compute @fusion_manager.register("fake_learned_scale_quant_perlayer_grad_d_reduce") ../../../../../../anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/fake_learned_scale_quant_perchannel.py:50 /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/fake_learned_scale_quant_perchannel.py:50: DeprecationWarning: te_fusion.fusion_manager.fusion_manager.register is deprecated,please replace it with tbe.common.register.register_op_compute @fusion_manager.register("fake_learned_scale_quant_perchannel") ../../../../../../anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/fake_learned_scale_quant_perchannel_grad.py:91 /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/fake_learned_scale_quant_perchannel_grad.py:91: DeprecationWarning: te_fusion.fusion_manager.fusion_manager.register is deprecated,please replace it with tbe.common.register.register_op_compute @fusion_manager.register("fake_learned_scale_quant_perchannel_grad_d") ../../../../../../anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/fake_learned_scale_quant_perchannel_grad_reduce.py:48 /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/fake_learned_scale_quant_perchannel_grad_reduce.py:48: DeprecationWarning: te_fusion.fusion_manager.fusion_manager.register is deprecated,please replace it with tbe.common.register.register_op_compute @fusion_manager.register("fake_learned_scale_quant_perchannel_grad_d_reduce") ../../../../../../anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/fake_quant_perchannel.py:52 /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/fake_quant_perchannel.py:52: DeprecationWarning: te_fusion.fusion_manager.fusion_manager.register is deprecated,please replace it with tbe.common.register.register_op_compute @fusion_manager.register("fake_quant_perchannel") ../../../../../../anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/fake_quant_perchannel_grad.py:81 /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/fake_quant_perchannel_grad.py:81: DeprecationWarning: te_fusion.fusion_manager.fusion_manager.register is deprecated,please replace it with tbe.common.register.register_op_compute @fusion_manager.register("fake_quant_perchannel_grad") ../../../../../../anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/fake_quant_perlayer.py:54 /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/fake_quant_perlayer.py:54: DeprecationWarning: te_fusion.fusion_manager.fusion_manager.register is deprecated,please replace it with tbe.common.register.register_op_compute @fusion_manager.register("fake_quant_per_layer") ../../../../../../anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/fake_quant_perlayer_grad.py:81 /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/fake_quant_perlayer_grad.py:81: DeprecationWarning: te_fusion.fusion_manager.fusion_manager.register is deprecated,please replace it with tbe.common.register.register_op_compute @fusion_manager.register("fake_quant_per_layer_grad") ../../../../../../anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/minmax_update_perchannel.py:50 /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/minmax_update_perchannel.py:50: DeprecationWarning: te_fusion.fusion_manager.fusion_manager.register is deprecated,please replace it with tbe.common.register.register_op_compute @fusion_manager.register("minmax_update_perchannel") ../../../../../../anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/minmax_update_perlayer.py:50 /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/ops/_op_impl/_custom_op/minmax_update_perlayer.py:50: DeprecationWarning: te_fusion.fusion_manager.fusion_manager.register is deprecated,please replace it with tbe.common.register.register_op_compute @fusion_manager.register("minmax_update_perlayer") -- Docs: https://docs.pytest.org/en/stable/warnings.html =========================== short test summary info ============================ FAILED checkpoints_convert.py::test_checkpoints_convert_by_layout - RuntimeEr... ================== 1 failed, 22 warnings in 167.74s (0:02:47) ================== [WARNING] DEVICE(1442412,ffff968beec0,python):2025-07-15-13:51:42.545.527 [mindspore/ccsrc/plugin/device/ascend/hal/hardware/ascend_device_res_manager.cc:350] SyncAllStreams] The ascend_res_manager_ is nullptr in scenarios where it is not actually executed [ERROR] ME(1442327:281473739124416,MainProcess):2025-07-15-13:51:44.428.79 [mindspore/parallel/cluster/process_entity/_api.py:363] Worker process 1442412 exit with exception. Error code: 1. [WARNING] ME(1442327:281473739124416,MainProcess):2025-07-15-13:51:44.433.59 [mindspore/parallel/cluster/process_entity/_api.py:369] There's worker exits with exception, kill all other workers. [ERROR] ME(1442327:281473739124416,MainProcess):2025-07-15-13:52:18.889.840 [mindspore/parallel/cluster/process_entity/_api.py:382] Scheduler process 1442404 exit with exception. [ERROR] ME(1442327:281473739124416,MainProcess):2025-07-15-13:52:18.890.932 [mindspore/parallel/cluster/process_entity/_api.py:603] Time out nodes are ['0', '2', '3', '4', '5', '7'] ./test_checkpoints_convert_by_layout/msrun_log/worker_1.log-70- ``"nccl"`` for GPU hardware platforms and ``"mccl"`` for CPU hardware platforms. ./test_checkpoints_convert_by_layout/msrun_log/worker_1.log-71- If not set, inference is automatically made based on the hardware ./test_checkpoints_convert_by_layout/msrun_log/worker_1.log-72- platform type (device_target). Default: ``None`` . ./test_checkpoints_convert_by_layout/msrun_log/worker_1.log-73- ./test_checkpoints_convert_by_layout/msrun_log/worker_1.log-74- Raises: ./test_checkpoints_convert_by_layout/msrun_log/worker_1.log:75: TypeError: If `backend_name` is not a string. ./test_checkpoints_convert_by_layout/msrun_log/worker_1.log:76: RuntimeError: If device target is invalid, or backend is invalid, or distributed initialization fails, ./test_checkpoints_convert_by_layout/msrun_log/worker_1.log-77- or the environment variables RANK_ID/MINDSPORE_HCCL_CONFIG_PATH ./test_checkpoints_convert_by_layout/msrun_log/worker_1.log-78- have not been exported when backend is HCCL. ./test_checkpoints_convert_by_layout/msrun_log/worker_1.log-79- ./test_checkpoints_convert_by_layout/msrun_log/worker_1.log-80- Supported Platforms: ./test_checkpoints_convert_by_layout/msrun_log/worker_1.log-81- ``Ascend`` ``GPU`` ``CPU`` -- ./test_checkpoints_convert_by_layout/msrun_log/worker_1.log-102- elif device_target == "GPU": ./test_checkpoints_convert_by_layout/msrun_log/worker_1.log-103- backend_name = "nccl" ./test_checkpoints_convert_by_layout/msrun_log/worker_1.log-104- elif device_target == "CPU": ./test_checkpoints_convert_by_layout/msrun_log/worker_1.log-105- backend_name = "mccl" ./test_checkpoints_convert_by_layout/msrun_log/worker_1.log-106- else: ./test_checkpoints_convert_by_layout/msrun_log/worker_1.log:107: raise RuntimeError("For 'set_context', the argument 'device_target' {} is not supported in " ./test_checkpoints_convert_by_layout/msrun_log/worker_1.log-108- "parallel initialization, please use Ascend, GPU or CPU.".format(device_target)) ./test_checkpoints_convert_by_layout/msrun_log/worker_1.log-109- if not isinstance(backend_name, str): ./test_checkpoints_convert_by_layout/msrun_log/worker_1.log:110: raise TypeError("For 'init', the argument 'backend_name' must be a string, " ./test_checkpoints_convert_by_layout/msrun_log/worker_1.log-111- "but got the type : {}".format(type(backend_name))) ./test_checkpoints_convert_by_layout/msrun_log/worker_1.log-112- if os.getenv("MS_ROLE") == "MS_SCHED": ./test_checkpoints_convert_by_layout/msrun_log/worker_1.log-113- backend_name = "mccl" ./test_checkpoints_convert_by_layout/msrun_log/worker_1.log-114- ./test_checkpoints_convert_by_layout/msrun_log/worker_1.log-115- _set_elegant_exit_handle() ./test_checkpoints_convert_by_layout/msrun_log/worker_1.log-116- if backend_name == "hccl": ./test_checkpoints_convert_by_layout/msrun_log/worker_1.log-117- if _is_ps_mode(): ./test_checkpoints_convert_by_layout/msrun_log/worker_1.log-118- # Use MindSpore cluster to build network for Parameter Server training. ./test_checkpoints_convert_by_layout/msrun_log/worker_1.log-119- init_cluster() ./test_checkpoints_convert_by_layout/msrun_log/worker_1.log-120- if _is_role_sched() or _is_role_pserver(): ./test_checkpoints_convert_by_layout/msrun_log/worker_1.log:121: raise RuntimeError("Parameter server and scheduler should use 'CPU' as backend instead of 'Ascend'") ./test_checkpoints_convert_by_layout/msrun_log/worker_1.log-122- if _get_ps_context("worker_num") == 1: ./test_checkpoints_convert_by_layout/msrun_log/worker_1.log-123- GlobalComm.INITED = True ./test_checkpoints_convert_by_layout/msrun_log/worker_1.log-124- return ./test_checkpoints_convert_by_layout/msrun_log/worker_1.log-125- if device_target != "Ascend": ./test_checkpoints_convert_by_layout/msrun_log/worker_1.log:126: raise RuntimeError("For 'init', the argument 'backend_name' should be '{}' to init '{}', " ./test_checkpoints_convert_by_layout/msrun_log/worker_1.log-127- "but got 'hccl'.".format(DEVICE_TO_BACKEND[device_target], device_target)) ./test_checkpoints_convert_by_layout/msrun_log/worker_1.log-128- if is_initialized(device_target): ./test_checkpoints_convert_by_layout/msrun_log/worker_1.log-129- logger.warning(f"For 'init' in Ascend backend, the backend is already initialized, please set it before " ./test_checkpoints_convert_by_layout/msrun_log/worker_1.log-130- "the definition of any Tensor and Parameter, and the instantiation and execution of any " ./test_checkpoints_convert_by_layout/msrun_log/worker_1.log-131- "operation and net, otherwise the 'init' may not take effect.") ./test_checkpoints_convert_by_layout/msrun_log/worker_1.log-132- if not host_init: ./test_checkpoints_convert_by_layout/msrun_log/worker_1.log-133- _check_parallel_envs() ./test_checkpoints_convert_by_layout/msrun_log/worker_1.log-134- GlobalComm.BACKEND = Backend("hccl") ./test_checkpoints_convert_by_layout/msrun_log/worker_1.log-135- _check_hccl() ./test_checkpoints_convert_by_layout/msrun_log/worker_1.log-136-> init_hccl() ./test_checkpoints_convert_by_layout/msrun_log/worker_1.log:137:E RuntimeError: Call aclrtSetDevice failed, ret[507033]. Got device count[8] and device id[1], please check if device id is valid. ./test_checkpoints_convert_by_layout/msrun_log/worker_1.log-138-E ./test_checkpoints_convert_by_layout/msrun_log/worker_1.log-139-E ---------------------------------------------------- ./test_checkpoints_convert_by_layout/msrun_log/worker_1.log-140-E - C++ Call Stack: (For framework developers) ./test_checkpoints_convert_by_layout/msrun_log/worker_1.log-141-E ---------------------------------------------------- ./test_checkpoints_convert_by_layout/msrun_log/worker_1.log-142-E mindspore/ccsrc/plugin/res_manager/ascend/hal_manager/ascend_hal_manager.cc:67 InitDevice ./test_checkpoints_convert_by_layout/msrun_log/worker_1.log-143- ./test_checkpoints_convert_by_layout/msrun_log/worker_1.log:144:/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/management.py:203: RuntimeError ./test_checkpoints_convert_by_layout/msrun_log/worker_1.log-145-=============================== warnings summary =============================== ./test_checkpoints_convert_by_layout/msrun_log/worker_1.log-146-../../../../../../.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549 ./test_checkpoints_convert_by_layout/msrun_log/worker_1.log-147- /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. ./test_checkpoints_convert_by_layout/msrun_log/worker_1.log-148- setattr(self, word, getattr(machar, word).flat[0]) ./test_checkpoints_convert_by_layout/msrun_log/worker_1.log-149- -- ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log-97-[WARNING] DISTRIBUTED(1442404,ffffb412eec0,python):2025-07-15-13:52:02.052.656 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:154] Finalize] This log means the cluster is successfully created. Retry to finalize the node and exit cluster... ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log-98-[WARNING] DISTRIBUTED(1442404,ffffb412eec0,python):2025-07-15-13:52:07.052.744 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:98] Finalize] The meta server node can not be finalized because there are still 7 alive nodes. ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log-99-[WARNING] DISTRIBUTED(1442404,ffffb412eec0,python):2025-07-15-13:52:07.052.778 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:154] Finalize] This log means the cluster is successfully created. Retry to finalize the node and exit cluster... ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log-100-[WARNING] DISTRIBUTED(1442404,ffffb412eec0,python):2025-07-15-13:52:12.052.865 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:98] Finalize] The meta server node can not be finalized because there are still 7 alive nodes. ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log-101-[WARNING] DISTRIBUTED(1442404,ffffb412eec0,python):2025-07-15-13:52:12.052.902 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:154] Finalize] This log means the cluster is successfully created. Retry to finalize the node and exit cluster... ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log:102:[ERROR] DISTRIBUTED(1442404,ffff47ffefa0,python):2025-07-15-13:52:14.071.971 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:511] UpdateTopoState] The node: 0 is timed out. It may exit with exception, please check this node's log. ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log:103:[ERROR] DISTRIBUTED(1442404,ffff47ffefa0,python):2025-07-15-13:52:14.072.042 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:511] UpdateTopoState] The node: 2 is timed out. It may exit with exception, please check this node's log. ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log:104:[ERROR] DISTRIBUTED(1442404,ffff47ffefa0,python):2025-07-15-13:52:14.072.071 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:511] UpdateTopoState] The node: 3 is timed out. It may exit with exception, please check this node's log. ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log:105:[ERROR] DISTRIBUTED(1442404,ffff47ffefa0,python):2025-07-15-13:52:14.072.096 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:511] UpdateTopoState] The node: 4 is timed out. It may exit with exception, please check this node's log. ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log:106:[ERROR] DISTRIBUTED(1442404,ffff47ffefa0,python):2025-07-15-13:52:14.072.121 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:511] UpdateTopoState] The node: 5 is timed out. It may exit with exception, please check this node's log. ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log:107:[ERROR] DISTRIBUTED(1442404,ffff47ffefa0,python):2025-07-15-13:52:14.072.146 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:511] UpdateTopoState] The node: 7 is timed out. It may exit with exception, please check this node's log. ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log:108:[ERROR] DISTRIBUTED(1442404,ffffb412eec0,python):2025-07-15-13:52:17.052.993 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:103] Finalize] There are 6 abnormal compute graph nodes. ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log-109-F ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log-110- ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log-111-=================================== FAILURES =================================== ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log-112-______________________ test_checkpoints_convert_by_layout ______________________ ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log-113- -- ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log-158- ``"nccl"`` for GPU hardware platforms and ``"mccl"`` for CPU hardware platforms. ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log-159- If not set, inference is automatically made based on the hardware ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log-160- platform type (device_target). Default: ``None`` . ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log-161- ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log-162- Raises: ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log:163: TypeError: If `backend_name` is not a string. ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log:164: RuntimeError: If device target is invalid, or backend is invalid, or distributed initialization fails, ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log-165- or the environment variables RANK_ID/MINDSPORE_HCCL_CONFIG_PATH ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log-166- have not been exported when backend is HCCL. ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log-167- ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log-168- Supported Platforms: ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log-169- ``Ascend`` ``GPU`` ``CPU`` -- ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log-190- elif device_target == "GPU": ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log-191- backend_name = "nccl" ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log-192- elif device_target == "CPU": ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log-193- backend_name = "mccl" ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log-194- else: ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log:195: raise RuntimeError("For 'set_context', the argument 'device_target' {} is not supported in " ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log-196- "parallel initialization, please use Ascend, GPU or CPU.".format(device_target)) ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log-197- if not isinstance(backend_name, str): ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log:198: raise TypeError("For 'init', the argument 'backend_name' must be a string, " ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log-199- "but got the type : {}".format(type(backend_name))) ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log-200- if os.getenv("MS_ROLE") == "MS_SCHED": ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log-201- backend_name = "mccl" ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log-202- ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log-203- _set_elegant_exit_handle() ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log-204- if backend_name == "hccl": ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log-205- if _is_ps_mode(): ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log-206- # Use MindSpore cluster to build network for Parameter Server training. ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log-207- init_cluster() ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log-208- if _is_role_sched() or _is_role_pserver(): ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log:209: raise RuntimeError("Parameter server and scheduler should use 'CPU' as backend instead of 'Ascend'") ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log-210- if _get_ps_context("worker_num") == 1: ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log-211- GlobalComm.INITED = True ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log-212- return ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log-213- if device_target != "Ascend": ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log:214: raise RuntimeError("For 'init', the argument 'backend_name' should be '{}' to init '{}', " ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log-215- "but got 'hccl'.".format(DEVICE_TO_BACKEND[device_target], device_target)) ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log-216- if is_initialized(device_target): ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log-217- logger.warning(f"For 'init' in Ascend backend, the backend is already initialized, please set it before " ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log-218- "the definition of any Tensor and Parameter, and the instantiation and execution of any " ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log-219- "operation and net, otherwise the 'init' may not take effect.") -- ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log-223- _check_hccl() ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log-224- init_hccl() ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log-225- GlobalComm.WORLD_COMM_GROUP = HCCL_WORLD_COMM_GROUP ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log-226- elif backend_name == "nccl": ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log-227- if device_target != "GPU": ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log:228: raise RuntimeError("For 'init', the argument 'backend_name' should be '{}' to init '{}', " ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log-229- "but got 'nccl'.".format(DEVICE_TO_BACKEND[device_target], device_target)) ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log-230- init_cluster() ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log-231- GlobalComm.BACKEND = Backend("nccl") ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log-232- GlobalComm.WORLD_COMM_GROUP = NCCL_WORLD_COMM_GROUP ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log-233- elif backend_name == "mccl": ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log-234-> init_cluster() ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log:235:E RuntimeError: The total number of timed out node is 6. Timed out node list is: [const vector]{0, 2, 3, 4, 5, 7}, worker 0 is the first one timed out, please check its log. ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log-236-E ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log-237-E ---------------------------------------------------- ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log-238-E - C++ Call Stack: (For framework developers) ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log-239-E ---------------------------------------------------- ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log-240-E mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:517 UpdateTopoState ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log-241- ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log:242:/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/management.py:213: RuntimeError ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log-243-=============================== warnings summary =============================== ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log-244-../../../../../../.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549 ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log-245- /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log-246- setattr(self, word, getattr(machar, word).flat[0]) ./test_checkpoints_convert_by_layout/msrun_log/scheduler.log-247- Traceback (most recent call last): File "/home/jenkins/anaconda3/envs/ci39/bin/msrun", line 8, in sys.exit(main()) File "/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/parallel/cluster/run.py", line 191, in main run(args) File "/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/parallel/cluster/run.py", line 185, in run process_manager.run() File "/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/parallel/cluster/process_entity/_api.py", line 268, in run self.join_processes() File "/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/parallel/cluster/process_entity/_api.py", line 387, in join_processes raise RuntimeError("Distributed job exited with exception. Please check logs in " RuntimeError: Distributed job exited with exception. Please check logs in directory: ./test_checkpoints_convert_by_layout/msrun_log. F =================================== FAILURES =================================== ______________________ test_checkpoints_convert_by_layout ______________________ @arg_mark(plat_marks=["platform_ascend910b"], level_mark="level1", card_mark="allcards", essential_mark="essential") def test_checkpoints_convert_by_layout(): """ Feature: Test checkpoints convert with layout. Description: Test distributed checkpoints convert specified by layout. Expectation: The convert checkpoints is correct. """ os.system("rm -rf ./test_checkpoints_convert_by_layout/") return_code = os.system( "msrun --worker_num=8 --local_worker_num=8 --master_addr=127.0.0.1 " \ "--master_port=10805 --join=True " \ "--log_dir=./test_checkpoints_convert_by_layout/msrun_log " \ "pytest -s checkpoints_convert.py::test_checkpoints_convert_by_layout" ) > assert return_code == 0 E assert 256 == 0 test_checkpoints_convert.py:32: AssertionError =========================== short test summary info ============================ FAILED test_checkpoints_convert.py::test_checkpoints_convert_by_layout - asse... ======================== 1 failed in 211.65s (0:03:31) =========================