============================= test session starts ============================== platform linux -- Python 3.9.21, pytest-6.2.5, py-1.11.0, pluggy-0.13.1 rootdir: /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3, configfile: ../../../../../../../../sault/virtual_test/virtualenv_002/sault/config/pytest.ini plugins: forked-1.6.0, hydra-core-1.3.2, xdist-1.32.0, anyio-4.9.0 collected 1 item test_deepseekv3_pretrain.py enable lazy inline in pp using gpt dataset. make: Nothing to be done for 'default'. /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) Start worker process with rank id:0, log file:/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/worker_0.log. Environment variable [RANK_ID=0] is exported. [WARNING] ME(911743:281473094381248,MainProcess):2025-07-15-10:38:59.581.822 [mindspore/parallel/cluster/process_entity/_utils.py:62] Launch process with command: taskset -c 144-167 python /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/../mindformers/run_mindformer.py --config /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/pretrain_deepseek3_mte_gptdataset.yaml --register_path /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/../mindformers/research/deepseek3/ Start worker process with rank id:1, log file:/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/worker_1.log. Environment variable [RANK_ID=1] is exported. [WARNING] ME(911743:281473094381248,MainProcess):2025-07-15-10:38:59.637.697 [mindspore/parallel/cluster/process_entity/_utils.py:62] Launch process with command: taskset -c 24-47 python /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/../mindformers/run_mindformer.py --config /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/pretrain_deepseek3_mte_gptdataset.yaml --register_path /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/../mindformers/research/deepseek3/ Start worker process with rank id:2, log file:/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/worker_2.log. Environment variable [RANK_ID=2] is exported. [WARNING] ME(911743:281473094381248,MainProcess):2025-07-15-10:38:59.696.223 [mindspore/parallel/cluster/process_entity/_utils.py:62] Launch process with command: taskset -c 96-119 python /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/../mindformers/run_mindformer.py --config /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/pretrain_deepseek3_mte_gptdataset.yaml --register_path /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/../mindformers/research/deepseek3/ Start worker process with rank id:3, log file:/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/worker_3.log. Environment variable [RANK_ID=3] is exported. [WARNING] ME(911743:281473094381248,MainProcess):2025-07-15-10:38:59.776.594 [mindspore/parallel/cluster/process_entity/_utils.py:62] Launch process with command: taskset -c 72-95 python /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/../mindformers/run_mindformer.py --config /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/pretrain_deepseek3_mte_gptdataset.yaml --register_path /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/../mindformers/research/deepseek3/ Start worker process with rank id:4, log file:/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/worker_4.log. Environment variable [RANK_ID=4] is exported. [WARNING] ME(911743:281473094381248,MainProcess):2025-07-15-10:38:59.848.267 [mindspore/parallel/cluster/process_entity/_utils.py:62] Launch process with command: taskset -c 0-23 python /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/../mindformers/run_mindformer.py --config /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/pretrain_deepseek3_mte_gptdataset.yaml --register_path /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/../mindformers/research/deepseek3/ Start worker process with rank id:5, log file:/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/worker_5.log. Environment variable [RANK_ID=5] is exported. [WARNING] ME(911743:281473094381248,MainProcess):2025-07-15-10:38:59.909.932 [mindspore/parallel/cluster/process_entity/_utils.py:62] Launch process with command: taskset -c 120-143 python /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/../mindformers/run_mindformer.py --config /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/pretrain_deepseek3_mte_gptdataset.yaml --register_path /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/../mindformers/research/deepseek3/ Start worker process with rank id:6, log file:/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/worker_6.log. Environment variable [RANK_ID=6] is exported. [WARNING] ME(911743:281473094381248,MainProcess):2025-07-15-10:38:59.971.117 [mindspore/parallel/cluster/process_entity/_utils.py:62] Launch process with command: taskset -c 48-71 python /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/../mindformers/run_mindformer.py --config /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/pretrain_deepseek3_mte_gptdataset.yaml --register_path /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/../mindformers/research/deepseek3/ Start worker process with rank id:7, log file:/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/worker_7.log. Environment variable [RANK_ID=7] is exported. [WARNING] ME(911743:281473094381248,MainProcess):2025-07-15-10:39:00.400.47 [mindspore/parallel/cluster/process_entity/_utils.py:62] Launch process with command: taskset -c 168-191 python /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/../mindformers/run_mindformer.py --config /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/pretrain_deepseek3_mte_gptdataset.yaml --register_path /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/../mindformers/research/deepseek3/ [WARNING] ME(911743:281473094381248,MainProcess):2025-07-15-10:39:00.100.386 [mindspore/parallel/cluster/process_entity/_api.py:267] Distributed job is spawned. Waiting all processes to exit... /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) 2025-07-15 10:39:08,371 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config swap_config is empty. 2025-07-15 10:39:08,371 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config monitor_config is empty. 2025-07-15 10:39:08,372 - mindformers./output/log[mindformers/tools/register/template.py:683] - WARNING - Some configs in yaml are useless for train: ['auto_tune', 'autotune_per_step', 'eval_callbacks', 'filepath_prefix', 'processor', 'remove_redundancy', 'resume_by_last_timestamp_ckpt'] 2025-07-15 10:39:08,372 - mindformers./output/log[mindformers/tools/utils.py:166] - INFO - set output path to '/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/output' [WARNING] ME(912072:281473543564992,MainProcess):2025-07-15-10:39:08.389.316 [mindspore/context.py:1412] For 'context.set_context', the parameter 'device_target' will be deprecated and removed in a future version. Please use the api mindspore.set_device() instead. [WARNING] ME(912072:281473543564992,MainProcess):2025-07-15-10:39:08.390.038 [mindspore/context.py:1412] For 'context.set_context', the parameter 'max_device_memory' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] ME(912072:281473543564992,MainProcess):2025-07-15-10:39:08.390.450 [mindspore/context.py:1412] For 'context.set_context', the parameter 'max_call_depth' will be deprecated and removed in a future version. Please use the api mindspore.set_recursion_limit() instead. [WARNING] ME(912072:281473543564992,MainProcess):2025-07-15-10:39:08.390.563 [mindspore/context.py:1412] For 'context.set_context', the parameter 'memory_optimize_level' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] ME(912072:281473543564992,MainProcess):2025-07-15-10:39:08.390.655 [mindspore/context.py:1412] For 'context.set_context', the parameter 'save_graphs' will be deprecated and removed in a future version. Please use the env MS_DEV_SAVE_GRAPHS instead. [WARNING] ME(912072:281473543564992,MainProcess):2025-07-15-10:39:08.390.765 [mindspore/context.py:1412] For 'context.set_context', the parameter 'save_graphs_path' will be deprecated and removed in a future version. Please use the env MS_DEV_SAVE_GRAPHS_PATH instead. [WARNING] ME(912072:281473543564992,MainProcess):2025-07-15-10:39:08.390.920 [mindspore/context.py:1412] For 'context.set_context', the parameter 'deterministic' will be deprecated and removed in a future version. Please use the api mindspore.set_deterministic() instead. [WARNING] ME(912072:281473543564992,MainProcess):2025-07-15-10:39:08.391.129 [mindspore/context.py:1412] For 'context.set_context', the parameter 'ascend_config' will be deprecated and removed in a future version. Please use the api mindspore.device_context.ascend.op_precision.precision_mode(), mindspore.device_context.ascend.op_precision.op_precision_mode(), mindspore.device_context.ascend.op_precision.matmul_allow_hf32(), mindspore.device_context.ascend.op_precision.conv_allow_hf32(), mindspore.device_context.ascend.op_tuning.op_compile() instead. [WARNING] ME(912072:281473543564992,MainProcess):2025-07-15-10:39:08.391.391 [mindspore/context.py:921] For 'context.set_context', 'dataset_broadcast_opt_level' parameter is deprecated, and will be removed in the next version, Please use 'dataset_broadcast_opt_level' instead. [WARNING] ME(912072:281473543564992,MainProcess):2025-07-15-10:39:08.391.499 [mindspore/context.py:921] For 'context.set_context', 'compute_communicate_fusion_level' parameter is deprecated, and will be removed in the next version, Please use 'computation_communication_fusion_level' instead. [WARNING] ME(912072:281473543564992,MainProcess):2025-07-15-10:39:08.391.600 [mindspore/context.py:1412] For 'context.set_context', the parameter 'mempool_block_size' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] DISTRIBUTED(912072,ffffaa93eec0,python):2025-07-15-10:39:08.393.554 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 21 source: 127.0.0.1:52396, destination: 127.0.0.1:7125 [WARNING] DISTRIBUTED(912072,ffffaa93eec0,python):2025-07-15-10:39:08.393.625 [mindspore/ccsrc/distributed/rpc/tcp/tcp_client.cc:76] Connect] Failed to connect to the tcp server : 127.0.0.1:7125, retry to reconnect(1/1)... 2025-07-15 10:39:08,445 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config swap_config is empty. 2025-07-15 10:39:08,446 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config monitor_config is empty. 2025-07-15 10:39:08,446 - mindformers./output/log[mindformers/tools/register/template.py:683] - WARNING - Some configs in yaml are useless for train: ['auto_tune', 'autotune_per_step', 'eval_callbacks', 'filepath_prefix', 'processor', 'remove_redundancy', 'resume_by_last_timestamp_ckpt'] 2025-07-15 10:39:08,446 - mindformers./output/log[mindformers/tools/utils.py:166] - INFO - set output path to '/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/output' [WARNING] ME(912068:281473778380480,MainProcess):2025-07-15-10:39:08.463.471 [mindspore/context.py:1412] For 'context.set_context', the parameter 'device_target' will be deprecated and removed in a future version. Please use the api mindspore.set_device() instead. [WARNING] ME(912068:281473778380480,MainProcess):2025-07-15-10:39:08.464.210 [mindspore/context.py:1412] For 'context.set_context', the parameter 'max_device_memory' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] ME(912068:281473778380480,MainProcess):2025-07-15-10:39:08.464.627 [mindspore/context.py:1412] For 'context.set_context', the parameter 'max_call_depth' will be deprecated and removed in a future version. Please use the api mindspore.set_recursion_limit() instead. [WARNING] ME(912068:281473778380480,MainProcess):2025-07-15-10:39:08.464.746 [mindspore/context.py:1412] For 'context.set_context', the parameter 'memory_optimize_level' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] ME(912068:281473778380480,MainProcess):2025-07-15-10:39:08.464.841 [mindspore/context.py:1412] For 'context.set_context', the parameter 'save_graphs' will be deprecated and removed in a future version. Please use the env MS_DEV_SAVE_GRAPHS instead. [WARNING] ME(912068:281473778380480,MainProcess):2025-07-15-10:39:08.464.954 [mindspore/context.py:1412] For 'context.set_context', the parameter 'save_graphs_path' will be deprecated and removed in a future version. Please use the env MS_DEV_SAVE_GRAPHS_PATH instead. [WARNING] ME(912068:281473778380480,MainProcess):2025-07-15-10:39:08.465.115 [mindspore/context.py:1412] For 'context.set_context', the parameter 'deterministic' will be deprecated and removed in a future version. Please use the api mindspore.set_deterministic() instead. [WARNING] ME(912068:281473778380480,MainProcess):2025-07-15-10:39:08.465.333 [mindspore/context.py:1412] For 'context.set_context', the parameter 'ascend_config' will be deprecated and removed in a future version. Please use the api mindspore.device_context.ascend.op_precision.precision_mode(), mindspore.device_context.ascend.op_precision.op_precision_mode(), mindspore.device_context.ascend.op_precision.matmul_allow_hf32(), mindspore.device_context.ascend.op_precision.conv_allow_hf32(), mindspore.device_context.ascend.op_tuning.op_compile() instead. [WARNING] ME(912068:281473778380480,MainProcess):2025-07-15-10:39:08.465.621 [mindspore/context.py:921] For 'context.set_context', 'dataset_broadcast_opt_level' parameter is deprecated, and will be removed in the next version, Please use 'dataset_broadcast_opt_level' instead. [WARNING] ME(912068:281473778380480,MainProcess):2025-07-15-10:39:08.465.735 [mindspore/context.py:921] For 'context.set_context', 'compute_communicate_fusion_level' parameter is deprecated, and will be removed in the next version, Please use 'computation_communication_fusion_level' instead. [WARNING] ME(912068:281473778380480,MainProcess):2025-07-15-10:39:08.465.840 [mindspore/context.py:1412] For 'context.set_context', the parameter 'mempool_block_size' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] DISTRIBUTED(912068,ffffb892eec0,python):2025-07-15-10:39:08.467.820 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 21 source: 127.0.0.1:52404, destination: 127.0.0.1:7125 [WARNING] DISTRIBUTED(912068,ffffb892eec0,python):2025-07-15-10:39:08.467.898 [mindspore/ccsrc/distributed/rpc/tcp/tcp_client.cc:76] Connect] Failed to connect to the tcp server : 127.0.0.1:7125, retry to reconnect(1/1)... 2025-07-15 10:39:08,487 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config swap_config is empty. 2025-07-15 10:39:08,488 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config monitor_config is empty. 2025-07-15 10:39:08,488 - mindformers./output/log[mindformers/tools/register/template.py:683] - WARNING - Some configs in yaml are useless for train: ['auto_tune', 'autotune_per_step', 'eval_callbacks', 'filepath_prefix', 'processor', 'remove_redundancy', 'resume_by_last_timestamp_ckpt'] 2025-07-15 10:39:08,488 - mindformers./output/log[mindformers/tools/utils.py:166] - INFO - set output path to '/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/output' [WARNING] ME(912076:281473777594048,MainProcess):2025-07-15-10:39:08.507.406 [mindspore/context.py:1412] For 'context.set_context', the parameter 'device_target' will be deprecated and removed in a future version. Please use the api mindspore.set_device() instead. [WARNING] ME(912076:281473777594048,MainProcess):2025-07-15-10:39:08.508.153 [mindspore/context.py:1412] For 'context.set_context', the parameter 'max_device_memory' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] ME(912076:281473777594048,MainProcess):2025-07-15-10:39:08.508.595 [mindspore/context.py:1412] For 'context.set_context', the parameter 'max_call_depth' will be deprecated and removed in a future version. Please use the api mindspore.set_recursion_limit() instead. [WARNING] ME(912076:281473777594048,MainProcess):2025-07-15-10:39:08.508.709 [mindspore/context.py:1412] For 'context.set_context', the parameter 'memory_optimize_level' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] ME(912076:281473777594048,MainProcess):2025-07-15-10:39:08.508.806 [mindspore/context.py:1412] For 'context.set_context', the parameter 'save_graphs' will be deprecated and removed in a future version. Please use the env MS_DEV_SAVE_GRAPHS instead. [WARNING] ME(912076:281473777594048,MainProcess):2025-07-15-10:39:08.508.919 [mindspore/context.py:1412] For 'context.set_context', the parameter 'save_graphs_path' will be deprecated and removed in a future version. Please use the env MS_DEV_SAVE_GRAPHS_PATH instead. [WARNING] ME(912076:281473777594048,MainProcess):2025-07-15-10:39:08.509.079 [mindspore/context.py:1412] For 'context.set_context', the parameter 'deterministic' will be deprecated and removed in a future version. Please use the api mindspore.set_deterministic() instead. [WARNING] ME(912076:281473777594048,MainProcess):2025-07-15-10:39:08.509.293 [mindspore/context.py:1412] For 'context.set_context', the parameter 'ascend_config' will be deprecated and removed in a future version. Please use the api mindspore.device_context.ascend.op_precision.precision_mode(), mindspore.device_context.ascend.op_precision.op_precision_mode(), mindspore.device_context.ascend.op_precision.matmul_allow_hf32(), mindspore.device_context.ascend.op_precision.conv_allow_hf32(), mindspore.device_context.ascend.op_tuning.op_compile() instead. [WARNING] ME(912076:281473777594048,MainProcess):2025-07-15-10:39:08.509.578 [mindspore/context.py:921] For 'context.set_context', 'dataset_broadcast_opt_level' parameter is deprecated, and will be removed in the next version, Please use 'dataset_broadcast_opt_level' instead. [WARNING] ME(912076:281473777594048,MainProcess):2025-07-15-10:39:08.509.688 [mindspore/context.py:921] For 'context.set_context', 'compute_communicate_fusion_level' parameter is deprecated, and will be removed in the next version, Please use 'computation_communication_fusion_level' instead. [WARNING] ME(912076:281473777594048,MainProcess):2025-07-15-10:39:08.509.790 [mindspore/context.py:1412] For 'context.set_context', the parameter 'mempool_block_size' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] DISTRIBUTED(912076,ffffb886eec0,python):2025-07-15-10:39:08.511.716 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 21 source: 127.0.0.1:52410, destination: 127.0.0.1:7125 [WARNING] DISTRIBUTED(912076,ffffb886eec0,python):2025-07-15-10:39:08.511.790 [mindspore/ccsrc/distributed/rpc/tcp/tcp_client.cc:76] Connect] Failed to connect to the tcp server : 127.0.0.1:7125, retry to reconnect(1/1)... 2025-07-15 10:39:08,532 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config swap_config is empty. 2025-07-15 10:39:08,533 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config monitor_config is empty. 2025-07-15 10:39:08,533 - mindformers./output/log[mindformers/tools/register/template.py:683] - WARNING - Some configs in yaml are useless for train: ['auto_tune', 'autotune_per_step', 'eval_callbacks', 'filepath_prefix', 'processor', 'remove_redundancy', 'resume_by_last_timestamp_ckpt'] 2025-07-15 10:39:08,534 - mindformers./output/log[mindformers/tools/utils.py:166] - INFO - set output path to '/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/output' [WARNING] ME(912080:281473296690880,MainProcess):2025-07-15-10:39:08.551.000 [mindspore/context.py:1412] For 'context.set_context', the parameter 'device_target' will be deprecated and removed in a future version. Please use the api mindspore.set_device() instead. [WARNING] ME(912080:281473296690880,MainProcess):2025-07-15-10:39:08.551.731 [mindspore/context.py:1412] For 'context.set_context', the parameter 'max_device_memory' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] ME(912080:281473296690880,MainProcess):2025-07-15-10:39:08.552.145 [mindspore/context.py:1412] For 'context.set_context', the parameter 'max_call_depth' will be deprecated and removed in a future version. Please use the api mindspore.set_recursion_limit() instead. [WARNING] ME(912080:281473296690880,MainProcess):2025-07-15-10:39:08.552.258 [mindspore/context.py:1412] For 'context.set_context', the parameter 'memory_optimize_level' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] ME(912080:281473296690880,MainProcess):2025-07-15-10:39:08.552.350 [mindspore/context.py:1412] For 'context.set_context', the parameter 'save_graphs' will be deprecated and removed in a future version. Please use the env MS_DEV_SAVE_GRAPHS instead. [WARNING] ME(912080:281473296690880,MainProcess):2025-07-15-10:39:08.552.463 [mindspore/context.py:1412] For 'context.set_context', the parameter 'save_graphs_path' will be deprecated and removed in a future version. Please use the env MS_DEV_SAVE_GRAPHS_PATH instead. [WARNING] ME(912080:281473296690880,MainProcess):2025-07-15-10:39:08.552.623 [mindspore/context.py:1412] For 'context.set_context', the parameter 'deterministic' will be deprecated and removed in a future version. Please use the api mindspore.set_deterministic() instead. [WARNING] ME(912080:281473296690880,MainProcess):2025-07-15-10:39:08.552.823 [mindspore/context.py:1412] For 'context.set_context', the parameter 'ascend_config' will be deprecated and removed in a future version. Please use the api mindspore.device_context.ascend.op_precision.precision_mode(), mindspore.device_context.ascend.op_precision.op_precision_mode(), mindspore.device_context.ascend.op_precision.matmul_allow_hf32(), mindspore.device_context.ascend.op_precision.conv_allow_hf32(), mindspore.device_context.ascend.op_tuning.op_compile() instead. [WARNING] ME(912080:281473296690880,MainProcess):2025-07-15-10:39:08.553.095 [mindspore/context.py:921] For 'context.set_context', 'dataset_broadcast_opt_level' parameter is deprecated, and will be removed in the next version, Please use 'dataset_broadcast_opt_level' instead. [WARNING] ME(912080:281473296690880,MainProcess):2025-07-15-10:39:08.553.206 [mindspore/context.py:921] For 'context.set_context', 'compute_communicate_fusion_level' parameter is deprecated, and will be removed in the next version, Please use 'computation_communication_fusion_level' instead. [WARNING] ME(912080:281473296690880,MainProcess):2025-07-15-10:39:08.553.308 [mindspore/context.py:1412] For 'context.set_context', the parameter 'mempool_block_size' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] DISTRIBUTED(912080,ffff9bdceec0,python):2025-07-15-10:39:08.554.994 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 21 source: 127.0.0.1:52412, destination: 127.0.0.1:7125 [WARNING] DISTRIBUTED(912080,ffff9bdceec0,python):2025-07-15-10:39:08.555.064 [mindspore/ccsrc/distributed/rpc/tcp/tcp_client.cc:76] Connect] Failed to connect to the tcp server : 127.0.0.1:7125, retry to reconnect(1/1)... 2025-07-15 10:39:08,612 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config swap_config is empty. 2025-07-15 10:39:08,613 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config monitor_config is empty. 2025-07-15 10:39:08,613 - mindformers./output/log[mindformers/tools/register/template.py:683] - WARNING - Some configs in yaml are useless for train: ['auto_tune', 'autotune_per_step', 'eval_callbacks', 'filepath_prefix', 'processor', 'remove_redundancy', 'resume_by_last_timestamp_ckpt'] 2025-07-15 10:39:08,614 - mindformers./output/log[mindformers/tools/utils.py:166] - INFO - set output path to '/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/output' [WARNING] ME(912084:281472835120832,MainProcess):2025-07-15-10:39:08.631.173 [mindspore/context.py:1412] For 'context.set_context', the parameter 'device_target' will be deprecated and removed in a future version. Please use the api mindspore.set_device() instead. [WARNING] ME(912084:281472835120832,MainProcess):2025-07-15-10:39:08.631.905 [mindspore/context.py:1412] For 'context.set_context', the parameter 'max_device_memory' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] ME(912084:281472835120832,MainProcess):2025-07-15-10:39:08.632.306 [mindspore/context.py:1412] For 'context.set_context', the parameter 'max_call_depth' will be deprecated and removed in a future version. Please use the api mindspore.set_recursion_limit() instead. [WARNING] ME(912084:281472835120832,MainProcess):2025-07-15-10:39:08.632.420 [mindspore/context.py:1412] For 'context.set_context', the parameter 'memory_optimize_level' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] ME(912084:281472835120832,MainProcess):2025-07-15-10:39:08.632.511 [mindspore/context.py:1412] For 'context.set_context', the parameter 'save_graphs' will be deprecated and removed in a future version. Please use the env MS_DEV_SAVE_GRAPHS instead. [WARNING] ME(912084:281472835120832,MainProcess):2025-07-15-10:39:08.632.622 [mindspore/context.py:1412] For 'context.set_context', the parameter 'save_graphs_path' will be deprecated and removed in a future version. Please use the env MS_DEV_SAVE_GRAPHS_PATH instead. [WARNING] ME(912084:281472835120832,MainProcess):2025-07-15-10:39:08.632.782 [mindspore/context.py:1412] For 'context.set_context', the parameter 'deterministic' will be deprecated and removed in a future version. Please use the api mindspore.set_deterministic() instead. [WARNING] ME(912084:281472835120832,MainProcess):2025-07-15-10:39:08.632.988 [mindspore/context.py:1412] For 'context.set_context', the parameter 'ascend_config' will be deprecated and removed in a future version. Please use the api mindspore.device_context.ascend.op_precision.precision_mode(), mindspore.device_context.ascend.op_precision.op_precision_mode(), mindspore.device_context.ascend.op_precision.matmul_allow_hf32(), mindspore.device_context.ascend.op_precision.conv_allow_hf32(), mindspore.device_context.ascend.op_tuning.op_compile() instead. [WARNING] ME(912084:281472835120832,MainProcess):2025-07-15-10:39:08.633.269 [mindspore/context.py:921] For 'context.set_context', 'dataset_broadcast_opt_level' parameter is deprecated, and will be removed in the next version, Please use 'dataset_broadcast_opt_level' instead. [WARNING] ME(912084:281472835120832,MainProcess):2025-07-15-10:39:08.633.378 [mindspore/context.py:921] For 'context.set_context', 'compute_communicate_fusion_level' parameter is deprecated, and will be removed in the next version, Please use 'computation_communication_fusion_level' instead. [WARNING] ME(912084:281472835120832,MainProcess):2025-07-15-10:39:08.633.482 [mindspore/context.py:1412] For 'context.set_context', the parameter 'mempool_block_size' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] DISTRIBUTED(912084,ffff8059eec0,python):2025-07-15-10:39:08.635.533 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 21 source: 127.0.0.1:52414, destination: 127.0.0.1:7125 [WARNING] DISTRIBUTED(912084,fffefabeefa0,python):2025-07-15-10:39:08.635.559 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:52414 to 127.0.0.1:7125 is successfully created. System errno: Success [WARNING] DISTRIBUTED(912084,ffff8059eec0,python):2025-07-15-10:39:08.635.606 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:7125 to be connected...Retry number: 1 2025-07-15 10:39:08,796 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config swap_config is empty. 2025-07-15 10:39:08,797 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config monitor_config is empty. 2025-07-15 10:39:08,797 - mindformers./output/log[mindformers/tools/register/template.py:683] - WARNING - Some configs in yaml are useless for train: ['auto_tune', 'autotune_per_step', 'eval_callbacks', 'filepath_prefix', 'processor', 'remove_redundancy', 'resume_by_last_timestamp_ckpt'] 2025-07-15 10:39:08,798 - mindformers./output/log[mindformers/tools/utils.py:166] - INFO - set output path to '/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/output' [WARNING] ME(912092:281473195765440,MainProcess):2025-07-15-10:39:08.815.089 [mindspore/context.py:1412] For 'context.set_context', the parameter 'device_target' will be deprecated and removed in a future version. Please use the api mindspore.set_device() instead. [WARNING] ME(912092:281473195765440,MainProcess):2025-07-15-10:39:08.815.815 [mindspore/context.py:1412] For 'context.set_context', the parameter 'max_device_memory' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] ME(912092:281473195765440,MainProcess):2025-07-15-10:39:08.816.229 [mindspore/context.py:1412] For 'context.set_context', the parameter 'max_call_depth' will be deprecated and removed in a future version. Please use the api mindspore.set_recursion_limit() instead. [WARNING] ME(912092:281473195765440,MainProcess):2025-07-15-10:39:08.816.342 [mindspore/context.py:1412] For 'context.set_context', the parameter 'memory_optimize_level' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] ME(912092:281473195765440,MainProcess):2025-07-15-10:39:08.816.435 [mindspore/context.py:1412] For 'context.set_context', the parameter 'save_graphs' will be deprecated and removed in a future version. Please use the env MS_DEV_SAVE_GRAPHS instead. [WARNING] ME(912092:281473195765440,MainProcess):2025-07-15-10:39:08.816.548 [mindspore/context.py:1412] For 'context.set_context', the parameter 'save_graphs_path' will be deprecated and removed in a future version. Please use the env MS_DEV_SAVE_GRAPHS_PATH instead. [WARNING] ME(912092:281473195765440,MainProcess):2025-07-15-10:39:08.816.707 [mindspore/context.py:1412] For 'context.set_context', the parameter 'deterministic' will be deprecated and removed in a future version. Please use the api mindspore.set_deterministic() instead. [WARNING] ME(912092:281473195765440,MainProcess):2025-07-15-10:39:08.816.908 [mindspore/context.py:1412] For 'context.set_context', the parameter 'ascend_config' will be deprecated and removed in a future version. Please use the api mindspore.device_context.ascend.op_precision.precision_mode(), mindspore.device_context.ascend.op_precision.op_precision_mode(), mindspore.device_context.ascend.op_precision.matmul_allow_hf32(), mindspore.device_context.ascend.op_precision.conv_allow_hf32(), mindspore.device_context.ascend.op_tuning.op_compile() instead. [WARNING] ME(912092:281473195765440,MainProcess):2025-07-15-10:39:08.817.179 [mindspore/context.py:921] For 'context.set_context', 'dataset_broadcast_opt_level' parameter is deprecated, and will be removed in the next version, Please use 'dataset_broadcast_opt_level' instead. [WARNING] ME(912092:281473195765440,MainProcess):2025-07-15-10:39:08.817.291 [mindspore/context.py:921] For 'context.set_context', 'compute_communicate_fusion_level' parameter is deprecated, and will be removed in the next version, Please use 'computation_communication_fusion_level' instead. [WARNING] ME(912092:281473195765440,MainProcess):2025-07-15-10:39:08.817.394 [mindspore/context.py:1412] For 'context.set_context', the parameter 'mempool_block_size' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] DISTRIBUTED(912092,ffff95d8eec0,python):2025-07-15-10:39:08.819.323 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 21 source: 127.0.0.1:52422, destination: 127.0.0.1:7125 [WARNING] DISTRIBUTED(912092,ffff0bffefa0,python):2025-07-15-10:39:08.819.323 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:52422 to 127.0.0.1:7125 is successfully created. System errno: Success [WARNING] DISTRIBUTED(912092,ffff95d8eec0,python):2025-07-15-10:39:08.819.392 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:7125 to be connected...Retry number: 1 2025-07-15 10:39:08,820 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config swap_config is empty. 2025-07-15 10:39:08,821 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config monitor_config is empty. 2025-07-15 10:39:08,821 - mindformers./output/log[mindformers/tools/register/template.py:683] - WARNING - Some configs in yaml are useless for train: ['auto_tune', 'autotune_per_step', 'eval_callbacks', 'filepath_prefix', 'processor', 'remove_redundancy', 'resume_by_last_timestamp_ckpt'] 2025-07-15 10:39:08,821 - mindformers./output/log[mindformers/tools/utils.py:166] - INFO - set output path to '/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/output' [WARNING] ME(912088:281472919269056,MainProcess):2025-07-15-10:39:08.838.796 [mindspore/context.py:1412] For 'context.set_context', the parameter 'device_target' will be deprecated and removed in a future version. Please use the api mindspore.set_device() instead. [WARNING] ME(912088:281472919269056,MainProcess):2025-07-15-10:39:08.839.534 [mindspore/context.py:1412] For 'context.set_context', the parameter 'max_device_memory' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] ME(912088:281472919269056,MainProcess):2025-07-15-10:39:08.839.982 [mindspore/context.py:1412] For 'context.set_context', the parameter 'max_call_depth' will be deprecated and removed in a future version. Please use the api mindspore.set_recursion_limit() instead. [WARNING] ME(912088:281472919269056,MainProcess):2025-07-15-10:39:08.840.094 [mindspore/context.py:1412] For 'context.set_context', the parameter 'memory_optimize_level' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] ME(912088:281472919269056,MainProcess):2025-07-15-10:39:08.840.185 [mindspore/context.py:1412] For 'context.set_context', the parameter 'save_graphs' will be deprecated and removed in a future version. Please use the env MS_DEV_SAVE_GRAPHS instead. [WARNING] ME(912088:281472919269056,MainProcess):2025-07-15-10:39:08.840.296 [mindspore/context.py:1412] For 'context.set_context', the parameter 'save_graphs_path' will be deprecated and removed in a future version. Please use the env MS_DEV_SAVE_GRAPHS_PATH instead. [WARNING] ME(912088:281472919269056,MainProcess):2025-07-15-10:39:08.840.453 [mindspore/context.py:1412] For 'context.set_context', the parameter 'deterministic' will be deprecated and removed in a future version. Please use the api mindspore.set_deterministic() instead. [WARNING] ME(912088:281472919269056,MainProcess):2025-07-15-10:39:08.840.661 [mindspore/context.py:1412] For 'context.set_context', the parameter 'ascend_config' will be deprecated and removed in a future version. Please use the api mindspore.device_context.ascend.op_precision.precision_mode(), mindspore.device_context.ascend.op_precision.op_precision_mode(), mindspore.device_context.ascend.op_precision.matmul_allow_hf32(), mindspore.device_context.ascend.op_precision.conv_allow_hf32(), mindspore.device_context.ascend.op_tuning.op_compile() instead. [WARNING] ME(912088:281472919269056,MainProcess):2025-07-15-10:39:08.840.930 [mindspore/context.py:921] For 'context.set_context', 'dataset_broadcast_opt_level' parameter is deprecated, and will be removed in the next version, Please use 'dataset_broadcast_opt_level' instead. [WARNING] ME(912088:281472919269056,MainProcess):2025-07-15-10:39:08.841.035 [mindspore/context.py:921] For 'context.set_context', 'compute_communicate_fusion_level' parameter is deprecated, and will be removed in the next version, Please use 'computation_communication_fusion_level' instead. [WARNING] ME(912088:281472919269056,MainProcess):2025-07-15-10:39:08.841.135 [mindspore/context.py:1412] For 'context.set_context', the parameter 'mempool_block_size' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] DISTRIBUTED(912088,ffff855deec0,python):2025-07-15-10:39:08.843.177 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 21 source: 127.0.0.1:52428, destination: 127.0.0.1:7125 [WARNING] DISTRIBUTED(912088,fffefb7eefa0,python):2025-07-15-10:39:08.843.184 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:52428 to 127.0.0.1:7125 is successfully created. System errno: Success [WARNING] DISTRIBUTED(912088,ffff855deec0,python):2025-07-15-10:39:08.843.242 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:7125 to be connected...Retry number: 1 [WARNING] DISTRIBUTED(912072,ffffaa93eec0,python):2025-07-15-10:39:08.893.732 [mindspore/ccsrc/distributed/cluster/topology/compute_graph_node.cc:173] Register] Failed to connect to the meta server node url: 127.0.0.1:7125 [WARNING] DISTRIBUTED(912072,ffffaa93eec0,python):2025-07-15-10:39:08.893.774 [mindspore/ccsrc/distributed/cluster/topology/compute_graph_node.cc:363] ReconnectWithTimeoutWindow] Failed to register and try to reconnect to the meta server. 2025-07-15 10:39:08,943 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config swap_config is empty. 2025-07-15 10:39:08,943 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config monitor_config is empty. 2025-07-15 10:39:08,944 - mindformers./output/log[mindformers/tools/register/template.py:683] - WARNING - Some configs in yaml are useless for train: ['auto_tune', 'autotune_per_step', 'eval_callbacks', 'filepath_prefix', 'processor', 'remove_redundancy', 'resume_by_last_timestamp_ckpt'] 2025-07-15 10:39:08,944 - mindformers./output/log[mindformers/tools/utils.py:166] - INFO - set output path to '/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/output' [WARNING] ME(912096:281473611394752,MainProcess):2025-07-15-10:39:08.961.083 [mindspore/context.py:1412] For 'context.set_context', the parameter 'device_target' will be deprecated and removed in a future version. Please use the api mindspore.set_device() instead. [WARNING] ME(912096:281473611394752,MainProcess):2025-07-15-10:39:08.961.828 [mindspore/context.py:1412] For 'context.set_context', the parameter 'max_device_memory' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] ME(912096:281473611394752,MainProcess):2025-07-15-10:39:08.962.251 [mindspore/context.py:1412] For 'context.set_context', the parameter 'max_call_depth' will be deprecated and removed in a future version. Please use the api mindspore.set_recursion_limit() instead. [WARNING] ME(912096:281473611394752,MainProcess):2025-07-15-10:39:08.962.370 [mindspore/context.py:1412] For 'context.set_context', the parameter 'memory_optimize_level' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] ME(912096:281473611394752,MainProcess):2025-07-15-10:39:08.962.491 [mindspore/context.py:1412] For 'context.set_context', the parameter 'save_graphs' will be deprecated and removed in a future version. Please use the env MS_DEV_SAVE_GRAPHS instead. [WARNING] ME(912096:281473611394752,MainProcess):2025-07-15-10:39:08.962.610 [mindspore/context.py:1412] For 'context.set_context', the parameter 'save_graphs_path' will be deprecated and removed in a future version. Please use the env MS_DEV_SAVE_GRAPHS_PATH instead. [WARNING] ME(912096:281473611394752,MainProcess):2025-07-15-10:39:08.962.774 [mindspore/context.py:1412] For 'context.set_context', the parameter 'deterministic' will be deprecated and removed in a future version. Please use the api mindspore.set_deterministic() instead. [WARNING] ME(912096:281473611394752,MainProcess):2025-07-15-10:39:08.962.999 [mindspore/context.py:1412] For 'context.set_context', the parameter 'ascend_config' will be deprecated and removed in a future version. Please use the api mindspore.device_context.ascend.op_precision.precision_mode(), mindspore.device_context.ascend.op_precision.op_precision_mode(), mindspore.device_context.ascend.op_precision.matmul_allow_hf32(), mindspore.device_context.ascend.op_precision.conv_allow_hf32(), mindspore.device_context.ascend.op_tuning.op_compile() instead. [WARNING] ME(912096:281473611394752,MainProcess):2025-07-15-10:39:08.963.290 [mindspore/context.py:921] For 'context.set_context', 'dataset_broadcast_opt_level' parameter is deprecated, and will be removed in the next version, Please use 'dataset_broadcast_opt_level' instead. [WARNING] ME(912096:281473611394752,MainProcess):2025-07-15-10:39:08.963.406 [mindspore/context.py:921] For 'context.set_context', 'compute_communicate_fusion_level' parameter is deprecated, and will be removed in the next version, Please use 'computation_communication_fusion_level' instead. [WARNING] ME(912096:281473611394752,MainProcess):2025-07-15-10:39:08.963.513 [mindspore/context.py:1412] For 'context.set_context', the parameter 'mempool_block_size' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] DISTRIBUTED(912096,ffffae9eeec0,python):2025-07-15-10:39:08.965.539 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 21 source: 127.0.0.1:52440, destination: 127.0.0.1:7125 [WARNING] DISTRIBUTED(912096,ffff2902efa0,python):2025-07-15-10:39:08.965.539 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:52440 to 127.0.0.1:7125 is successfully created. System errno: Success [WARNING] DISTRIBUTED(912096,ffffae9eeec0,python):2025-07-15-10:39:08.965.613 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:7125 to be connected...Retry number: 1 [WARNING] DISTRIBUTED(912068,ffffb892eec0,python):2025-07-15-10:39:08.968.011 [mindspore/ccsrc/distributed/cluster/topology/compute_graph_node.cc:173] Register] Failed to connect to the meta server node url: 127.0.0.1:7125 [WARNING] DISTRIBUTED(912068,ffffb892eec0,python):2025-07-15-10:39:08.968.053 [mindspore/ccsrc/distributed/cluster/topology/compute_graph_node.cc:363] ReconnectWithTimeoutWindow] Failed to register and try to reconnect to the meta server. [WARNING] DISTRIBUTED(912076,ffffb886eec0,python):2025-07-15-10:39:09.011.898 [mindspore/ccsrc/distributed/cluster/topology/compute_graph_node.cc:173] Register] Failed to connect to the meta server node url: 127.0.0.1:7125 [WARNING] DISTRIBUTED(912076,ffffb886eec0,python):2025-07-15-10:39:09.011.939 [mindspore/ccsrc/distributed/cluster/topology/compute_graph_node.cc:363] ReconnectWithTimeoutWindow] Failed to register and try to reconnect to the meta server. [WARNING] DISTRIBUTED(912080,ffff9bdceec0,python):2025-07-15-10:39:09.055.173 [mindspore/ccsrc/distributed/cluster/topology/compute_graph_node.cc:173] Register] Failed to connect to the meta server node url: 127.0.0.1:7125 [WARNING] DISTRIBUTED(912080,ffff9bdceec0,python):2025-07-15-10:39:09.055.213 [mindspore/ccsrc/distributed/cluster/topology/compute_graph_node.cc:363] ReconnectWithTimeoutWindow] Failed to register and try to reconnect to the meta server. [WARNING] DISTRIBUTED(912084,ffff8059eec0,python):2025-07-15-10:39:09.135.845 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 22 source: 127.0.0.1:52454, destination: 127.0.0.1:7125 [WARNING] DISTRIBUTED(912084,fffefbc0efa0,python):2025-07-15-10:39:09.135.874 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:52454 to 127.0.0.1:7125 is successfully created. System errno: Success [WARNING] DISTRIBUTED(912084,ffff8059eec0,python):2025-07-15-10:39:09.135.893 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:7125 to be connected...Retry number: 2 [WARNING] DISTRIBUTED(912092,ffff95d8eec0,python):2025-07-15-10:39:09.319.597 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 22 source: 127.0.0.1:52456, destination: 127.0.0.1:7125 [WARNING] DISTRIBUTED(912092,ffff113fefa0,python):2025-07-15-10:39:09.319.623 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:52456 to 127.0.0.1:7125 is successfully created. System errno: Success [WARNING] DISTRIBUTED(912092,ffff95d8eec0,python):2025-07-15-10:39:09.319.642 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:7125 to be connected...Retry number: 2 [WARNING] DISTRIBUTED(912088,ffff855deec0,python):2025-07-15-10:39:09.343.464 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 22 source: 127.0.0.1:52470, destination: 127.0.0.1:7125 [WARNING] DISTRIBUTED(912088,ffff00c5efa0,python):2025-07-15-10:39:09.343.494 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:52470 to 127.0.0.1:7125 is successfully created. System errno: Success [WARNING] DISTRIBUTED(912088,ffff855deec0,python):2025-07-15-10:39:09.343.504 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:7125 to be connected...Retry number: 2 [WARNING] DISTRIBUTED(912072,ffffaa93eec0,python):2025-07-15-10:39:09.394.030 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 22 source: 127.0.0.1:52472, destination: 127.0.0.1:7125 [WARNING] DISTRIBUTED(912072,ffffaa93eec0,python):2025-07-15-10:39:09.394.072 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:7125 to be connected...Retry number: 1 [WARNING] DISTRIBUTED(912072,ffff25fcefa0,python):2025-07-15-10:39:09.394.069 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:52472 to 127.0.0.1:7125 is successfully created. System errno: Success [WARNING] DISTRIBUTED(912096,ffffae9eeec0,python):2025-07-15-10:39:09.465.849 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 22 source: 127.0.0.1:41618, destination: 127.0.0.1:7125 [WARNING] DISTRIBUTED(912096,ffff2a04efa0,python):2025-07-15-10:39:09.465.876 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:41618 to 127.0.0.1:7125 is successfully created. System errno: Success [WARNING] DISTRIBUTED(912096,ffffae9eeec0,python):2025-07-15-10:39:09.465.892 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:7125 to be connected...Retry number: 2 [WARNING] DISTRIBUTED(912068,ffffb892eec0,python):2025-07-15-10:39:09.468.307 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 22 source: 127.0.0.1:41620, destination: 127.0.0.1:7125 [WARNING] DISTRIBUTED(912068,ffffb892eec0,python):2025-07-15-10:39:09.468.348 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:7125 to be connected...Retry number: 1 [WARNING] DISTRIBUTED(912068,ffff33f8efa0,python):2025-07-15-10:39:09.468.346 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:41620 to 127.0.0.1:7125 is successfully created. System errno: Success [WARNING] DISTRIBUTED(912076,ffffb886eec0,python):2025-07-15-10:39:09.512.187 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 22 source: 127.0.0.1:41636, destination: 127.0.0.1:7125 [WARNING] DISTRIBUTED(912076,ffff33eeefa0,python):2025-07-15-10:39:09.512.218 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:41636 to 127.0.0.1:7125 is successfully created. System errno: Success [WARNING] DISTRIBUTED(912076,ffffb886eec0,python):2025-07-15-10:39:09.512.228 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:7125 to be connected...Retry number: 1 [WARNING] DISTRIBUTED(912080,ffff9bdceec0,python):2025-07-15-10:39:09.555.428 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 22 source: 127.0.0.1:41646, destination: 127.0.0.1:7125 [WARNING] DISTRIBUTED(912080,ffff1744efa0,python):2025-07-15-10:39:09.555.457 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:41646 to 127.0.0.1:7125 is successfully created. System errno: Success [WARNING] DISTRIBUTED(912080,ffff9bdceec0,python):2025-07-15-10:39:09.555.469 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:7125 to be connected...Retry number: 1 [WARNING] DISTRIBUTED(912084,ffff8059eec0,python):2025-07-15-10:39:09.636.520 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(1/14400). [WARNING] DISTRIBUTED(912092,ffff95d8eec0,python):2025-07-15-10:39:09.820.098 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(1/14400). [WARNING] DISTRIBUTED(912088,ffff855deec0,python):2025-07-15-10:39:09.843.928 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(1/14400). [WARNING] DISTRIBUTED(912072,ffffaa93eec0,python):2025-07-15-10:39:09.894.281 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 23 source: 127.0.0.1:41660, destination: 127.0.0.1:7125 [WARNING] DISTRIBUTED(912072,ffffaa93eec0,python):2025-07-15-10:39:09.894.321 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:7125 to be connected...Retry number: 2 [WARNING] DISTRIBUTED(912072,ffff24faefa0,python):2025-07-15-10:39:09.894.323 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:41660 to 127.0.0.1:7125 is successfully created. System errno: Success [WARNING] DISTRIBUTED(912096,ffffae9eeec0,python):2025-07-15-10:39:09.966.472 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(1/14400). [WARNING] DISTRIBUTED(912068,ffffb892eec0,python):2025-07-15-10:39:09.968.560 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 23 source: 127.0.0.1:41670, destination: 127.0.0.1:7125 [WARNING] DISTRIBUTED(912068,ffffb892eec0,python):2025-07-15-10:39:09.968.599 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:7125 to be connected...Retry number: 2 [WARNING] DISTRIBUTED(912068,ffff32f6efa0,python):2025-07-15-10:39:09.968.617 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:41670 to 127.0.0.1:7125 is successfully created. System errno: Success [WARNING] DISTRIBUTED(912076,ffffb886eec0,python):2025-07-15-10:39:10.012.451 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 23 source: 127.0.0.1:41672, destination: 127.0.0.1:7125 [WARNING] DISTRIBUTED(912076,ffff32ecefa0,python):2025-07-15-10:39:10.012.482 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:41672 to 127.0.0.1:7125 is successfully created. System errno: Success [WARNING] DISTRIBUTED(912076,ffffb886eec0,python):2025-07-15-10:39:10.012.490 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:7125 to be connected...Retry number: 2 [WARNING] DISTRIBUTED(912080,ffff9bdceec0,python):2025-07-15-10:39:10.055.675 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 23 source: 127.0.0.1:41684, destination: 127.0.0.1:7125 [WARNING] DISTRIBUTED(912080,ffff1642efa0,python):2025-07-15-10:39:10.055.700 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:41684 to 127.0.0.1:7125 is successfully created. System errno: Success [WARNING] DISTRIBUTED(912080,ffff9bdceec0,python):2025-07-15-10:39:10.055.715 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:7125 to be connected...Retry number: 2 [WARNING] DISTRIBUTED(912084,ffff8059eec0,python):2025-07-15-10:39:10.136.626 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(2/14400). [WARNING] DISTRIBUTED(912092,ffff95d8eec0,python):2025-07-15-10:39:10.320.203 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(2/14400). [WARNING] DISTRIBUTED(912088,ffff855deec0,python):2025-07-15-10:39:10.344.032 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(2/14400). [WARNING] DISTRIBUTED(912072,ffffaa93eec0,python):2025-07-15-10:39:10.394.817 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(1/14400). [WARNING] DISTRIBUTED(912096,ffffae9eeec0,python):2025-07-15-10:39:10.466.585 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(2/14400). [WARNING] DISTRIBUTED(912068,ffffb892eec0,python):2025-07-15-10:39:10.469.103 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(1/14400). [WARNING] DISTRIBUTED(912076,ffffb886eec0,python):2025-07-15-10:39:10.513.028 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(1/14400). [WARNING] DISTRIBUTED(912080,ffff9bdceec0,python):2025-07-15-10:39:10.556.225 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(1/14400). [WARNING] DISTRIBUTED(912084,ffff8059eec0,python):2025-07-15-10:39:10.636.731 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(3/14400). [WARNING] DISTRIBUTED(912092,ffff95d8eec0,python):2025-07-15-10:39:10.820.310 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(3/14400). [WARNING] DISTRIBUTED(912088,ffff855deec0,python):2025-07-15-10:39:10.844.133 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(3/14400). [WARNING] DISTRIBUTED(912072,ffffaa93eec0,python):2025-07-15-10:39:10.894.984 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(2/14400). [WARNING] DISTRIBUTED(912096,ffffae9eeec0,python):2025-07-15-10:39:10.966.696 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(3/14400). [WARNING] DISTRIBUTED(912068,ffffb892eec0,python):2025-07-15-10:39:10.969.220 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(2/14400). [WARNING] DISTRIBUTED(912076,ffffb886eec0,python):2025-07-15-10:39:11.013.142 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(2/14400). [MS_DEV_RUNTIME_CONF]Runtime config: memory_statistics:True [WARNING] DISTRIBUTED(912080,ffff9bdceec0,python):2025-07-15-10:39:11.056.414 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:249] BuildCluster] Cluster is successfully initialized. [WARNING] DISTRIBUTED(912080,ffff9bdceec0,python):2025-07-15-10:39:11.056.456 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:355] PostProcess] This node 3 rank id: 3 [MS_RUNTIME_PROF]The jit_level is: O1, and enable kernelbykernel executor. [MS_DEV_RUNTIME_CONF]Runtime config: memory_statistics:True [WARNING] DISTRIBUTED(912084,ffff8059eec0,python):2025-07-15-10:39:11.136.927 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:249] BuildCluster] Cluster is successfully initialized. [WARNING] DISTRIBUTED(912084,ffff8059eec0,python):2025-07-15-10:39:11.136.970 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:355] PostProcess] This node 4 rank id: 4 [MS_RUNTIME_PROF]The jit_level is: O1, and enable kernelbykernel executor. [MS_RUNTIME_PROF]Device MOC Size:62420M, Device free MOC Size:62092M, Reserved MOC size for Other Components(HCCL/rts/etc.):7124M, Recommend Reserved MOC size for Other Components:3880M, User define MindSpore MOC Size:54G, MindSpore Used MOC Size:55296M. [MS_DEV_RUNTIME_CONF]Runtime config: memory_statistics:True [WARNING] DISTRIBUTED(912092,ffff95d8eec0,python):2025-07-15-10:39:11.320.483 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:249] BuildCluster] Cluster is successfully initialized. [WARNING] DISTRIBUTED(912092,ffff95d8eec0,python):2025-07-15-10:39:11.320.524 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:355] PostProcess] This node 6 rank id: 6 [MS_RUNTIME_PROF]The jit_level is: O1, and enable kernelbykernel executor. [MS_DEV_RUNTIME_CONF]Runtime config: memory_statistics:True [WARNING] DISTRIBUTED(912088,ffff855deec0,python):2025-07-15-10:39:11.344.290 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:249] BuildCluster] Cluster is successfully initialized. [WARNING] DISTRIBUTED(912088,ffff855deec0,python):2025-07-15-10:39:11.344.325 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:355] PostProcess] This node 5 rank id: 5 [MS_RUNTIME_PROF]The jit_level is: O1, and enable kernelbykernel executor. [MS_RUNTIME_PROF]Device MOC Size:62420M, Device free MOC Size:62092M, Reserved MOC size for Other Components(HCCL/rts/etc.):7124M, Recommend Reserved MOC size for Other Components:3880M, User define MindSpore MOC Size:54G, MindSpore Used MOC Size:55296M. [MS_DEV_RUNTIME_CONF]Runtime config: memory_statistics:True [WARNING] DISTRIBUTED(912072,ffffaa93eec0,python):2025-07-15-10:39:11.395.163 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:249] BuildCluster] Cluster is successfully initialized. [WARNING] DISTRIBUTED(912072,ffffaa93eec0,python):2025-07-15-10:39:11.395.204 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:355] PostProcess] This node 1 rank id: 1 [MS_RUNTIME_PROF]The jit_level is: O1, and enable kernelbykernel executor. [MS_DEV_RUNTIME_CONF]Runtime config: memory_statistics:True [WARNING] DISTRIBUTED(912096,ffffae9eeec0,python):2025-07-15-10:39:11.466.884 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:249] BuildCluster] Cluster is successfully initialized. [WARNING] DISTRIBUTED(912096,ffffae9eeec0,python):2025-07-15-10:39:11.466.924 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:355] PostProcess] This node 7 rank id: 7 [MS_RUNTIME_PROF]The jit_level is: O1, and enable kernelbykernel executor. [MS_DEV_RUNTIME_CONF]Runtime config: memory_statistics:True [WARNING] DISTRIBUTED(912068,ffffb892eec0,python):2025-07-15-10:39:11.469.414 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:249] BuildCluster] Cluster is successfully initialized. [WARNING] DISTRIBUTED(912068,ffffb892eec0,python):2025-07-15-10:39:11.469.456 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:355] PostProcess] This node 0 rank id: 0 [MS_RUNTIME_PROF]The jit_level is: O1, and enable kernelbykernel executor. [MS_DEV_RUNTIME_CONF]Runtime config: memory_statistics:True [WARNING] DISTRIBUTED(912076,ffffb886eec0,python):2025-07-15-10:39:11.513.328 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:249] BuildCluster] Cluster is successfully initialized. [WARNING] DISTRIBUTED(912076,ffffb886eec0,python):2025-07-15-10:39:11.513.365 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:355] PostProcess] This node 2 rank id: 2 [MS_RUNTIME_PROF]The jit_level is: O1, and enable kernelbykernel executor. [MS_RUNTIME_PROF]Device MOC Size:62420M, Device free MOC Size:62092M, Reserved MOC size for Other Components(HCCL/rts/etc.):7124M, Recommend Reserved MOC size for Other Components:3880M, User define MindSpore MOC Size:54G, MindSpore Used MOC Size:55296M. [MS_RUNTIME_PROF]Device MOC Size:62420M, Device free MOC Size:62091M, Reserved MOC size for Other Components(HCCL/rts/etc.):7124M, Recommend Reserved MOC size for Other Components:3880M, User define MindSpore MOC Size:54G, MindSpore Used MOC Size:55296M. [MS_RUNTIME_PROF]Device MOC Size:62420M, Device free MOC Size:62092M, Reserved MOC size for Other Components(HCCL/rts/etc.):7124M, Recommend Reserved MOC size for Other Components:3880M, User define MindSpore MOC Size:54G, MindSpore Used MOC Size:55296M. [MS_RUNTIME_PROF]Device MOC Size:62420M, Device free MOC Size:62091M, Reserved MOC size for Other Components(HCCL/rts/etc.):7124M, Recommend Reserved MOC size for Other Components:3880M, User define MindSpore MOC Size:54G, MindSpore Used MOC Size:55296M. [MS_RUNTIME_PROF]Device MOC Size:62420M, Device free MOC Size:62092M, Reserved MOC size for Other Components(HCCL/rts/etc.):7124M, Recommend Reserved MOC size for Other Components:3880M, User define MindSpore MOC Size:54G, MindSpore Used MOC Size:55296M. [WARNING] GRAPH_KERNEL(912080,ffff9bdceec0,python):2025-07-15-10:39:12.829.181 [mindspore/ccsrc/backend/common/graph_kernel/graph_kernel_flags.cc:116] ParseFlags] For 'context.set_context', the flag 'None' in the parameter 'graph_kernel_flags' is invalid. Valid flag format is "--key=value", flags are separated by spaces(e.g. "--key1=value1 --key2=value2"). bool flag's value can be implicit, the "--key" means "--key=true". graph_kernel_flags = "None" [WARNING] DISTRIBUTED(912080,ffff9bdceec0,python):2025-07-15-10:39:12.832.405 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: hccl_world_group [const vector]{0, 1, 2, 3, 4, 5, 6, 7}, async: 1, submit_now: 1 [WARNING] DISTRIBUTED(912080,ffff9bdceec0,python):2025-07-15-10:39:12.832.599 [mindspore/ccsrc/distributed/collective/collective_manager.cc:393] CreateCommunicationGroup] This group's communicator is async created hccl_world_group [WARNING] DEVICE(912080,fffebeefefa0,python):2025-07-15-10:39:12.832.817 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:254] SetGlobalCommInfo] Start to SetGlobalCommInfo for hccl_world_group, master_ip:2130706433, master_port:7125, node_rank:2130706433, total_rank_size:8, local_rank_size8 [WARNING] HCCL_ADPT(912080,fffebeefefa0,python):2025-07-15-10:39:12.832.916 [mindspore/ccsrc/utils/dlopen_macro.h:165] DlsymAscend] Dynamically load symbol HcclSetGlobalCommInfo failed, result = /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/../lib/plugin/ascend/libhccl_plugin.so: undefined symbol: HcclSetGlobalCommInfo [WARNING] HCCL_ADPT(912080,fffebeefefa0,python):2025-07-15-10:39:12.832.952 [mindspore/ccsrc/plugin/res_manager/ascend/hccl_adapter/hccl_adapter.cc:635] HcclSetGlobalCommInfo] Func HcclSetGlobalCommInfo is not supported in CANN package. [WARNING] DEVICE(912080,fffebeefefa0,python):2025-07-15-10:39:12.832.983 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:265] SetGlobalCommInfo] End to SetGlobalCommInfo for hccl_world_group [WARNING] DEVICE(912080,fffebeefefa0,python):2025-07-15-10:39:12.833.328 [mindspore/ccsrc/plugin/device/cpu/hal/hardware/ms_collective_comm_lib.cc:251] QueryUniqueID] Retry to lookup the unique id for group hccl_world_group from the meta server node...Retry time: 399/400, sleep 2 2025-07-15 10:39:12,834 - mindformers./output/log[mindformers/tools/utils.py:181] - INFO - set strategy path to './output/strategy/ckpt_strategy_rank_3.ckpt' 2025-07-15 10:39:12,861 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config swap_config is empty. 2025-07-15 10:39:12,861 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config metric is empty. 2025-07-15 10:39:12,862 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config monitor_config is empty. 2025-07-15 10:39:12,862 - mindformers./output/log[mindformers/tools/register/template.py:683] - WARNING - Some configs in yaml are useless for train: ['auto_tune', 'autotune_per_step', 'eval_callbacks', 'eval_dataset', 'eval_dataset_task', 'filepath_prefix', 'processor'] 2025-07-15 10:39:12,862 - mindformers./output/log[mindformers/trainer/trainer.py:1008] - INFO - Load configs in /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/configs/general/run_general_task.yaml to build trainer. 2025-07-15 10:39:12,862 - mindformers./output/log[mindformers/trainer/trainer.py:1044] - INFO - ..........Init Config.......... 2025-07-15 10:39:12,863 - mindformers./output/log[mindformers/core/parallel_config.py:41] - INFO - initial moe_config from dict: {'expert_num': 4, 'capacity_factor': 1.5, 'aux_loss_factor': 0.05, 'num_experts_chosen': 2, 'expert_group_size': 2, 'group_wise_a2a': False, 'comp_comm_parallel': False, 'comp_comm_parallel_degree': 2, 'save_token_distribution': False, 'cur_layer': 0, 'enable_cold_hot_expert': False, 'update_step': 10000, 'hot_expert_num': 0, 'cold_token_percent': 1.0, 'moe_module_name': '', 'routing_policy': 'TopkRouterV2', 'norm_topk_prob': False, 'enable_sdrop': False, 'use_fused_ops_topkrouter': True, 'router_dense_type': 'float32', 'shared_expert_num': 1, 'use_shared_expert_gating': False, 'max_router_load': 131072, 'topk_method': 'greedy', 'topk_group': 3, 'n_group': 8, 'first_k_dense_replace': 1, 'moe_intermediate_size': 512, 'routed_scaling_factor': 2.5, 'aux_loss_types': ['expert'], 'aux_loss_factors': [0.0001], 'z_loss_factor': 0.0, 'balance_via_topk_bias': True, 'topk_bias_update_rate': 0.0001, 'use_allgather_dispatcher': False, 'moe_shared_expert_overlap': False, 'expert_model_parallel': 1, 'use_gating_sigmoid': True, 'enable_deredundency': False, 'npu_nums_per_device': 2, 'use_gmm': True, 'enable_gmm_safe_tokens': False, 'use_fused_ops_permute': True, 'callback_moe_droprate': False} 2025-07-15 10:39:12,863 - mindformers./output/log[mindformers/core/parallel_config.py:48] - INFO - initial swap_config from dict: {'swap': False, 'layer_swap': None, 'op_swap': None, 'default_prefetch': 1} 2025-07-15 10:39:12,863 - mindformers./output/log[mindformers/core/parallel_config.py:55] - INFO - initial recompute_config from dict: {'recompute': True, 'select_recompute': False, 'parallel_optimizer_comm_recompute': True, 'select_comm_recompute': False, 'mp_comm_recompute': True, 'recompute_slice_activation': True, 'select_recompute_exclude': False, 'select_comm_recompute_exclude': False} 2025-07-15 10:39:12,863 - mindformers./output/log[mindformers/core/parallel_config.py:61] - INFO - initial parallel_config from dict: {'data_parallel': 2, 'model_parallel': 2, 'context_parallel': 1, 'expert_parallel': 2, 'pipeline_stage': 2, 'micro_batch_num': 2, 'seq_split_num': 1, 'use_seq_parallel': True, 'optimizer_shard': None, 'gradient_aggregation_group': 4, 'vocab_emb_dp': True, 'context_parallel_algo': 'colossalai_cp', 'ulysses_degree_in_cp': 1, 'mem_coeff': 0.1} 2025-07-15 10:39:12,864 - mindformers./output/log[mindformers/core/parallel_config.py:63] - INFO - pipeline_stage = 2 > 1, vocab_emd_dp will be reset to False. 2025-07-15 10:39:12,864 - mindformers./output/log[mindformers/tools/utils.py:166] - INFO - set output path to '/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/output' 2025-07-15 10:39:12,865 - mindformers./output/log[mindformers/tools/utils.py:181] - INFO - set strategy path to './output/strategy/ckpt_strategy_rank_3.ckpt' [WARNING] GRAPH_KERNEL(912084,ffff8059eec0,python):2025-07-15-10:39:12.904.349 [mindspore/ccsrc/backend/common/graph_kernel/graph_kernel_flags.cc:116] ParseFlags] For 'context.set_context', the flag 'None' in the parameter 'graph_kernel_flags' is invalid. Valid flag format is "--key=value", flags are separated by spaces(e.g. "--key1=value1 --key2=value2"). bool flag's value can be implicit, the "--key" means "--key=true". graph_kernel_flags = "None" [WARNING] DISTRIBUTED(912084,ffff8059eec0,python):2025-07-15-10:39:12.907.776 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: hccl_world_group [const vector]{0, 1, 2, 3, 4, 5, 6, 7}, async: 1, submit_now: 1 [WARNING] DISTRIBUTED(912084,ffff8059eec0,python):2025-07-15-10:39:12.907.995 [mindspore/ccsrc/distributed/collective/collective_manager.cc:393] CreateCommunicationGroup] This group's communicator is async created hccl_world_group [WARNING] DEVICE(912084,fffea30befa0,python):2025-07-15-10:39:12.908.219 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:254] SetGlobalCommInfo] Start to SetGlobalCommInfo for hccl_world_group, master_ip:2130706433, master_port:7125, node_rank:2130706433, total_rank_size:8, local_rank_size8 [WARNING] HCCL_ADPT(912084,fffea30befa0,python):2025-07-15-10:39:12.908.306 [mindspore/ccsrc/utils/dlopen_macro.h:165] DlsymAscend] Dynamically load symbol HcclSetGlobalCommInfo failed, result = /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/../lib/plugin/ascend/libhccl_plugin.so: undefined symbol: HcclSetGlobalCommInfo [WARNING] HCCL_ADPT(912084,fffea30befa0,python):2025-07-15-10:39:12.908.341 [mindspore/ccsrc/plugin/res_manager/ascend/hccl_adapter/hccl_adapter.cc:635] HcclSetGlobalCommInfo] Func HcclSetGlobalCommInfo is not supported in CANN package. [WARNING] DEVICE(912084,fffea30befa0,python):2025-07-15-10:39:12.908.373 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:265] SetGlobalCommInfo] End to SetGlobalCommInfo for hccl_world_group [WARNING] DEVICE(912084,fffea30befa0,python):2025-07-15-10:39:12.908.762 [mindspore/ccsrc/plugin/device/cpu/hal/hardware/ms_collective_comm_lib.cc:251] QueryUniqueID] Retry to lookup the unique id for group hccl_world_group from the meta server node...Retry time: 399/400, sleep 1 2025-07-15 10:39:12,909 - mindformers./output/log[mindformers/tools/utils.py:181] - INFO - set strategy path to './output/strategy/ckpt_strategy_rank_4.ckpt' 2025-07-15 10:39:12,935 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config swap_config is empty. 2025-07-15 10:39:12,936 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config metric is empty. 2025-07-15 10:39:12,936 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config monitor_config is empty. 2025-07-15 10:39:12,936 - mindformers./output/log[mindformers/tools/register/template.py:683] - WARNING - Some configs in yaml are useless for train: ['auto_tune', 'autotune_per_step', 'eval_callbacks', 'eval_dataset', 'eval_dataset_task', 'filepath_prefix', 'processor'] 2025-07-15 10:39:12,936 - mindformers./output/log[mindformers/trainer/trainer.py:1008] - INFO - Load configs in /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/configs/general/run_general_task.yaml to build trainer. 2025-07-15 10:39:12,937 - mindformers./output/log[mindformers/trainer/trainer.py:1044] - INFO - ..........Init Config.......... 2025-07-15 10:39:12,937 - mindformers./output/log[mindformers/core/parallel_config.py:41] - INFO - initial moe_config from dict: {'expert_num': 4, 'capacity_factor': 1.5, 'aux_loss_factor': 0.05, 'num_experts_chosen': 2, 'expert_group_size': 2, 'group_wise_a2a': False, 'comp_comm_parallel': False, 'comp_comm_parallel_degree': 2, 'save_token_distribution': False, 'cur_layer': 0, 'enable_cold_hot_expert': False, 'update_step': 10000, 'hot_expert_num': 0, 'cold_token_percent': 1.0, 'moe_module_name': '', 'routing_policy': 'TopkRouterV2', 'norm_topk_prob': False, 'enable_sdrop': False, 'use_fused_ops_topkrouter': True, 'router_dense_type': 'float32', 'shared_expert_num': 1, 'use_shared_expert_gating': False, 'max_router_load': 131072, 'topk_method': 'greedy', 'topk_group': 3, 'n_group': 8, 'first_k_dense_replace': 1, 'moe_intermediate_size': 512, 'routed_scaling_factor': 2.5, 'aux_loss_types': ['expert'], 'aux_loss_factors': [0.0001], 'z_loss_factor': 0.0, 'balance_via_topk_bias': True, 'topk_bias_update_rate': 0.0001, 'use_allgather_dispatcher': False, 'moe_shared_expert_overlap': False, 'expert_model_parallel': 1, 'use_gating_sigmoid': True, 'enable_deredundency': False, 'npu_nums_per_device': 2, 'use_gmm': True, 'enable_gmm_safe_tokens': False, 'use_fused_ops_permute': True, 'callback_moe_droprate': False} 2025-07-15 10:39:12,937 - mindformers./output/log[mindformers/core/parallel_config.py:48] - INFO - initial swap_config from dict: {'swap': False, 'layer_swap': None, 'op_swap': None, 'default_prefetch': 1} 2025-07-15 10:39:12,937 - mindformers./output/log[mindformers/core/parallel_config.py:55] - INFO - initial recompute_config from dict: {'recompute': True, 'select_recompute': False, 'parallel_optimizer_comm_recompute': True, 'select_comm_recompute': False, 'mp_comm_recompute': True, 'recompute_slice_activation': True, 'select_recompute_exclude': False, 'select_comm_recompute_exclude': False} 2025-07-15 10:39:12,938 - mindformers./output/log[mindformers/core/parallel_config.py:61] - INFO - initial parallel_config from dict: {'data_parallel': 2, 'model_parallel': 2, 'context_parallel': 1, 'expert_parallel': 2, 'pipeline_stage': 2, 'micro_batch_num': 2, 'seq_split_num': 1, 'use_seq_parallel': True, 'optimizer_shard': None, 'gradient_aggregation_group': 4, 'vocab_emb_dp': True, 'context_parallel_algo': 'colossalai_cp', 'ulysses_degree_in_cp': 1, 'mem_coeff': 0.1} 2025-07-15 10:39:12,938 - mindformers./output/log[mindformers/core/parallel_config.py:63] - INFO - pipeline_stage = 2 > 1, vocab_emd_dp will be reset to False. 2025-07-15 10:39:12,939 - mindformers./output/log[mindformers/tools/utils.py:166] - INFO - set output path to '/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/output' 2025-07-15 10:39:12,939 - mindformers./output/log[mindformers/tools/utils.py:181] - INFO - set strategy path to './output/strategy/ckpt_strategy_rank_4.ckpt' 2025-07-15 10:39:12,977 - mindformers./output/log[mindformers/trainer/base_trainer.py:107] - INFO - host_name: ascend213, host_ip: 121.37.54.128 2025-07-15 10:39:12,978 - mindformers./output/log[mindformers/trainer/base_trainer.py:113] - INFO - Now Running Task is: text_generation, Model is: deepseekV3 2025-07-15 10:39:12,978 - mindformers./output/log[mindformers/trainer/base_trainer.py:143] - WARNING - Input model name is not in the supported list or unspecified. 2025-07-15 10:39:12,979 - mindformers./output/log[mindformers/trainer/base_trainer.py:144] - WARNING - See the list of supported task and model name: ['codellama_34b', 'common', 'deepseek1_5_7b', 'deepseek_33b', 'glm3_6b', 'glm4_9b', 'gpt2', 'gpt2_13b', 'gpt2_52b', 'gpt2_lora', 'gpt2_xl', 'gpt2_xl_lora', 'internlm_7b', 'internlm_7b_lora', 'llama2_13b', 'llama2_70b', 'llama2_7b', 'llama2_7b_lora', 'llama_7b_slora', 'yi_34b', 'yi_6b'] 2025-07-15 10:39:12,979 - mindformers./output/log[mindformers/trainer/base_trainer.py:145] - WARNING - The default model config: /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/configs/gpt2/run_gpt2.yaml will now be used for the text_generation task 2025-07-15 10:39:12,979 - mindformers./output/log[mindformers/trainer/trainer.py:1117] - INFO - ..........Init Model.......... 2025-07-15 10:39:12,980 - mindformers./output/log[mindformers/trainer/trainer.py:323] - INFO - ==========Trainer Init Success!========== 2025-07-15 10:39:12,980 - mindformers./output/log[mindformers/trainer/trainer.py:406] - WARNING - sink_size will not be able to set in a future release. Modifying sink_size may cause functional issues when resuming training from a checkpoint. 2025-07-15 10:39:12,980 - mindformers./output/log[mindformers/trainer/trainer.py:1117] - INFO - ..........Init Model.......... 2025-07-15 10:39:12,981 - mindformers./output/log[mindformers/trainer/base_trainer.py:234] - INFO - Pipeline parallel was opened: pipeline_stages = 2, full batch is False, gradient_accumulation_steps will not take effect in pipeline parallel, batch size per card will be changed: per_batch_size = batch_size * micro_batch_num * micro_batch_interleave_num = 2 = 1 * 2 * 1). 2025-07-15 10:39:12,981 - mindformers./output/log[mindformers/trainer/base_trainer.py:241] - INFO - global_batch_size = per_batch_size * data_parallel = 2 * 2 = 4 2025-07-15 10:39:12,981 - mindformers./output/log[mindformers/trainer/base_trainer.py:338] - WARNING - When using the pipeline parallel mode, the MFPipelineWithLossScaleCell class is used by default. 2025-07-15 10:39:12,981 - mindformers./output/log[mindformers/trainer/base_trainer.py:346] - INFO - PipelineWrapper under evaluate or predict mode will not take effect. 2025-07-15 10:39:12,981 - mindformers./output/log[mindformers/trainer/base_trainer.py:920] - INFO - .........Build Dataset For Train.......... 2025-07-15 10:39:12,982 - mindformers./output/log[mindformers/trainer/base_trainer.py:464] - INFO - .........Build Dataset From Config.......... 2025-07-15 10:39:12,982 - mindformers./output/log[mindformers/dataset/causal_language_model_dataset.py:302] - INFO - Now Create Causal Language Model Dataset. 2025-07-15 10:39:12,983 - mindformers./output/log[mindformers/dataset/base_dataset.py:83] - INFO - Now dataset_strategy is [[2, 1], [2, 1], [2, 1], [2, 1]], shard_id: 1, num_shards: 2 2025-07-15 10:39:13,069 - mindformers./output/log[mindformers/trainer/base_trainer.py:107] - INFO - host_name: ascend213, host_ip: 121.37.54.128 2025-07-15 10:39:13,070 - mindformers./output/log[mindformers/trainer/base_trainer.py:113] - INFO - Now Running Task is: text_generation, Model is: deepseekV3 2025-07-15 10:39:13,070 - mindformers./output/log[mindformers/trainer/base_trainer.py:143] - WARNING - Input model name is not in the supported list or unspecified. 2025-07-15 10:39:13,070 - mindformers./output/log[mindformers/trainer/base_trainer.py:144] - WARNING - See the list of supported task and model name: ['codellama_34b', 'common', 'deepseek1_5_7b', 'deepseek_33b', 'glm3_6b', 'glm4_9b', 'gpt2', 'gpt2_13b', 'gpt2_52b', 'gpt2_lora', 'gpt2_xl', 'gpt2_xl_lora', 'internlm_7b', 'internlm_7b_lora', 'llama2_13b', 'llama2_70b', 'llama2_7b', 'llama2_7b_lora', 'llama_7b_slora', 'yi_34b', 'yi_6b'] 2025-07-15 10:39:13,071 - mindformers./output/log[mindformers/trainer/base_trainer.py:145] - WARNING - The default model config: /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/configs/gpt2/run_gpt2.yaml will now be used for the text_generation task 2025-07-15 10:39:13,071 - mindformers./output/log[mindformers/trainer/trainer.py:1117] - INFO - ..........Init Model.......... 2025-07-15 10:39:13,071 - mindformers./output/log[mindformers/trainer/trainer.py:323] - INFO - ==========Trainer Init Success!========== 2025-07-15 10:39:13,072 - mindformers./output/log[mindformers/trainer/trainer.py:406] - WARNING - sink_size will not be able to set in a future release. Modifying sink_size may cause functional issues when resuming training from a checkpoint. 2025-07-15 10:39:13,072 - mindformers./output/log[mindformers/trainer/trainer.py:1117] - INFO - ..........Init Model.......... 2025-07-15 10:39:13,072 - mindformers./output/log[mindformers/trainer/base_trainer.py:234] - INFO - Pipeline parallel was opened: pipeline_stages = 2, full batch is False, gradient_accumulation_steps will not take effect in pipeline parallel, batch size per card will be changed: per_batch_size = batch_size * micro_batch_num * micro_batch_interleave_num = 2 = 1 * 2 * 1). 2025-07-15 10:39:13,072 - mindformers./output/log[mindformers/trainer/base_trainer.py:241] - INFO - global_batch_size = per_batch_size * data_parallel = 2 * 2 = 4 2025-07-15 10:39:13,073 - mindformers./output/log[mindformers/trainer/base_trainer.py:338] - WARNING - When using the pipeline parallel mode, the MFPipelineWithLossScaleCell class is used by default. 2025-07-15 10:39:13,073 - mindformers./output/log[mindformers/trainer/base_trainer.py:346] - INFO - PipelineWrapper under evaluate or predict mode will not take effect. 2025-07-15 10:39:13,073 - mindformers./output/log[mindformers/trainer/base_trainer.py:920] - INFO - .........Build Dataset For Train.......... 2025-07-15 10:39:13,073 - mindformers./output/log[mindformers/trainer/base_trainer.py:464] - INFO - .........Build Dataset From Config.......... 2025-07-15 10:39:13,073 - mindformers./output/log[mindformers/dataset/causal_language_model_dataset.py:302] - INFO - Now Create Causal Language Model Dataset. 2025-07-15 10:39:13,074 - mindformers./output/log[mindformers/dataset/base_dataset.py:83] - INFO - Now dataset_strategy is [[2, 1], [2, 1], [2, 1], [2, 1]], shard_id: 0, num_shards: 2 [WARNING] GRAPH_KERNEL(912092,ffff95d8eec0,python):2025-07-15-10:39:13.077.090 [mindspore/ccsrc/backend/common/graph_kernel/graph_kernel_flags.cc:116] ParseFlags] For 'context.set_context', the flag 'None' in the parameter 'graph_kernel_flags' is invalid. Valid flag format is "--key=value", flags are separated by spaces(e.g. "--key1=value1 --key2=value2"). bool flag's value can be implicit, the "--key" means "--key=true". graph_kernel_flags = "None" [WARNING] DISTRIBUTED(912092,ffff95d8eec0,python):2025-07-15-10:39:13.080.579 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: hccl_world_group [const vector]{0, 1, 2, 3, 4, 5, 6, 7}, async: 1, submit_now: 1 [WARNING] DISTRIBUTED(912092,ffff95d8eec0,python):2025-07-15-10:39:13.080.781 [mindspore/ccsrc/distributed/collective/collective_manager.cc:393] CreateCommunicationGroup] This group's communicator is async created hccl_world_group [WARNING] DEVICE(912092,fffeb906efa0,python):2025-07-15-10:39:13.081.015 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:254] SetGlobalCommInfo] Start to SetGlobalCommInfo for hccl_world_group, master_ip:2130706433, master_port:7125, node_rank:2130706433, total_rank_size:8, local_rank_size8 [WARNING] HCCL_ADPT(912092,fffeb906efa0,python):2025-07-15-10:39:13.081.119 [mindspore/ccsrc/utils/dlopen_macro.h:165] DlsymAscend] Dynamically load symbol HcclSetGlobalCommInfo failed, result = /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/../lib/plugin/ascend/libhccl_plugin.so: undefined symbol: HcclSetGlobalCommInfo [WARNING] HCCL_ADPT(912092,fffeb906efa0,python):2025-07-15-10:39:13.081.157 [mindspore/ccsrc/plugin/res_manager/ascend/hccl_adapter/hccl_adapter.cc:635] HcclSetGlobalCommInfo] Func HcclSetGlobalCommInfo is not supported in CANN package. [WARNING] DEVICE(912092,fffeb906efa0,python):2025-07-15-10:39:13.081.189 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:265] SetGlobalCommInfo] End to SetGlobalCommInfo for hccl_world_group [WARNING] DEVICE(912092,fffeb906efa0,python):2025-07-15-10:39:13.081.593 [mindspore/ccsrc/plugin/device/cpu/hal/hardware/ms_collective_comm_lib.cc:251] QueryUniqueID] Retry to lookup the unique id for group hccl_world_group from the meta server node...Retry time: 399/400, sleep 1 2025-07-15 10:39:13,082 - mindformers./output/log[mindformers/tools/utils.py:181] - INFO - set strategy path to './output/strategy/ckpt_strategy_rank_6.ckpt' 2025-07-15 10:39:13,109 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config swap_config is empty. 2025-07-15 10:39:13,109 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config metric is empty. 2025-07-15 10:39:13,109 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config monitor_config is empty. 2025-07-15 10:39:13,110 - mindformers./output/log[mindformers/tools/register/template.py:683] - WARNING - Some configs in yaml are useless for train: ['auto_tune', 'autotune_per_step', 'eval_callbacks', 'eval_dataset', 'eval_dataset_task', 'filepath_prefix', 'processor'] 2025-07-15 10:39:13,110 - mindformers./output/log[mindformers/trainer/trainer.py:1008] - INFO - Load configs in /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/configs/general/run_general_task.yaml to build trainer. 2025-07-15 10:39:13,110 - mindformers./output/log[mindformers/trainer/trainer.py:1044] - INFO - ..........Init Config.......... 2025-07-15 10:39:13,110 - mindformers./output/log[mindformers/core/parallel_config.py:41] - INFO - initial moe_config from dict: {'expert_num': 4, 'capacity_factor': 1.5, 'aux_loss_factor': 0.05, 'num_experts_chosen': 2, 'expert_group_size': 2, 'group_wise_a2a': False, 'comp_comm_parallel': False, 'comp_comm_parallel_degree': 2, 'save_token_distribution': False, 'cur_layer': 0, 'enable_cold_hot_expert': False, 'update_step': 10000, 'hot_expert_num': 0, 'cold_token_percent': 1.0, 'moe_module_name': '', 'routing_policy': 'TopkRouterV2', 'norm_topk_prob': False, 'enable_sdrop': False, 'use_fused_ops_topkrouter': True, 'router_dense_type': 'float32', 'shared_expert_num': 1, 'use_shared_expert_gating': False, 'max_router_load': 131072, 'topk_method': 'greedy', 'topk_group': 3, 'n_group': 8, 'first_k_dense_replace': 1, 'moe_intermediate_size': 512, 'routed_scaling_factor': 2.5, 'aux_loss_types': ['expert'], 'aux_loss_factors': [0.0001], 'z_loss_factor': 0.0, 'balance_via_topk_bias': True, 'topk_bias_update_rate': 0.0001, 'use_allgather_dispatcher': False, 'moe_shared_expert_overlap': False, 'expert_model_parallel': 1, 'use_gating_sigmoid': True, 'enable_deredundency': False, 'npu_nums_per_device': 2, 'use_gmm': True, 'enable_gmm_safe_tokens': False, 'use_fused_ops_permute': True, 'callback_moe_droprate': False} 2025-07-15 10:39:13,111 - mindformers./output/log[mindformers/core/parallel_config.py:48] - INFO - initial swap_config from dict: {'swap': False, 'layer_swap': None, 'op_swap': None, 'default_prefetch': 1} 2025-07-15 10:39:13,111 - mindformers./output/log[mindformers/core/parallel_config.py:55] - INFO - initial recompute_config from dict: {'recompute': True, 'select_recompute': False, 'parallel_optimizer_comm_recompute': True, 'select_comm_recompute': False, 'mp_comm_recompute': True, 'recompute_slice_activation': True, 'select_recompute_exclude': False, 'select_comm_recompute_exclude': False} 2025-07-15 10:39:13,111 - mindformers./output/log[mindformers/core/parallel_config.py:61] - INFO - initial parallel_config from dict: {'data_parallel': 2, 'model_parallel': 2, 'context_parallel': 1, 'expert_parallel': 2, 'pipeline_stage': 2, 'micro_batch_num': 2, 'seq_split_num': 1, 'use_seq_parallel': True, 'optimizer_shard': None, 'gradient_aggregation_group': 4, 'vocab_emb_dp': True, 'context_parallel_algo': 'colossalai_cp', 'ulysses_degree_in_cp': 1, 'mem_coeff': 0.1} 2025-07-15 10:39:13,111 - mindformers./output/log[mindformers/core/parallel_config.py:63] - INFO - pipeline_stage = 2 > 1, vocab_emd_dp will be reset to False. 2025-07-15 10:39:13,112 - mindformers./output/log[mindformers/tools/utils.py:166] - INFO - set output path to '/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/output' 2025-07-15 10:39:13,112 - mindformers./output/log[mindformers/tools/utils.py:181] - INFO - set strategy path to './output/strategy/ckpt_strategy_rank_6.ckpt' [WARNING] GRAPH_KERNEL(912088,ffff855deec0,python):2025-07-15-10:39:13.114.373 [mindspore/ccsrc/backend/common/graph_kernel/graph_kernel_flags.cc:116] ParseFlags] For 'context.set_context', the flag 'None' in the parameter 'graph_kernel_flags' is invalid. Valid flag format is "--key=value", flags are separated by spaces(e.g. "--key1=value1 --key2=value2"). bool flag's value can be implicit, the "--key" means "--key=true". graph_kernel_flags = "None" [WARNING] DISTRIBUTED(912088,ffff855deec0,python):2025-07-15-10:39:13.117.808 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: hccl_world_group [const vector]{0, 1, 2, 3, 4, 5, 6, 7}, async: 1, submit_now: 1 [WARNING] DISTRIBUTED(912088,ffff855deec0,python):2025-07-15-10:39:13.118.017 [mindspore/ccsrc/distributed/collective/collective_manager.cc:393] CreateCommunicationGroup] This group's communicator is async created hccl_world_group [WARNING] DEVICE(912088,fffea8baefa0,python):2025-07-15-10:39:13.118.238 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:254] SetGlobalCommInfo] Start to SetGlobalCommInfo for hccl_world_group, master_ip:2130706433, master_port:7125, node_rank:2130706433, total_rank_size:8, local_rank_size8 [WARNING] HCCL_ADPT(912088,fffea8baefa0,python):2025-07-15-10:39:13.118.339 [mindspore/ccsrc/utils/dlopen_macro.h:165] DlsymAscend] Dynamically load symbol HcclSetGlobalCommInfo failed, result = /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/../lib/plugin/ascend/libhccl_plugin.so: undefined symbol: HcclSetGlobalCommInfo [WARNING] HCCL_ADPT(912088,fffea8baefa0,python):2025-07-15-10:39:13.118.373 [mindspore/ccsrc/plugin/res_manager/ascend/hccl_adapter/hccl_adapter.cc:635] HcclSetGlobalCommInfo] Func HcclSetGlobalCommInfo is not supported in CANN package. [WARNING] DEVICE(912088,fffea8baefa0,python):2025-07-15-10:39:13.118.402 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:265] SetGlobalCommInfo] End to SetGlobalCommInfo for hccl_world_group [WARNING] DEVICE(912088,fffea8baefa0,python):2025-07-15-10:39:13.118.906 [mindspore/ccsrc/plugin/device/cpu/hal/hardware/ms_collective_comm_lib.cc:251] QueryUniqueID] Retry to lookup the unique id for group hccl_world_group from the meta server node...Retry time: 399/400, sleep 2 2025-07-15 10:39:13,119 - mindformers./output/log[mindformers/tools/utils.py:181] - INFO - set strategy path to './output/strategy/ckpt_strategy_rank_5.ckpt' 2025-07-15 10:39:13,146 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config swap_config is empty. 2025-07-15 10:39:13,146 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config metric is empty. 2025-07-15 10:39:13,146 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config monitor_config is empty. 2025-07-15 10:39:13,146 - mindformers./output/log[mindformers/tools/register/template.py:683] - WARNING - Some configs in yaml are useless for train: ['auto_tune', 'autotune_per_step', 'eval_callbacks', 'eval_dataset', 'eval_dataset_task', 'filepath_prefix', 'processor'] 2025-07-15 10:39:13,147 - mindformers./output/log[mindformers/trainer/trainer.py:1008] - INFO - Load configs in /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/configs/general/run_general_task.yaml to build trainer. 2025-07-15 10:39:13,147 - mindformers./output/log[mindformers/trainer/trainer.py:1044] - INFO - ..........Init Config.......... 2025-07-15 10:39:13,147 - mindformers./output/log[mindformers/core/parallel_config.py:41] - INFO - initial moe_config from dict: {'expert_num': 4, 'capacity_factor': 1.5, 'aux_loss_factor': 0.05, 'num_experts_chosen': 2, 'expert_group_size': 2, 'group_wise_a2a': False, 'comp_comm_parallel': False, 'comp_comm_parallel_degree': 2, 'save_token_distribution': False, 'cur_layer': 0, 'enable_cold_hot_expert': False, 'update_step': 10000, 'hot_expert_num': 0, 'cold_token_percent': 1.0, 'moe_module_name': '', 'routing_policy': 'TopkRouterV2', 'norm_topk_prob': False, 'enable_sdrop': False, 'use_fused_ops_topkrouter': True, 'router_dense_type': 'float32', 'shared_expert_num': 1, 'use_shared_expert_gating': False, 'max_router_load': 131072, 'topk_method': 'greedy', 'topk_group': 3, 'n_group': 8, 'first_k_dense_replace': 1, 'moe_intermediate_size': 512, 'routed_scaling_factor': 2.5, 'aux_loss_types': ['expert'], 'aux_loss_factors': [0.0001], 'z_loss_factor': 0.0, 'balance_via_topk_bias': True, 'topk_bias_update_rate': 0.0001, 'use_allgather_dispatcher': False, 'moe_shared_expert_overlap': False, 'expert_model_parallel': 1, 'use_gating_sigmoid': True, 'enable_deredundency': False, 'npu_nums_per_device': 2, 'use_gmm': True, 'enable_gmm_safe_tokens': False, 'use_fused_ops_permute': True, 'callback_moe_droprate': False} 2025-07-15 10:39:13,147 - mindformers./output/log[mindformers/core/parallel_config.py:48] - INFO - initial swap_config from dict: {'swap': False, 'layer_swap': None, 'op_swap': None, 'default_prefetch': 1} 2025-07-15 10:39:13,148 - mindformers./output/log[mindformers/core/parallel_config.py:55] - INFO - initial recompute_config from dict: {'recompute': True, 'select_recompute': False, 'parallel_optimizer_comm_recompute': True, 'select_comm_recompute': False, 'mp_comm_recompute': True, 'recompute_slice_activation': True, 'select_recompute_exclude': False, 'select_comm_recompute_exclude': False} 2025-07-15 10:39:13,148 - mindformers./output/log[mindformers/core/parallel_config.py:61] - INFO - initial parallel_config from dict: {'data_parallel': 2, 'model_parallel': 2, 'context_parallel': 1, 'expert_parallel': 2, 'pipeline_stage': 2, 'micro_batch_num': 2, 'seq_split_num': 1, 'use_seq_parallel': True, 'optimizer_shard': None, 'gradient_aggregation_group': 4, 'vocab_emb_dp': True, 'context_parallel_algo': 'colossalai_cp', 'ulysses_degree_in_cp': 1, 'mem_coeff': 0.1} 2025-07-15 10:39:13,148 - mindformers./output/log[mindformers/core/parallel_config.py:63] - INFO - pipeline_stage = 2 > 1, vocab_emd_dp will be reset to False. 2025-07-15 10:39:13,149 - mindformers./output/log[mindformers/tools/utils.py:166] - INFO - set output path to '/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/output' 2025-07-15 10:39:13,149 - mindformers./output/log[mindformers/tools/utils.py:181] - INFO - set strategy path to './output/strategy/ckpt_strategy_rank_5.ckpt' [WARNING] GRAPH_KERNEL(912068,ffffb892eec0,python):2025-07-15-10:39:13.191.952 [mindspore/ccsrc/backend/common/graph_kernel/graph_kernel_flags.cc:116] ParseFlags] For 'context.set_context', the flag 'None' in the parameter 'graph_kernel_flags' is invalid. Valid flag format is "--key=value", flags are separated by spaces(e.g. "--key1=value1 --key2=value2"). bool flag's value can be implicit, the "--key" means "--key=true". graph_kernel_flags = "None" [WARNING] DISTRIBUTED(912068,ffffb892eec0,python):2025-07-15-10:39:13.195.397 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: hccl_world_group [const vector]{0, 1, 2, 3, 4, 5, 6, 7}, async: 1, submit_now: 1 [WARNING] DISTRIBUTED(912068,ffffb892eec0,python):2025-07-15-10:39:13.195.618 [mindspore/ccsrc/distributed/collective/collective_manager.cc:393] CreateCommunicationGroup] This group's communicator is async created hccl_world_group [WARNING] DEVICE(912068,fffedb2eefa0,python):2025-07-15-10:39:13.195.876 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:254] SetGlobalCommInfo] Start to SetGlobalCommInfo for hccl_world_group, master_ip:2130706433, master_port:7125, node_rank:2130706433, total_rank_size:8, local_rank_size8 [WARNING] HCCL_ADPT(912068,fffedb2eefa0,python):2025-07-15-10:39:13.195.974 [mindspore/ccsrc/utils/dlopen_macro.h:165] DlsymAscend] Dynamically load symbol HcclSetGlobalCommInfo failed, result = /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/../lib/plugin/ascend/libhccl_plugin.so: undefined symbol: HcclSetGlobalCommInfo [WARNING] HCCL_ADPT(912068,fffedb2eefa0,python):2025-07-15-10:39:13.196.011 [mindspore/ccsrc/plugin/res_manager/ascend/hccl_adapter/hccl_adapter.cc:635] HcclSetGlobalCommInfo] Func HcclSetGlobalCommInfo is not supported in CANN package. [WARNING] DEVICE(912068,fffedb2eefa0,python):2025-07-15-10:39:13.196.043 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:265] SetGlobalCommInfo] End to SetGlobalCommInfo for hccl_world_group [WARNING] GRAPH_KERNEL(912096,ffffae9eeec0,python):2025-07-15-10:39:13.196.698 [mindspore/ccsrc/backend/common/graph_kernel/graph_kernel_flags.cc:116] ParseFlags] For 'context.set_context', the flag 'None' in the parameter 'graph_kernel_flags' is invalid. Valid flag format is "--key=value", flags are separated by spaces(e.g. "--key1=value1 --key2=value2"). bool flag's value can be implicit, the "--key" means "--key=true". graph_kernel_flags = "None" 2025-07-15 10:39:13,197 - mindformers./output/log[mindformers/tools/utils.py:181] - INFO - set strategy path to './output/strategy/ckpt_strategy_rank_0.ckpt' [WARNING] DISTRIBUTED(912096,ffffae9eeec0,python):2025-07-15-10:39:13.200.344 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: hccl_world_group [const vector]{0, 1, 2, 3, 4, 5, 6, 7}, async: 1, submit_now: 1 [WARNING] DISTRIBUTED(912096,ffffae9eeec0,python):2025-07-15-10:39:13.200.576 [mindspore/ccsrc/distributed/collective/collective_manager.cc:393] CreateCommunicationGroup] This group's communicator is async created hccl_world_group [WARNING] DEVICE(912096,fffed11fefa0,python):2025-07-15-10:39:13.200.879 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:254] SetGlobalCommInfo] Start to SetGlobalCommInfo for hccl_world_group, master_ip:2130706433, master_port:7125, node_rank:2130706433, total_rank_size:8, local_rank_size8 [WARNING] HCCL_ADPT(912096,fffed11fefa0,python):2025-07-15-10:39:13.200.973 [mindspore/ccsrc/utils/dlopen_macro.h:165] DlsymAscend] Dynamically load symbol HcclSetGlobalCommInfo failed, result = /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/../lib/plugin/ascend/libhccl_plugin.so: undefined symbol: HcclSetGlobalCommInfo [WARNING] HCCL_ADPT(912096,fffed11fefa0,python):2025-07-15-10:39:13.201.010 [mindspore/ccsrc/plugin/res_manager/ascend/hccl_adapter/hccl_adapter.cc:635] HcclSetGlobalCommInfo] Func HcclSetGlobalCommInfo is not supported in CANN package. [WARNING] DEVICE(912096,fffed11fefa0,python):2025-07-15-10:39:13.201.042 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:265] SetGlobalCommInfo] End to SetGlobalCommInfo for hccl_world_group [WARNING] DEVICE(912096,fffed11fefa0,python):2025-07-15-10:39:13.201.557 [mindspore/ccsrc/plugin/device/cpu/hal/hardware/ms_collective_comm_lib.cc:251] QueryUniqueID] Retry to lookup the unique id for group hccl_world_group from the meta server node...Retry time: 399/400, sleep 2 2025-07-15 10:39:13,202 - mindformers./output/log[mindformers/tools/utils.py:181] - INFO - set strategy path to './output/strategy/ckpt_strategy_rank_7.ckpt' [WARNING] DISTRIBUTED(912068,fffedb2eefa0,python):2025-07-15-10:39:13.203.358 [mindspore/ccsrc/distributed/collective/collective_manager.cc:1021] CreateDeviceCommunicator] Begin initialize communication group on the device side: hccl_world_group [WARNING] DEVICE(912068,fffed92aefa0,python):2025-07-15-10:39:13.203.669 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:169] InitByRootInfoConfig] Start to initialize communicator by HcclCommInitRootInfoConfig for hccl_world_group, hcclBufferSize is 200 MB, hcclDeterministic is 1 2025-07-15 10:39:13,224 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config swap_config is empty. 2025-07-15 10:39:13,224 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config metric is empty. 2025-07-15 10:39:13,224 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config monitor_config is empty. 2025-07-15 10:39:13,224 - mindformers./output/log[mindformers/tools/register/template.py:683] - WARNING - Some configs in yaml are useless for train: ['auto_tune', 'autotune_per_step', 'eval_callbacks', 'eval_dataset', 'eval_dataset_task', 'filepath_prefix', 'processor'] 2025-07-15 10:39:13,225 - mindformers./output/log[mindformers/trainer/trainer.py:1008] - INFO - Load configs in /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/configs/general/run_general_task.yaml to build trainer. 2025-07-15 10:39:13,225 - mindformers./output/log[mindformers/trainer/trainer.py:1044] - INFO - ..........Init Config.......... 2025-07-15 10:39:13,225 - mindformers./output/log[mindformers/core/parallel_config.py:41] - INFO - initial moe_config from dict: {'expert_num': 4, 'capacity_factor': 1.5, 'aux_loss_factor': 0.05, 'num_experts_chosen': 2, 'expert_group_size': 2, 'group_wise_a2a': False, 'comp_comm_parallel': False, 'comp_comm_parallel_degree': 2, 'save_token_distribution': False, 'cur_layer': 0, 'enable_cold_hot_expert': False, 'update_step': 10000, 'hot_expert_num': 0, 'cold_token_percent': 1.0, 'moe_module_name': '', 'routing_policy': 'TopkRouterV2', 'norm_topk_prob': False, 'enable_sdrop': False, 'use_fused_ops_topkrouter': True, 'router_dense_type': 'float32', 'shared_expert_num': 1, 'use_shared_expert_gating': False, 'max_router_load': 131072, 'topk_method': 'greedy', 'topk_group': 3, 'n_group': 8, 'first_k_dense_replace': 1, 'moe_intermediate_size': 512, 'routed_scaling_factor': 2.5, 'aux_loss_types': ['expert'], 'aux_loss_factors': [0.0001], 'z_loss_factor': 0.0, 'balance_via_topk_bias': True, 'topk_bias_update_rate': 0.0001, 'use_allgather_dispatcher': False, 'moe_shared_expert_overlap': False, 'expert_model_parallel': 1, 'use_gating_sigmoid': True, 'enable_deredundency': False, 'npu_nums_per_device': 2, 'use_gmm': True, 'enable_gmm_safe_tokens': False, 'use_fused_ops_permute': True, 'callback_moe_droprate': False} 2025-07-15 10:39:13,225 - mindformers./output/log[mindformers/core/parallel_config.py:48] - INFO - initial swap_config from dict: {'swap': False, 'layer_swap': None, 'op_swap': None, 'default_prefetch': 1} 2025-07-15 10:39:13,226 - mindformers./output/log[mindformers/core/parallel_config.py:55] - INFO - initial recompute_config from dict: {'recompute': True, 'select_recompute': False, 'parallel_optimizer_comm_recompute': True, 'select_comm_recompute': False, 'mp_comm_recompute': True, 'recompute_slice_activation': True, 'select_recompute_exclude': False, 'select_comm_recompute_exclude': False} 2025-07-15 10:39:13,226 - mindformers./output/log[mindformers/core/parallel_config.py:61] - INFO - initial parallel_config from dict: {'data_parallel': 2, 'model_parallel': 2, 'context_parallel': 1, 'expert_parallel': 2, 'pipeline_stage': 2, 'micro_batch_num': 2, 'seq_split_num': 1, 'use_seq_parallel': True, 'optimizer_shard': None, 'gradient_aggregation_group': 4, 'vocab_emb_dp': True, 'context_parallel_algo': 'colossalai_cp', 'ulysses_degree_in_cp': 1, 'mem_coeff': 0.1} 2025-07-15 10:39:13,226 - mindformers./output/log[mindformers/core/parallel_config.py:63] - INFO - pipeline_stage = 2 > 1, vocab_emd_dp will be reset to False. 2025-07-15 10:39:13,227 - mindformers./output/log[mindformers/tools/utils.py:166] - INFO - set output path to '/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/output' 2025-07-15 10:39:13,227 - mindformers./output/log[mindformers/tools/utils.py:181] - INFO - set strategy path to './output/strategy/ckpt_strategy_rank_0.ckpt' 2025-07-15 10:39:13,229 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config swap_config is empty. 2025-07-15 10:39:13,229 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config metric is empty. 2025-07-15 10:39:13,229 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config monitor_config is empty. 2025-07-15 10:39:13,229 - mindformers./output/log[mindformers/tools/register/template.py:683] - WARNING - Some configs in yaml are useless for train: ['auto_tune', 'autotune_per_step', 'eval_callbacks', 'eval_dataset', 'eval_dataset_task', 'filepath_prefix', 'processor'] 2025-07-15 10:39:13,230 - mindformers./output/log[mindformers/trainer/trainer.py:1008] - INFO - Load configs in /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/configs/general/run_general_task.yaml to build trainer. 2025-07-15 10:39:13,230 - mindformers./output/log[mindformers/trainer/trainer.py:1044] - INFO - ..........Init Config.......... 2025-07-15 10:39:13,230 - mindformers./output/log[mindformers/core/parallel_config.py:41] - INFO - initial moe_config from dict: {'expert_num': 4, 'capacity_factor': 1.5, 'aux_loss_factor': 0.05, 'num_experts_chosen': 2, 'expert_group_size': 2, 'group_wise_a2a': False, 'comp_comm_parallel': False, 'comp_comm_parallel_degree': 2, 'save_token_distribution': False, 'cur_layer': 0, 'enable_cold_hot_expert': False, 'update_step': 10000, 'hot_expert_num': 0, 'cold_token_percent': 1.0, 'moe_module_name': '', 'routing_policy': 'TopkRouterV2', 'norm_topk_prob': False, 'enable_sdrop': False, 'use_fused_ops_topkrouter': True, 'router_dense_type': 'float32', 'shared_expert_num': 1, 'use_shared_expert_gating': False, 'max_router_load': 131072, 'topk_method': 'greedy', 'topk_group': 3, 'n_group': 8, 'first_k_dense_replace': 1, 'moe_intermediate_size': 512, 'routed_scaling_factor': 2.5, 'aux_loss_types': ['expert'], 'aux_loss_factors': [0.0001], 'z_loss_factor': 0.0, 'balance_via_topk_bias': True, 'topk_bias_update_rate': 0.0001, 'use_allgather_dispatcher': False, 'moe_shared_expert_overlap': False, 'expert_model_parallel': 1, 'use_gating_sigmoid': True, 'enable_deredundency': False, 'npu_nums_per_device': 2, 'use_gmm': True, 'enable_gmm_safe_tokens': False, 'use_fused_ops_permute': True, 'callback_moe_droprate': False} 2025-07-15 10:39:13,231 - mindformers./output/log[mindformers/core/parallel_config.py:48] - INFO - initial swap_config from dict: {'swap': False, 'layer_swap': None, 'op_swap': None, 'default_prefetch': 1} 2025-07-15 10:39:13,231 - mindformers./output/log[mindformers/core/parallel_config.py:55] - INFO - initial recompute_config from dict: {'recompute': True, 'select_recompute': False, 'parallel_optimizer_comm_recompute': True, 'select_comm_recompute': False, 'mp_comm_recompute': True, 'recompute_slice_activation': True, 'select_recompute_exclude': False, 'select_comm_recompute_exclude': False} 2025-07-15 10:39:13,231 - mindformers./output/log[mindformers/core/parallel_config.py:61] - INFO - initial parallel_config from dict: {'data_parallel': 2, 'model_parallel': 2, 'context_parallel': 1, 'expert_parallel': 2, 'pipeline_stage': 2, 'micro_batch_num': 2, 'seq_split_num': 1, 'use_seq_parallel': True, 'optimizer_shard': None, 'gradient_aggregation_group': 4, 'vocab_emb_dp': True, 'context_parallel_algo': 'colossalai_cp', 'ulysses_degree_in_cp': 1, 'mem_coeff': 0.1} 2025-07-15 10:39:13,231 - mindformers./output/log[mindformers/core/parallel_config.py:63] - INFO - pipeline_stage = 2 > 1, vocab_emd_dp will be reset to False. 2025-07-15 10:39:13,232 - mindformers./output/log[mindformers/tools/utils.py:166] - INFO - set output path to '/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/output' 2025-07-15 10:39:13,232 - mindformers./output/log[mindformers/tools/utils.py:181] - INFO - set strategy path to './output/strategy/ckpt_strategy_rank_7.ckpt' 2025-07-15 10:39:13,247 - mindformers./output/log[mindformers/trainer/base_trainer.py:107] - INFO - host_name: ascend213, host_ip: 121.37.54.128 2025-07-15 10:39:13,247 - mindformers./output/log[mindformers/trainer/base_trainer.py:113] - INFO - Now Running Task is: text_generation, Model is: deepseekV3 2025-07-15 10:39:13,248 - mindformers./output/log[mindformers/trainer/base_trainer.py:143] - WARNING - Input model name is not in the supported list or unspecified. 2025-07-15 10:39:13,248 - mindformers./output/log[mindformers/trainer/base_trainer.py:144] - WARNING - See the list of supported task and model name: ['codellama_34b', 'common', 'deepseek1_5_7b', 'deepseek_33b', 'glm3_6b', 'glm4_9b', 'gpt2', 'gpt2_13b', 'gpt2_52b', 'gpt2_lora', 'gpt2_xl', 'gpt2_xl_lora', 'internlm_7b', 'internlm_7b_lora', 'llama2_13b', 'llama2_70b', 'llama2_7b', 'llama2_7b_lora', 'llama_7b_slora', 'yi_34b', 'yi_6b'] 2025-07-15 10:39:13,248 - mindformers./output/log[mindformers/trainer/base_trainer.py:145] - WARNING - The default model config: /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/configs/gpt2/run_gpt2.yaml will now be used for the text_generation task 2025-07-15 10:39:13,249 - mindformers./output/log[mindformers/trainer/trainer.py:1117] - INFO - ..........Init Model.......... 2025-07-15 10:39:13,249 - mindformers./output/log[mindformers/trainer/trainer.py:323] - INFO - ==========Trainer Init Success!========== 2025-07-15 10:39:13,249 - mindformers./output/log[mindformers/trainer/trainer.py:406] - WARNING - sink_size will not be able to set in a future release. Modifying sink_size may cause functional issues when resuming training from a checkpoint. 2025-07-15 10:39:13,249 - mindformers./output/log[mindformers/trainer/trainer.py:1117] - INFO - ..........Init Model.......... 2025-07-15 10:39:13,250 - mindformers./output/log[mindformers/trainer/base_trainer.py:234] - INFO - Pipeline parallel was opened: pipeline_stages = 2, full batch is False, gradient_accumulation_steps will not take effect in pipeline parallel, batch size per card will be changed: per_batch_size = batch_size * micro_batch_num * micro_batch_interleave_num = 2 = 1 * 2 * 1). 2025-07-15 10:39:13,250 - mindformers./output/log[mindformers/trainer/base_trainer.py:241] - INFO - global_batch_size = per_batch_size * data_parallel = 2 * 2 = 4 2025-07-15 10:39:13,250 - mindformers./output/log[mindformers/trainer/base_trainer.py:338] - WARNING - When using the pipeline parallel mode, the MFPipelineWithLossScaleCell class is used by default. 2025-07-15 10:39:13,251 - mindformers./output/log[mindformers/trainer/base_trainer.py:346] - INFO - PipelineWrapper under evaluate or predict mode will not take effect. 2025-07-15 10:39:13,251 - mindformers./output/log[mindformers/trainer/base_trainer.py:920] - INFO - .........Build Dataset For Train.......... 2025-07-15 10:39:13,251 - mindformers./output/log[mindformers/trainer/base_trainer.py:464] - INFO - .........Build Dataset From Config.......... 2025-07-15 10:39:13,251 - mindformers./output/log[mindformers/dataset/causal_language_model_dataset.py:302] - INFO - Now Create Causal Language Model Dataset. 2025-07-15 10:39:13,252 - mindformers./output/log[mindformers/dataset/base_dataset.py:83] - INFO - Now dataset_strategy is [[2, 1], [2, 1], [2, 1], [2, 1]], shard_id: 1, num_shards: 2 [WARNING] GRAPH_KERNEL(912076,ffffb886eec0,python):2025-07-15-10:39:13.265.403 [mindspore/ccsrc/backend/common/graph_kernel/graph_kernel_flags.cc:116] ParseFlags] For 'context.set_context', the flag 'None' in the parameter 'graph_kernel_flags' is invalid. Valid flag format is "--key=value", flags are separated by spaces(e.g. "--key1=value1 --key2=value2"). bool flag's value can be implicit, the "--key" means "--key=true". graph_kernel_flags = "None" [WARNING] DISTRIBUTED(912076,ffffb886eec0,python):2025-07-15-10:39:13.268.922 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: hccl_world_group [const vector]{0, 1, 2, 3, 4, 5, 6, 7}, async: 1, submit_now: 1 [WARNING] DISTRIBUTED(912076,ffffb886eec0,python):2025-07-15-10:39:13.269.138 [mindspore/ccsrc/distributed/collective/collective_manager.cc:393] CreateCommunicationGroup] This group's communicator is async created hccl_world_group [WARNING] DEVICE(912076,fffedb0befa0,python):2025-07-15-10:39:13.269.374 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:254] SetGlobalCommInfo] Start to SetGlobalCommInfo for hccl_world_group, master_ip:2130706433, master_port:7125, node_rank:2130706433, total_rank_size:8, local_rank_size8 [WARNING] HCCL_ADPT(912076,fffedb0befa0,python):2025-07-15-10:39:13.269.470 [mindspore/ccsrc/utils/dlopen_macro.h:165] DlsymAscend] Dynamically load symbol HcclSetGlobalCommInfo failed, result = /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/../lib/plugin/ascend/libhccl_plugin.so: undefined symbol: HcclSetGlobalCommInfo [WARNING] HCCL_ADPT(912076,fffedb0befa0,python):2025-07-15-10:39:13.269.505 [mindspore/ccsrc/plugin/res_manager/ascend/hccl_adapter/hccl_adapter.cc:635] HcclSetGlobalCommInfo] Func HcclSetGlobalCommInfo is not supported in CANN package. [WARNING] DEVICE(912076,fffedb0befa0,python):2025-07-15-10:39:13.269.533 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:265] SetGlobalCommInfo] End to SetGlobalCommInfo for hccl_world_group [WARNING] DISTRIBUTED(912076,fffedb0befa0,python):2025-07-15-10:39:13.269.967 [mindspore/ccsrc/distributed/collective/collective_manager.cc:1021] CreateDeviceCommunicator] Begin initialize communication group on the device side: hccl_world_group [WARNING] DEVICE(912076,fffeda8aefa0,python):2025-07-15-10:39:13.270.278 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:169] InitByRootInfoConfig] Start to initialize communicator by HcclCommInitRootInfoConfig for hccl_world_group, hcclBufferSize is 200 MB, hcclDeterministic is 1 2025-07-15 10:39:13,270 - mindformers./output/log[mindformers/tools/utils.py:181] - INFO - set strategy path to './output/strategy/ckpt_strategy_rank_2.ckpt' 2025-07-15 10:39:13,285 - mindformers./output/log[mindformers/trainer/base_trainer.py:107] - INFO - host_name: ascend213, host_ip: 121.37.54.128 2025-07-15 10:39:13,285 - mindformers./output/log[mindformers/trainer/base_trainer.py:113] - INFO - Now Running Task is: text_generation, Model is: deepseekV3 2025-07-15 10:39:13,286 - mindformers./output/log[mindformers/trainer/base_trainer.py:143] - WARNING - Input model name is not in the supported list or unspecified. 2025-07-15 10:39:13,286 - mindformers./output/log[mindformers/trainer/base_trainer.py:144] - WARNING - See the list of supported task and model name: ['codellama_34b', 'common', 'deepseek1_5_7b', 'deepseek_33b', 'glm3_6b', 'glm4_9b', 'gpt2', 'gpt2_13b', 'gpt2_52b', 'gpt2_lora', 'gpt2_xl', 'gpt2_xl_lora', 'internlm_7b', 'internlm_7b_lora', 'llama2_13b', 'llama2_70b', 'llama2_7b', 'llama2_7b_lora', 'llama_7b_slora', 'yi_34b', 'yi_6b'] 2025-07-15 10:39:13,286 - mindformers./output/log[mindformers/trainer/base_trainer.py:145] - WARNING - The default model config: /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/configs/gpt2/run_gpt2.yaml will now be used for the text_generation task 2025-07-15 10:39:13,287 - mindformers./output/log[mindformers/trainer/trainer.py:1117] - INFO - ..........Init Model.......... 2025-07-15 10:39:13,287 - mindformers./output/log[mindformers/trainer/trainer.py:323] - INFO - ==========Trainer Init Success!========== 2025-07-15 10:39:13,287 - mindformers./output/log[mindformers/trainer/trainer.py:406] - WARNING - sink_size will not be able to set in a future release. Modifying sink_size may cause functional issues when resuming training from a checkpoint. 2025-07-15 10:39:13,288 - mindformers./output/log[mindformers/trainer/trainer.py:1117] - INFO - ..........Init Model.......... 2025-07-15 10:39:13,288 - mindformers./output/log[mindformers/trainer/base_trainer.py:234] - INFO - Pipeline parallel was opened: pipeline_stages = 2, full batch is False, gradient_accumulation_steps will not take effect in pipeline parallel, batch size per card will be changed: per_batch_size = batch_size * micro_batch_num * micro_batch_interleave_num = 2 = 1 * 2 * 1). 2025-07-15 10:39:13,288 - mindformers./output/log[mindformers/trainer/base_trainer.py:241] - INFO - global_batch_size = per_batch_size * data_parallel = 2 * 2 = 4 2025-07-15 10:39:13,288 - mindformers./output/log[mindformers/trainer/base_trainer.py:338] - WARNING - When using the pipeline parallel mode, the MFPipelineWithLossScaleCell class is used by default. 2025-07-15 10:39:13,289 - mindformers./output/log[mindformers/trainer/base_trainer.py:346] - INFO - PipelineWrapper under evaluate or predict mode will not take effect. 2025-07-15 10:39:13,289 - mindformers./output/log[mindformers/trainer/base_trainer.py:920] - INFO - .........Build Dataset For Train.......... 2025-07-15 10:39:13,289 - mindformers./output/log[mindformers/trainer/base_trainer.py:464] - INFO - .........Build Dataset From Config.......... 2025-07-15 10:39:13,289 - mindformers./output/log[mindformers/dataset/causal_language_model_dataset.py:302] - INFO - Now Create Causal Language Model Dataset. 2025-07-15 10:39:13,290 - mindformers./output/log[mindformers/dataset/base_dataset.py:83] - INFO - Now dataset_strategy is [[2, 1], [2, 1], [2, 1], [2, 1]], shard_id: 0, num_shards: 2 2025-07-15 10:39:13,301 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config swap_config is empty. 2025-07-15 10:39:13,301 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config metric is empty. 2025-07-15 10:39:13,302 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config monitor_config is empty. 2025-07-15 10:39:13,302 - mindformers./output/log[mindformers/tools/register/template.py:683] - WARNING - Some configs in yaml are useless for train: ['auto_tune', 'autotune_per_step', 'eval_callbacks', 'eval_dataset', 'eval_dataset_task', 'filepath_prefix', 'processor'] 2025-07-15 10:39:13,302 - mindformers./output/log[mindformers/trainer/trainer.py:1008] - INFO - Load configs in /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/configs/general/run_general_task.yaml to build trainer. 2025-07-15 10:39:13,302 - mindformers./output/log[mindformers/trainer/trainer.py:1044] - INFO - ..........Init Config.......... 2025-07-15 10:39:13,302 - mindformers./output/log[mindformers/core/parallel_config.py:41] - INFO - initial moe_config from dict: {'expert_num': 4, 'capacity_factor': 1.5, 'aux_loss_factor': 0.05, 'num_experts_chosen': 2, 'expert_group_size': 2, 'group_wise_a2a': False, 'comp_comm_parallel': False, 'comp_comm_parallel_degree': 2, 'save_token_distribution': False, 'cur_layer': 0, 'enable_cold_hot_expert': False, 'update_step': 10000, 'hot_expert_num': 0, 'cold_token_percent': 1.0, 'moe_module_name': '', 'routing_policy': 'TopkRouterV2', 'norm_topk_prob': False, 'enable_sdrop': False, 'use_fused_ops_topkrouter': True, 'router_dense_type': 'float32', 'shared_expert_num': 1, 'use_shared_expert_gating': False, 'max_router_load': 131072, 'topk_method': 'greedy', 'topk_group': 3, 'n_group': 8, 'first_k_dense_replace': 1, 'moe_intermediate_size': 512, 'routed_scaling_factor': 2.5, 'aux_loss_types': ['expert'], 'aux_loss_factors': [0.0001], 'z_loss_factor': 0.0, 'balance_via_topk_bias': True, 'topk_bias_update_rate': 0.0001, 'use_allgather_dispatcher': False, 'moe_shared_expert_overlap': False, 'expert_model_parallel': 1, 'use_gating_sigmoid': True, 'enable_deredundency': False, 'npu_nums_per_device': 2, 'use_gmm': True, 'enable_gmm_safe_tokens': False, 'use_fused_ops_permute': True, 'callback_moe_droprate': False} 2025-07-15 10:39:13,303 - mindformers./output/log[mindformers/core/parallel_config.py:48] - INFO - initial swap_config from dict: {'swap': False, 'layer_swap': None, 'op_swap': None, 'default_prefetch': 1} 2025-07-15 10:39:13,303 - mindformers./output/log[mindformers/core/parallel_config.py:55] - INFO - initial recompute_config from dict: {'recompute': True, 'select_recompute': False, 'parallel_optimizer_comm_recompute': True, 'select_comm_recompute': False, 'mp_comm_recompute': True, 'recompute_slice_activation': True, 'select_recompute_exclude': False, 'select_comm_recompute_exclude': False} 2025-07-15 10:39:13,303 - mindformers./output/log[mindformers/core/parallel_config.py:61] - INFO - initial parallel_config from dict: {'data_parallel': 2, 'model_parallel': 2, 'context_parallel': 1, 'expert_parallel': 2, 'pipeline_stage': 2, 'micro_batch_num': 2, 'seq_split_num': 1, 'use_seq_parallel': True, 'optimizer_shard': None, 'gradient_aggregation_group': 4, 'vocab_emb_dp': True, 'context_parallel_algo': 'colossalai_cp', 'ulysses_degree_in_cp': 1, 'mem_coeff': 0.1} 2025-07-15 10:39:13,304 - mindformers./output/log[mindformers/core/parallel_config.py:63] - INFO - pipeline_stage = 2 > 1, vocab_emd_dp will be reset to False. 2025-07-15 10:39:13,304 - mindformers./output/log[mindformers/tools/utils.py:166] - INFO - set output path to '/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/output' 2025-07-15 10:39:13,305 - mindformers./output/log[mindformers/tools/utils.py:181] - INFO - set strategy path to './output/strategy/ckpt_strategy_rank_2.ckpt' [WARNING] DISTRIBUTED(912080,fffebeefefa0,python):2025-07-15-10:39:13.333.748 [mindspore/ccsrc/distributed/collective/collective_manager.cc:1021] CreateDeviceCommunicator] Begin initialize communication group on the device side: hccl_world_group [WARNING] DEVICE(912080,fffebe6eefa0,python):2025-07-15-10:39:13.334.130 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:169] InitByRootInfoConfig] Start to initialize communicator by HcclCommInitRootInfoConfig for hccl_world_group, hcclBufferSize is 200 MB, hcclDeterministic is 1 2025-07-15 10:39:13,363 - mindformers./output/log[mindformers/trainer/base_trainer.py:107] - INFO - host_name: ascend213, host_ip: 121.37.54.128 2025-07-15 10:39:13,364 - mindformers./output/log[mindformers/trainer/base_trainer.py:113] - INFO - Now Running Task is: text_generation, Model is: deepseekV3 2025-07-15 10:39:13,364 - mindformers./output/log[mindformers/trainer/base_trainer.py:143] - WARNING - Input model name is not in the supported list or unspecified. 2025-07-15 10:39:13,364 - mindformers./output/log[mindformers/trainer/base_trainer.py:144] - WARNING - See the list of supported task and model name: ['codellama_34b', 'common', 'deepseek1_5_7b', 'deepseek_33b', 'glm3_6b', 'glm4_9b', 'gpt2', 'gpt2_13b', 'gpt2_52b', 'gpt2_lora', 'gpt2_xl', 'gpt2_xl_lora', 'internlm_7b', 'internlm_7b_lora', 'llama2_13b', 'llama2_70b', 'llama2_7b', 'llama2_7b_lora', 'llama_7b_slora', 'yi_34b', 'yi_6b'] 2025-07-15 10:39:13,365 - mindformers./output/log[mindformers/trainer/base_trainer.py:145] - WARNING - The default model config: /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/configs/gpt2/run_gpt2.yaml will now be used for the text_generation task 2025-07-15 10:39:13,365 - mindformers./output/log[mindformers/trainer/trainer.py:1117] - INFO - ..........Init Model.......... 2025-07-15 10:39:13,366 - mindformers./output/log[mindformers/trainer/trainer.py:323] - INFO - ==========Trainer Init Success!========== 2025-07-15 10:39:13,366 - mindformers./output/log[mindformers/trainer/trainer.py:406] - WARNING - sink_size will not be able to set in a future release. Modifying sink_size may cause functional issues when resuming training from a checkpoint. 2025-07-15 10:39:13,366 - mindformers./output/log[mindformers/trainer/trainer.py:1117] - INFO - ..........Init Model.......... 2025-07-15 10:39:13,367 - mindformers./output/log[mindformers/trainer/base_trainer.py:234] - INFO - Pipeline parallel was opened: pipeline_stages = 2, full batch is False, gradient_accumulation_steps will not take effect in pipeline parallel, batch size per card will be changed: per_batch_size = batch_size * micro_batch_num * micro_batch_interleave_num = 2 = 1 * 2 * 1). 2025-07-15 10:39:13,367 - mindformers./output/log[mindformers/trainer/base_trainer.py:241] - INFO - global_batch_size = per_batch_size * data_parallel = 2 * 2 = 4 2025-07-15 10:39:13,367 - mindformers./output/log[mindformers/trainer/base_trainer.py:338] - WARNING - When using the pipeline parallel mode, the MFPipelineWithLossScaleCell class is used by default. 2025-07-15 10:39:13,367 - mindformers./output/log[mindformers/trainer/base_trainer.py:346] - INFO - PipelineWrapper under evaluate or predict mode will not take effect. 2025-07-15 10:39:13,367 - mindformers./output/log[mindformers/trainer/base_trainer.py:920] - INFO - .........Build Dataset For Train.......... 2025-07-15 10:39:13,368 - mindformers./output/log[mindformers/trainer/base_trainer.py:464] - INFO - .........Build Dataset From Config.......... 2025-07-15 10:39:13,368 - mindformers./output/log[mindformers/dataset/causal_language_model_dataset.py:302] - INFO - Now Create Causal Language Model Dataset. 2025-07-15 10:39:13,369 - mindformers./output/log[mindformers/dataset/base_dataset.py:83] - INFO - Now dataset_strategy is [[2, 1], [2, 1], [2, 1], [2, 1]], shard_id: 0, num_shards: 2 2025-07-15 10:39:13,370 - mindformers./output/log[mindformers/trainer/base_trainer.py:107] - INFO - host_name: ascend213, host_ip: 121.37.54.128 2025-07-15 10:39:13,370 - mindformers./output/log[mindformers/trainer/base_trainer.py:113] - INFO - Now Running Task is: text_generation, Model is: deepseekV3 2025-07-15 10:39:13,370 - mindformers./output/log[mindformers/trainer/base_trainer.py:143] - WARNING - Input model name is not in the supported list or unspecified. 2025-07-15 10:39:13,371 - mindformers./output/log[mindformers/trainer/base_trainer.py:144] - WARNING - See the list of supported task and model name: ['codellama_34b', 'common', 'deepseek1_5_7b', 'deepseek_33b', 'glm3_6b', 'glm4_9b', 'gpt2', 'gpt2_13b', 'gpt2_52b', 'gpt2_lora', 'gpt2_xl', 'gpt2_xl_lora', 'internlm_7b', 'internlm_7b_lora', 'llama2_13b', 'llama2_70b', 'llama2_7b', 'llama2_7b_lora', 'llama_7b_slora', 'yi_34b', 'yi_6b'] 2025-07-15 10:39:13,371 - mindformers./output/log[mindformers/trainer/base_trainer.py:145] - WARNING - The default model config: /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/configs/gpt2/run_gpt2.yaml will now be used for the text_generation task 2025-07-15 10:39:13,372 - mindformers./output/log[mindformers/trainer/trainer.py:1117] - INFO - ..........Init Model.......... 2025-07-15 10:39:13,372 - mindformers./output/log[mindformers/trainer/trainer.py:323] - INFO - ==========Trainer Init Success!========== 2025-07-15 10:39:13,372 - mindformers./output/log[mindformers/trainer/trainer.py:406] - WARNING - sink_size will not be able to set in a future release. Modifying sink_size may cause functional issues when resuming training from a checkpoint. 2025-07-15 10:39:13,372 - mindformers./output/log[mindformers/trainer/trainer.py:1117] - INFO - ..........Init Model.......... 2025-07-15 10:39:13,373 - mindformers./output/log[mindformers/trainer/base_trainer.py:234] - INFO - Pipeline parallel was opened: pipeline_stages = 2, full batch is False, gradient_accumulation_steps will not take effect in pipeline parallel, batch size per card will be changed: per_batch_size = batch_size * micro_batch_num * micro_batch_interleave_num = 2 = 1 * 2 * 1). 2025-07-15 10:39:13,373 - mindformers./output/log[mindformers/trainer/base_trainer.py:241] - INFO - global_batch_size = per_batch_size * data_parallel = 2 * 2 = 4 2025-07-15 10:39:13,373 - mindformers./output/log[mindformers/trainer/base_trainer.py:338] - WARNING - When using the pipeline parallel mode, the MFPipelineWithLossScaleCell class is used by default. 2025-07-15 10:39:13,373 - mindformers./output/log[mindformers/trainer/base_trainer.py:346] - INFO - PipelineWrapper under evaluate or predict mode will not take effect. 2025-07-15 10:39:13,374 - mindformers./output/log[mindformers/trainer/base_trainer.py:920] - INFO - .........Build Dataset For Train.......... 2025-07-15 10:39:13,374 - mindformers./output/log[mindformers/trainer/base_trainer.py:464] - INFO - .........Build Dataset From Config.......... 2025-07-15 10:39:13,374 - mindformers./output/log[mindformers/dataset/causal_language_model_dataset.py:302] - INFO - Now Create Causal Language Model Dataset. 2025-07-15 10:39:13,375 - mindformers./output/log[mindformers/dataset/base_dataset.py:83] - INFO - Now dataset_strategy is [[2, 1], [2, 1], [2, 1], [2, 1]], shard_id: 1, num_shards: 2 [WARNING] DISTRIBUTED(912084,fffea30befa0,python):2025-07-15-10:39:13.409.162 [mindspore/ccsrc/distributed/collective/collective_manager.cc:1021] CreateDeviceCommunicator] Begin initialize communication group on the device side: hccl_world_group [WARNING] DEVICE(912084,fffea28aefa0,python):2025-07-15-10:39:13.409.582 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:169] InitByRootInfoConfig] Start to initialize communicator by HcclCommInitRootInfoConfig for hccl_world_group, hcclBufferSize is 200 MB, hcclDeterministic is 1 make: Entering directory '/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/mindformers/dataset/blended_datasets' 2025-07-15 10:39:13,449 - mindformers./output/log[mindformers/trainer/base_trainer.py:107] - INFO - host_name: ascend213, host_ip: 121.37.54.128 2025-07-15 10:39:13,449 - mindformers./output/log[mindformers/trainer/base_trainer.py:113] - INFO - Now Running Task is: text_generation, Model is: deepseekV3 2025-07-15 10:39:13,450 - mindformers./output/log[mindformers/trainer/base_trainer.py:143] - WARNING - Input model name is not in the supported list or unspecified. 2025-07-15 10:39:13,450 - mindformers./output/log[mindformers/trainer/base_trainer.py:144] - WARNING - See the list of supported task and model name: ['codellama_34b', 'common', 'deepseek1_5_7b', 'deepseek_33b', 'glm3_6b', 'glm4_9b', 'gpt2', 'gpt2_13b', 'gpt2_52b', 'gpt2_lora', 'gpt2_xl', 'gpt2_xl_lora', 'internlm_7b', 'internlm_7b_lora', 'llama2_13b', 'llama2_70b', 'llama2_7b', 'llama2_7b_lora', 'llama_7b_slora', 'yi_34b', 'yi_6b'] 2025-07-15 10:39:13,451 - mindformers./output/log[mindformers/trainer/base_trainer.py:145] - WARNING - The default model config: /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/configs/gpt2/run_gpt2.yaml will now be used for the text_generation task 2025-07-15 10:39:13,451 - mindformers./output/log[mindformers/trainer/trainer.py:1117] - INFO - ..........Init Model.......... 2025-07-15 10:39:13,451 - mindformers./output/log[mindformers/trainer/trainer.py:323] - INFO - ==========Trainer Init Success!========== 2025-07-15 10:39:13,451 - mindformers./output/log[mindformers/trainer/trainer.py:406] - WARNING - sink_size will not be able to set in a future release. Modifying sink_size may cause functional issues when resuming training from a checkpoint. 2025-07-15 10:39:13,452 - mindformers./output/log[mindformers/trainer/trainer.py:1117] - INFO - ..........Init Model.......... 2025-07-15 10:39:13,452 - mindformers./output/log[mindformers/trainer/base_trainer.py:234] - INFO - Pipeline parallel was opened: pipeline_stages = 2, full batch is False, gradient_accumulation_steps will not take effect in pipeline parallel, batch size per card will be changed: per_batch_size = batch_size * micro_batch_num * micro_batch_interleave_num = 2 = 1 * 2 * 1). 2025-07-15 10:39:13,452 - mindformers./output/log[mindformers/trainer/base_trainer.py:241] - INFO - global_batch_size = per_batch_size * data_parallel = 2 * 2 = 4 2025-07-15 10:39:13,452 - mindformers./output/log[mindformers/trainer/base_trainer.py:338] - WARNING - When using the pipeline parallel mode, the MFPipelineWithLossScaleCell class is used by default. 2025-07-15 10:39:13,453 - mindformers./output/log[mindformers/trainer/base_trainer.py:346] - INFO - PipelineWrapper under evaluate or predict mode will not take effect. 2025-07-15 10:39:13,453 - mindformers./output/log[mindformers/trainer/base_trainer.py:920] - INFO - .........Build Dataset For Train.......... 2025-07-15 10:39:13,453 - mindformers./output/log[mindformers/trainer/base_trainer.py:464] - INFO - .........Build Dataset From Config.......... 2025-07-15 10:39:13,453 - mindformers./output/log[mindformers/dataset/causal_language_model_dataset.py:302] - INFO - Now Create Causal Language Model Dataset. 2025-07-15 10:39:13,454 - mindformers./output/log[mindformers/dataset/base_dataset.py:83] - INFO - Now dataset_strategy is [[2, 1], [2, 1], [2, 1], [2, 1]], shard_id: 1, num_shards: 2 make: Nothing to be done for 'default'. make: Leaving directory '/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/mindformers/dataset/blended_datasets' [WARNING] DISTRIBUTED(912092,fffeb906efa0,python):2025-07-15-10:39:13.581.981 [mindspore/ccsrc/distributed/collective/collective_manager.cc:1021] CreateDeviceCommunicator] Begin initialize communication group on the device side: hccl_world_group [WARNING] DEVICE(912092,fffeb885efa0,python):2025-07-15-10:39:13.582.389 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:169] InitByRootInfoConfig] Start to initialize communicator by HcclCommInitRootInfoConfig for hccl_world_group, hcclBufferSize is 200 MB, hcclDeterministic is 1 [WARNING] DISTRIBUTED(912088,fffea8baefa0,python):2025-07-15-10:39:13.619.261 [mindspore/ccsrc/distributed/collective/collective_manager.cc:1021] CreateDeviceCommunicator] Begin initialize communication group on the device side: hccl_world_group [WARNING] DEVICE(912088,fffe5bffefa0,python):2025-07-15-10:39:13.619.741 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:169] InitByRootInfoConfig] Start to initialize communicator by HcclCommInitRootInfoConfig for hccl_world_group, hcclBufferSize is 200 MB, hcclDeterministic is 1 [WARNING] DISTRIBUTED(912096,fffed11fefa0,python):2025-07-15-10:39:13.701.967 [mindspore/ccsrc/distributed/collective/collective_manager.cc:1021] CreateDeviceCommunicator] Begin initialize communication group on the device side: hccl_world_group [WARNING] DEVICE(912096,fffed09eefa0,python):2025-07-15-10:39:13.702.388 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:169] InitByRootInfoConfig] Start to initialize communicator by HcclCommInitRootInfoConfig for hccl_world_group, hcclBufferSize is 200 MB, hcclDeterministic is 1 2025-07-15 10:41:55,183 - mindformers./output/log[mindformers/core/context/parallel.py:88] - ERROR - Notice: if you are trying to run with a single device, please set use_parallel=False. If not, please check the error message above. 2025-07-15 10:41:55,184 - mindformers./output/log[mindformers/tools/cloud_adapter/cloud_monitor.py:43] - ERROR - Traceback (most recent call last): File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/mindformers/tools/cloud_adapter/cloud_monitor.py", line 34, in wrapper result = run_func(*args, **kwargs) File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/../mindformers/run_mindformer.py", line 68, in main build_context(config) File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/mindformers/core/context/build_context.py", line 464, in build_context ctx = Context(mf_config) File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/mindformers/core/context/build_context.py", line 71, in __init__ self.parallel_opr.init_communication() File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/mindformers/core/context/parallel.py", line 86, in init_communication init() File "/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/management.py", line 203, in init init_hccl() RuntimeError: Call aclrtSetDevice failed, ret[507033]. Got device count[8] and device id[1], please check if device id is valid. ---------------------------------------------------- - C++ Call Stack: (For framework developers) ---------------------------------------------------- mindspore/ccsrc/plugin/res_manager/ascend/hal_manager/ascend_hal_manager.cc:67 InitDevice Traceback (most recent call last): File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/../mindformers/run_mindformer.py", line 336, in main(config_) File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/mindformers/tools/cloud_adapter/cloud_monitor.py", line 44, in wrapper raise exc File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/mindformers/tools/cloud_adapter/cloud_monitor.py", line 34, in wrapper result = run_func(*args, **kwargs) File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/../mindformers/run_mindformer.py", line 68, in main build_context(config) File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/mindformers/core/context/build_context.py", line 464, in build_context ctx = Context(mf_config) File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/mindformers/core/context/build_context.py", line 71, in __init__ self.parallel_opr.init_communication() File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/mindformers/core/context/parallel.py", line 86, in init_communication init() File "/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/management.py", line 203, in init init_hccl() RuntimeError: Call aclrtSetDevice failed, ret[507033]. Got device count[8] and device id[1], please check if device id is valid. ---------------------------------------------------- - C++ Call Stack: (For framework developers) ---------------------------------------------------- mindspore/ccsrc/plugin/res_manager/ascend/hal_manager/ascend_hal_manager.cc:67 InitDevice [WARNING] DEVICE(912072,ffffaa93eec0,python):2025-07-15-10:41:55.458.821 [mindspore/ccsrc/plugin/device/ascend/hal/hardware/ascend_device_res_manager.cc:350] SyncAllStreams] The ascend_res_manager_ is nullptr in scenarios where it is not actually executed [ERROR] ME(911743:281473094381248,MainProcess):2025-07-15-10:41:57.358.669 [mindspore/parallel/cluster/process_entity/_api.py:363] Worker process 912072 exit with exception. Error code: 1. [WARNING] ME(911743:281473094381248,MainProcess):2025-07-15-10:41:57.358.986 [mindspore/parallel/cluster/process_entity/_api.py:369] There's worker exits with exception, kill all other workers. [ERROR] ME(911743:281473094381248,MainProcess):2025-07-15-10:42:27.703.424 [mindspore/parallel/cluster/process_entity/_api.py:382] Scheduler process 912066 exit with exception. [ERROR] ME(911743:281473094381248,MainProcess):2025-07-15-10:42:27.704.618 [mindspore/parallel/cluster/process_entity/_api.py:603] Time out nodes are ['1'] /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/worker_1.log-39-[WARNING] DISTRIBUTED(912072,ffffaa93eec0,python):2025-07-15-10:39:10.894.984 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(2/14400). /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/worker_1.log-40-[MS_DEV_RUNTIME_CONF]Runtime config: memory_statistics:True /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/worker_1.log-41-[WARNING] DISTRIBUTED(912072,ffffaa93eec0,python):2025-07-15-10:39:11.395.163 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:249] BuildCluster] Cluster is successfully initialized. /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/worker_1.log-42-[WARNING] DISTRIBUTED(912072,ffffaa93eec0,python):2025-07-15-10:39:11.395.204 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:355] PostProcess] This node 1 rank id: 1 /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/worker_1.log-43-[MS_RUNTIME_PROF]The jit_level is: O1, and enable kernelbykernel executor. /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/worker_1.log:44:2025-07-15 10:41:55,183 - mindformers./output/log[mindformers/core/context/parallel.py:88] - ERROR - Notice: if you are trying to run with a single device, please set use_parallel=False. If not, please check the error message above. /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/worker_1.log:45:2025-07-15 10:41:55,184 - mindformers./output/log[mindformers/tools/cloud_adapter/cloud_monitor.py:43] - ERROR - Traceback (most recent call last): /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/worker_1.log-46- File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/mindformers/tools/cloud_adapter/cloud_monitor.py", line 34, in wrapper /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/worker_1.log-47- result = run_func(*args, **kwargs) /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/worker_1.log-48- File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/../mindformers/run_mindformer.py", line 68, in main /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/worker_1.log-49- build_context(config) /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/worker_1.log-50- File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/mindformers/core/context/build_context.py", line 464, in build_context -- /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/worker_1.log-53- self.parallel_opr.init_communication() /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/worker_1.log-54- File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/mindformers/core/context/parallel.py", line 86, in init_communication /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/worker_1.log-55- init() /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/worker_1.log-56- File "/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/management.py", line 203, in init /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/worker_1.log-57- init_hccl() /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/worker_1.log:58:RuntimeError: Call aclrtSetDevice failed, ret[507033]. Got device count[8] and device id[1], please check if device id is valid. /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/worker_1.log-59- /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/worker_1.log-60----------------------------------------------------- /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/worker_1.log-61-- C++ Call Stack: (For framework developers) /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/worker_1.log-62----------------------------------------------------- /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/worker_1.log-63-mindspore/ccsrc/plugin/res_manager/ascend/hal_manager/ascend_hal_manager.cc:67 InitDevice /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/worker_1.log-64- /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/worker_1.log-65- /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/worker_1.log:66:Traceback (most recent call last): /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/worker_1.log-67- File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/../mindformers/run_mindformer.py", line 336, in /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/worker_1.log-68- main(config_) /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/worker_1.log-69- File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/mindformers/tools/cloud_adapter/cloud_monitor.py", line 44, in wrapper /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/worker_1.log-70- raise exc /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/worker_1.log-71- File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/mindformers/tools/cloud_adapter/cloud_monitor.py", line 34, in wrapper -- /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/worker_1.log-78- self.parallel_opr.init_communication() /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/worker_1.log-79- File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/mindformers/core/context/parallel.py", line 86, in init_communication /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/worker_1.log-80- init() /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/worker_1.log-81- File "/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/management.py", line 203, in init /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/worker_1.log-82- init_hccl() /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/worker_1.log:83:RuntimeError: Call aclrtSetDevice failed, ret[507033]. Got device count[8] and device id[1], please check if device id is valid. /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/worker_1.log-84- /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/worker_1.log-85----------------------------------------------------- /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/worker_1.log-86-- C++ Call Stack: (For framework developers) /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/worker_1.log-87----------------------------------------------------- /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/worker_1.log-88-mindspore/ccsrc/plugin/res_manager/ascend/hal_manager/ascend_hal_manager.cc:67 InitDevice -- /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/scheduler.log-118-[WARNING] DISTRIBUTED(912066,ffffbf07eec0,python):2025-07-15-10:42:10.575.155 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:154] Finalize] This log means the cluster is successfully created. Retry to finalize the node and exit cluster... /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/scheduler.log-119-[WARNING] DISTRIBUTED(912066,ffffbf07eec0,python):2025-07-15-10:42:15.575.384 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:98] Finalize] The meta server node can not be finalized because there are still 8 alive nodes. /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/scheduler.log-120-[WARNING] DISTRIBUTED(912066,ffffbf07eec0,python):2025-07-15-10:42:15.575.476 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:154] Finalize] This log means the cluster is successfully created. Retry to finalize the node and exit cluster... /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/scheduler.log-121-[WARNING] DISTRIBUTED(912066,ffffbf07eec0,python):2025-07-15-10:42:20.575.680 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:98] Finalize] The meta server node can not be finalized because there are still 8 alive nodes. /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/scheduler.log-122-[WARNING] DISTRIBUTED(912066,ffffbf07eec0,python):2025-07-15-10:42:20.575.806 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:154] Finalize] This log means the cluster is successfully created. Retry to finalize the node and exit cluster... /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/scheduler.log:123:[ERROR] DISTRIBUTED(912066,ffff396cefa0,python):2025-07-15-10:42:25.094.370 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:511] UpdateTopoState] The node: 1 is timed out. It may exit with exception, please check this node's log. /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/scheduler.log:124:[ERROR] DISTRIBUTED(912066,ffffbf07eec0,python):2025-07-15-10:42:25.576.011 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:103] Finalize] There are 1 abnormal compute graph nodes. /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/scheduler.log:125:2025-07-15 10:42:25,576 - mindformers./output/log[mindformers/core/context/parallel.py:88] - ERROR - Notice: if you are trying to run with a single device, please set use_parallel=False. If not, please check the error message above. /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/scheduler.log:126:2025-07-15 10:42:25,578 - mindformers./output/log[mindformers/tools/cloud_adapter/cloud_monitor.py:43] - ERROR - Traceback (most recent call last): /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/scheduler.log-127- File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/mindformers/tools/cloud_adapter/cloud_monitor.py", line 34, in wrapper /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/scheduler.log-128- result = run_func(*args, **kwargs) /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/scheduler.log-129- File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/../mindformers/run_mindformer.py", line 68, in main /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/scheduler.log-130- build_context(config) /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/scheduler.log-131- File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/mindformers/core/context/build_context.py", line 464, in build_context -- /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/scheduler.log-134- self.parallel_opr.init_communication() /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/scheduler.log-135- File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/mindformers/core/context/parallel.py", line 86, in init_communication /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/scheduler.log-136- init() /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/scheduler.log-137- File "/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/management.py", line 213, in init /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/scheduler.log-138- init_cluster() /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/scheduler.log:139:RuntimeError: The total number of timed out node is 1. Timed out node list is: [const vector]{1}, worker 1 is the first one timed out, please check its log. /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/scheduler.log-140- /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/scheduler.log-141----------------------------------------------------- /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/scheduler.log-142-- C++ Call Stack: (For framework developers) /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/scheduler.log-143----------------------------------------------------- /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/scheduler.log-144-mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:517 UpdateTopoState /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/scheduler.log-145- /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/scheduler.log-146- /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/scheduler.log:147:Traceback (most recent call last): /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/scheduler.log-148- File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/../mindformers/run_mindformer.py", line 336, in /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/scheduler.log-149- main(config_) /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/scheduler.log-150- File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/mindformers/tools/cloud_adapter/cloud_monitor.py", line 44, in wrapper /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/scheduler.log-151- raise exc /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/scheduler.log-152- File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/mindformers/tools/cloud_adapter/cloud_monitor.py", line 34, in wrapper -- /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/scheduler.log-159- self.parallel_opr.init_communication() /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/scheduler.log-160- File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/mindformers/core/context/parallel.py", line 86, in init_communication /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/scheduler.log-161- init() /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/scheduler.log-162- File "/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/management.py", line 213, in init /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/scheduler.log-163- init_cluster() /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/scheduler.log:164:RuntimeError: The total number of timed out node is 1. Timed out node list is: [const vector]{1}, worker 1 is the first one timed out, please check its log. /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/scheduler.log-165- /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/scheduler.log-166----------------------------------------------------- /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/scheduler.log-167-- C++ Call Stack: (For framework developers) /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/scheduler.log-168----------------------------------------------------- /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/scheduler.log-169-mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:517 UpdateTopoState Traceback (most recent call last): File "/home/jenkins/anaconda3/envs/ci39/bin/msrun", line 8, in sys.exit(main()) File "/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/parallel/cluster/run.py", line 191, in main run(args) File "/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/parallel/cluster/run.py", line 185, in run process_manager.run() File "/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/parallel/cluster/process_entity/_api.py", line 268, in run self.join_processes() File "/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/parallel/cluster/process_entity/_api.py", line 387, in join_processes raise RuntimeError("Distributed job exited with exception. Please check logs in " RuntimeError: Distributed job exited with exception. Please check logs in directory: /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/. [MS_DEV_RUNTIME_CONF]Runtime config: memory_statistics:True F =================================== FAILURES =================================== ________ test_deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_mte_8p_gptdataset _________ @arg_mark(plat_marks=['platform_ascend910b'], level_mark='level1', card_mark='allcards', essential_mark='essential') def test_deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_mte_8p_gptdataset(): """ Feature: test deepseekv3 cell dp2mp2ep4pp2mb4gas1bs1 mte 8p gptdataset Description: test deepseekv3 cell dp2mp2ep4pp2mb4gas1bs1 mte 8p gptdataset Expectation: st pass """ case_name = "deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset" sh_path = os.path.split(os.path.realpath(__file__))[0] file_path = f"{sh_path}/pretrain_deepseek3_mte_gptdataset.yaml" device_num = 8 master_port = 7125 hccl_if_base_port = 63375 os.makedirs(os.path.join(sh_path, case_name), exist_ok=True) clear_directory(f"{sh_path}/{case_name}") env_cmd = 'export MS_DEV_RUNTIME_CONF="memory_statistics:True";' env_cmd += 'export MS_MEMORY_STATISTIC=1' os.system(f"{env_cmd};bash {sh_path}/run_llm.sh {device_num} {file_path} \ {case_name} {master_port} {hccl_if_base_port} pp gpt") # check train over check_pair = {"Training Over": 1} real_log_path = log_path_preprocess(case_name, device_num) for log_path in real_log_path: > check_log(log_path, check_pair) test_deepseekv3_pretrain.py:482: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ file_path = './deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/worker_0.log' check_pairs = {'Training Over': 1} def check_log(file_path, check_pairs=None): # check the number of key in check_pairs in log file is equal to the value log_error_count = subprocess.check_output( ["grep -rE '%s' %s | wc -l" % ("ERROR|Traceback", file_path)], shell=True) log_cnt = str(log_error_count, 'utf-8').strip() if log_cnt != "0": os.system(f"cat {file_path}") assert log_cnt == "0", f"Error found in {file_path}" if check_pairs is not None: for key_word, value in check_pairs.items(): log_output = subprocess.check_output( ["grep -r '%s' %s | wc -l" % (key_word, file_path)], shell=True) log_cnt = str(log_output, 'utf-8').strip() > assert log_cnt == str(value), (f"Failed to find {key_word} in {file_path} or content is not correct." f"Expected occurrences: {value}, but got {log_cnt}") E AssertionError: Failed to find Training Over in ./deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_gptdataset/worker_0.log or content is not correct.Expected occurrences: 1, but got 0 ../utils.py:160: AssertionError =========================== short test summary info ============================ FAILED test_deepseekv3_pretrain.py::test_deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_mte_8p_gptdataset ======================== 1 failed in 226.16s (0:03:46) =========================