============================= test session starts ============================== platform linux -- Python 3.9.21, pytest-6.2.5, py-1.11.0, pluggy-0.13.1 rootdir: /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3, configfile: ../../../../../../../../sault/virtual_test/virtualenv_002/sault/config/pytest.ini plugins: forked-1.6.0, hydra-core-1.3.2, xdist-1.32.0, anyio-4.9.0 collected 1 item test_deepseekv3_pretrain.py enable lazy inline in pp /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) Start worker process with rank id:0, log file:/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/worker_0.log. Environment variable [RANK_ID=0] is exported. [WARNING] ME(899444:281473369566912,MainProcess):2025-07-15-10:17:22.328.712 [mindspore/parallel/cluster/process_entity/_utils.py:62] Launch process with command: taskset -c 144-167 python /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/../mindformers/run_mindformer.py --config /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/pretrain_deepseek3.yaml --register_path /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/../mindformers/research/deepseek3/ Start worker process with rank id:1, log file:/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/worker_1.log. Environment variable [RANK_ID=1] is exported. [WARNING] ME(899444:281473369566912,MainProcess):2025-07-15-10:17:22.396.129 [mindspore/parallel/cluster/process_entity/_utils.py:62] Launch process with command: taskset -c 24-47 python /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/../mindformers/run_mindformer.py --config /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/pretrain_deepseek3.yaml --register_path /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/../mindformers/research/deepseek3/ Start worker process with rank id:2, log file:/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/worker_2.log. Environment variable [RANK_ID=2] is exported. [WARNING] ME(899444:281473369566912,MainProcess):2025-07-15-10:17:22.462.358 [mindspore/parallel/cluster/process_entity/_utils.py:62] Launch process with command: taskset -c 96-119 python /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/../mindformers/run_mindformer.py --config /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/pretrain_deepseek3.yaml --register_path /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/../mindformers/research/deepseek3/ Start worker process with rank id:3, log file:/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/worker_3.log. Environment variable [RANK_ID=3] is exported. [WARNING] ME(899444:281473369566912,MainProcess):2025-07-15-10:17:22.527.208 [mindspore/parallel/cluster/process_entity/_utils.py:62] Launch process with command: taskset -c 72-95 python /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/../mindformers/run_mindformer.py --config /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/pretrain_deepseek3.yaml --register_path /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/../mindformers/research/deepseek3/ Start worker process with rank id:4, log file:/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/worker_4.log. Environment variable [RANK_ID=4] is exported. [WARNING] ME(899444:281473369566912,MainProcess):2025-07-15-10:17:22.592.195 [mindspore/parallel/cluster/process_entity/_utils.py:62] Launch process with command: taskset -c 0-23 python /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/../mindformers/run_mindformer.py --config /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/pretrain_deepseek3.yaml --register_path /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/../mindformers/research/deepseek3/ Start worker process with rank id:5, log file:/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/worker_5.log. Environment variable [RANK_ID=5] is exported. [WARNING] ME(899444:281473369566912,MainProcess):2025-07-15-10:17:22.658.036 [mindspore/parallel/cluster/process_entity/_utils.py:62] Launch process with command: taskset -c 120-143 python /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/../mindformers/run_mindformer.py --config /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/pretrain_deepseek3.yaml --register_path /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/../mindformers/research/deepseek3/ Start worker process with rank id:6, log file:/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/worker_6.log. Environment variable [RANK_ID=6] is exported. [WARNING] ME(899444:281473369566912,MainProcess):2025-07-15-10:17:22.750.448 [mindspore/parallel/cluster/process_entity/_utils.py:62] Launch process with command: taskset -c 48-71 python /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/../mindformers/run_mindformer.py --config /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/pretrain_deepseek3.yaml --register_path /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/../mindformers/research/deepseek3/ Start worker process with rank id:7, log file:/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/worker_7.log. Environment variable [RANK_ID=7] is exported. [WARNING] ME(899444:281473369566912,MainProcess):2025-07-15-10:17:22.814.921 [mindspore/parallel/cluster/process_entity/_utils.py:62] Launch process with command: taskset -c 168-191 python /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/../mindformers/run_mindformer.py --config /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/pretrain_deepseek3.yaml --register_path /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/../mindformers/research/deepseek3/ [WARNING] ME(899444:281473369566912,MainProcess):2025-07-15-10:17:22.880.626 [mindspore/parallel/cluster/process_entity/_api.py:267] Distributed job is spawned. Waiting all processes to exit... /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) 2025-07-15 10:17:30,966 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config swap_config is empty. 2025-07-15 10:17:30,967 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config monitor_config is empty. 2025-07-15 10:17:30,967 - mindformers./output/log[mindformers/tools/register/template.py:683] - WARNING - Some configs in yaml are useless for train: ['auto_tune', 'autotune_per_step', 'eval_callbacks', 'filepath_prefix', 'processor', 'remove_redundancy', 'resume_by_last_timestamp_ckpt'] 2025-07-15 10:17:30,968 - mindformers./output/log[mindformers/tools/utils.py:166] - INFO - set output path to '/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/output' [WARNING] ME(899768:281473836510912,MainProcess):2025-07-15-10:17:30.987.558 [mindspore/context.py:1412] For 'context.set_context', the parameter 'device_target' will be deprecated and removed in a future version. Please use the api mindspore.set_device() instead. [WARNING] ME(899768:281473836510912,MainProcess):2025-07-15-10:17:30.988.294 [mindspore/context.py:1412] For 'context.set_context', the parameter 'max_device_memory' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] ME(899768:281473836510912,MainProcess):2025-07-15-10:17:30.988.685 [mindspore/context.py:1412] For 'context.set_context', the parameter 'max_call_depth' will be deprecated and removed in a future version. Please use the api mindspore.set_recursion_limit() instead. [WARNING] ME(899768:281473836510912,MainProcess):2025-07-15-10:17:30.988.797 [mindspore/context.py:1412] For 'context.set_context', the parameter 'ascend_config' will be deprecated and removed in a future version. Please use the api mindspore.device_context.ascend.op_precision.precision_mode(), mindspore.device_context.ascend.op_precision.op_precision_mode(), mindspore.device_context.ascend.op_precision.matmul_allow_hf32(), mindspore.device_context.ascend.op_precision.conv_allow_hf32(), mindspore.device_context.ascend.op_tuning.op_compile() instead. [WARNING] ME(899768:281473836510912,MainProcess):2025-07-15-10:17:30.989.113 [mindspore/context.py:921] For 'context.set_context', 'matmul_grad_comm_overlap' parameter is deprecated, and will be removed in the next version, Please use 'grad_matmul_communication_overlap' instead. [WARNING] ME(899768:281473836510912,MainProcess):2025-07-15-10:17:30.989.243 [mindspore/context.py:1412] For 'context.set_context', the parameter 'memory_optimize_level' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] ME(899768:281473836510912,MainProcess):2025-07-15-10:17:30.989.338 [mindspore/context.py:1412] For 'context.set_context', the parameter 'save_graphs' will be deprecated and removed in a future version. Please use the env MS_DEV_SAVE_GRAPHS instead. [WARNING] ME(899768:281473836510912,MainProcess):2025-07-15-10:17:30.989.447 [mindspore/context.py:1412] For 'context.set_context', the parameter 'save_graphs_path' will be deprecated and removed in a future version. Please use the env MS_DEV_SAVE_GRAPHS_PATH instead. [WARNING] ME(899768:281473836510912,MainProcess):2025-07-15-10:17:30.989.631 [mindspore/context.py:1412] For 'context.set_context', the parameter 'deterministic' will be deprecated and removed in a future version. Please use the api mindspore.set_deterministic() instead. [WARNING] ME(899768:281473836510912,MainProcess):2025-07-15-10:17:30.989.848 [mindspore/context.py:1412] For 'context.set_context', the parameter 'mempool_block_size' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] DISTRIBUTED(899768,ffffbc09eec0,python):2025-07-15-10:17:30.991.880 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 21 source: 127.0.0.1:40560, destination: 127.0.0.1:7124 [WARNING] DISTRIBUTED(899768,ffffbc09eec0,python):2025-07-15-10:17:30.991.954 [mindspore/ccsrc/distributed/rpc/tcp/tcp_client.cc:76] Connect] Failed to connect to the tcp server : 127.0.0.1:7124, retry to reconnect(1/1)... 2025-07-15 10:17:31,055 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config swap_config is empty. 2025-07-15 10:17:31,055 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config monitor_config is empty. 2025-07-15 10:17:31,055 - mindformers./output/log[mindformers/tools/register/template.py:683] - WARNING - Some configs in yaml are useless for train: ['auto_tune', 'autotune_per_step', 'eval_callbacks', 'filepath_prefix', 'processor', 'remove_redundancy', 'resume_by_last_timestamp_ckpt'] 2025-07-15 10:17:31,056 - mindformers./output/log[mindformers/tools/utils.py:166] - INFO - set output path to '/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/output' [WARNING] ME(899764:281473520299712,MainProcess):2025-07-15-10:17:31.752.86 [mindspore/context.py:1412] For 'context.set_context', the parameter 'device_target' will be deprecated and removed in a future version. Please use the api mindspore.set_device() instead. [WARNING] ME(899764:281473520299712,MainProcess):2025-07-15-10:17:31.760.48 [mindspore/context.py:1412] For 'context.set_context', the parameter 'max_device_memory' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] ME(899764:281473520299712,MainProcess):2025-07-15-10:17:31.764.61 [mindspore/context.py:1412] For 'context.set_context', the parameter 'max_call_depth' will be deprecated and removed in a future version. Please use the api mindspore.set_recursion_limit() instead. [WARNING] ME(899764:281473520299712,MainProcess):2025-07-15-10:17:31.765.80 [mindspore/context.py:1412] For 'context.set_context', the parameter 'ascend_config' will be deprecated and removed in a future version. Please use the api mindspore.device_context.ascend.op_precision.precision_mode(), mindspore.device_context.ascend.op_precision.op_precision_mode(), mindspore.device_context.ascend.op_precision.matmul_allow_hf32(), mindspore.device_context.ascend.op_precision.conv_allow_hf32(), mindspore.device_context.ascend.op_tuning.op_compile() instead. [WARNING] ME(899764:281473520299712,MainProcess):2025-07-15-10:17:31.769.04 [mindspore/context.py:921] For 'context.set_context', 'matmul_grad_comm_overlap' parameter is deprecated, and will be removed in the next version, Please use 'grad_matmul_communication_overlap' instead. [WARNING] ME(899764:281473520299712,MainProcess):2025-07-15-10:17:31.770.45 [mindspore/context.py:1412] For 'context.set_context', the parameter 'memory_optimize_level' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] ME(899764:281473520299712,MainProcess):2025-07-15-10:17:31.771.45 [mindspore/context.py:1412] For 'context.set_context', the parameter 'save_graphs' will be deprecated and removed in a future version. Please use the env MS_DEV_SAVE_GRAPHS instead. [WARNING] ME(899764:281473520299712,MainProcess):2025-07-15-10:17:31.772.60 [mindspore/context.py:1412] For 'context.set_context', the parameter 'save_graphs_path' will be deprecated and removed in a future version. Please use the env MS_DEV_SAVE_GRAPHS_PATH instead. [WARNING] ME(899764:281473520299712,MainProcess):2025-07-15-10:17:31.774.51 [mindspore/context.py:1412] For 'context.set_context', the parameter 'deterministic' will be deprecated and removed in a future version. Please use the api mindspore.set_deterministic() instead. [WARNING] ME(899764:281473520299712,MainProcess):2025-07-15-10:17:31.776.76 [mindspore/context.py:1412] For 'context.set_context', the parameter 'mempool_block_size' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] DISTRIBUTED(899764,ffffa930eec0,python):2025-07-15-10:17:31.079.712 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 21 source: 127.0.0.1:40572, destination: 127.0.0.1:7124 [WARNING] DISTRIBUTED(899764,ffffa930eec0,python):2025-07-15-10:17:31.079.788 [mindspore/ccsrc/distributed/rpc/tcp/tcp_client.cc:76] Connect] Failed to connect to the tcp server : 127.0.0.1:7124, retry to reconnect(1/1)... 2025-07-15 10:17:31,131 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config swap_config is empty. 2025-07-15 10:17:31,131 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config monitor_config is empty. 2025-07-15 10:17:31,131 - mindformers./output/log[mindformers/tools/register/template.py:683] - WARNING - Some configs in yaml are useless for train: ['auto_tune', 'autotune_per_step', 'eval_callbacks', 'filepath_prefix', 'processor', 'remove_redundancy', 'resume_by_last_timestamp_ckpt'] 2025-07-15 10:17:31,132 - mindformers./output/log[mindformers/tools/utils.py:166] - INFO - set output path to '/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/output' [WARNING] ME(899772:281473613426368,MainProcess):2025-07-15-10:17:31.150.942 [mindspore/context.py:1412] For 'context.set_context', the parameter 'device_target' will be deprecated and removed in a future version. Please use the api mindspore.set_device() instead. [WARNING] ME(899772:281473613426368,MainProcess):2025-07-15-10:17:31.151.680 [mindspore/context.py:1412] For 'context.set_context', the parameter 'max_device_memory' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] ME(899772:281473613426368,MainProcess):2025-07-15-10:17:31.152.076 [mindspore/context.py:1412] For 'context.set_context', the parameter 'max_call_depth' will be deprecated and removed in a future version. Please use the api mindspore.set_recursion_limit() instead. [WARNING] ME(899772:281473613426368,MainProcess):2025-07-15-10:17:31.152.192 [mindspore/context.py:1412] For 'context.set_context', the parameter 'ascend_config' will be deprecated and removed in a future version. Please use the api mindspore.device_context.ascend.op_precision.precision_mode(), mindspore.device_context.ascend.op_precision.op_precision_mode(), mindspore.device_context.ascend.op_precision.matmul_allow_hf32(), mindspore.device_context.ascend.op_precision.conv_allow_hf32(), mindspore.device_context.ascend.op_tuning.op_compile() instead. [WARNING] ME(899772:281473613426368,MainProcess):2025-07-15-10:17:31.152.513 [mindspore/context.py:921] For 'context.set_context', 'matmul_grad_comm_overlap' parameter is deprecated, and will be removed in the next version, Please use 'grad_matmul_communication_overlap' instead. [WARNING] ME(899772:281473613426368,MainProcess):2025-07-15-10:17:31.152.648 [mindspore/context.py:1412] For 'context.set_context', the parameter 'memory_optimize_level' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] ME(899772:281473613426368,MainProcess):2025-07-15-10:17:31.152.741 [mindspore/context.py:1412] For 'context.set_context', the parameter 'save_graphs' will be deprecated and removed in a future version. Please use the env MS_DEV_SAVE_GRAPHS instead. [WARNING] ME(899772:281473613426368,MainProcess):2025-07-15-10:17:31.152.852 [mindspore/context.py:1412] For 'context.set_context', the parameter 'save_graphs_path' will be deprecated and removed in a future version. Please use the env MS_DEV_SAVE_GRAPHS_PATH instead. [WARNING] ME(899772:281473613426368,MainProcess):2025-07-15-10:17:31.153.033 [mindspore/context.py:1412] For 'context.set_context', the parameter 'deterministic' will be deprecated and removed in a future version. Please use the api mindspore.set_deterministic() instead. [WARNING] ME(899772:281473613426368,MainProcess):2025-07-15-10:17:31.153.248 [mindspore/context.py:1412] For 'context.set_context', the parameter 'mempool_block_size' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] DISTRIBUTED(899772,ffffaebdeec0,python):2025-07-15-10:17:31.155.145 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 21 source: 127.0.0.1:40578, destination: 127.0.0.1:7124 [WARNING] DISTRIBUTED(899772,ffff2921efa0,python):2025-07-15-10:17:31.155.155 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:40578 to 127.0.0.1:7124 is successfully created. System errno: Success [WARNING] DISTRIBUTED(899772,ffffaebdeec0,python):2025-07-15-10:17:31.155.215 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:7124 to be connected...Retry number: 1 2025-07-15 10:17:31,183 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config swap_config is empty. 2025-07-15 10:17:31,183 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config monitor_config is empty. 2025-07-15 10:17:31,184 - mindformers./output/log[mindformers/tools/register/template.py:683] - WARNING - Some configs in yaml are useless for train: ['auto_tune', 'autotune_per_step', 'eval_callbacks', 'filepath_prefix', 'processor', 'remove_redundancy', 'resume_by_last_timestamp_ckpt'] 2025-07-15 10:17:31,184 - mindformers./output/log[mindformers/tools/utils.py:166] - INFO - set output path to '/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/output' [WARNING] ME(899776:281473336536768,MainProcess):2025-07-15-10:17:31.202.812 [mindspore/context.py:1412] For 'context.set_context', the parameter 'device_target' will be deprecated and removed in a future version. Please use the api mindspore.set_device() instead. [WARNING] ME(899776:281473336536768,MainProcess):2025-07-15-10:17:31.203.538 [mindspore/context.py:1412] For 'context.set_context', the parameter 'max_device_memory' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] ME(899776:281473336536768,MainProcess):2025-07-15-10:17:31.203.937 [mindspore/context.py:1412] For 'context.set_context', the parameter 'max_call_depth' will be deprecated and removed in a future version. Please use the api mindspore.set_recursion_limit() instead. [WARNING] ME(899776:281473336536768,MainProcess):2025-07-15-10:17:31.204.051 [mindspore/context.py:1412] For 'context.set_context', the parameter 'ascend_config' will be deprecated and removed in a future version. Please use the api mindspore.device_context.ascend.op_precision.precision_mode(), mindspore.device_context.ascend.op_precision.op_precision_mode(), mindspore.device_context.ascend.op_precision.matmul_allow_hf32(), mindspore.device_context.ascend.op_precision.conv_allow_hf32(), mindspore.device_context.ascend.op_tuning.op_compile() instead. [WARNING] ME(899776:281473336536768,MainProcess):2025-07-15-10:17:31.204.356 [mindspore/context.py:921] For 'context.set_context', 'matmul_grad_comm_overlap' parameter is deprecated, and will be removed in the next version, Please use 'grad_matmul_communication_overlap' instead. [WARNING] ME(899776:281473336536768,MainProcess):2025-07-15-10:17:31.204.488 [mindspore/context.py:1412] For 'context.set_context', the parameter 'memory_optimize_level' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] ME(899776:281473336536768,MainProcess):2025-07-15-10:17:31.204.583 [mindspore/context.py:1412] For 'context.set_context', the parameter 'save_graphs' will be deprecated and removed in a future version. Please use the env MS_DEV_SAVE_GRAPHS instead. [WARNING] ME(899776:281473336536768,MainProcess):2025-07-15-10:17:31.204.694 [mindspore/context.py:1412] For 'context.set_context', the parameter 'save_graphs_path' will be deprecated and removed in a future version. Please use the env MS_DEV_SAVE_GRAPHS_PATH instead. [WARNING] ME(899776:281473336536768,MainProcess):2025-07-15-10:17:31.204.866 [mindspore/context.py:1412] For 'context.set_context', the parameter 'deterministic' will be deprecated and removed in a future version. Please use the api mindspore.set_deterministic() instead. [WARNING] ME(899776:281473336536768,MainProcess):2025-07-15-10:17:31.205.063 [mindspore/context.py:1412] For 'context.set_context', the parameter 'mempool_block_size' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] DISTRIBUTED(899776,ffff9e3ceec0,python):2025-07-15-10:17:31.206.981 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 21 source: 127.0.0.1:40584, destination: 127.0.0.1:7124 [WARNING] DISTRIBUTED(899776,ffff18a0efa0,python):2025-07-15-10:17:31.207.003 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:40584 to 127.0.0.1:7124 is successfully created. System errno: Success [WARNING] DISTRIBUTED(899776,ffff9e3ceec0,python):2025-07-15-10:17:31.207.039 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:7124 to be connected...Retry number: 1 2025-07-15 10:17:31,211 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config swap_config is empty. 2025-07-15 10:17:31,211 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config monitor_config is empty. 2025-07-15 10:17:31,211 - mindformers./output/log[mindformers/tools/register/template.py:683] - WARNING - Some configs in yaml are useless for train: ['auto_tune', 'autotune_per_step', 'eval_callbacks', 'filepath_prefix', 'processor', 'remove_redundancy', 'resume_by_last_timestamp_ckpt'] 2025-07-15 10:17:31,212 - mindformers./output/log[mindformers/tools/utils.py:166] - INFO - set output path to '/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/output' [WARNING] ME(899780:281473133964992,MainProcess):2025-07-15-10:17:31.230.779 [mindspore/context.py:1412] For 'context.set_context', the parameter 'device_target' will be deprecated and removed in a future version. Please use the api mindspore.set_device() instead. [WARNING] ME(899780:281473133964992,MainProcess):2025-07-15-10:17:31.231.501 [mindspore/context.py:1412] For 'context.set_context', the parameter 'max_device_memory' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] ME(899780:281473133964992,MainProcess):2025-07-15-10:17:31.231.894 [mindspore/context.py:1412] For 'context.set_context', the parameter 'max_call_depth' will be deprecated and removed in a future version. Please use the api mindspore.set_recursion_limit() instead. [WARNING] ME(899780:281473133964992,MainProcess):2025-07-15-10:17:31.232.009 [mindspore/context.py:1412] For 'context.set_context', the parameter 'ascend_config' will be deprecated and removed in a future version. Please use the api mindspore.device_context.ascend.op_precision.precision_mode(), mindspore.device_context.ascend.op_precision.op_precision_mode(), mindspore.device_context.ascend.op_precision.matmul_allow_hf32(), mindspore.device_context.ascend.op_precision.conv_allow_hf32(), mindspore.device_context.ascend.op_tuning.op_compile() instead. [WARNING] ME(899780:281473133964992,MainProcess):2025-07-15-10:17:31.232.314 [mindspore/context.py:921] For 'context.set_context', 'matmul_grad_comm_overlap' parameter is deprecated, and will be removed in the next version, Please use 'grad_matmul_communication_overlap' instead. [WARNING] ME(899780:281473133964992,MainProcess):2025-07-15-10:17:31.232.445 [mindspore/context.py:1412] For 'context.set_context', the parameter 'memory_optimize_level' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] ME(899780:281473133964992,MainProcess):2025-07-15-10:17:31.232.539 [mindspore/context.py:1412] For 'context.set_context', the parameter 'save_graphs' will be deprecated and removed in a future version. Please use the env MS_DEV_SAVE_GRAPHS instead. [WARNING] ME(899780:281473133964992,MainProcess):2025-07-15-10:17:31.232.648 [mindspore/context.py:1412] For 'context.set_context', the parameter 'save_graphs_path' will be deprecated and removed in a future version. Please use the env MS_DEV_SAVE_GRAPHS_PATH instead. [WARNING] ME(899780:281473133964992,MainProcess):2025-07-15-10:17:31.232.831 [mindspore/context.py:1412] For 'context.set_context', the parameter 'deterministic' will be deprecated and removed in a future version. Please use the api mindspore.set_deterministic() instead. [WARNING] ME(899780:281473133964992,MainProcess):2025-07-15-10:17:31.233.041 [mindspore/context.py:1412] For 'context.set_context', the parameter 'mempool_block_size' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] DISTRIBUTED(899780,ffff9229eec0,python):2025-07-15-10:17:31.235.127 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 21 source: 127.0.0.1:40586, destination: 127.0.0.1:7124 [WARNING] DISTRIBUTED(899780,ffff9229eec0,python):2025-07-15-10:17:31.235.197 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:7124 to be connected...Retry number: 1 [WARNING] DISTRIBUTED(899780,ffff0c8eefa0,python):2025-07-15-10:17:31.235.149 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:40586 to 127.0.0.1:7124 is successfully created. System errno: Success 2025-07-15 10:17:31,350 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config swap_config is empty. 2025-07-15 10:17:31,351 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config monitor_config is empty. 2025-07-15 10:17:31,351 - mindformers./output/log[mindformers/tools/register/template.py:683] - WARNING - Some configs in yaml are useless for train: ['auto_tune', 'autotune_per_step', 'eval_callbacks', 'filepath_prefix', 'processor', 'remove_redundancy', 'resume_by_last_timestamp_ckpt'] 2025-07-15 10:17:31,352 - mindformers./output/log[mindformers/tools/utils.py:166] - INFO - set output path to '/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/output' 2025-07-15 10:17:31,358 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config swap_config is empty. 2025-07-15 10:17:31,359 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config monitor_config is empty. 2025-07-15 10:17:31,359 - mindformers./output/log[mindformers/tools/register/template.py:683] - WARNING - Some configs in yaml are useless for train: ['auto_tune', 'autotune_per_step', 'eval_callbacks', 'filepath_prefix', 'processor', 'remove_redundancy', 'resume_by_last_timestamp_ckpt'] 2025-07-15 10:17:31,360 - mindformers./output/log[mindformers/tools/utils.py:166] - INFO - set output path to '/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/output' [WARNING] ME(899784:281473510076096,MainProcess):2025-07-15-10:17:31.370.394 [mindspore/context.py:1412] For 'context.set_context', the parameter 'device_target' will be deprecated and removed in a future version. Please use the api mindspore.set_device() instead. [WARNING] ME(899784:281473510076096,MainProcess):2025-07-15-10:17:31.371.165 [mindspore/context.py:1412] For 'context.set_context', the parameter 'max_device_memory' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] ME(899784:281473510076096,MainProcess):2025-07-15-10:17:31.371.564 [mindspore/context.py:1412] For 'context.set_context', the parameter 'max_call_depth' will be deprecated and removed in a future version. Please use the api mindspore.set_recursion_limit() instead. [WARNING] ME(899784:281473510076096,MainProcess):2025-07-15-10:17:31.371.683 [mindspore/context.py:1412] For 'context.set_context', the parameter 'ascend_config' will be deprecated and removed in a future version. Please use the api mindspore.device_context.ascend.op_precision.precision_mode(), mindspore.device_context.ascend.op_precision.op_precision_mode(), mindspore.device_context.ascend.op_precision.matmul_allow_hf32(), mindspore.device_context.ascend.op_precision.conv_allow_hf32(), mindspore.device_context.ascend.op_tuning.op_compile() instead. [WARNING] ME(899784:281473510076096,MainProcess):2025-07-15-10:17:31.372.002 [mindspore/context.py:921] For 'context.set_context', 'matmul_grad_comm_overlap' parameter is deprecated, and will be removed in the next version, Please use 'grad_matmul_communication_overlap' instead. [WARNING] ME(899784:281473510076096,MainProcess):2025-07-15-10:17:31.372.139 [mindspore/context.py:1412] For 'context.set_context', the parameter 'memory_optimize_level' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] ME(899784:281473510076096,MainProcess):2025-07-15-10:17:31.372.237 [mindspore/context.py:1412] For 'context.set_context', the parameter 'save_graphs' will be deprecated and removed in a future version. Please use the env MS_DEV_SAVE_GRAPHS instead. [WARNING] ME(899784:281473510076096,MainProcess):2025-07-15-10:17:31.372.351 [mindspore/context.py:1412] For 'context.set_context', the parameter 'save_graphs_path' will be deprecated and removed in a future version. Please use the env MS_DEV_SAVE_GRAPHS_PATH instead. [WARNING] ME(899784:281473510076096,MainProcess):2025-07-15-10:17:31.372.541 [mindspore/context.py:1412] For 'context.set_context', the parameter 'deterministic' will be deprecated and removed in a future version. Please use the api mindspore.set_deterministic() instead. [WARNING] ME(899784:281473510076096,MainProcess):2025-07-15-10:17:31.372.764 [mindspore/context.py:1412] For 'context.set_context', the parameter 'mempool_block_size' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] DISTRIBUTED(899784,ffff22f9efa0,python):2025-07-15-10:17:31.374.713 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:40598 to 127.0.0.1:7124 is successfully created. System errno: Success [WARNING] DISTRIBUTED(899784,ffffa894eec0,python):2025-07-15-10:17:31.374.713 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 21 source: 127.0.0.1:40598, destination: 127.0.0.1:7124 [WARNING] DISTRIBUTED(899784,ffffa894eec0,python):2025-07-15-10:17:31.374.909 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 22 source: 127.0.0.1:40614, destination: 127.0.0.1:7124 [WARNING] DISTRIBUTED(899784,ffff23fbefa0,python):2025-07-15-10:17:31.374.939 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:40614 to 127.0.0.1:7124 is successfully created. System errno: Success [WARNING] DISTRIBUTED(899784,ffffa894eec0,python):2025-07-15-10:17:31.374.949 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:7124 to be connected...Retry number: 1 [WARNING] ME(899788:281473237184192,MainProcess):2025-07-15-10:17:31.378.723 [mindspore/context.py:1412] For 'context.set_context', the parameter 'device_target' will be deprecated and removed in a future version. Please use the api mindspore.set_device() instead. [WARNING] ME(899788:281473237184192,MainProcess):2025-07-15-10:17:31.379.488 [mindspore/context.py:1412] For 'context.set_context', the parameter 'max_device_memory' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] ME(899788:281473237184192,MainProcess):2025-07-15-10:17:31.379.884 [mindspore/context.py:1412] For 'context.set_context', the parameter 'max_call_depth' will be deprecated and removed in a future version. Please use the api mindspore.set_recursion_limit() instead. [WARNING] ME(899788:281473237184192,MainProcess):2025-07-15-10:17:31.379.998 [mindspore/context.py:1412] For 'context.set_context', the parameter 'ascend_config' will be deprecated and removed in a future version. Please use the api mindspore.device_context.ascend.op_precision.precision_mode(), mindspore.device_context.ascend.op_precision.op_precision_mode(), mindspore.device_context.ascend.op_precision.matmul_allow_hf32(), mindspore.device_context.ascend.op_precision.conv_allow_hf32(), mindspore.device_context.ascend.op_tuning.op_compile() instead. [WARNING] ME(899788:281473237184192,MainProcess):2025-07-15-10:17:31.380.304 [mindspore/context.py:921] For 'context.set_context', 'matmul_grad_comm_overlap' parameter is deprecated, and will be removed in the next version, Please use 'grad_matmul_communication_overlap' instead. [WARNING] ME(899788:281473237184192,MainProcess):2025-07-15-10:17:31.380.440 [mindspore/context.py:1412] For 'context.set_context', the parameter 'memory_optimize_level' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] ME(899788:281473237184192,MainProcess):2025-07-15-10:17:31.380.538 [mindspore/context.py:1412] For 'context.set_context', the parameter 'save_graphs' will be deprecated and removed in a future version. Please use the env MS_DEV_SAVE_GRAPHS instead. [WARNING] ME(899788:281473237184192,MainProcess):2025-07-15-10:17:31.380.647 [mindspore/context.py:1412] For 'context.set_context', the parameter 'save_graphs_path' will be deprecated and removed in a future version. Please use the env MS_DEV_SAVE_GRAPHS_PATH instead. [WARNING] ME(899788:281473237184192,MainProcess):2025-07-15-10:17:31.380.827 [mindspore/context.py:1412] For 'context.set_context', the parameter 'deterministic' will be deprecated and removed in a future version. Please use the api mindspore.set_deterministic() instead. [WARNING] ME(899788:281473237184192,MainProcess):2025-07-15-10:17:31.381.042 [mindspore/context.py:1412] For 'context.set_context', the parameter 'mempool_block_size' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] DISTRIBUTED(899788,ffff12b6efa0,python):2025-07-15-10:17:31.382.787 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:40616 to 127.0.0.1:7124 is successfully created. System errno: Success [WARNING] DISTRIBUTED(899788,ffff9850eec0,python):2025-07-15-10:17:31.382.787 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 21 source: 127.0.0.1:40616, destination: 127.0.0.1:7124 [WARNING] DISTRIBUTED(899788,ffff9850eec0,python):2025-07-15-10:17:31.382.968 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 22 source: 127.0.0.1:40620, destination: 127.0.0.1:7124 [WARNING] DISTRIBUTED(899788,ffff13b8efa0,python):2025-07-15-10:17:31.382.997 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:40620 to 127.0.0.1:7124 is successfully created. System errno: Success [WARNING] DISTRIBUTED(899788,ffff9850eec0,python):2025-07-15-10:17:31.383.007 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:7124 to be connected...Retry number: 1 [WARNING] DISTRIBUTED(899768,ffffbc09eec0,python):2025-07-15-10:17:31.492.064 [mindspore/ccsrc/distributed/cluster/topology/compute_graph_node.cc:173] Register] Failed to connect to the meta server node url: 127.0.0.1:7124 [WARNING] DISTRIBUTED(899768,ffffbc09eec0,python):2025-07-15-10:17:31.492.107 [mindspore/ccsrc/distributed/cluster/topology/compute_graph_node.cc:363] ReconnectWithTimeoutWindow] Failed to register and try to reconnect to the meta server. 2025-07-15 10:17:31,568 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config swap_config is empty. 2025-07-15 10:17:31,569 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config monitor_config is empty. 2025-07-15 10:17:31,569 - mindformers./output/log[mindformers/tools/register/template.py:683] - WARNING - Some configs in yaml are useless for train: ['auto_tune', 'autotune_per_step', 'eval_callbacks', 'filepath_prefix', 'processor', 'remove_redundancy', 'resume_by_last_timestamp_ckpt'] 2025-07-15 10:17:31,570 - mindformers./output/log[mindformers/tools/utils.py:166] - INFO - set output path to '/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/output' [WARNING] DISTRIBUTED(899764,ffffa930eec0,python):2025-07-15-10:17:31.579.899 [mindspore/ccsrc/distributed/cluster/topology/compute_graph_node.cc:173] Register] Failed to connect to the meta server node url: 127.0.0.1:7124 [WARNING] DISTRIBUTED(899764,ffffa930eec0,python):2025-07-15-10:17:31.579.935 [mindspore/ccsrc/distributed/cluster/topology/compute_graph_node.cc:363] ReconnectWithTimeoutWindow] Failed to register and try to reconnect to the meta server. [WARNING] ME(899792:281473398599360,MainProcess):2025-07-15-10:17:31.588.752 [mindspore/context.py:1412] For 'context.set_context', the parameter 'device_target' will be deprecated and removed in a future version. Please use the api mindspore.set_device() instead. [WARNING] ME(899792:281473398599360,MainProcess):2025-07-15-10:17:31.589.502 [mindspore/context.py:1412] For 'context.set_context', the parameter 'max_device_memory' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] ME(899792:281473398599360,MainProcess):2025-07-15-10:17:31.589.915 [mindspore/context.py:1412] For 'context.set_context', the parameter 'max_call_depth' will be deprecated and removed in a future version. Please use the api mindspore.set_recursion_limit() instead. [WARNING] ME(899792:281473398599360,MainProcess):2025-07-15-10:17:31.590.038 [mindspore/context.py:1412] For 'context.set_context', the parameter 'ascend_config' will be deprecated and removed in a future version. Please use the api mindspore.device_context.ascend.op_precision.precision_mode(), mindspore.device_context.ascend.op_precision.op_precision_mode(), mindspore.device_context.ascend.op_precision.matmul_allow_hf32(), mindspore.device_context.ascend.op_precision.conv_allow_hf32(), mindspore.device_context.ascend.op_tuning.op_compile() instead. [WARNING] ME(899792:281473398599360,MainProcess):2025-07-15-10:17:31.590.358 [mindspore/context.py:921] For 'context.set_context', 'matmul_grad_comm_overlap' parameter is deprecated, and will be removed in the next version, Please use 'grad_matmul_communication_overlap' instead. [WARNING] ME(899792:281473398599360,MainProcess):2025-07-15-10:17:31.590.517 [mindspore/context.py:1412] For 'context.set_context', the parameter 'memory_optimize_level' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] ME(899792:281473398599360,MainProcess):2025-07-15-10:17:31.590.620 [mindspore/context.py:1412] For 'context.set_context', the parameter 'save_graphs' will be deprecated and removed in a future version. Please use the env MS_DEV_SAVE_GRAPHS instead. [WARNING] ME(899792:281473398599360,MainProcess):2025-07-15-10:17:31.590.735 [mindspore/context.py:1412] For 'context.set_context', the parameter 'save_graphs_path' will be deprecated and removed in a future version. Please use the env MS_DEV_SAVE_GRAPHS_PATH instead. [WARNING] ME(899792:281473398599360,MainProcess):2025-07-15-10:17:31.590.931 [mindspore/context.py:1412] For 'context.set_context', the parameter 'deterministic' will be deprecated and removed in a future version. Please use the api mindspore.set_deterministic() instead. [WARNING] ME(899792:281473398599360,MainProcess):2025-07-15-10:17:31.591.155 [mindspore/context.py:1412] For 'context.set_context', the parameter 'mempool_block_size' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] DISTRIBUTED(899792,ffffa1efeec0,python):2025-07-15-10:17:31.593.197 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 21 source: 127.0.0.1:40636, destination: 127.0.0.1:7124 [WARNING] DISTRIBUTED(899792,ffff17ffefa0,python):2025-07-15-10:17:31.593.196 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:40636 to 127.0.0.1:7124 is successfully created. System errno: Success [WARNING] DISTRIBUTED(899792,ffffa1efeec0,python):2025-07-15-10:17:31.593.272 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:7124 to be connected...Retry number: 1 [WARNING] DISTRIBUTED(899772,ffffaebdeec0,python):2025-07-15-10:17:31.655.459 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 22 source: 127.0.0.1:40646, destination: 127.0.0.1:7124 [WARNING] DISTRIBUTED(899772,ffff2a23efa0,python):2025-07-15-10:17:31.655.486 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:40646 to 127.0.0.1:7124 is successfully created. System errno: Success [WARNING] DISTRIBUTED(899772,ffffaebdeec0,python):2025-07-15-10:17:31.655.504 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:7124 to be connected...Retry number: 2 [WARNING] DISTRIBUTED(899776,ffff9e3ceec0,python):2025-07-15-10:17:31.707.252 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 22 source: 127.0.0.1:40662, destination: 127.0.0.1:7124 [WARNING] DISTRIBUTED(899776,ffff19a2efa0,python):2025-07-15-10:17:31.707.287 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:40662 to 127.0.0.1:7124 is successfully created. System errno: Success [WARNING] DISTRIBUTED(899776,ffff9e3ceec0,python):2025-07-15-10:17:31.707.294 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:7124 to be connected...Retry number: 2 [WARNING] DISTRIBUTED(899780,ffff9229eec0,python):2025-07-15-10:17:31.735.445 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 22 source: 127.0.0.1:40668, destination: 127.0.0.1:7124 [WARNING] DISTRIBUTED(899780,ffff0d90efa0,python):2025-07-15-10:17:31.735.476 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:40668 to 127.0.0.1:7124 is successfully created. System errno: Success [WARNING] DISTRIBUTED(899780,ffff9229eec0,python):2025-07-15-10:17:31.735.491 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:7124 to be connected...Retry number: 2 [WARNING] DISTRIBUTED(899784,ffffa894eec0,python):2025-07-15-10:17:31.875.584 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(1/14400). [WARNING] DISTRIBUTED(899788,ffff9850eec0,python):2025-07-15-10:17:31.883.428 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(1/14400). [WARNING] DISTRIBUTED(899768,ffffbc09eec0,python):2025-07-15-10:17:31.992.365 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 22 source: 127.0.0.1:40676, destination: 127.0.0.1:7124 [WARNING] DISTRIBUTED(899768,ffff3770efa0,python):2025-07-15-10:17:31.992.399 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:40676 to 127.0.0.1:7124 is successfully created. System errno: Success [WARNING] DISTRIBUTED(899768,ffffbc09eec0,python):2025-07-15-10:17:31.992.411 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:7124 to be connected...Retry number: 1 [WARNING] DISTRIBUTED(899764,ffffa930eec0,python):2025-07-15-10:17:32.080.196 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 22 source: 127.0.0.1:40688, destination: 127.0.0.1:7124 [WARNING] DISTRIBUTED(899764,ffff2497efa0,python):2025-07-15-10:17:32.080.228 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:40688 to 127.0.0.1:7124 is successfully created. System errno: Success [WARNING] DISTRIBUTED(899764,ffffa930eec0,python):2025-07-15-10:17:32.080.237 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:7124 to be connected...Retry number: 1 [WARNING] DISTRIBUTED(899792,ffffa1efeec0,python):2025-07-15-10:17:32.093.509 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 22 source: 127.0.0.1:40698, destination: 127.0.0.1:7124 [WARNING] DISTRIBUTED(899792,ffff1d55efa0,python):2025-07-15-10:17:32.093.535 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:40698 to 127.0.0.1:7124 is successfully created. System errno: Success [WARNING] DISTRIBUTED(899792,ffffa1efeec0,python):2025-07-15-10:17:32.093.552 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:7124 to be connected...Retry number: 2 [WARNING] DISTRIBUTED(899772,ffffaebdeec0,python):2025-07-15-10:17:32.156.116 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(1/14400). [WARNING] DISTRIBUTED(899776,ffff9e3ceec0,python):2025-07-15-10:17:32.207.700 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(1/14400). [WARNING] DISTRIBUTED(899780,ffff9229eec0,python):2025-07-15-10:17:32.235.956 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(1/14400). [WARNING] DISTRIBUTED(899784,ffffa894eec0,python):2025-07-15-10:17:32.375.696 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(2/14400). [WARNING] DISTRIBUTED(899788,ffff9850eec0,python):2025-07-15-10:17:32.383.538 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(2/14400). [WARNING] DISTRIBUTED(899768,ffffbc09eec0,python):2025-07-15-10:17:32.492.642 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 23 source: 127.0.0.1:40700, destination: 127.0.0.1:7124 [WARNING] DISTRIBUTED(899768,ffff366eefa0,python):2025-07-15-10:17:32.492.677 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:40700 to 127.0.0.1:7124 is successfully created. System errno: Success [WARNING] DISTRIBUTED(899768,ffffbc09eec0,python):2025-07-15-10:17:32.492.684 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:7124 to be connected...Retry number: 2 [WARNING] DISTRIBUTED(899764,ffffa930eec0,python):2025-07-15-10:17:32.580.473 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 23 source: 127.0.0.1:40708, destination: 127.0.0.1:7124 [WARNING] DISTRIBUTED(899764,ffffa930eec0,python):2025-07-15-10:17:32.580.511 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:7124 to be connected...Retry number: 2 [WARNING] DISTRIBUTED(899764,ffff1f7eefa0,python):2025-07-15-10:17:32.580.529 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:40708 to 127.0.0.1:7124 is successfully created. System errno: Success [WARNING] DISTRIBUTED(899792,ffffa1efeec0,python):2025-07-15-10:17:32.594.144 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(1/14400). [WARNING] DISTRIBUTED(899772,ffffaebdeec0,python):2025-07-15-10:17:32.656.245 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(2/14400). [WARNING] DISTRIBUTED(899776,ffff9e3ceec0,python):2025-07-15-10:17:32.707.802 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(2/14400). [WARNING] DISTRIBUTED(899780,ffff9229eec0,python):2025-07-15-10:17:32.736.068 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(2/14400). [WARNING] DISTRIBUTED(899784,ffffa894eec0,python):2025-07-15-10:17:32.875.798 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(3/14400). [WARNING] DISTRIBUTED(899788,ffff9850eec0,python):2025-07-15-10:17:32.883.639 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(3/14400). [WARNING] DISTRIBUTED(899768,ffffbc09eec0,python):2025-07-15-10:17:32.993.168 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(1/14400). [WARNING] DISTRIBUTED(899764,ffffa930eec0,python):2025-07-15-10:17:33.081.062 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(1/14400). [WARNING] DISTRIBUTED(899792,ffffa1efeec0,python):2025-07-15-10:17:33.094.258 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(2/14400). [WARNING] DISTRIBUTED(899772,ffffaebdeec0,python):2025-07-15-10:17:33.156.359 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(3/14400). [WARNING] DISTRIBUTED(899776,ffff9e3ceec0,python):2025-07-15-10:17:33.207.905 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(3/14400). [WARNING] DISTRIBUTED(899780,ffff9229eec0,python):2025-07-15-10:17:33.236.175 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(3/14400). [WARNING] DISTRIBUTED(899784,ffffa894eec0,python):2025-07-15-10:17:33.375.899 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(4/14400). [WARNING] DISTRIBUTED(899788,ffff9850eec0,python):2025-07-15-10:17:33.383.738 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(4/14400). [WARNING] DISTRIBUTED(899768,ffffbc09eec0,python):2025-07-15-10:17:33.493.279 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(2/14400). [MS_DEV_RUNTIME_CONF]Runtime config: memory_statistics:True [WARNING] DISTRIBUTED(899764,ffffa930eec0,python):2025-07-15-10:17:33.581.243 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:249] BuildCluster] Cluster is successfully initialized. [WARNING] DISTRIBUTED(899764,ffffa930eec0,python):2025-07-15-10:17:33.581.284 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:355] PostProcess] This node 0 rank id: 0 [MS_RUNTIME_PROF]The jit_level is: O1, and enable kernelbykernel executor. [MS_DEV_RUNTIME_CONF]Runtime config: memory_statistics:True [WARNING] DISTRIBUTED(899792,ffffa1efeec0,python):2025-07-15-10:17:33.594.443 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:249] BuildCluster] Cluster is successfully initialized. [WARNING] DISTRIBUTED(899792,ffffa1efeec0,python):2025-07-15-10:17:33.594.485 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:355] PostProcess] This node 7 rank id: 7 [MS_RUNTIME_PROF]The jit_level is: O1, and enable kernelbykernel executor. [MS_DEV_RUNTIME_CONF]Runtime config: memory_statistics:True [WARNING] DISTRIBUTED(899772,ffffaebdeec0,python):2025-07-15-10:17:33.656.541 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:249] BuildCluster] Cluster is successfully initialized. [WARNING] DISTRIBUTED(899772,ffffaebdeec0,python):2025-07-15-10:17:33.656.582 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:355] PostProcess] This node 2 rank id: 2 [MS_RUNTIME_PROF]The jit_level is: O1, and enable kernelbykernel executor. [MS_DEV_RUNTIME_CONF]Runtime config: memory_statistics:True [WARNING] DISTRIBUTED(899776,ffff9e3ceec0,python):2025-07-15-10:17:33.708.064 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:249] BuildCluster] Cluster is successfully initialized. [WARNING] DISTRIBUTED(899776,ffff9e3ceec0,python):2025-07-15-10:17:33.708.103 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:355] PostProcess] This node 3 rank id: 3 [MS_RUNTIME_PROF]The jit_level is: O1, and enable kernelbykernel executor. [MS_DEV_RUNTIME_CONF]Runtime config: memory_statistics:True [WARNING] DISTRIBUTED(899780,ffff9229eec0,python):2025-07-15-10:17:33.736.356 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:249] BuildCluster] Cluster is successfully initialized. [WARNING] DISTRIBUTED(899780,ffff9229eec0,python):2025-07-15-10:17:33.736.401 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:355] PostProcess] This node 4 rank id: 4 [MS_RUNTIME_PROF]The jit_level is: O1, and enable kernelbykernel executor. [MS_RUNTIME_PROF]Device MOC Size:62420M, Device free MOC Size:62091M, Reserved MOC size for Other Components(HCCL/rts/etc.):7124M, Recommend Reserved MOC size for Other Components:3880M, User define MindSpore MOC Size:54G, MindSpore Used MOC Size:55296M. [MS_RUNTIME_PROF]Device MOC Size:62420M, Device free MOC Size:62092M, Reserved MOC size for Other Components(HCCL/rts/etc.):7124M, Recommend Reserved MOC size for Other Components:3880M, User define MindSpore MOC Size:54G, MindSpore Used MOC Size:55296M. [MS_RUNTIME_PROF]Device MOC Size:62420M, Device free MOC Size:62091M, Reserved MOC size for Other Components(HCCL/rts/etc.):7124M, Recommend Reserved MOC size for Other Components:3880M, User define MindSpore MOC Size:54G, MindSpore Used MOC Size:55296M. [MS_DEV_RUNTIME_CONF]Runtime config: memory_statistics:True [WARNING] DISTRIBUTED(899784,ffffa894eec0,python):2025-07-15-10:17:33.876.072 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:249] BuildCluster] Cluster is successfully initialized. [WARNING] DISTRIBUTED(899784,ffffa894eec0,python):2025-07-15-10:17:33.876.114 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:355] PostProcess] This node 5 rank id: 5 [MS_RUNTIME_PROF]The jit_level is: O1, and enable kernelbykernel executor. [MS_DEV_RUNTIME_CONF]Runtime config: memory_statistics:True [WARNING] DISTRIBUTED(899788,ffff9850eec0,python):2025-07-15-10:17:33.883.981 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:249] BuildCluster] Cluster is successfully initialized. [WARNING] DISTRIBUTED(899788,ffff9850eec0,python):2025-07-15-10:17:33.884.030 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:355] PostProcess] This node 6 rank id: 6 [MS_RUNTIME_PROF]The jit_level is: O1, and enable kernelbykernel executor. [MS_RUNTIME_PROF]Device MOC Size:62420M, Device free MOC Size:62091M, Reserved MOC size for Other Components(HCCL/rts/etc.):7124M, Recommend Reserved MOC size for Other Components:3880M, User define MindSpore MOC Size:54G, MindSpore Used MOC Size:55296M. [MS_RUNTIME_PROF]Device MOC Size:62420M, Device free MOC Size:62092M, Reserved MOC size for Other Components(HCCL/rts/etc.):7124M, Recommend Reserved MOC size for Other Components:3880M, User define MindSpore MOC Size:54G, MindSpore Used MOC Size:55296M. [MS_DEV_RUNTIME_CONF]Runtime config: memory_statistics:True [WARNING] DISTRIBUTED(899768,ffffbc09eec0,python):2025-07-15-10:17:33.993.454 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:249] BuildCluster] Cluster is successfully initialized. [WARNING] DISTRIBUTED(899768,ffffbc09eec0,python):2025-07-15-10:17:33.993.493 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:355] PostProcess] This node 1 rank id: 1 [MS_RUNTIME_PROF]The jit_level is: O1, and enable kernelbykernel executor. [MS_RUNTIME_PROF]Device MOC Size:62420M, Device free MOC Size:62091M, Reserved MOC size for Other Components(HCCL/rts/etc.):7124M, Recommend Reserved MOC size for Other Components:3880M, User define MindSpore MOC Size:54G, MindSpore Used MOC Size:55296M. [MS_RUNTIME_PROF]Device MOC Size:62420M, Device free MOC Size:62091M, Reserved MOC size for Other Components(HCCL/rts/etc.):7124M, Recommend Reserved MOC size for Other Components:3880M, User define MindSpore MOC Size:54G, MindSpore Used MOC Size:55296M. [WARNING] GRAPH_KERNEL(899764,ffffa930eec0,python):2025-07-15-10:17:35.289.529 [mindspore/ccsrc/backend/common/graph_kernel/graph_kernel_flags.cc:116] ParseFlags] For 'context.set_context', the flag 'None' in the parameter 'graph_kernel_flags' is invalid. Valid flag format is "--key=value", flags are separated by spaces(e.g. "--key1=value1 --key2=value2"). bool flag's value can be implicit, the "--key" means "--key=true". graph_kernel_flags = "None" [WARNING] DISTRIBUTED(899764,ffffa930eec0,python):2025-07-15-10:17:35.293.142 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: hccl_world_group [const vector]{0, 1, 2, 3, 4, 5, 6, 7}, async: 1, submit_now: 1 [WARNING] DISTRIBUTED(899764,ffffa930eec0,python):2025-07-15-10:17:35.293.362 [mindspore/ccsrc/distributed/collective/collective_manager.cc:393] CreateCommunicationGroup] This group's communicator is async created hccl_world_group [WARNING] DEVICE(899764,fffeccbaefa0,python):2025-07-15-10:17:35.293.597 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:254] SetGlobalCommInfo] Start to SetGlobalCommInfo for hccl_world_group, master_ip:2130706433, master_port:7124, node_rank:2130706433, total_rank_size:8, local_rank_size8 [WARNING] HCCL_ADPT(899764,fffeccbaefa0,python):2025-07-15-10:17:35.293.692 [mindspore/ccsrc/utils/dlopen_macro.h:165] DlsymAscend] Dynamically load symbol HcclSetGlobalCommInfo failed, result = /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/../lib/plugin/ascend/libhccl_plugin.so: undefined symbol: HcclSetGlobalCommInfo [WARNING] HCCL_ADPT(899764,fffeccbaefa0,python):2025-07-15-10:17:35.293.729 [mindspore/ccsrc/plugin/res_manager/ascend/hccl_adapter/hccl_adapter.cc:635] HcclSetGlobalCommInfo] Func HcclSetGlobalCommInfo is not supported in CANN package. [WARNING] DEVICE(899764,fffeccbaefa0,python):2025-07-15-10:17:35.293.760 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:265] SetGlobalCommInfo] End to SetGlobalCommInfo for hccl_world_group 2025-07-15 10:17:35,295 - mindformers./output/log[mindformers/tools/utils.py:181] - INFO - set strategy path to './output/strategy/ckpt_strategy_rank_0.ckpt' [WARNING] DISTRIBUTED(899764,fffeccbaefa0,python):2025-07-15-10:17:35.301.251 [mindspore/ccsrc/distributed/collective/collective_manager.cc:1021] CreateDeviceCommunicator] Begin initialize communication group on the device side: hccl_world_group [WARNING] DEVICE(899764,fffe7e7cefa0,python):2025-07-15-10:17:35.301.565 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:169] InitByRootInfoConfig] Start to initialize communicator by HcclCommInitRootInfoConfig for hccl_world_group, hcclBufferSize is 200 MB, hcclDeterministic is 1 2025-07-15 10:17:35,321 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config swap_config is empty. 2025-07-15 10:17:35,322 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config metric is empty. 2025-07-15 10:17:35,322 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config monitor_config is empty. 2025-07-15 10:17:35,322 - mindformers./output/log[mindformers/tools/register/template.py:683] - WARNING - Some configs in yaml are useless for train: ['auto_tune', 'autotune_per_step', 'eval_callbacks', 'eval_dataset', 'eval_dataset_task', 'filepath_prefix', 'processor'] 2025-07-15 10:17:35,323 - mindformers./output/log[mindformers/trainer/trainer.py:1008] - INFO - Load configs in /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/configs/general/run_general_task.yaml to build trainer. 2025-07-15 10:17:35,323 - mindformers./output/log[mindformers/trainer/trainer.py:1044] - INFO - ..........Init Config.......... 2025-07-15 10:17:35,323 - mindformers./output/log[mindformers/core/parallel_config.py:41] - INFO - initial moe_config from dict: {'expert_num': 4, 'capacity_factor': 1.5, 'aux_loss_factor': 0.05, 'num_experts_chosen': 2, 'expert_group_size': 2, 'group_wise_a2a': False, 'comp_comm_parallel': False, 'comp_comm_parallel_degree': 2, 'save_token_distribution': False, 'cur_layer': 0, 'enable_cold_hot_expert': False, 'update_step': 10000, 'hot_expert_num': 0, 'cold_token_percent': 1.0, 'moe_module_name': '', 'routing_policy': 'TopkRouterV2', 'norm_topk_prob': False, 'enable_sdrop': False, 'use_fused_ops_topkrouter': True, 'router_dense_type': 'float32', 'shared_expert_num': 1, 'use_shared_expert_gating': False, 'max_router_load': 131072, 'topk_method': 'greedy', 'topk_group': 3, 'n_group': 8, 'first_k_dense_replace': 0, 'moe_intermediate_size': 512, 'routed_scaling_factor': 2.5, 'aux_loss_types': ['expert'], 'aux_loss_factors': [0.0001], 'z_loss_factor': 0.0, 'balance_via_topk_bias': True, 'topk_bias_update_rate': 0.0001, 'use_allgather_dispatcher': False, 'moe_shared_expert_overlap': False, 'expert_model_parallel': 1, 'use_gating_sigmoid': True, 'enable_deredundency': True, 'npu_nums_per_device': 2, 'use_gmm': False, 'enable_gmm_safe_tokens': True, 'use_fused_ops_permute': True, 'callback_moe_droprate': False} 2025-07-15 10:17:35,323 - mindformers./output/log[mindformers/core/parallel_config.py:48] - INFO - initial swap_config from dict: {'swap': False, 'layer_swap': None, 'op_swap': None, 'default_prefetch': 1} 2025-07-15 10:17:35,324 - mindformers./output/log[mindformers/core/parallel_config.py:55] - INFO - initial recompute_config from dict: {'recompute': True, 'select_recompute': False, 'parallel_optimizer_comm_recompute': True, 'select_comm_recompute': False, 'mp_comm_recompute': True, 'recompute_slice_activation': True, 'select_recompute_exclude': False, 'select_comm_recompute_exclude': False} 2025-07-15 10:17:35,324 - mindformers./output/log[mindformers/core/parallel_config.py:61] - INFO - initial parallel_config from dict: {'data_parallel': 2, 'model_parallel': 2, 'context_parallel': 1, 'expert_parallel': 2, 'pipeline_stage': 2, 'micro_batch_num': 2, 'seq_split_num': 1, 'use_seq_parallel': True, 'optimizer_shard': None, 'gradient_aggregation_group': 4, 'vocab_emb_dp': True, 'context_parallel_algo': 'colossalai_cp', 'ulysses_degree_in_cp': 1, 'mem_coeff': 0.1} 2025-07-15 10:17:35,324 - mindformers./output/log[mindformers/core/parallel_config.py:63] - INFO - pipeline_stage = 2 > 1, vocab_emd_dp will be reset to False. 2025-07-15 10:17:35,325 - mindformers./output/log[mindformers/tools/utils.py:166] - INFO - set output path to '/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/output' 2025-07-15 10:17:35,325 - mindformers./output/log[mindformers/tools/utils.py:181] - INFO - set strategy path to './output/strategy/ckpt_strategy_rank_0.ckpt' [WARNING] GRAPH_KERNEL(899792,ffffa1efeec0,python):2025-07-15-10:17:35.330.993 [mindspore/ccsrc/backend/common/graph_kernel/graph_kernel_flags.cc:116] ParseFlags] For 'context.set_context', the flag 'None' in the parameter 'graph_kernel_flags' is invalid. Valid flag format is "--key=value", flags are separated by spaces(e.g. "--key1=value1 --key2=value2"). bool flag's value can be implicit, the "--key" means "--key=true". graph_kernel_flags = "None" [WARNING] DISTRIBUTED(899792,ffffa1efeec0,python):2025-07-15-10:17:35.334.584 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: hccl_world_group [const vector]{0, 1, 2, 3, 4, 5, 6, 7}, async: 1, submit_now: 1 [WARNING] DISTRIBUTED(899792,ffffa1efeec0,python):2025-07-15-10:17:35.334.811 [mindspore/ccsrc/distributed/collective/collective_manager.cc:393] CreateCommunicationGroup] This group's communicator is async created hccl_world_group [WARNING] DEVICE(899792,fffec506efa0,python):2025-07-15-10:17:35.335.050 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:254] SetGlobalCommInfo] Start to SetGlobalCommInfo for hccl_world_group, master_ip:2130706433, master_port:7124, node_rank:2130706433, total_rank_size:8, local_rank_size8 [WARNING] HCCL_ADPT(899792,fffec506efa0,python):2025-07-15-10:17:35.335.147 [mindspore/ccsrc/utils/dlopen_macro.h:165] DlsymAscend] Dynamically load symbol HcclSetGlobalCommInfo failed, result = /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/../lib/plugin/ascend/libhccl_plugin.so: undefined symbol: HcclSetGlobalCommInfo [WARNING] HCCL_ADPT(899792,fffec506efa0,python):2025-07-15-10:17:35.335.183 [mindspore/ccsrc/plugin/res_manager/ascend/hccl_adapter/hccl_adapter.cc:635] HcclSetGlobalCommInfo] Func HcclSetGlobalCommInfo is not supported in CANN package. [WARNING] DEVICE(899792,fffec506efa0,python):2025-07-15-10:17:35.335.212 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:265] SetGlobalCommInfo] End to SetGlobalCommInfo for hccl_world_group [WARNING] DISTRIBUTED(899792,fffec506efa0,python):2025-07-15-10:17:35.335.677 [mindspore/ccsrc/distributed/collective/collective_manager.cc:1021] CreateDeviceCommunicator] Begin initialize communication group on the device side: hccl_world_group [WARNING] DEVICE(899792,fffec485efa0,python):2025-07-15-10:17:35.336.001 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:169] InitByRootInfoConfig] Start to initialize communicator by HcclCommInitRootInfoConfig for hccl_world_group, hcclBufferSize is 200 MB, hcclDeterministic is 1 2025-07-15 10:17:35,336 - mindformers./output/log[mindformers/tools/utils.py:181] - INFO - set strategy path to './output/strategy/ckpt_strategy_rank_7.ckpt' 2025-07-15 10:17:35,363 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config swap_config is empty. 2025-07-15 10:17:35,363 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config metric is empty. 2025-07-15 10:17:35,364 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config monitor_config is empty. 2025-07-15 10:17:35,364 - mindformers./output/log[mindformers/tools/register/template.py:683] - WARNING - Some configs in yaml are useless for train: ['auto_tune', 'autotune_per_step', 'eval_callbacks', 'eval_dataset', 'eval_dataset_task', 'filepath_prefix', 'processor'] 2025-07-15 10:17:35,364 - mindformers./output/log[mindformers/trainer/trainer.py:1008] - INFO - Load configs in /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/configs/general/run_general_task.yaml to build trainer. 2025-07-15 10:17:35,364 - mindformers./output/log[mindformers/trainer/trainer.py:1044] - INFO - ..........Init Config.......... 2025-07-15 10:17:35,364 - mindformers./output/log[mindformers/core/parallel_config.py:41] - INFO - initial moe_config from dict: {'expert_num': 4, 'capacity_factor': 1.5, 'aux_loss_factor': 0.05, 'num_experts_chosen': 2, 'expert_group_size': 2, 'group_wise_a2a': False, 'comp_comm_parallel': False, 'comp_comm_parallel_degree': 2, 'save_token_distribution': False, 'cur_layer': 0, 'enable_cold_hot_expert': False, 'update_step': 10000, 'hot_expert_num': 0, 'cold_token_percent': 1.0, 'moe_module_name': '', 'routing_policy': 'TopkRouterV2', 'norm_topk_prob': False, 'enable_sdrop': False, 'use_fused_ops_topkrouter': True, 'router_dense_type': 'float32', 'shared_expert_num': 1, 'use_shared_expert_gating': False, 'max_router_load': 131072, 'topk_method': 'greedy', 'topk_group': 3, 'n_group': 8, 'first_k_dense_replace': 0, 'moe_intermediate_size': 512, 'routed_scaling_factor': 2.5, 'aux_loss_types': ['expert'], 'aux_loss_factors': [0.0001], 'z_loss_factor': 0.0, 'balance_via_topk_bias': True, 'topk_bias_update_rate': 0.0001, 'use_allgather_dispatcher': False, 'moe_shared_expert_overlap': False, 'expert_model_parallel': 1, 'use_gating_sigmoid': True, 'enable_deredundency': True, 'npu_nums_per_device': 2, 'use_gmm': False, 'enable_gmm_safe_tokens': True, 'use_fused_ops_permute': True, 'callback_moe_droprate': False} 2025-07-15 10:17:35,365 - mindformers./output/log[mindformers/core/parallel_config.py:48] - INFO - initial swap_config from dict: {'swap': False, 'layer_swap': None, 'op_swap': None, 'default_prefetch': 1} 2025-07-15 10:17:35,365 - mindformers./output/log[mindformers/core/parallel_config.py:55] - INFO - initial recompute_config from dict: {'recompute': True, 'select_recompute': False, 'parallel_optimizer_comm_recompute': True, 'select_comm_recompute': False, 'mp_comm_recompute': True, 'recompute_slice_activation': True, 'select_recompute_exclude': False, 'select_comm_recompute_exclude': False} 2025-07-15 10:17:35,365 - mindformers./output/log[mindformers/core/parallel_config.py:61] - INFO - initial parallel_config from dict: {'data_parallel': 2, 'model_parallel': 2, 'context_parallel': 1, 'expert_parallel': 2, 'pipeline_stage': 2, 'micro_batch_num': 2, 'seq_split_num': 1, 'use_seq_parallel': True, 'optimizer_shard': None, 'gradient_aggregation_group': 4, 'vocab_emb_dp': True, 'context_parallel_algo': 'colossalai_cp', 'ulysses_degree_in_cp': 1, 'mem_coeff': 0.1} 2025-07-15 10:17:35,365 - mindformers./output/log[mindformers/core/parallel_config.py:63] - INFO - pipeline_stage = 2 > 1, vocab_emd_dp will be reset to False. 2025-07-15 10:17:35,366 - mindformers./output/log[mindformers/tools/utils.py:166] - INFO - set output path to '/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/output' 2025-07-15 10:17:35,367 - mindformers./output/log[mindformers/tools/utils.py:181] - INFO - set strategy path to './output/strategy/ckpt_strategy_rank_7.ckpt' [WARNING] GRAPH_KERNEL(899772,ffffaebdeec0,python):2025-07-15-10:17:35.406.623 [mindspore/ccsrc/backend/common/graph_kernel/graph_kernel_flags.cc:116] ParseFlags] For 'context.set_context', the flag 'None' in the parameter 'graph_kernel_flags' is invalid. Valid flag format is "--key=value", flags are separated by spaces(e.g. "--key1=value1 --key2=value2"). bool flag's value can be implicit, the "--key" means "--key=true". graph_kernel_flags = "None" [WARNING] DISTRIBUTED(899772,ffffaebdeec0,python):2025-07-15-10:17:35.410.274 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: hccl_world_group [const vector]{0, 1, 2, 3, 4, 5, 6, 7}, async: 1, submit_now: 1 [WARNING] DISTRIBUTED(899772,ffffaebdeec0,python):2025-07-15-10:17:35.410.526 [mindspore/ccsrc/distributed/collective/collective_manager.cc:393] CreateCommunicationGroup] This group's communicator is async created hccl_world_group [WARNING] DEVICE(899772,fffed22eefa0,python):2025-07-15-10:17:35.410.783 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:254] SetGlobalCommInfo] Start to SetGlobalCommInfo for hccl_world_group, master_ip:2130706433, master_port:7124, node_rank:2130706433, total_rank_size:8, local_rank_size8 [WARNING] HCCL_ADPT(899772,fffed22eefa0,python):2025-07-15-10:17:35.410.880 [mindspore/ccsrc/utils/dlopen_macro.h:165] DlsymAscend] Dynamically load symbol HcclSetGlobalCommInfo failed, result = /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/../lib/plugin/ascend/libhccl_plugin.so: undefined symbol: HcclSetGlobalCommInfo [WARNING] HCCL_ADPT(899772,fffed22eefa0,python):2025-07-15-10:17:35.410.917 [mindspore/ccsrc/plugin/res_manager/ascend/hccl_adapter/hccl_adapter.cc:635] HcclSetGlobalCommInfo] Func HcclSetGlobalCommInfo is not supported in CANN package. [WARNING] DEVICE(899772,fffed22eefa0,python):2025-07-15-10:17:35.410.950 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:265] SetGlobalCommInfo] End to SetGlobalCommInfo for hccl_world_group [WARNING] DISTRIBUTED(899772,fffed22eefa0,python):2025-07-15-10:17:35.411.369 [mindspore/ccsrc/distributed/collective/collective_manager.cc:1021] CreateDeviceCommunicator] Begin initialize communication group on the device side: hccl_world_group [WARNING] DEVICE(899772,fffed1adefa0,python):2025-07-15-10:17:35.411.689 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:169] InitByRootInfoConfig] Start to initialize communicator by HcclCommInitRootInfoConfig for hccl_world_group, hcclBufferSize is 200 MB, hcclDeterministic is 1 2025-07-15 10:17:35,412 - mindformers./output/log[mindformers/tools/utils.py:181] - INFO - set strategy path to './output/strategy/ckpt_strategy_rank_2.ckpt' 2025-07-15 10:17:35,439 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config swap_config is empty. 2025-07-15 10:17:35,439 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config metric is empty. 2025-07-15 10:17:35,439 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config monitor_config is empty. 2025-07-15 10:17:35,440 - mindformers./output/log[mindformers/tools/register/template.py:683] - WARNING - Some configs in yaml are useless for train: ['auto_tune', 'autotune_per_step', 'eval_callbacks', 'eval_dataset', 'eval_dataset_task', 'filepath_prefix', 'processor'] 2025-07-15 10:17:35,440 - mindformers./output/log[mindformers/trainer/trainer.py:1008] - INFO - Load configs in /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/configs/general/run_general_task.yaml to build trainer. 2025-07-15 10:17:35,440 - mindformers./output/log[mindformers/trainer/trainer.py:1044] - INFO - ..........Init Config.......... 2025-07-15 10:17:35,440 - mindformers./output/log[mindformers/core/parallel_config.py:41] - INFO - initial moe_config from dict: {'expert_num': 4, 'capacity_factor': 1.5, 'aux_loss_factor': 0.05, 'num_experts_chosen': 2, 'expert_group_size': 2, 'group_wise_a2a': False, 'comp_comm_parallel': False, 'comp_comm_parallel_degree': 2, 'save_token_distribution': False, 'cur_layer': 0, 'enable_cold_hot_expert': False, 'update_step': 10000, 'hot_expert_num': 0, 'cold_token_percent': 1.0, 'moe_module_name': '', 'routing_policy': 'TopkRouterV2', 'norm_topk_prob': False, 'enable_sdrop': False, 'use_fused_ops_topkrouter': True, 'router_dense_type': 'float32', 'shared_expert_num': 1, 'use_shared_expert_gating': False, 'max_router_load': 131072, 'topk_method': 'greedy', 'topk_group': 3, 'n_group': 8, 'first_k_dense_replace': 0, 'moe_intermediate_size': 512, 'routed_scaling_factor': 2.5, 'aux_loss_types': ['expert'], 'aux_loss_factors': [0.0001], 'z_loss_factor': 0.0, 'balance_via_topk_bias': True, 'topk_bias_update_rate': 0.0001, 'use_allgather_dispatcher': False, 'moe_shared_expert_overlap': False, 'expert_model_parallel': 1, 'use_gating_sigmoid': True, 'enable_deredundency': True, 'npu_nums_per_device': 2, 'use_gmm': False, 'enable_gmm_safe_tokens': True, 'use_fused_ops_permute': True, 'callback_moe_droprate': False} 2025-07-15 10:17:35,441 - mindformers./output/log[mindformers/core/parallel_config.py:48] - INFO - initial swap_config from dict: {'swap': False, 'layer_swap': None, 'op_swap': None, 'default_prefetch': 1} 2025-07-15 10:17:35,441 - mindformers./output/log[mindformers/core/parallel_config.py:55] - INFO - initial recompute_config from dict: {'recompute': True, 'select_recompute': False, 'parallel_optimizer_comm_recompute': True, 'select_comm_recompute': False, 'mp_comm_recompute': True, 'recompute_slice_activation': True, 'select_recompute_exclude': False, 'select_comm_recompute_exclude': False} 2025-07-15 10:17:35,441 - mindformers./output/log[mindformers/core/parallel_config.py:61] - INFO - initial parallel_config from dict: {'data_parallel': 2, 'model_parallel': 2, 'context_parallel': 1, 'expert_parallel': 2, 'pipeline_stage': 2, 'micro_batch_num': 2, 'seq_split_num': 1, 'use_seq_parallel': True, 'optimizer_shard': None, 'gradient_aggregation_group': 4, 'vocab_emb_dp': True, 'context_parallel_algo': 'colossalai_cp', 'ulysses_degree_in_cp': 1, 'mem_coeff': 0.1} 2025-07-15 10:17:35,441 - mindformers./output/log[mindformers/core/parallel_config.py:63] - INFO - pipeline_stage = 2 > 1, vocab_emd_dp will be reset to False. 2025-07-15 10:17:35,442 - mindformers./output/log[mindformers/tools/utils.py:166] - INFO - set output path to '/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/output' 2025-07-15 10:17:35,443 - mindformers./output/log[mindformers/tools/utils.py:181] - INFO - set strategy path to './output/strategy/ckpt_strategy_rank_2.ckpt' 2025-07-15 10:17:35,453 - mindformers./output/log[mindformers/trainer/base_trainer.py:107] - INFO - host_name: ascend213, host_ip: 121.37.54.128 2025-07-15 10:17:35,454 - mindformers./output/log[mindformers/trainer/base_trainer.py:113] - INFO - Now Running Task is: text_generation, Model is: deepseekV3 2025-07-15 10:17:35,454 - mindformers./output/log[mindformers/trainer/base_trainer.py:143] - WARNING - Input model name is not in the supported list or unspecified. 2025-07-15 10:17:35,454 - mindformers./output/log[mindformers/trainer/base_trainer.py:144] - WARNING - See the list of supported task and model name: ['codellama_34b', 'common', 'deepseek1_5_7b', 'deepseek_33b', 'glm3_6b', 'glm4_9b', 'gpt2', 'gpt2_13b', 'gpt2_52b', 'gpt2_lora', 'gpt2_xl', 'gpt2_xl_lora', 'internlm_7b', 'internlm_7b_lora', 'llama2_13b', 'llama2_70b', 'llama2_7b', 'llama2_7b_lora', 'llama_7b_slora', 'yi_34b', 'yi_6b'] 2025-07-15 10:17:35,455 - mindformers./output/log[mindformers/trainer/base_trainer.py:145] - WARNING - The default model config: /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/configs/gpt2/run_gpt2.yaml will now be used for the text_generation task 2025-07-15 10:17:35,455 - mindformers./output/log[mindformers/trainer/trainer.py:1117] - INFO - ..........Init Model.......... 2025-07-15 10:17:35,455 - mindformers./output/log[mindformers/trainer/trainer.py:323] - INFO - ==========Trainer Init Success!========== 2025-07-15 10:17:35,456 - mindformers./output/log[mindformers/trainer/trainer.py:406] - WARNING - sink_size will not be able to set in a future release. Modifying sink_size may cause functional issues when resuming training from a checkpoint. 2025-07-15 10:17:35,456 - mindformers./output/log[mindformers/trainer/trainer.py:1117] - INFO - ..........Init Model.......... 2025-07-15 10:17:35,457 - mindformers./output/log[mindformers/trainer/base_trainer.py:204] - INFO - Pipeline parallel was opened: pipeline_stages = 2, full batch is True, gradient_accumulation_steps will not take effect in pipeline parallel, global batch size will be changed: global_batch_size = batch_size * data_parallel * micro_batch_num * micro_batch_interleave_num = 4 = 1 * 2 * 2 * 1). 2025-07-15 10:17:35,457 - mindformers./output/log[mindformers/trainer/base_trainer.py:338] - WARNING - When using the pipeline parallel mode, the MFPipelineWithLossScaleCell class is used by default. 2025-07-15 10:17:35,457 - mindformers./output/log[mindformers/trainer/base_trainer.py:346] - INFO - PipelineWrapper under evaluate or predict mode will not take effect. 2025-07-15 10:17:35,457 - mindformers./output/log[mindformers/trainer/base_trainer.py:920] - INFO - .........Build Dataset For Train.......... 2025-07-15 10:17:35,457 - mindformers./output/log[mindformers/trainer/base_trainer.py:464] - INFO - .........Build Dataset From Config.......... 2025-07-15 10:17:35,458 - mindformers./output/log[mindformers/dataset/causal_language_model_dataset.py:302] - INFO - Now Create Causal Language Model Dataset. 2025-07-15 10:17:35,458 - mindformers./output/log[mindformers/dataset/base_dataset.py:83] - INFO - Now dataset_strategy is full_batch, shard_id: None, num_shards: None 2025-07-15 10:17:35,466 - mindformers./output/log[mindformers/trainer/base_trainer.py:924] - INFO - Create train dataset finish, dataset size:6 2025-07-15 10:17:35,466 - mindformers./output/log[mindformers/trainer/utils.py:176] - INFO - Will be Training epochs:1, sink_size:1 2025-07-15 10:17:35,466 - mindformers./output/log[mindformers/trainer/utils.py:178] - INFO - Create training dataset finish, dataset size:6 2025-07-15 10:17:35,467 - mindformers./output/log[mindformers/trainer/base_trainer.py:971] - INFO - .........Build Net For Train.......... 2025-07-15 10:17:35,467 - mindformers./output/log[mindformers/trainer/base_trainer.py:498] - INFO - .........Build Network From Config.......... [WARNING] GRAPH_KERNEL(899776,ffff9e3ceec0,python):2025-07-15-10:17:35.483.055 [mindspore/ccsrc/backend/common/graph_kernel/graph_kernel_flags.cc:116] ParseFlags] For 'context.set_context', the flag 'None' in the parameter 'graph_kernel_flags' is invalid. Valid flag format is "--key=value", flags are separated by spaces(e.g. "--key1=value1 --key2=value2"). bool flag's value can be implicit, the "--key" means "--key=true". graph_kernel_flags = "None" [WARNING] DISTRIBUTED(899776,ffff9e3ceec0,python):2025-07-15-10:17:35.486.417 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: hccl_world_group [const vector]{0, 1, 2, 3, 4, 5, 6, 7}, async: 1, submit_now: 1 [WARNING] DISTRIBUTED(899776,ffff9e3ceec0,python):2025-07-15-10:17:35.486.642 [mindspore/ccsrc/distributed/collective/collective_manager.cc:393] CreateCommunicationGroup] This group's communicator is async created hccl_world_group [WARNING] DEVICE(899776,fffec187efa0,python):2025-07-15-10:17:35.486.866 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:254] SetGlobalCommInfo] Start to SetGlobalCommInfo for hccl_world_group, master_ip:2130706433, master_port:7124, node_rank:2130706433, total_rank_size:8, local_rank_size8 [WARNING] HCCL_ADPT(899776,fffec187efa0,python):2025-07-15-10:17:35.486.965 [mindspore/ccsrc/utils/dlopen_macro.h:165] DlsymAscend] Dynamically load symbol HcclSetGlobalCommInfo failed, result = /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/../lib/plugin/ascend/libhccl_plugin.so: undefined symbol: HcclSetGlobalCommInfo [WARNING] HCCL_ADPT(899776,fffec187efa0,python):2025-07-15-10:17:35.487.001 [mindspore/ccsrc/plugin/res_manager/ascend/hccl_adapter/hccl_adapter.cc:635] HcclSetGlobalCommInfo] Func HcclSetGlobalCommInfo is not supported in CANN package. [WARNING] DEVICE(899776,fffec187efa0,python):2025-07-15-10:17:35.487.034 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:265] SetGlobalCommInfo] End to SetGlobalCommInfo for hccl_world_group [WARNING] DISTRIBUTED(899776,fffec187efa0,python):2025-07-15-10:17:35.487.400 [mindspore/ccsrc/distributed/collective/collective_manager.cc:1021] CreateDeviceCommunicator] Begin initialize communication group on the device side: hccl_world_group [WARNING] DEVICE(899776,fffec106efa0,python):2025-07-15-10:17:35.487.696 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:169] InitByRootInfoConfig] Start to initialize communicator by HcclCommInitRootInfoConfig for hccl_world_group, hcclBufferSize is 200 MB, hcclDeterministic is 1 2025-07-15 10:17:35,488 - mindformers./output/log[mindformers/tools/utils.py:181] - INFO - set strategy path to './output/strategy/ckpt_strategy_rank_3.ckpt' [WARNING] GRAPH_KERNEL(899780,ffff9229eec0,python):2025-07-15-10:17:35.500.918 [mindspore/ccsrc/backend/common/graph_kernel/graph_kernel_flags.cc:116] ParseFlags] For 'context.set_context', the flag 'None' in the parameter 'graph_kernel_flags' is invalid. Valid flag format is "--key=value", flags are separated by spaces(e.g. "--key1=value1 --key2=value2"). bool flag's value can be implicit, the "--key" means "--key=true". graph_kernel_flags = "None" [WARNING] DISTRIBUTED(899780,ffff9229eec0,python):2025-07-15-10:17:35.504.456 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: hccl_world_group [const vector]{0, 1, 2, 3, 4, 5, 6, 7}, async: 1, submit_now: 1 2025-07-15 10:17:35,504 - mindformers./output/log[mindformers/trainer/base_trainer.py:107] - INFO - host_name: ascend213, host_ip: 121.37.54.128 [WARNING] DISTRIBUTED(899780,ffff9229eec0,python):2025-07-15-10:17:35.504.697 [mindspore/ccsrc/distributed/collective/collective_manager.cc:393] CreateCommunicationGroup] This group's communicator is async created hccl_world_group 2025-07-15 10:17:35,504 - mindformers./output/log[mindformers/trainer/base_trainer.py:113] - INFO - Now Running Task is: text_generation, Model is: deepseekV3 [WARNING] DEVICE(899780,fffeb587efa0,python):2025-07-15-10:17:35.504.943 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:254] SetGlobalCommInfo] Start to SetGlobalCommInfo for hccl_world_group, master_ip:2130706433, master_port:7124, node_rank:2130706433, total_rank_size:8, local_rank_size8 [WARNING] HCCL_ADPT(899780,fffeb587efa0,python):2025-07-15-10:17:35.505.035 [mindspore/ccsrc/utils/dlopen_macro.h:165] DlsymAscend] Dynamically load symbol HcclSetGlobalCommInfo failed, result = /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/../lib/plugin/ascend/libhccl_plugin.so: undefined symbol: HcclSetGlobalCommInfo [WARNING] HCCL_ADPT(899780,fffeb587efa0,python):2025-07-15-10:17:35.505.073 [mindspore/ccsrc/plugin/res_manager/ascend/hccl_adapter/hccl_adapter.cc:635] HcclSetGlobalCommInfo] Func HcclSetGlobalCommInfo is not supported in CANN package. [WARNING] DEVICE(899780,fffeb587efa0,python):2025-07-15-10:17:35.505.106 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:265] SetGlobalCommInfo] End to SetGlobalCommInfo for hccl_world_group 2025-07-15 10:17:35,505 - mindformers./output/log[mindformers/trainer/base_trainer.py:143] - WARNING - Input model name is not in the supported list or unspecified. [WARNING] DISTRIBUTED(899780,fffeb587efa0,python):2025-07-15-10:17:35.505.536 [mindspore/ccsrc/distributed/collective/collective_manager.cc:1021] CreateDeviceCommunicator] Begin initialize communication group on the device side: hccl_world_group 2025-07-15 10:17:35,505 - mindformers./output/log[mindformers/trainer/base_trainer.py:144] - WARNING - See the list of supported task and model name: ['codellama_34b', 'common', 'deepseek1_5_7b', 'deepseek_33b', 'glm3_6b', 'glm4_9b', 'gpt2', 'gpt2_13b', 'gpt2_52b', 'gpt2_lora', 'gpt2_xl', 'gpt2_xl_lora', 'internlm_7b', 'internlm_7b_lora', 'llama2_13b', 'llama2_70b', 'llama2_7b', 'llama2_7b_lora', 'llama_7b_slora', 'yi_34b', 'yi_6b'] [WARNING] DEVICE(899780,fffeb506efa0,python):2025-07-15-10:17:35.505.885 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:169] InitByRootInfoConfig] Start to initialize communicator by HcclCommInitRootInfoConfig for hccl_world_group, hcclBufferSize is 200 MB, hcclDeterministic is 1 2025-07-15 10:17:35,505 - mindformers./output/log[mindformers/trainer/base_trainer.py:145] - WARNING - The default model config: /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/configs/gpt2/run_gpt2.yaml will now be used for the text_generation task 2025-07-15 10:17:35,506 - mindformers./output/log[mindformers/trainer/trainer.py:1117] - INFO - ..........Init Model.......... 2025-07-15 10:17:35,506 - mindformers./output/log[mindformers/tools/utils.py:181] - INFO - set strategy path to './output/strategy/ckpt_strategy_rank_4.ckpt' 2025-07-15 10:17:35,506 - mindformers./output/log[mindformers/trainer/trainer.py:323] - INFO - ==========Trainer Init Success!========== 2025-07-15 10:17:35,506 - mindformers./output/log[mindformers/trainer/trainer.py:406] - WARNING - sink_size will not be able to set in a future release. Modifying sink_size may cause functional issues when resuming training from a checkpoint. 2025-07-15 10:17:35,507 - mindformers./output/log[mindformers/trainer/trainer.py:1117] - INFO - ..........Init Model.......... 2025-07-15 10:17:35,507 - mindformers./output/log[mindformers/trainer/base_trainer.py:204] - INFO - Pipeline parallel was opened: pipeline_stages = 2, full batch is True, gradient_accumulation_steps will not take effect in pipeline parallel, global batch size will be changed: global_batch_size = batch_size * data_parallel * micro_batch_num * micro_batch_interleave_num = 4 = 1 * 2 * 2 * 1). 2025-07-15 10:17:35,507 - mindformers./output/log[mindformers/trainer/base_trainer.py:338] - WARNING - When using the pipeline parallel mode, the MFPipelineWithLossScaleCell class is used by default. 2025-07-15 10:17:35,508 - mindformers./output/log[mindformers/trainer/base_trainer.py:346] - INFO - PipelineWrapper under evaluate or predict mode will not take effect. 2025-07-15 10:17:35,508 - mindformers./output/log[mindformers/trainer/base_trainer.py:920] - INFO - .........Build Dataset For Train.......... 2025-07-15 10:17:35,508 - mindformers./output/log[mindformers/trainer/base_trainer.py:464] - INFO - .........Build Dataset From Config.......... 2025-07-15 10:17:35,508 - mindformers./output/log[mindformers/dataset/causal_language_model_dataset.py:302] - INFO - Now Create Causal Language Model Dataset. 2025-07-15 10:17:35,509 - mindformers./output/log[mindformers/dataset/base_dataset.py:83] - INFO - Now dataset_strategy is full_batch, shard_id: None, num_shards: None 2025-07-15 10:17:35,515 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config swap_config is empty. 2025-07-15 10:17:35,515 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config metric is empty. 2025-07-15 10:17:35,515 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config monitor_config is empty. 2025-07-15 10:17:35,515 - mindformers./output/log[mindformers/tools/register/template.py:683] - WARNING - Some configs in yaml are useless for train: ['auto_tune', 'autotune_per_step', 'eval_callbacks', 'eval_dataset', 'eval_dataset_task', 'filepath_prefix', 'processor'] 2025-07-15 10:17:35,516 - mindformers./output/log[mindformers/trainer/trainer.py:1008] - INFO - Load configs in /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/configs/general/run_general_task.yaml to build trainer. 2025-07-15 10:17:35,516 - mindformers./output/log[mindformers/trainer/trainer.py:1044] - INFO - ..........Init Config.......... 2025-07-15 10:17:35,516 - mindformers./output/log[mindformers/core/parallel_config.py:41] - INFO - initial moe_config from dict: {'expert_num': 4, 'capacity_factor': 1.5, 'aux_loss_factor': 0.05, 'num_experts_chosen': 2, 'expert_group_size': 2, 'group_wise_a2a': False, 'comp_comm_parallel': False, 'comp_comm_parallel_degree': 2, 'save_token_distribution': False, 'cur_layer': 0, 'enable_cold_hot_expert': False, 'update_step': 10000, 'hot_expert_num': 0, 'cold_token_percent': 1.0, 'moe_module_name': '', 'routing_policy': 'TopkRouterV2', 'norm_topk_prob': False, 'enable_sdrop': False, 'use_fused_ops_topkrouter': True, 'router_dense_type': 'float32', 'shared_expert_num': 1, 'use_shared_expert_gating': False, 'max_router_load': 131072, 'topk_method': 'greedy', 'topk_group': 3, 'n_group': 8, 'first_k_dense_replace': 0, 'moe_intermediate_size': 512, 'routed_scaling_factor': 2.5, 'aux_loss_types': ['expert'], 'aux_loss_factors': [0.0001], 'z_loss_factor': 0.0, 'balance_via_topk_bias': True, 'topk_bias_update_rate': 0.0001, 'use_allgather_dispatcher': False, 'moe_shared_expert_overlap': False, 'expert_model_parallel': 1, 'use_gating_sigmoid': True, 'enable_deredundency': True, 'npu_nums_per_device': 2, 'use_gmm': False, 'enable_gmm_safe_tokens': True, 'use_fused_ops_permute': True, 'callback_moe_droprate': False} 2025-07-15 10:17:35,516 - mindformers./output/log[mindformers/trainer/base_trainer.py:924] - INFO - Create train dataset finish, dataset size:6 2025-07-15 10:17:35,517 - mindformers./output/log[mindformers/core/parallel_config.py:48] - INFO - initial swap_config from dict: {'swap': False, 'layer_swap': None, 'op_swap': None, 'default_prefetch': 1} 2025-07-15 10:17:35,517 - mindformers./output/log[mindformers/trainer/utils.py:176] - INFO - Will be Training epochs:1, sink_size:1 2025-07-15 10:17:35,517 - mindformers./output/log[mindformers/core/parallel_config.py:55] - INFO - initial recompute_config from dict: {'recompute': True, 'select_recompute': False, 'parallel_optimizer_comm_recompute': True, 'select_comm_recompute': False, 'mp_comm_recompute': True, 'recompute_slice_activation': True, 'select_recompute_exclude': False, 'select_comm_recompute_exclude': False} 2025-07-15 10:17:35,517 - mindformers./output/log[mindformers/trainer/utils.py:178] - INFO - Create training dataset finish, dataset size:6 2025-07-15 10:17:35,517 - mindformers./output/log[mindformers/core/parallel_config.py:61] - INFO - initial parallel_config from dict: {'data_parallel': 2, 'model_parallel': 2, 'context_parallel': 1, 'expert_parallel': 2, 'pipeline_stage': 2, 'micro_batch_num': 2, 'seq_split_num': 1, 'use_seq_parallel': True, 'optimizer_shard': None, 'gradient_aggregation_group': 4, 'vocab_emb_dp': True, 'context_parallel_algo': 'colossalai_cp', 'ulysses_degree_in_cp': 1, 'mem_coeff': 0.1} 2025-07-15 10:17:35,517 - mindformers./output/log[mindformers/trainer/base_trainer.py:971] - INFO - .........Build Net For Train.......... 2025-07-15 10:17:35,517 - mindformers./output/log[mindformers/core/parallel_config.py:63] - INFO - pipeline_stage = 2 > 1, vocab_emd_dp will be reset to False. 2025-07-15 10:17:35,518 - mindformers./output/log[mindformers/trainer/base_trainer.py:498] - INFO - .........Build Network From Config.......... 2025-07-15 10:17:35,518 - mindformers./output/log[mindformers/tools/utils.py:166] - INFO - set output path to '/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/output' 2025-07-15 10:17:35,518 - mindformers./output/log[mindformers/tools/utils.py:181] - INFO - set strategy path to './output/strategy/ckpt_strategy_rank_3.ckpt' 2025-07-15 10:17:35,533 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config swap_config is empty. 2025-07-15 10:17:35,533 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config metric is empty. 2025-07-15 10:17:35,533 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config monitor_config is empty. 2025-07-15 10:17:35,533 - mindformers./output/log[mindformers/tools/register/template.py:683] - WARNING - Some configs in yaml are useless for train: ['auto_tune', 'autotune_per_step', 'eval_callbacks', 'eval_dataset', 'eval_dataset_task', 'filepath_prefix', 'processor'] 2025-07-15 10:17:35,534 - mindformers./output/log[mindformers/trainer/trainer.py:1008] - INFO - Load configs in /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/configs/general/run_general_task.yaml to build trainer. 2025-07-15 10:17:35,534 - mindformers./output/log[mindformers/trainer/trainer.py:1044] - INFO - ..........Init Config.......... 2025-07-15 10:17:35,534 - mindformers./output/log[mindformers/core/parallel_config.py:41] - INFO - initial moe_config from dict: {'expert_num': 4, 'capacity_factor': 1.5, 'aux_loss_factor': 0.05, 'num_experts_chosen': 2, 'expert_group_size': 2, 'group_wise_a2a': False, 'comp_comm_parallel': False, 'comp_comm_parallel_degree': 2, 'save_token_distribution': False, 'cur_layer': 0, 'enable_cold_hot_expert': False, 'update_step': 10000, 'hot_expert_num': 0, 'cold_token_percent': 1.0, 'moe_module_name': '', 'routing_policy': 'TopkRouterV2', 'norm_topk_prob': False, 'enable_sdrop': False, 'use_fused_ops_topkrouter': True, 'router_dense_type': 'float32', 'shared_expert_num': 1, 'use_shared_expert_gating': False, 'max_router_load': 131072, 'topk_method': 'greedy', 'topk_group': 3, 'n_group': 8, 'first_k_dense_replace': 0, 'moe_intermediate_size': 512, 'routed_scaling_factor': 2.5, 'aux_loss_types': ['expert'], 'aux_loss_factors': [0.0001], 'z_loss_factor': 0.0, 'balance_via_topk_bias': True, 'topk_bias_update_rate': 0.0001, 'use_allgather_dispatcher': False, 'moe_shared_expert_overlap': False, 'expert_model_parallel': 1, 'use_gating_sigmoid': True, 'enable_deredundency': True, 'npu_nums_per_device': 2, 'use_gmm': False, 'enable_gmm_safe_tokens': True, 'use_fused_ops_permute': True, 'callback_moe_droprate': False} 2025-07-15 10:17:35,535 - mindformers./output/log[mindformers/core/parallel_config.py:48] - INFO - initial swap_config from dict: {'swap': False, 'layer_swap': None, 'op_swap': None, 'default_prefetch': 1} 2025-07-15 10:17:35,535 - mindformers./output/log[mindformers/core/parallel_config.py:55] - INFO - initial recompute_config from dict: {'recompute': True, 'select_recompute': False, 'parallel_optimizer_comm_recompute': True, 'select_comm_recompute': False, 'mp_comm_recompute': True, 'recompute_slice_activation': True, 'select_recompute_exclude': False, 'select_comm_recompute_exclude': False} 2025-07-15 10:17:35,535 - mindformers./output/log[mindformers/core/parallel_config.py:61] - INFO - initial parallel_config from dict: {'data_parallel': 2, 'model_parallel': 2, 'context_parallel': 1, 'expert_parallel': 2, 'pipeline_stage': 2, 'micro_batch_num': 2, 'seq_split_num': 1, 'use_seq_parallel': True, 'optimizer_shard': None, 'gradient_aggregation_group': 4, 'vocab_emb_dp': True, 'context_parallel_algo': 'colossalai_cp', 'ulysses_degree_in_cp': 1, 'mem_coeff': 0.1} 2025-07-15 10:17:35,535 - mindformers./output/log[mindformers/core/parallel_config.py:63] - INFO - pipeline_stage = 2 > 1, vocab_emd_dp will be reset to False. 2025-07-15 10:17:35,536 - mindformers./output/log[mindformers/tools/utils.py:166] - INFO - set output path to '/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/output' 2025-07-15 10:17:35,536 - mindformers./output/log[mindformers/tools/utils.py:181] - INFO - set strategy path to './output/strategy/ckpt_strategy_rank_4.ckpt' 2025-07-15 10:17:35,588 - mindformers./output/log[mindformers/trainer/base_trainer.py:107] - INFO - host_name: ascend213, host_ip: 121.37.54.128 2025-07-15 10:17:35,588 - mindformers./output/log[mindformers/trainer/base_trainer.py:113] - INFO - Now Running Task is: text_generation, Model is: deepseekV3 2025-07-15 10:17:35,589 - mindformers./output/log[mindformers/trainer/base_trainer.py:143] - WARNING - Input model name is not in the supported list or unspecified. 2025-07-15 10:17:35,589 - mindformers./output/log[mindformers/trainer/base_trainer.py:144] - WARNING - See the list of supported task and model name: ['codellama_34b', 'common', 'deepseek1_5_7b', 'deepseek_33b', 'glm3_6b', 'glm4_9b', 'gpt2', 'gpt2_13b', 'gpt2_52b', 'gpt2_lora', 'gpt2_xl', 'gpt2_xl_lora', 'internlm_7b', 'internlm_7b_lora', 'llama2_13b', 'llama2_70b', 'llama2_7b', 'llama2_7b_lora', 'llama_7b_slora', 'yi_34b', 'yi_6b'] 2025-07-15 10:17:35,589 - mindformers./output/log[mindformers/trainer/base_trainer.py:145] - WARNING - The default model config: /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/configs/gpt2/run_gpt2.yaml will now be used for the text_generation task 2025-07-15 10:17:35,590 - mindformers./output/log[mindformers/trainer/trainer.py:1117] - INFO - ..........Init Model.......... 2025-07-15 10:17:35,590 - mindformers./output/log[mindformers/trainer/trainer.py:323] - INFO - ==========Trainer Init Success!========== 2025-07-15 10:17:35,590 - mindformers./output/log[mindformers/trainer/trainer.py:406] - WARNING - sink_size will not be able to set in a future release. Modifying sink_size may cause functional issues when resuming training from a checkpoint. 2025-07-15 10:17:35,590 - mindformers./output/log[mindformers/trainer/trainer.py:1117] - INFO - ..........Init Model.......... 2025-07-15 10:17:35,591 - mindformers./output/log[mindformers/trainer/base_trainer.py:204] - INFO - Pipeline parallel was opened: pipeline_stages = 2, full batch is True, gradient_accumulation_steps will not take effect in pipeline parallel, global batch size will be changed: global_batch_size = batch_size * data_parallel * micro_batch_num * micro_batch_interleave_num = 4 = 1 * 2 * 2 * 1). 2025-07-15 10:17:35,591 - mindformers./output/log[mindformers/trainer/base_trainer.py:338] - WARNING - When using the pipeline parallel mode, the MFPipelineWithLossScaleCell class is used by default. 2025-07-15 10:17:35,591 - mindformers./output/log[mindformers/trainer/base_trainer.py:346] - INFO - PipelineWrapper under evaluate or predict mode will not take effect. 2025-07-15 10:17:35,591 - mindformers./output/log[mindformers/trainer/base_trainer.py:920] - INFO - .........Build Dataset For Train.......... 2025-07-15 10:17:35,592 - mindformers./output/log[mindformers/trainer/base_trainer.py:464] - INFO - .........Build Dataset From Config.......... 2025-07-15 10:17:35,592 - mindformers./output/log[mindformers/dataset/causal_language_model_dataset.py:302] - INFO - Now Create Causal Language Model Dataset. 2025-07-15 10:17:35,593 - mindformers./output/log[mindformers/dataset/base_dataset.py:83] - INFO - Now dataset_strategy is full_batch, shard_id: None, num_shards: None 2025-07-15 10:17:35,600 - mindformers./output/log[mindformers/trainer/base_trainer.py:924] - INFO - Create train dataset finish, dataset size:6 2025-07-15 10:17:35,600 - mindformers./output/log[mindformers/trainer/utils.py:176] - INFO - Will be Training epochs:1, sink_size:1 2025-07-15 10:17:35,600 - mindformers./output/log[mindformers/trainer/utils.py:178] - INFO - Create training dataset finish, dataset size:6 2025-07-15 10:17:35,601 - mindformers./output/log[mindformers/trainer/base_trainer.py:971] - INFO - .........Build Net For Train.......... 2025-07-15 10:17:35,601 - mindformers./output/log[mindformers/trainer/base_trainer.py:498] - INFO - .........Build Network From Config.......... 2025-07-15 10:17:35,633 - mindformers./output/log[mindformers/trainer/base_trainer.py:107] - INFO - host_name: ascend213, host_ip: 121.37.54.128 2025-07-15 10:17:35,633 - mindformers./output/log[mindformers/trainer/base_trainer.py:113] - INFO - Now Running Task is: text_generation, Model is: deepseekV3 2025-07-15 10:17:35,634 - mindformers./output/log[mindformers/trainer/base_trainer.py:143] - WARNING - Input model name is not in the supported list or unspecified. 2025-07-15 10:17:35,634 - mindformers./output/log[mindformers/trainer/base_trainer.py:144] - WARNING - See the list of supported task and model name: ['codellama_34b', 'common', 'deepseek1_5_7b', 'deepseek_33b', 'glm3_6b', 'glm4_9b', 'gpt2', 'gpt2_13b', 'gpt2_52b', 'gpt2_lora', 'gpt2_xl', 'gpt2_xl_lora', 'internlm_7b', 'internlm_7b_lora', 'llama2_13b', 'llama2_70b', 'llama2_7b', 'llama2_7b_lora', 'llama_7b_slora', 'yi_34b', 'yi_6b'] 2025-07-15 10:17:35,634 - mindformers./output/log[mindformers/trainer/base_trainer.py:145] - WARNING - The default model config: /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/configs/gpt2/run_gpt2.yaml will now be used for the text_generation task 2025-07-15 10:17:35,635 - mindformers./output/log[mindformers/trainer/trainer.py:1117] - INFO - ..........Init Model.......... 2025-07-15 10:17:35,635 - mindformers./output/log[mindformers/trainer/trainer.py:323] - INFO - ==========Trainer Init Success!========== 2025-07-15 10:17:35,635 - mindformers./output/log[mindformers/trainer/trainer.py:406] - WARNING - sink_size will not be able to set in a future release. Modifying sink_size may cause functional issues when resuming training from a checkpoint. 2025-07-15 10:17:35,636 - mindformers./output/log[mindformers/trainer/trainer.py:1117] - INFO - ..........Init Model.......... 2025-07-15 10:17:35,636 - mindformers./output/log[mindformers/trainer/base_trainer.py:204] - INFO - Pipeline parallel was opened: pipeline_stages = 2, full batch is True, gradient_accumulation_steps will not take effect in pipeline parallel, global batch size will be changed: global_batch_size = batch_size * data_parallel * micro_batch_num * micro_batch_interleave_num = 4 = 1 * 2 * 2 * 1). 2025-07-15 10:17:35,636 - mindformers./output/log[mindformers/trainer/base_trainer.py:338] - WARNING - When using the pipeline parallel mode, the MFPipelineWithLossScaleCell class is used by default. 2025-07-15 10:17:35,636 - mindformers./output/log[mindformers/trainer/base_trainer.py:346] - INFO - PipelineWrapper under evaluate or predict mode will not take effect. 2025-07-15 10:17:35,637 - mindformers./output/log[mindformers/trainer/base_trainer.py:920] - INFO - .........Build Dataset For Train.......... 2025-07-15 10:17:35,637 - mindformers./output/log[mindformers/trainer/base_trainer.py:464] - INFO - .........Build Dataset From Config.......... 2025-07-15 10:17:35,637 - mindformers./output/log[mindformers/dataset/causal_language_model_dataset.py:302] - INFO - Now Create Causal Language Model Dataset. 2025-07-15 10:17:35,638 - mindformers./output/log[mindformers/dataset/base_dataset.py:83] - INFO - Now dataset_strategy is full_batch, shard_id: None, num_shards: None [WARNING] GRAPH_KERNEL(899788,ffff9850eec0,python):2025-07-15-10:17:35.639.784 [mindspore/ccsrc/backend/common/graph_kernel/graph_kernel_flags.cc:116] ParseFlags] For 'context.set_context', the flag 'None' in the parameter 'graph_kernel_flags' is invalid. Valid flag format is "--key=value", flags are separated by spaces(e.g. "--key1=value1 --key2=value2"). bool flag's value can be implicit, the "--key" means "--key=true". graph_kernel_flags = "None" [WARNING] DISTRIBUTED(899788,ffff9850eec0,python):2025-07-15-10:17:35.643.309 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: hccl_world_group [const vector]{0, 1, 2, 3, 4, 5, 6, 7}, async: 1, submit_now: 1 [WARNING] DISTRIBUTED(899788,ffff9850eec0,python):2025-07-15-10:17:35.643.534 [mindspore/ccsrc/distributed/collective/collective_manager.cc:393] CreateCommunicationGroup] This group's communicator is async created hccl_world_group [WARNING] DEVICE(899788,fffebb2eefa0,python):2025-07-15-10:17:35.643.770 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:254] SetGlobalCommInfo] Start to SetGlobalCommInfo for hccl_world_group, master_ip:2130706433, master_port:7124, node_rank:2130706433, total_rank_size:8, local_rank_size8 [WARNING] HCCL_ADPT(899788,fffebb2eefa0,python):2025-07-15-10:17:35.643.874 [mindspore/ccsrc/utils/dlopen_macro.h:165] DlsymAscend] Dynamically load symbol HcclSetGlobalCommInfo failed, result = /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/../lib/plugin/ascend/libhccl_plugin.so: undefined symbol: HcclSetGlobalCommInfo [WARNING] HCCL_ADPT(899788,fffebb2eefa0,python):2025-07-15-10:17:35.643.909 [mindspore/ccsrc/plugin/res_manager/ascend/hccl_adapter/hccl_adapter.cc:635] HcclSetGlobalCommInfo] Func HcclSetGlobalCommInfo is not supported in CANN package. [WARNING] DEVICE(899788,fffebb2eefa0,python):2025-07-15-10:17:35.643.940 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:265] SetGlobalCommInfo] End to SetGlobalCommInfo for hccl_world_group [WARNING] DISTRIBUTED(899788,fffebb2eefa0,python):2025-07-15-10:17:35.644.337 [mindspore/ccsrc/distributed/collective/collective_manager.cc:1021] CreateDeviceCommunicator] Begin initialize communication group on the device side: hccl_world_group [WARNING] DEVICE(899788,fffebaadefa0,python):2025-07-15-10:17:35.644.652 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:169] InitByRootInfoConfig] Start to initialize communicator by HcclCommInitRootInfoConfig for hccl_world_group, hcclBufferSize is 200 MB, hcclDeterministic is 1 2025-07-15 10:17:35,645 - mindformers./output/log[mindformers/trainer/base_trainer.py:924] - INFO - Create train dataset finish, dataset size:6 2025-07-15 10:17:35,645 - mindformers./output/log[mindformers/tools/utils.py:181] - INFO - set strategy path to './output/strategy/ckpt_strategy_rank_6.ckpt' 2025-07-15 10:17:35,645 - mindformers./output/log[mindformers/trainer/utils.py:176] - INFO - Will be Training epochs:1, sink_size:1 2025-07-15 10:17:35,645 - mindformers./output/log[mindformers/trainer/utils.py:178] - INFO - Create training dataset finish, dataset size:6 2025-07-15 10:17:35,646 - mindformers./output/log[mindformers/trainer/base_trainer.py:971] - INFO - .........Build Net For Train.......... 2025-07-15 10:17:35,646 - mindformers./output/log[mindformers/trainer/base_trainer.py:498] - INFO - .........Build Network From Config.......... [WARNING] GRAPH_KERNEL(899784,ffffa894eec0,python):2025-07-15-10:17:35.670.615 [mindspore/ccsrc/backend/common/graph_kernel/graph_kernel_flags.cc:116] ParseFlags] For 'context.set_context', the flag 'None' in the parameter 'graph_kernel_flags' is invalid. Valid flag format is "--key=value", flags are separated by spaces(e.g. "--key1=value1 --key2=value2"). bool flag's value can be implicit, the "--key" means "--key=true". graph_kernel_flags = "None" 2025-07-15 10:17:35,670 - mindformers./output/log[mindformers/trainer/base_trainer.py:107] - INFO - host_name: ascend213, host_ip: 121.37.54.128 2025-07-15 10:17:35,671 - mindformers./output/log[mindformers/trainer/base_trainer.py:113] - INFO - Now Running Task is: text_generation, Model is: deepseekV3 2025-07-15 10:17:35,671 - mindformers./output/log[mindformers/trainer/base_trainer.py:143] - WARNING - Input model name is not in the supported list or unspecified. 2025-07-15 10:17:35,671 - mindformers./output/log[mindformers/trainer/base_trainer.py:144] - WARNING - See the list of supported task and model name: ['codellama_34b', 'common', 'deepseek1_5_7b', 'deepseek_33b', 'glm3_6b', 'glm4_9b', 'gpt2', 'gpt2_13b', 'gpt2_52b', 'gpt2_lora', 'gpt2_xl', 'gpt2_xl_lora', 'internlm_7b', 'internlm_7b_lora', 'llama2_13b', 'llama2_70b', 'llama2_7b', 'llama2_7b_lora', 'llama_7b_slora', 'yi_34b', 'yi_6b'] 2025-07-15 10:17:35,672 - mindformers./output/log[mindformers/trainer/base_trainer.py:145] - WARNING - The default model config: /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/configs/gpt2/run_gpt2.yaml will now be used for the text_generation task 2025-07-15 10:17:35,672 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config swap_config is empty. 2025-07-15 10:17:35,672 - mindformers./output/log[mindformers/trainer/trainer.py:1117] - INFO - ..........Init Model.......... 2025-07-15 10:17:35,672 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config metric is empty. 2025-07-15 10:17:35,672 - mindformers./output/log[mindformers/trainer/trainer.py:323] - INFO - ==========Trainer Init Success!========== 2025-07-15 10:17:35,672 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config monitor_config is empty. 2025-07-15 10:17:35,672 - mindformers./output/log[mindformers/tools/register/template.py:683] - WARNING - Some configs in yaml are useless for train: ['auto_tune', 'autotune_per_step', 'eval_callbacks', 'eval_dataset', 'eval_dataset_task', 'filepath_prefix', 'processor'] 2025-07-15 10:17:35,673 - mindformers./output/log[mindformers/trainer/trainer.py:406] - WARNING - sink_size will not be able to set in a future release. Modifying sink_size may cause functional issues when resuming training from a checkpoint. 2025-07-15 10:17:35,673 - mindformers./output/log[mindformers/trainer/trainer.py:1008] - INFO - Load configs in /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/configs/general/run_general_task.yaml to build trainer. 2025-07-15 10:17:35,673 - mindformers./output/log[mindformers/trainer/trainer.py:1117] - INFO - ..........Init Model.......... 2025-07-15 10:17:35,673 - mindformers./output/log[mindformers/trainer/trainer.py:1044] - INFO - ..........Init Config.......... 2025-07-15 10:17:35,673 - mindformers./output/log[mindformers/core/parallel_config.py:41] - INFO - initial moe_config from dict: {'expert_num': 4, 'capacity_factor': 1.5, 'aux_loss_factor': 0.05, 'num_experts_chosen': 2, 'expert_group_size': 2, 'group_wise_a2a': False, 'comp_comm_parallel': False, 'comp_comm_parallel_degree': 2, 'save_token_distribution': False, 'cur_layer': 0, 'enable_cold_hot_expert': False, 'update_step': 10000, 'hot_expert_num': 0, 'cold_token_percent': 1.0, 'moe_module_name': '', 'routing_policy': 'TopkRouterV2', 'norm_topk_prob': False, 'enable_sdrop': False, 'use_fused_ops_topkrouter': True, 'router_dense_type': 'float32', 'shared_expert_num': 1, 'use_shared_expert_gating': False, 'max_router_load': 131072, 'topk_method': 'greedy', 'topk_group': 3, 'n_group': 8, 'first_k_dense_replace': 0, 'moe_intermediate_size': 512, 'routed_scaling_factor': 2.5, 'aux_loss_types': ['expert'], 'aux_loss_factors': [0.0001], 'z_loss_factor': 0.0, 'balance_via_topk_bias': True, 'topk_bias_update_rate': 0.0001, 'use_allgather_dispatcher': False, 'moe_shared_expert_overlap': False, 'expert_model_parallel': 1, 'use_gating_sigmoid': True, 'enable_deredundency': True, 'npu_nums_per_device': 2, 'use_gmm': False, 'enable_gmm_safe_tokens': True, 'use_fused_ops_permute': True, 'callback_moe_droprate': False} 2025-07-15 10:17:35,673 - mindformers./output/log[mindformers/trainer/base_trainer.py:204] - INFO - Pipeline parallel was opened: pipeline_stages = 2, full batch is True, gradient_accumulation_steps will not take effect in pipeline parallel, global batch size will be changed: global_batch_size = batch_size * data_parallel * micro_batch_num * micro_batch_interleave_num = 4 = 1 * 2 * 2 * 1). [WARNING] DISTRIBUTED(899784,ffffa894eec0,python):2025-07-15-10:17:35.673.902 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: hccl_world_group [const vector]{0, 1, 2, 3, 4, 5, 6, 7}, async: 1, submit_now: 1 2025-07-15 10:17:35,673 - mindformers./output/log[mindformers/trainer/base_trainer.py:338] - WARNING - When using the pipeline parallel mode, the MFPipelineWithLossScaleCell class is used by default. 2025-07-15 10:17:35,674 - mindformers./output/log[mindformers/core/parallel_config.py:48] - INFO - initial swap_config from dict: {'swap': False, 'layer_swap': None, 'op_swap': None, 'default_prefetch': 1} 2025-07-15 10:17:35,674 - mindformers./output/log[mindformers/trainer/base_trainer.py:346] - INFO - PipelineWrapper under evaluate or predict mode will not take effect. [WARNING] DISTRIBUTED(899784,ffffa894eec0,python):2025-07-15-10:17:35.674.151 [mindspore/ccsrc/distributed/collective/collective_manager.cc:393] CreateCommunicationGroup] This group's communicator is async created hccl_world_group 2025-07-15 10:17:35,674 - mindformers./output/log[mindformers/core/parallel_config.py:55] - INFO - initial recompute_config from dict: {'recompute': True, 'select_recompute': False, 'parallel_optimizer_comm_recompute': True, 'select_comm_recompute': False, 'mp_comm_recompute': True, 'recompute_slice_activation': True, 'select_recompute_exclude': False, 'select_comm_recompute_exclude': False} 2025-07-15 10:17:35,674 - mindformers./output/log[mindformers/trainer/base_trainer.py:920] - INFO - .........Build Dataset For Train.......... [WARNING] DEVICE(899784,fffe8f7eefa0,python):2025-07-15-10:17:35.674.390 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:254] SetGlobalCommInfo] Start to SetGlobalCommInfo for hccl_world_group, master_ip:2130706433, master_port:7124, node_rank:2130706433, total_rank_size:8, local_rank_size8 [WARNING] HCCL_ADPT(899784,fffe8f7eefa0,python):2025-07-15-10:17:35.674.506 [mindspore/ccsrc/utils/dlopen_macro.h:165] DlsymAscend] Dynamically load symbol HcclSetGlobalCommInfo failed, result = /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/../lib/plugin/ascend/libhccl_plugin.so: undefined symbol: HcclSetGlobalCommInfo 2025-07-15 10:17:35,674 - mindformers./output/log[mindformers/trainer/base_trainer.py:464] - INFO - .........Build Dataset From Config.......... [WARNING] HCCL_ADPT(899784,fffe8f7eefa0,python):2025-07-15-10:17:35.674.545 [mindspore/ccsrc/plugin/res_manager/ascend/hccl_adapter/hccl_adapter.cc:635] HcclSetGlobalCommInfo] Func HcclSetGlobalCommInfo is not supported in CANN package. 2025-07-15 10:17:35,674 - mindformers./output/log[mindformers/core/parallel_config.py:61] - INFO - initial parallel_config from dict: {'data_parallel': 2, 'model_parallel': 2, 'context_parallel': 1, 'expert_parallel': 2, 'pipeline_stage': 2, 'micro_batch_num': 2, 'seq_split_num': 1, 'use_seq_parallel': True, 'optimizer_shard': None, 'gradient_aggregation_group': 4, 'vocab_emb_dp': True, 'context_parallel_algo': 'colossalai_cp', 'ulysses_degree_in_cp': 1, 'mem_coeff': 0.1} [WARNING] DEVICE(899784,fffe8f7eefa0,python):2025-07-15-10:17:35.674.579 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:265] SetGlobalCommInfo] End to SetGlobalCommInfo for hccl_world_group 2025-07-15 10:17:35,674 - mindformers./output/log[mindformers/dataset/causal_language_model_dataset.py:302] - INFO - Now Create Causal Language Model Dataset. 2025-07-15 10:17:35,674 - mindformers./output/log[mindformers/core/parallel_config.py:63] - INFO - pipeline_stage = 2 > 1, vocab_emd_dp will be reset to False. [WARNING] DISTRIBUTED(899784,fffe8f7eefa0,python):2025-07-15-10:17:35.675.085 [mindspore/ccsrc/distributed/collective/collective_manager.cc:1021] CreateDeviceCommunicator] Begin initialize communication group on the device side: hccl_world_group [WARNING] DEVICE(899784,fffe8efdefa0,python):2025-07-15-10:17:35.675.395 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:169] InitByRootInfoConfig] Start to initialize communicator by HcclCommInitRootInfoConfig for hccl_world_group, hcclBufferSize is 200 MB, hcclDeterministic is 1 2025-07-15 10:17:35,675 - mindformers./output/log[mindformers/tools/utils.py:166] - INFO - set output path to '/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/output' 2025-07-15 10:17:35,675 - mindformers./output/log[mindformers/dataset/base_dataset.py:83] - INFO - Now dataset_strategy is full_batch, shard_id: None, num_shards: None 2025-07-15 10:17:35,675 - mindformers./output/log[mindformers/tools/utils.py:181] - INFO - set strategy path to './output/strategy/ckpt_strategy_rank_5.ckpt' 2025-07-15 10:17:35,675 - mindformers./output/log[mindformers/tools/utils.py:181] - INFO - set strategy path to './output/strategy/ckpt_strategy_rank_6.ckpt' 2025-07-15 10:17:35,682 - mindformers./output/log[mindformers/trainer/base_trainer.py:924] - INFO - Create train dataset finish, dataset size:6 2025-07-15 10:17:35,683 - mindformers./output/log[mindformers/trainer/utils.py:176] - INFO - Will be Training epochs:1, sink_size:1 2025-07-15 10:17:35,683 - mindformers./output/log[mindformers/trainer/utils.py:178] - INFO - Create training dataset finish, dataset size:6 2025-07-15 10:17:35,683 - mindformers./output/log[mindformers/trainer/base_trainer.py:971] - INFO - .........Build Net For Train.......... 2025-07-15 10:17:35,683 - mindformers./output/log[mindformers/trainer/base_trainer.py:498] - INFO - .........Build Network From Config.......... 2025-07-15 10:17:35,702 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config swap_config is empty. 2025-07-15 10:17:35,703 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config metric is empty. 2025-07-15 10:17:35,703 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config monitor_config is empty. 2025-07-15 10:17:35,703 - mindformers./output/log[mindformers/tools/register/template.py:683] - WARNING - Some configs in yaml are useless for train: ['auto_tune', 'autotune_per_step', 'eval_callbacks', 'eval_dataset', 'eval_dataset_task', 'filepath_prefix', 'processor'] 2025-07-15 10:17:35,704 - mindformers./output/log[mindformers/trainer/trainer.py:1008] - INFO - Load configs in /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/configs/general/run_general_task.yaml to build trainer. 2025-07-15 10:17:35,704 - mindformers./output/log[mindformers/trainer/trainer.py:1044] - INFO - ..........Init Config.......... 2025-07-15 10:17:35,704 - mindformers./output/log[mindformers/core/parallel_config.py:41] - INFO - initial moe_config from dict: {'expert_num': 4, 'capacity_factor': 1.5, 'aux_loss_factor': 0.05, 'num_experts_chosen': 2, 'expert_group_size': 2, 'group_wise_a2a': False, 'comp_comm_parallel': False, 'comp_comm_parallel_degree': 2, 'save_token_distribution': False, 'cur_layer': 0, 'enable_cold_hot_expert': False, 'update_step': 10000, 'hot_expert_num': 0, 'cold_token_percent': 1.0, 'moe_module_name': '', 'routing_policy': 'TopkRouterV2', 'norm_topk_prob': False, 'enable_sdrop': False, 'use_fused_ops_topkrouter': True, 'router_dense_type': 'float32', 'shared_expert_num': 1, 'use_shared_expert_gating': False, 'max_router_load': 131072, 'topk_method': 'greedy', 'topk_group': 3, 'n_group': 8, 'first_k_dense_replace': 0, 'moe_intermediate_size': 512, 'routed_scaling_factor': 2.5, 'aux_loss_types': ['expert'], 'aux_loss_factors': [0.0001], 'z_loss_factor': 0.0, 'balance_via_topk_bias': True, 'topk_bias_update_rate': 0.0001, 'use_allgather_dispatcher': False, 'moe_shared_expert_overlap': False, 'expert_model_parallel': 1, 'use_gating_sigmoid': True, 'enable_deredundency': True, 'npu_nums_per_device': 2, 'use_gmm': False, 'enable_gmm_safe_tokens': True, 'use_fused_ops_permute': True, 'callback_moe_droprate': False} 2025-07-15 10:17:35,704 - mindformers./output/log[mindformers/core/parallel_config.py:48] - INFO - initial swap_config from dict: {'swap': False, 'layer_swap': None, 'op_swap': None, 'default_prefetch': 1} 2025-07-15 10:17:35,705 - mindformers./output/log[mindformers/core/parallel_config.py:55] - INFO - initial recompute_config from dict: {'recompute': True, 'select_recompute': False, 'parallel_optimizer_comm_recompute': True, 'select_comm_recompute': False, 'mp_comm_recompute': True, 'recompute_slice_activation': True, 'select_recompute_exclude': False, 'select_comm_recompute_exclude': False} 2025-07-15 10:17:35,705 - mindformers./output/log[mindformers/core/parallel_config.py:61] - INFO - initial parallel_config from dict: {'data_parallel': 2, 'model_parallel': 2, 'context_parallel': 1, 'expert_parallel': 2, 'pipeline_stage': 2, 'micro_batch_num': 2, 'seq_split_num': 1, 'use_seq_parallel': True, 'optimizer_shard': None, 'gradient_aggregation_group': 4, 'vocab_emb_dp': True, 'context_parallel_algo': 'colossalai_cp', 'ulysses_degree_in_cp': 1, 'mem_coeff': 0.1} 2025-07-15 10:17:35,705 - mindformers./output/log[mindformers/core/parallel_config.py:63] - INFO - pipeline_stage = 2 > 1, vocab_emd_dp will be reset to False. 2025-07-15 10:17:35,706 - mindformers./output/log[mindformers/tools/utils.py:166] - INFO - set output path to '/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/output' 2025-07-15 10:17:35,706 - mindformers./output/log[mindformers/tools/utils.py:181] - INFO - set strategy path to './output/strategy/ckpt_strategy_rank_5.ckpt' 2025-07-15 10:17:35,802 - mindformers./output/log[mindformers/version_control.py:140] - INFO - The Lazy Inline compilation acceleration feature is turned on. 2025-07-15 10:17:35,807 - mindformers./output/log[mindformers/research/deepseek3/deepseek2_model.py:1216] - INFO - Enable flash attention. 2025-07-15 10:17:35,810 - mindformers./output/log[mindformers/trainer/base_trainer.py:107] - INFO - host_name: ascend213, host_ip: 121.37.54.128 2025-07-15 10:17:35,811 - mindformers./output/log[mindformers/trainer/base_trainer.py:113] - INFO - Now Running Task is: text_generation, Model is: deepseekV3 2025-07-15 10:17:35,811 - mindformers./output/log[mindformers/trainer/base_trainer.py:143] - WARNING - Input model name is not in the supported list or unspecified. 2025-07-15 10:17:35,811 - mindformers./output/log[mindformers/trainer/base_trainer.py:144] - WARNING - See the list of supported task and model name: ['codellama_34b', 'common', 'deepseek1_5_7b', 'deepseek_33b', 'glm3_6b', 'glm4_9b', 'gpt2', 'gpt2_13b', 'gpt2_52b', 'gpt2_lora', 'gpt2_xl', 'gpt2_xl_lora', 'internlm_7b', 'internlm_7b_lora', 'llama2_13b', 'llama2_70b', 'llama2_7b', 'llama2_7b_lora', 'llama_7b_slora', 'yi_34b', 'yi_6b'] 2025-07-15 10:17:35,812 - mindformers./output/log[mindformers/trainer/base_trainer.py:145] - WARNING - The default model config: /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/configs/gpt2/run_gpt2.yaml will now be used for the text_generation task 2025-07-15 10:17:35,812 - mindformers./output/log[mindformers/trainer/trainer.py:1117] - INFO - ..........Init Model.......... 2025-07-15 10:17:35,813 - mindformers./output/log[mindformers/trainer/trainer.py:323] - INFO - ==========Trainer Init Success!========== 2025-07-15 10:17:35,813 - mindformers./output/log[mindformers/trainer/trainer.py:406] - WARNING - sink_size will not be able to set in a future release. Modifying sink_size may cause functional issues when resuming training from a checkpoint. 2025-07-15 10:17:35,813 - mindformers./output/log[mindformers/trainer/trainer.py:1117] - INFO - ..........Init Model.......... 2025-07-15 10:17:35,814 - mindformers./output/log[mindformers/trainer/base_trainer.py:204] - INFO - Pipeline parallel was opened: pipeline_stages = 2, full batch is True, gradient_accumulation_steps will not take effect in pipeline parallel, global batch size will be changed: global_batch_size = batch_size * data_parallel * micro_batch_num * micro_batch_interleave_num = 4 = 1 * 2 * 2 * 1). 2025-07-15 10:17:35,814 - mindformers./output/log[mindformers/trainer/base_trainer.py:338] - WARNING - When using the pipeline parallel mode, the MFPipelineWithLossScaleCell class is used by default. 2025-07-15 10:17:35,814 - mindformers./output/log[mindformers/trainer/base_trainer.py:346] - INFO - PipelineWrapper under evaluate or predict mode will not take effect. 2025-07-15 10:17:35,814 - mindformers./output/log[mindformers/trainer/base_trainer.py:920] - INFO - .........Build Dataset For Train.......... 2025-07-15 10:17:35,814 - mindformers./output/log[mindformers/trainer/base_trainer.py:464] - INFO - .........Build Dataset From Config.......... 2025-07-15 10:17:35,815 - mindformers./output/log[mindformers/dataset/causal_language_model_dataset.py:302] - INFO - Now Create Causal Language Model Dataset. 2025-07-15 10:17:35,816 - mindformers./output/log[mindformers/dataset/base_dataset.py:83] - INFO - Now dataset_strategy is full_batch, shard_id: None, num_shards: None 2025-07-15 10:17:35,822 - mindformers./output/log[mindformers/trainer/base_trainer.py:924] - INFO - Create train dataset finish, dataset size:6 2025-07-15 10:17:35,823 - mindformers./output/log[mindformers/trainer/utils.py:176] - INFO - Will be Training epochs:1, sink_size:1 2025-07-15 10:17:35,823 - mindformers./output/log[mindformers/trainer/utils.py:178] - INFO - Create training dataset finish, dataset size:6 2025-07-15 10:17:35,823 - mindformers./output/log[mindformers/trainer/base_trainer.py:971] - INFO - .........Build Net For Train.......... 2025-07-15 10:17:35,823 - mindformers./output/log[mindformers/trainer/base_trainer.py:498] - INFO - .........Build Network From Config.......... 2025-07-15 10:17:35,842 - mindformers./output/log[mindformers/version_control.py:140] - INFO - The Lazy Inline compilation acceleration feature is turned on. 2025-07-15 10:17:35,847 - mindformers./output/log[mindformers/research/deepseek3/deepseek2_model.py:1216] - INFO - Enable flash attention. 2025-07-15 10:17:35,849 - mindformers./output/log[mindformers/trainer/base_trainer.py:107] - INFO - host_name: ascend213, host_ip: 121.37.54.128 2025-07-15 10:17:35,850 - mindformers./output/log[mindformers/trainer/base_trainer.py:113] - INFO - Now Running Task is: text_generation, Model is: deepseekV3 2025-07-15 10:17:35,850 - mindformers./output/log[mindformers/trainer/base_trainer.py:143] - WARNING - Input model name is not in the supported list or unspecified. 2025-07-15 10:17:35,850 - mindformers./output/log[mindformers/trainer/base_trainer.py:144] - WARNING - See the list of supported task and model name: ['codellama_34b', 'common', 'deepseek1_5_7b', 'deepseek_33b', 'glm3_6b', 'glm4_9b', 'gpt2', 'gpt2_13b', 'gpt2_52b', 'gpt2_lora', 'gpt2_xl', 'gpt2_xl_lora', 'internlm_7b', 'internlm_7b_lora', 'llama2_13b', 'llama2_70b', 'llama2_7b', 'llama2_7b_lora', 'llama_7b_slora', 'yi_34b', 'yi_6b'] 2025-07-15 10:17:35,851 - mindformers./output/log[mindformers/trainer/base_trainer.py:145] - WARNING - The default model config: /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/configs/gpt2/run_gpt2.yaml will now be used for the text_generation task 2025-07-15 10:17:35,851 - mindformers./output/log[mindformers/trainer/trainer.py:1117] - INFO - ..........Init Model.......... 2025-07-15 10:17:35,851 - mindformers./output/log[mindformers/trainer/trainer.py:323] - INFO - ==========Trainer Init Success!========== 2025-07-15 10:17:35,852 - mindformers./output/log[mindformers/trainer/trainer.py:406] - WARNING - sink_size will not be able to set in a future release. Modifying sink_size may cause functional issues when resuming training from a checkpoint. 2025-07-15 10:17:35,852 - mindformers./output/log[mindformers/trainer/trainer.py:1117] - INFO - ..........Init Model.......... 2025-07-15 10:17:35,852 - mindformers./output/log[mindformers/trainer/base_trainer.py:204] - INFO - Pipeline parallel was opened: pipeline_stages = 2, full batch is True, gradient_accumulation_steps will not take effect in pipeline parallel, global batch size will be changed: global_batch_size = batch_size * data_parallel * micro_batch_num * micro_batch_interleave_num = 4 = 1 * 2 * 2 * 1). 2025-07-15 10:17:35,852 - mindformers./output/log[mindformers/trainer/base_trainer.py:338] - WARNING - When using the pipeline parallel mode, the MFPipelineWithLossScaleCell class is used by default. 2025-07-15 10:17:35,853 - mindformers./output/log[mindformers/trainer/base_trainer.py:346] - INFO - PipelineWrapper under evaluate or predict mode will not take effect. 2025-07-15 10:17:35,853 - mindformers./output/log[mindformers/trainer/base_trainer.py:920] - INFO - .........Build Dataset For Train.......... 2025-07-15 10:17:35,853 - mindformers./output/log[mindformers/trainer/base_trainer.py:464] - INFO - .........Build Dataset From Config.......... 2025-07-15 10:17:35,853 - mindformers./output/log[mindformers/dataset/causal_language_model_dataset.py:302] - INFO - Now Create Causal Language Model Dataset. 2025-07-15 10:17:35,854 - mindformers./output/log[mindformers/dataset/base_dataset.py:83] - INFO - Now dataset_strategy is full_batch, shard_id: None, num_shards: None 2025-07-15 10:17:35,861 - mindformers./output/log[mindformers/trainer/base_trainer.py:924] - INFO - Create train dataset finish, dataset size:6 2025-07-15 10:17:35,862 - mindformers./output/log[mindformers/trainer/utils.py:176] - INFO - Will be Training epochs:1, sink_size:1 2025-07-15 10:17:35,862 - mindformers./output/log[mindformers/trainer/utils.py:178] - INFO - Create training dataset finish, dataset size:6 2025-07-15 10:17:35,862 - mindformers./output/log[mindformers/trainer/base_trainer.py:971] - INFO - .........Build Net For Train.......... 2025-07-15 10:17:35,862 - mindformers./output/log[mindformers/trainer/base_trainer.py:498] - INFO - .........Build Network From Config.......... 2025-07-15 10:17:35,927 - mindformers./output/log[mindformers/version_control.py:140] - INFO - The Lazy Inline compilation acceleration feature is turned on. 2025-07-15 10:17:35,933 - mindformers./output/log[mindformers/research/deepseek3/deepseek2_model.py:1216] - INFO - Enable flash attention. 2025-07-15 10:17:35,957 - mindformers./output/log[mindformers/version_control.py:140] - INFO - The Lazy Inline compilation acceleration feature is turned on. 2025-07-15 10:17:35,962 - mindformers./output/log[mindformers/research/deepseek3/deepseek2_model.py:1216] - INFO - Enable flash attention. 2025-07-15 10:17:35,980 - mindformers./output/log[mindformers/models/utils.py:190] - INFO - num_layers per stage: [[1, 1]] 2025-07-15 10:17:35,980 - mindformers./output/log[mindformers/models/utils.py:191] - INFO - Accumulated num_layers per stage: [[1, 2]] 2025-07-15 10:17:35,980 - mindformers./output/log[mindformers/models/utils.py:193] - INFO - Pipeline id list with start_stage: [0, 1] 2025-07-15 10:17:35,980 - mindformers./output/log[mindformers/models/utils.py:194] - INFO - Interleave id list: [0, 0] 2025-07-15 10:17:35,981 - mindformers./output/log[mindformers/models/utils.py:212] - INFO - Formative layer_recompute: [[1, 1]] 2025-07-15 10:17:35,981 - mindformers./output/log[mindformers/models/utils.py:214] - INFO - The configuration of select_recompute_exclude and select_comm_recompute_exclude have the highest priority. 2025-07-15 10:17:35,981 - mindformers./output/log[mindformers/models/utils.py:220] - INFO - Formative select_recompute: {'feed_forward\\.mul': [[0, 0]], 'feed_forward\\.w1\\.activation\\.silu': [[0, 0]]} 2025-07-15 10:17:35,981 - mindformers./output/log[mindformers/models/utils.py:221] - INFO - Formative select_comm_recompute: {'.*\\.norm': [[0, 0]]} 2025-07-15 10:17:35,981 - mindformers./output/log[mindformers/models/utils.py:222] - INFO - Formative select_recompute_exclude: {} 2025-07-15 10:17:35,981 - mindformers./output/log[mindformers/models/utils.py:223] - INFO - Formative select_comm_recompute_exclude: {} 2025-07-15 10:17:35,994 - mindformers./output/log[mindformers/version_control.py:140] - INFO - The Lazy Inline compilation acceleration feature is turned on. 2025-07-15 10:17:35,999 - mindformers./output/log[mindformers/research/deepseek3/deepseek2_model.py:1216] - INFO - Enable flash attention. 2025-07-15 10:17:36,001 - mindformers./output/log[mindformers/research/deepseek3/deepseek2_model.py:1072] - INFO - MoE config is provided, use MoE FFN with shared ffn 2025-07-15 10:17:36,021 - mindformers./output/log[mindformers/models/utils.py:190] - INFO - num_layers per stage: [[1, 1]] 2025-07-15 10:17:36,021 - mindformers./output/log[mindformers/models/utils.py:191] - INFO - Accumulated num_layers per stage: [[1, 2]] 2025-07-15 10:17:36,022 - mindformers./output/log[mindformers/models/utils.py:193] - INFO - Pipeline id list with start_stage: [0, 1] 2025-07-15 10:17:36,022 - mindformers./output/log[mindformers/models/utils.py:194] - INFO - Interleave id list: [0, 0] 2025-07-15 10:17:36,022 - mindformers./output/log[mindformers/models/utils.py:212] - INFO - Formative layer_recompute: [[1, 1]] 2025-07-15 10:17:36,022 - mindformers./output/log[mindformers/models/utils.py:214] - INFO - The configuration of select_recompute_exclude and select_comm_recompute_exclude have the highest priority. 2025-07-15 10:17:36,022 - mindformers./output/log[mindformers/models/utils.py:220] - INFO - Formative select_recompute: {'feed_forward\\.mul': [[0, 0]], 'feed_forward\\.w1\\.activation\\.silu': [[0, 0]]} 2025-07-15 10:17:36,023 - mindformers./output/log[mindformers/models/utils.py:221] - INFO - Formative select_comm_recompute: {'.*\\.norm': [[0, 0]]} 2025-07-15 10:17:36,023 - mindformers./output/log[mindformers/models/utils.py:222] - INFO - Formative select_recompute_exclude: {} 2025-07-15 10:17:36,023 - mindformers./output/log[mindformers/models/utils.py:223] - INFO - Formative select_comm_recompute_exclude: {} 2025-07-15 10:17:36,037 - mindformers./output/log[mindformers/models/utils.py:423] - INFO - Set full recompute at layer 0 2025-07-15 10:17:36,042 - mindformers./output/log[mindformers/research/deepseek3/deepseek2_model.py:1072] - INFO - MoE config is provided, use MoE FFN with shared ffn 2025-07-15 10:17:36,056 - mindformers./output/log[mindformers/research/deepseek3/deepseek2_model.py:1072] - INFO - MoE config is provided, use MoE FFN with shared ffn 2025-07-15 10:17:36,078 - mindformers./output/log[mindformers/models/utils.py:423] - INFO - Set full recompute at layer 0 2025-07-15 10:17:36,089 - mindformers./output/log[mindformers/models/utils.py:423] - INFO - Set full recompute at layer 1 2025-07-15 10:17:36,095 - mindformers./output/log[mindformers/models/utils.py:423] - INFO - Set full recompute at layer 1 2025-07-15 10:17:36,097 - mindformers./output/log[mindformers/research/deepseek3/deepseek2_model.py:1072] - INFO - MoE config is provided, use MoE FFN with shared ffn 2025-07-15 10:17:36,098 - mindformers./output/log[mindformers/research/deepseek3/deepseek2_model.py:134] - INFO - Using 2 data parallel, 1 context parallel and 2 model parallel for the embedding lookup. 2025-07-15 10:17:36,106 - mindformers./output/log[mindformers/models/modeling_utils.py:1494] - INFO - model built, but weights is unloaded, since the config has no checkpoint_name_or_path attribute or checkpoint_name_or_path is None. 2025-07-15 10:17:36,106 - mindformers./output/log[mindformers/research/deepseek3/deepseek2_model.py:1643] - INFO - Predict run mode:False 2025-07-15 10:17:36,109 - mindformers./output/log[mindformers/models/utils.py:190] - INFO - num_layers per stage: [[1, 1]] 2025-07-15 10:17:36,109 - mindformers./output/log[mindformers/models/utils.py:191] - INFO - Accumulated num_layers per stage: [[1, 2]] 2025-07-15 10:17:36,109 - mindformers./output/log[mindformers/models/utils.py:193] - INFO - Pipeline id list with start_stage: [0, 1] 2025-07-15 10:17:36,109 - mindformers./output/log[mindformers/models/utils.py:194] - INFO - Interleave id list: [0, 0] 2025-07-15 10:17:36,110 - mindformers./output/log[mindformers/models/utils.py:212] - INFO - Formative layer_recompute: [[1, 1]] 2025-07-15 10:17:36,110 - mindformers./output/log[mindformers/models/utils.py:214] - INFO - The configuration of select_recompute_exclude and select_comm_recompute_exclude have the highest priority. 2025-07-15 10:17:36,110 - mindformers./output/log[mindformers/models/utils.py:220] - INFO - Formative select_recompute: {'feed_forward\\.mul': [[0, 0]], 'feed_forward\\.w1\\.activation\\.silu': [[0, 0]]} 2025-07-15 10:17:36,110 - mindformers./output/log[mindformers/models/utils.py:221] - INFO - Formative select_comm_recompute: {'.*\\.norm': [[0, 0]]} 2025-07-15 10:17:36,110 - mindformers./output/log[mindformers/models/utils.py:222] - INFO - Formative select_recompute_exclude: {} 2025-07-15 10:17:36,110 - mindformers./output/log[mindformers/models/utils.py:223] - INFO - Formative select_comm_recompute_exclude: {} 2025-07-15 10:17:36,114 - mindformers./output/log[mindformers/trainer/base_trainer.py:715] - INFO - Network Parameters: 91 M. 2025-07-15 10:17:36,114 - mindformers./output/log[mindformers/trainer/base_trainer.py:1010] - INFO - .........Build Optimizer For Train.......... 2025-07-15 10:17:36,114 - mindformers./output/log[mindformers/trainer/base_trainer.py:581] - INFO - .........Build Optimizer From Config.......... 2025-07-15 10:17:36,115 - mindformers./output/log[mindformers/trainer/base_trainer.py:628] - INFO - .........Build LR Schedule From Config.......... 2025-07-15 10:17:36,117 - mindformers./output/log[mindformers/trainer/optimizer_grouped_parameters.py:77] - WARNING - dynamic_lr_schedule will be reset and invalid when layer_scale is False. 2025-07-15 10:17:36,117 - mindformers./output/log[mindformers/trainer/optimizer_grouped_parameters.py:116] - INFO - Param groups = { "decay": { "weight_decay": 0.1, "params": [ "model.tok_embeddings.embedding_weight", "model.layers.0.attention.q2l_proj.weight", "model.layers.0.attention.l2q_nope_proj.weight", "model.layers.0.attention.l2q_pe_proj.weight", "model.layers.0.attention.kv2l_k_pe.weight", "model.layers.0.attention.kv2l_latent_kv.weight", "model.layers.0.attention.lkv2kv_k_nope.weight", "model.layers.0.attention.lkv2kv_v.weight", "model.layers.0.attention.wo.weight", "model.layers.0.feed_forward.routed_experts.ffn.w1.weight", "model.layers.0.feed_forward.routed_experts.ffn.w2.weight", "model.layers.0.feed_forward.routed_experts.ffn.w3.weight", "model.layers.0.feed_forward.routed_experts.router.dense.weight", "model.layers.0.feed_forward.shared_experts.w1.weight", "model.layers.0.feed_forward.shared_experts.w2.weight", "model.layers.0.feed_forward.shared_experts.w3.weight", "model.layers.1.attention.q2l_proj.weight", "model.layers.1.attention.l2q_nope_proj.weight", "model.layers.1.attention.l2q_pe_proj.weight", "model.layers.1.attention.kv2l_k_pe.weight", "model.layers.1.attention.kv2l_latent_kv.weight", "model.layers.1.attention.lkv2kv_k_nope.weight", "model.layers.1.attention.lkv2kv_v.weight", "model.layers.1.attention.wo.weight", "model.layers.1.feed_forward.routed_experts.ffn.w1.weight", "model.layers.1.feed_forward.routed_experts.ffn.w2.weight", "model.layers.1.feed_forward.routed_experts.ffn.w3.weight", "model.layers.1.feed_forward.routed_experts.router.dense.weight", "model.layers.1.feed_forward.shared_experts.w1.weight", "model.layers.1.feed_forward.shared_experts.w2.weight", "model.layers.1.feed_forward.shared_experts.w3.weight", "model.mtp_hidden_fusers.0.dense.weight", "lm_head.weight" ] }, "no_decay": { "weight_decay": 0.0, "params": [ "model.layers.0.ffn_norm.weight", "model.layers.0.attention_norm.weight", "model.layers.0.attention.lq_norm.weight", "model.layers.0.attention.lkv_norm.weight", "model.layers.1.ffn_norm.weight", "model.layers.1.attention_norm.weight", "model.layers.1.attention.lq_norm.weight", "model.layers.1.attention.lkv_norm.weight", "model.mtp_hidden_fusers.0.norm.weight", "model.mtp_hidden_fusers.0.norm_emb.weight", "model.mtp_norms.0.weight", "model.norm_out.weight" ] } } 2025-07-15 10:17:36,130 - mindformers./output/log[mindformers/research/deepseek3/deepseek2_model.py:1072] - INFO - MoE config is provided, use MoE FFN with shared ffn 2025-07-15 10:17:36,130 - mindformers./output/log[mindformers/models/utils.py:423] - INFO - Set full recompute at layer 1 2025-07-15 10:17:36,136 - mindformers./output/log[mindformers/models/utils.py:423] - INFO - Set full recompute at layer 1 2025-07-15 10:17:36,136 - mindformers./output/log[mindformers/models/utils.py:190] - INFO - num_layers per stage: [[1, 1]] 2025-07-15 10:17:36,136 - mindformers./output/log[mindformers/models/utils.py:191] - INFO - Accumulated num_layers per stage: [[1, 2]] 2025-07-15 10:17:36,136 - mindformers./output/log[mindformers/models/utils.py:193] - INFO - Pipeline id list with start_stage: [0, 1] 2025-07-15 10:17:36,137 - mindformers./output/log[mindformers/models/utils.py:194] - INFO - Interleave id list: [0, 0] 2025-07-15 10:17:36,137 - mindformers./output/log[mindformers/models/utils.py:212] - INFO - Formative layer_recompute: [[1, 1]] 2025-07-15 10:17:36,137 - mindformers./output/log[mindformers/models/utils.py:214] - INFO - The configuration of select_recompute_exclude and select_comm_recompute_exclude have the highest priority. 2025-07-15 10:17:36,137 - mindformers./output/log[mindformers/models/utils.py:220] - INFO - Formative select_recompute: {'feed_forward\\.mul': [[0, 0]], 'feed_forward\\.w1\\.activation\\.silu': [[0, 0]]} 2025-07-15 10:17:36,137 - mindformers./output/log[mindformers/models/utils.py:221] - INFO - Formative select_comm_recompute: {'.*\\.norm': [[0, 0]]} 2025-07-15 10:17:36,137 - mindformers./output/log[mindformers/models/utils.py:222] - INFO - Formative select_recompute_exclude: {} 2025-07-15 10:17:36,138 - mindformers./output/log[mindformers/models/utils.py:223] - INFO - Formative select_comm_recompute_exclude: {} 2025-07-15 10:17:36,139 - mindformers./output/log[mindformers/research/deepseek3/deepseek2_model.py:134] - INFO - Using 2 data parallel, 1 context parallel and 2 model parallel for the embedding lookup. 2025-07-15 10:17:36,147 - mindformers./output/log[mindformers/models/modeling_utils.py:1494] - INFO - model built, but weights is unloaded, since the config has no checkpoint_name_or_path attribute or checkpoint_name_or_path is None. 2025-07-15 10:17:36,147 - mindformers./output/log[mindformers/research/deepseek3/deepseek2_model.py:1643] - INFO - Predict run mode:False 2025-07-15 10:17:36,148 - mindformers./output/log[mindformers/version_control.py:140] - INFO - The Lazy Inline compilation acceleration feature is turned on. 2025-07-15 10:17:36,153 - mindformers./output/log[mindformers/research/deepseek3/deepseek2_model.py:1216] - INFO - Enable flash attention. 2025-07-15 10:17:36,155 - mindformers./output/log[mindformers/trainer/base_trainer.py:715] - INFO - Network Parameters: 91 M. 2025-07-15 10:17:36,155 - mindformers./output/log[mindformers/trainer/base_trainer.py:1010] - INFO - .........Build Optimizer For Train.......... 2025-07-15 10:17:36,155 - mindformers./output/log[mindformers/trainer/base_trainer.py:581] - INFO - .........Build Optimizer From Config.......... 2025-07-15 10:17:36,155 - mindformers./output/log[mindformers/trainer/base_trainer.py:628] - INFO - .........Build LR Schedule From Config.......... 2025-07-15 10:17:36,156 - mindformers./output/log[mindformers/trainer/base_trainer.py:1019] - INFO - .........Build Running Wrapper From Config For Train.......... 2025-07-15 10:17:36,156 - mindformers./output/log[mindformers/trainer/base_trainer.py:665] - INFO - .........Build Model Wrapper for Train From Config.......... 2025-07-15 10:17:36,157 - mindformers./output/log[mindformers/research/deepseek3/deepseek2_model.py:1072] - INFO - MoE config is provided, use MoE FFN with shared ffn 2025-07-15 10:17:36,157 - mindformers./output/log[mindformers/trainer/optimizer_grouped_parameters.py:77] - WARNING - dynamic_lr_schedule will be reset and invalid when layer_scale is False. 2025-07-15 10:17:36,158 - mindformers./output/log[mindformers/trainer/optimizer_grouped_parameters.py:116] - INFO - Param groups = { "decay": { "weight_decay": 0.1, "params": [ "model.tok_embeddings.embedding_weight", "model.layers.0.attention.q2l_proj.weight", "model.layers.0.attention.l2q_nope_proj.weight", "model.layers.0.attention.l2q_pe_proj.weight", "model.layers.0.attention.kv2l_k_pe.weight", "model.layers.0.attention.kv2l_latent_kv.weight", "model.layers.0.attention.lkv2kv_k_nope.weight", "model.layers.0.attention.lkv2kv_v.weight", "model.layers.0.attention.wo.weight", "model.layers.0.feed_forward.routed_experts.ffn.w1.weight", "model.layers.0.feed_forward.routed_experts.ffn.w2.weight", "model.layers.0.feed_forward.routed_experts.ffn.w3.weight", "model.layers.0.feed_forward.routed_experts.router.dense.weight", "model.layers.0.feed_forward.shared_experts.w1.weight", "model.layers.0.feed_forward.shared_experts.w2.weight", "model.layers.0.feed_forward.shared_experts.w3.weight", "model.layers.1.attention.q2l_proj.weight", "model.layers.1.attention.l2q_nope_proj.weight", "model.layers.1.attention.l2q_pe_proj.weight", "model.layers.1.attention.kv2l_k_pe.weight", "model.layers.1.attention.kv2l_latent_kv.weight", "model.layers.1.attention.lkv2kv_k_nope.weight", "model.layers.1.attention.lkv2kv_v.weight", "model.layers.1.attention.wo.weight", "model.layers.1.feed_forward.routed_experts.ffn.w1.weight", "model.layers.1.feed_forward.routed_experts.ffn.w2.weight", "model.layers.1.feed_forward.routed_experts.ffn.w3.weight", "model.layers.1.feed_forward.routed_experts.router.dense.weight", "model.layers.1.feed_forward.shared_experts.w1.weight", "model.layers.1.feed_forward.shared_experts.w2.weight", "model.layers.1.feed_forward.shared_experts.w3.weight", "model.mtp_hidden_fusers.0.dense.weight", "lm_head.weight" ] }, "no_decay": { "weight_decay": 0.0, "params": [ "model.layers.0.ffn_norm.weight", "model.layers.0.attention_norm.weight", "model.layers.0.attention.lq_norm.weight", "model.layers.0.attention.lkv_norm.weight", "model.layers.1.ffn_norm.weight", "model.layers.1.attention_norm.weight", "model.layers.1.attention.lq_norm.weight", "model.layers.1.attention.lkv_norm.weight", "model.mtp_hidden_fusers.0.norm.weight", "model.mtp_hidden_fusers.0.norm_emb.weight", "model.mtp_norms.0.weight", "model.norm_out.weight" ] } } 2025-07-15 10:17:36,165 - mindformers./output/log[mindformers/models/utils.py:423] - INFO - Set full recompute at layer 0 2025-07-15 10:17:36,178 - mindformers./output/log[mindformers/models/utils.py:190] - INFO - num_layers per stage: [[1, 1]] 2025-07-15 10:17:36,178 - mindformers./output/log[mindformers/models/utils.py:191] - INFO - Accumulated num_layers per stage: [[1, 2]] 2025-07-15 10:17:36,178 - mindformers./output/log[mindformers/models/utils.py:193] - INFO - Pipeline id list with start_stage: [0, 1] 2025-07-15 10:17:36,178 - mindformers./output/log[mindformers/models/utils.py:194] - INFO - Interleave id list: [0, 0] 2025-07-15 10:17:36,179 - mindformers./output/log[mindformers/models/utils.py:212] - INFO - Formative layer_recompute: [[1, 1]] 2025-07-15 10:17:36,179 - mindformers./output/log[mindformers/models/utils.py:214] - INFO - The configuration of select_recompute_exclude and select_comm_recompute_exclude have the highest priority. 2025-07-15 10:17:36,179 - mindformers./output/log[mindformers/models/utils.py:220] - INFO - Formative select_recompute: {'feed_forward\\.mul': [[0, 0]], 'feed_forward\\.w1\\.activation\\.silu': [[0, 0]]} 2025-07-15 10:17:36,179 - mindformers./output/log[mindformers/models/utils.py:221] - INFO - Formative select_comm_recompute: {'.*\\.norm': [[0, 0]]} [WARNING] DISTRIBUTED(899764,ffffa930eec0,python):2025-07-15-10:17:36.179.767 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: cb4ececddcb4517ca0bcddafd23813b9 [const vector]{0, 4}, async: 0, submit_now: 1 2025-07-15 10:17:36,179 - mindformers./output/log[mindformers/models/utils.py:222] - INFO - Formative select_recompute_exclude: {} 2025-07-15 10:17:36,180 - mindformers./output/log[mindformers/models/utils.py:223] - INFO - Formative select_comm_recompute_exclude: {} 2025-07-15 10:17:36,181 - mindformers./output/log[mindformers/version_control.py:140] - INFO - The Lazy Inline compilation acceleration feature is turned on. 2025-07-15 10:17:36,184 - mindformers./output/log[mindformers/research/deepseek3/deepseek2_model.py:1072] - INFO - MoE config is provided, use MoE FFN with shared ffn 2025-07-15 10:17:36,186 - mindformers./output/log[mindformers/research/deepseek3/deepseek2_model.py:1216] - INFO - Enable flash attention. 2025-07-15 10:17:36,192 - mindformers./output/log[mindformers/models/utils.py:423] - INFO - Set full recompute at layer 0 2025-07-15 10:17:36,196 - mindformers./output/log[mindformers/trainer/base_trainer.py:1019] - INFO - .........Build Running Wrapper From Config For Train.......... 2025-07-15 10:17:36,197 - mindformers./output/log[mindformers/trainer/base_trainer.py:665] - INFO - .........Build Model Wrapper for Train From Config.......... 2025-07-15 10:17:36,198 - mindformers./output/log[mindformers/research/deepseek3/deepseek2_model.py:1072] - INFO - MoE config is provided, use MoE FFN with shared ffn 2025-07-15 10:17:36,211 - mindformers./output/log[mindformers/research/deepseek3/deepseek2_model.py:1072] - INFO - MoE config is provided, use MoE FFN with shared ffn 2025-07-15 10:17:36,217 - mindformers./output/log[mindformers/models/utils.py:423] - INFO - Set full recompute at layer 1 [WARNING] DISTRIBUTED(899792,ffffa1efeec0,python):2025-07-15-10:17:36.220.081 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: e30609fbce6a1a756f50a31ec86eae83 [const vector]{3, 7}, async: 0, submit_now: 1 2025-07-15 10:17:36,223 - mindformers./output/log[mindformers/models/utils.py:423] - INFO - Set full recompute at layer 1 2025-07-15 10:17:36,226 - mindformers./output/log[mindformers/research/deepseek3/deepseek2_model.py:134] - INFO - Using 2 data parallel, 1 context parallel and 2 model parallel for the embedding lookup. 2025-07-15 10:17:36,232 - mindformers./output/log[mindformers/models/utils.py:423] - INFO - Set full recompute at layer 0 2025-07-15 10:17:36,235 - mindformers./output/log[mindformers/models/modeling_utils.py:1494] - INFO - model built, but weights is unloaded, since the config has no checkpoint_name_or_path attribute or checkpoint_name_or_path is None. 2025-07-15 10:17:36,235 - mindformers./output/log[mindformers/research/deepseek3/deepseek2_model.py:1643] - INFO - Predict run mode:False 2025-07-15 10:17:36,242 - mindformers./output/log[mindformers/trainer/base_trainer.py:715] - INFO - Network Parameters: 91 M. 2025-07-15 10:17:36,243 - mindformers./output/log[mindformers/trainer/base_trainer.py:1010] - INFO - .........Build Optimizer For Train.......... 2025-07-15 10:17:36,243 - mindformers./output/log[mindformers/trainer/base_trainer.py:581] - INFO - .........Build Optimizer From Config.......... 2025-07-15 10:17:36,243 - mindformers./output/log[mindformers/trainer/base_trainer.py:628] - INFO - .........Build LR Schedule From Config.......... 2025-07-15 10:17:36,243 - mindformers./output/log[mindformers/models/utils.py:423] - INFO - Set full recompute at layer 1 2025-07-15 10:17:36,245 - mindformers./output/log[mindformers/trainer/optimizer_grouped_parameters.py:77] - WARNING - dynamic_lr_schedule will be reset and invalid when layer_scale is False. 2025-07-15 10:17:36,246 - mindformers./output/log[mindformers/trainer/optimizer_grouped_parameters.py:116] - INFO - Param groups = { "decay": { "weight_decay": 0.1, "params": [ "model.tok_embeddings.embedding_weight", "model.layers.0.attention.q2l_proj.weight", "model.layers.0.attention.l2q_nope_proj.weight", "model.layers.0.attention.l2q_pe_proj.weight", "model.layers.0.attention.kv2l_k_pe.weight", "model.layers.0.attention.kv2l_latent_kv.weight", "model.layers.0.attention.lkv2kv_k_nope.weight", "model.layers.0.attention.lkv2kv_v.weight", "model.layers.0.attention.wo.weight", "model.layers.0.feed_forward.routed_experts.ffn.w1.weight", "model.layers.0.feed_forward.routed_experts.ffn.w2.weight", "model.layers.0.feed_forward.routed_experts.ffn.w3.weight", "model.layers.0.feed_forward.routed_experts.router.dense.weight", "model.layers.0.feed_forward.shared_experts.w1.weight", "model.layers.0.feed_forward.shared_experts.w2.weight", "model.layers.0.feed_forward.shared_experts.w3.weight", "model.layers.1.attention.q2l_proj.weight", "model.layers.1.attention.l2q_nope_proj.weight", "model.layers.1.attention.l2q_pe_proj.weight", "model.layers.1.attention.kv2l_k_pe.weight", "model.layers.1.attention.kv2l_latent_kv.weight", "model.layers.1.attention.lkv2kv_k_nope.weight", "model.layers.1.attention.lkv2kv_v.weight", "model.layers.1.attention.wo.weight", "model.layers.1.feed_forward.routed_experts.ffn.w1.weight", "model.layers.1.feed_forward.routed_experts.ffn.w2.weight", "model.layers.1.feed_forward.routed_experts.ffn.w3.weight", "model.layers.1.feed_forward.routed_experts.router.dense.weight", "model.layers.1.feed_forward.shared_experts.w1.weight", "model.layers.1.feed_forward.shared_experts.w2.weight", "model.layers.1.feed_forward.shared_experts.w3.weight", "model.mtp_hidden_fusers.0.dense.weight", "lm_head.weight" ] }, "no_decay": { "weight_decay": 0.0, "params": [ "model.layers.0.ffn_norm.weight", "model.layers.0.attention_norm.weight", "model.layers.0.attention.lq_norm.weight", "model.layers.0.attention.lkv_norm.weight", "model.layers.1.ffn_norm.weight", "model.layers.1.attention_norm.weight", "model.layers.1.attention.lq_norm.weight", "model.layers.1.attention.lkv_norm.weight", "model.mtp_hidden_fusers.0.norm.weight", "model.mtp_hidden_fusers.0.norm_emb.weight", "model.mtp_norms.0.weight", "model.norm_out.weight" ] } } 2025-07-15 10:17:36,249 - mindformers./output/log[mindformers/models/utils.py:423] - INFO - Set full recompute at layer 1 2025-07-15 10:17:36,251 - mindformers./output/log[mindformers/research/deepseek3/deepseek2_model.py:1072] - INFO - MoE config is provided, use MoE FFN with shared ffn 2025-07-15 10:17:36,252 - mindformers./output/log[mindformers/research/deepseek3/deepseek2_model.py:134] - INFO - Using 2 data parallel, 1 context parallel and 2 model parallel for the embedding lookup. 2025-07-15 10:17:36,260 - mindformers./output/log[mindformers/models/modeling_utils.py:1494] - INFO - model built, but weights is unloaded, since the config has no checkpoint_name_or_path attribute or checkpoint_name_or_path is None. 2025-07-15 10:17:36,260 - mindformers./output/log[mindformers/research/deepseek3/deepseek2_model.py:1643] - INFO - Predict run mode:False 2025-07-15 10:17:36,267 - mindformers./output/log[mindformers/trainer/base_trainer.py:715] - INFO - Network Parameters: 91 M. 2025-07-15 10:17:36,268 - mindformers./output/log[mindformers/trainer/base_trainer.py:1010] - INFO - .........Build Optimizer For Train.......... 2025-07-15 10:17:36,268 - mindformers./output/log[mindformers/trainer/base_trainer.py:581] - INFO - .........Build Optimizer From Config.......... 2025-07-15 10:17:36,268 - mindformers./output/log[mindformers/trainer/base_trainer.py:628] - INFO - .........Build LR Schedule From Config.......... 2025-07-15 10:17:36,270 - mindformers./output/log[mindformers/trainer/optimizer_grouped_parameters.py:77] - WARNING - dynamic_lr_schedule will be reset and invalid when layer_scale is False. 2025-07-15 10:17:36,271 - mindformers./output/log[mindformers/trainer/optimizer_grouped_parameters.py:116] - INFO - Param groups = { "decay": { "weight_decay": 0.1, "params": [ "model.tok_embeddings.embedding_weight", "model.layers.0.attention.q2l_proj.weight", "model.layers.0.attention.l2q_nope_proj.weight", "model.layers.0.attention.l2q_pe_proj.weight", "model.layers.0.attention.kv2l_k_pe.weight", "model.layers.0.attention.kv2l_latent_kv.weight", "model.layers.0.attention.lkv2kv_k_nope.weight", "model.layers.0.attention.lkv2kv_v.weight", "model.layers.0.attention.wo.weight", "model.layers.0.feed_forward.routed_experts.ffn.w1.weight", "model.layers.0.feed_forward.routed_experts.ffn.w2.weight", "model.layers.0.feed_forward.routed_experts.ffn.w3.weight", "model.layers.0.feed_forward.routed_experts.router.dense.weight", "model.layers.0.feed_forward.shared_experts.w1.weight", "model.layers.0.feed_forward.shared_experts.w2.weight", "model.layers.0.feed_forward.shared_experts.w3.weight", "model.layers.1.attention.q2l_proj.weight", "model.layers.1.attention.l2q_nope_proj.weight", "model.layers.1.attention.l2q_pe_proj.weight", "model.layers.1.attention.kv2l_k_pe.weight", "model.layers.1.attention.kv2l_latent_kv.weight", "model.layers.1.attention.lkv2kv_k_nope.weight", "model.layers.1.attention.lkv2kv_v.weight", "model.layers.1.attention.wo.weight", "model.layers.1.feed_forward.routed_experts.ffn.w1.weight", "model.layers.1.feed_forward.routed_experts.ffn.w2.weight", "model.layers.1.feed_forward.routed_experts.ffn.w3.weight", "model.layers.1.feed_forward.routed_experts.router.dense.weight", "model.layers.1.feed_forward.shared_experts.w1.weight", "model.layers.1.feed_forward.shared_experts.w2.weight", "model.layers.1.feed_forward.shared_experts.w3.weight", "model.mtp_hidden_fusers.0.dense.weight", "lm_head.weight" ] }, "no_decay": { "weight_decay": 0.0, "params": [ "model.layers.0.ffn_norm.weight", "model.layers.0.attention_norm.weight", "model.layers.0.attention.lq_norm.weight", "model.layers.0.attention.lkv_norm.weight", "model.layers.1.ffn_norm.weight", "model.layers.1.attention_norm.weight", "model.layers.1.attention.lq_norm.weight", "model.layers.1.attention.lkv_norm.weight", "model.mtp_hidden_fusers.0.norm.weight", "model.mtp_hidden_fusers.0.norm_emb.weight", "model.mtp_norms.0.weight", "model.norm_out.weight" ] } } 2025-07-15 10:17:36,283 - mindformers./output/log[mindformers/models/utils.py:423] - INFO - Set full recompute at layer 1 2025-07-15 10:17:36,284 - mindformers./output/log[mindformers/trainer/base_trainer.py:1019] - INFO - .........Build Running Wrapper From Config For Train.......... 2025-07-15 10:17:36,284 - mindformers./output/log[mindformers/trainer/base_trainer.py:665] - INFO - .........Build Model Wrapper for Train From Config.......... 2025-07-15 10:17:36,288 - mindformers./output/log[mindformers/models/utils.py:423] - INFO - Set full recompute at layer 1 2025-07-15 10:17:36,291 - mindformers./output/log[mindformers/research/deepseek3/deepseek2_model.py:134] - INFO - Using 2 data parallel, 1 context parallel and 2 model parallel for the embedding lookup. 2025-07-15 10:17:36,299 - mindformers./output/log[mindformers/models/modeling_utils.py:1494] - INFO - model built, but weights is unloaded, since the config has no checkpoint_name_or_path attribute or checkpoint_name_or_path is None. 2025-07-15 10:17:36,300 - mindformers./output/log[mindformers/research/deepseek3/deepseek2_model.py:1643] - INFO - Predict run mode:False [WARNING] DISTRIBUTED(899772,ffffaebdeec0,python):2025-07-15-10:17:36.307.375 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: d9639340c2f0051c1a7a09da5ef07ed4 [const vector]{2, 6}, async: 0, submit_now: 1 2025-07-15 10:17:36,307 - mindformers./output/log[mindformers/trainer/base_trainer.py:715] - INFO - Network Parameters: 91 M. 2025-07-15 10:17:36,307 - mindformers./output/log[mindformers/trainer/base_trainer.py:1010] - INFO - .........Build Optimizer For Train.......... 2025-07-15 10:17:36,307 - mindformers./output/log[mindformers/trainer/base_trainer.py:581] - INFO - .........Build Optimizer From Config.......... 2025-07-15 10:17:36,308 - mindformers./output/log[mindformers/trainer/base_trainer.py:628] - INFO - .........Build LR Schedule From Config.......... 2025-07-15 10:17:36,308 - mindformers./output/log[mindformers/trainer/base_trainer.py:1019] - INFO - .........Build Running Wrapper From Config For Train.......... 2025-07-15 10:17:36,308 - mindformers./output/log[mindformers/trainer/base_trainer.py:665] - INFO - .........Build Model Wrapper for Train From Config.......... 2025-07-15 10:17:36,309 - mindformers./output/log[mindformers/trainer/optimizer_grouped_parameters.py:77] - WARNING - dynamic_lr_schedule will be reset and invalid when layer_scale is False. 2025-07-15 10:17:36,310 - mindformers./output/log[mindformers/trainer/optimizer_grouped_parameters.py:116] - INFO - Param groups = { "decay": { "weight_decay": 0.1, "params": [ "model.tok_embeddings.embedding_weight", "model.layers.0.attention.q2l_proj.weight", "model.layers.0.attention.l2q_nope_proj.weight", "model.layers.0.attention.l2q_pe_proj.weight", "model.layers.0.attention.kv2l_k_pe.weight", "model.layers.0.attention.kv2l_latent_kv.weight", "model.layers.0.attention.lkv2kv_k_nope.weight", "model.layers.0.attention.lkv2kv_v.weight", "model.layers.0.attention.wo.weight", "model.layers.0.feed_forward.routed_experts.ffn.w1.weight", "model.layers.0.feed_forward.routed_experts.ffn.w2.weight", "model.layers.0.feed_forward.routed_experts.ffn.w3.weight", "model.layers.0.feed_forward.routed_experts.router.dense.weight", "model.layers.0.feed_forward.shared_experts.w1.weight", "model.layers.0.feed_forward.shared_experts.w2.weight", "model.layers.0.feed_forward.shared_experts.w3.weight", "model.layers.1.attention.q2l_proj.weight", "model.layers.1.attention.l2q_nope_proj.weight", "model.layers.1.attention.l2q_pe_proj.weight", "model.layers.1.attention.kv2l_k_pe.weight", "model.layers.1.attention.kv2l_latent_kv.weight", "model.layers.1.attention.lkv2kv_k_nope.weight", "model.layers.1.attention.lkv2kv_v.weight", "model.layers.1.attention.wo.weight", "model.layers.1.feed_forward.routed_experts.ffn.w1.weight", "model.layers.1.feed_forward.routed_experts.ffn.w2.weight", "model.layers.1.feed_forward.routed_experts.ffn.w3.weight", "model.layers.1.feed_forward.routed_experts.router.dense.weight", "model.layers.1.feed_forward.shared_experts.w1.weight", "model.layers.1.feed_forward.shared_experts.w2.weight", "model.layers.1.feed_forward.shared_experts.w3.weight", "model.mtp_hidden_fusers.0.dense.weight", "lm_head.weight" ] }, "no_decay": { "weight_decay": 0.0, "params": [ "model.layers.0.ffn_norm.weight", "model.layers.0.attention_norm.weight", "model.layers.0.attention.lq_norm.weight", "model.layers.0.attention.lkv_norm.weight", "model.layers.1.ffn_norm.weight", "model.layers.1.attention_norm.weight", "model.layers.1.attention.lq_norm.weight", "model.layers.1.attention.lkv_norm.weight", "model.mtp_hidden_fusers.0.norm.weight", "model.mtp_hidden_fusers.0.norm_emb.weight", "model.mtp_norms.0.weight", "model.norm_out.weight" ] } } 2025-07-15 10:17:36,329 - mindformers./output/log[mindformers/models/utils.py:190] - INFO - num_layers per stage: [[1, 1]] 2025-07-15 10:17:36,330 - mindformers./output/log[mindformers/models/utils.py:191] - INFO - Accumulated num_layers per stage: [[1, 2]] 2025-07-15 10:17:36,330 - mindformers./output/log[mindformers/models/utils.py:193] - INFO - Pipeline id list with start_stage: [0, 1] 2025-07-15 10:17:36,330 - mindformers./output/log[mindformers/models/utils.py:194] - INFO - Interleave id list: [0, 0] 2025-07-15 10:17:36,330 - mindformers./output/log[mindformers/models/utils.py:212] - INFO - Formative layer_recompute: [[1, 1]] 2025-07-15 10:17:36,331 - mindformers./output/log[mindformers/models/utils.py:214] - INFO - The configuration of select_recompute_exclude and select_comm_recompute_exclude have the highest priority. [WARNING] DISTRIBUTED(899776,ffff9e3ceec0,python):2025-07-15-10:17:36.331.094 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: e30609fbce6a1a756f50a31ec86eae83 [const vector]{3, 7}, async: 0, submit_now: 1 2025-07-15 10:17:36,331 - mindformers./output/log[mindformers/models/utils.py:220] - INFO - Formative select_recompute: {'feed_forward\\.mul': [[0, 0]], 'feed_forward\\.w1\\.activation\\.silu': [[0, 0]]} 2025-07-15 10:17:36,331 - mindformers./output/log[mindformers/models/utils.py:221] - INFO - Formative select_comm_recompute: {'.*\\.norm': [[0, 0]]} 2025-07-15 10:17:36,331 - mindformers./output/log[mindformers/models/utils.py:222] - INFO - Formative select_recompute_exclude: {} 2025-07-15 10:17:36,331 - mindformers./output/log[mindformers/models/utils.py:223] - INFO - Formative select_comm_recompute_exclude: {} 2025-07-15 10:17:36,347 - mindformers./output/log[mindformers/trainer/base_trainer.py:1019] - INFO - .........Build Running Wrapper From Config For Train.......... 2025-07-15 10:17:36,347 - mindformers./output/log[mindformers/trainer/base_trainer.py:665] - INFO - .........Build Model Wrapper for Train From Config.......... 2025-07-15 10:17:36,351 - mindformers./output/log[mindformers/research/deepseek3/deepseek2_model.py:1072] - INFO - MoE config is provided, use MoE FFN with shared ffn 2025-07-15 10:17:36,362 - mindformers./output/log[mindformers/models/utils.py:190] - INFO - num_layers per stage: [[1, 1]] 2025-07-15 10:17:36,362 - mindformers./output/log[mindformers/models/utils.py:191] - INFO - Accumulated num_layers per stage: [[1, 2]] 2025-07-15 10:17:36,362 - mindformers./output/log[mindformers/models/utils.py:193] - INFO - Pipeline id list with start_stage: [0, 1] 2025-07-15 10:17:36,362 - mindformers./output/log[mindformers/models/utils.py:194] - INFO - Interleave id list: [0, 0] 2025-07-15 10:17:36,363 - mindformers./output/log[mindformers/models/utils.py:212] - INFO - Formative layer_recompute: [[1, 1]] 2025-07-15 10:17:36,363 - mindformers./output/log[mindformers/models/utils.py:214] - INFO - The configuration of select_recompute_exclude and select_comm_recompute_exclude have the highest priority. 2025-07-15 10:17:36,363 - mindformers./output/log[mindformers/models/utils.py:220] - INFO - Formative select_recompute: {'feed_forward\\.mul': [[0, 0]], 'feed_forward\\.w1\\.activation\\.silu': [[0, 0]]} 2025-07-15 10:17:36,363 - mindformers./output/log[mindformers/models/utils.py:221] - INFO - Formative select_comm_recompute: {'.*\\.norm': [[0, 0]]} 2025-07-15 10:17:36,363 - mindformers./output/log[mindformers/models/utils.py:222] - INFO - Formative select_recompute_exclude: {} 2025-07-15 10:17:36,363 - mindformers./output/log[mindformers/models/utils.py:223] - INFO - Formative select_comm_recompute_exclude: {} [WARNING] DISTRIBUTED(899780,ffff9229eec0,python):2025-07-15-10:17:36.370.208 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: cb4ececddcb4517ca0bcddafd23813b9 [const vector]{0, 4}, async: 0, submit_now: 1 2025-07-15 10:17:36,383 - mindformers./output/log[mindformers/research/deepseek3/deepseek2_model.py:1072] - INFO - MoE config is provided, use MoE FFN with shared ffn 2025-07-15 10:17:36,386 - mindformers./output/log[mindformers/models/utils.py:423] - INFO - Set full recompute at layer 0 2025-07-15 10:17:36,405 - mindformers./output/log[mindformers/research/deepseek3/deepseek2_model.py:1072] - INFO - MoE config is provided, use MoE FFN with shared ffn 2025-07-15 10:17:36,417 - mindformers./output/log[mindformers/models/utils.py:423] - INFO - Set full recompute at layer 0 2025-07-15 10:17:36,436 - mindformers./output/log[mindformers/research/deepseek3/deepseek2_model.py:1072] - INFO - MoE config is provided, use MoE FFN with shared ffn 2025-07-15 10:17:36,438 - mindformers./output/log[mindformers/models/utils.py:423] - INFO - Set full recompute at layer 1 2025-07-15 10:17:36,443 - mindformers./output/log[mindformers/models/utils.py:423] - INFO - Set full recompute at layer 1 2025-07-15 10:17:36,446 - mindformers./output/log[mindformers/research/deepseek3/deepseek2_model.py:134] - INFO - Using 2 data parallel, 1 context parallel and 2 model parallel for the embedding lookup. 2025-07-15 10:17:36,455 - mindformers./output/log[mindformers/models/modeling_utils.py:1494] - INFO - model built, but weights is unloaded, since the config has no checkpoint_name_or_path attribute or checkpoint_name_or_path is None. 2025-07-15 10:17:36,455 - mindformers./output/log[mindformers/research/deepseek3/deepseek2_model.py:1643] - INFO - Predict run mode:False 2025-07-15 10:17:36,462 - mindformers./output/log[mindformers/trainer/base_trainer.py:715] - INFO - Network Parameters: 91 M. 2025-07-15 10:17:36,462 - mindformers./output/log[mindformers/trainer/base_trainer.py:1010] - INFO - .........Build Optimizer For Train.......... 2025-07-15 10:17:36,463 - mindformers./output/log[mindformers/trainer/base_trainer.py:581] - INFO - .........Build Optimizer From Config.......... 2025-07-15 10:17:36,463 - mindformers./output/log[mindformers/trainer/base_trainer.py:628] - INFO - .........Build LR Schedule From Config.......... 2025-07-15 10:17:36,465 - mindformers./output/log[mindformers/trainer/optimizer_grouped_parameters.py:77] - WARNING - dynamic_lr_schedule will be reset and invalid when layer_scale is False. 2025-07-15 10:17:36,466 - mindformers./output/log[mindformers/trainer/optimizer_grouped_parameters.py:116] - INFO - Param groups = { "decay": { "weight_decay": 0.1, "params": [ "model.tok_embeddings.embedding_weight", "model.layers.0.attention.q2l_proj.weight", "model.layers.0.attention.l2q_nope_proj.weight", "model.layers.0.attention.l2q_pe_proj.weight", "model.layers.0.attention.kv2l_k_pe.weight", "model.layers.0.attention.kv2l_latent_kv.weight", "model.layers.0.attention.lkv2kv_k_nope.weight", "model.layers.0.attention.lkv2kv_v.weight", "model.layers.0.attention.wo.weight", "model.layers.0.feed_forward.routed_experts.ffn.w1.weight", "model.layers.0.feed_forward.routed_experts.ffn.w2.weight", "model.layers.0.feed_forward.routed_experts.ffn.w3.weight", "model.layers.0.feed_forward.routed_experts.router.dense.weight", "model.layers.0.feed_forward.shared_experts.w1.weight", "model.layers.0.feed_forward.shared_experts.w2.weight", "model.layers.0.feed_forward.shared_experts.w3.weight", "model.layers.1.attention.q2l_proj.weight", "model.layers.1.attention.l2q_nope_proj.weight", "model.layers.1.attention.l2q_pe_proj.weight", "model.layers.1.attention.kv2l_k_pe.weight", "model.layers.1.attention.kv2l_latent_kv.weight", "model.layers.1.attention.lkv2kv_k_nope.weight", "model.layers.1.attention.lkv2kv_v.weight", "model.layers.1.attention.wo.weight", "model.layers.1.feed_forward.routed_experts.ffn.w1.weight", "model.layers.1.feed_forward.routed_experts.ffn.w2.weight", "model.layers.1.feed_forward.routed_experts.ffn.w3.weight", "model.layers.1.feed_forward.routed_experts.router.dense.weight", "model.layers.1.feed_forward.shared_experts.w1.weight", "model.layers.1.feed_forward.shared_experts.w2.weight", "model.layers.1.feed_forward.shared_experts.w3.weight", "model.mtp_hidden_fusers.0.dense.weight", "lm_head.weight" ] }, "no_decay": { "weight_decay": 0.0, "params": [ "model.layers.0.ffn_norm.weight", "model.layers.0.attention_norm.weight", "model.layers.0.attention.lq_norm.weight", "model.layers.0.attention.lkv_norm.weight", "model.layers.1.ffn_norm.weight", "model.layers.1.attention_norm.weight", "model.layers.1.attention.lq_norm.weight", "model.layers.1.attention.lkv_norm.weight", "model.mtp_hidden_fusers.0.norm.weight", "model.mtp_hidden_fusers.0.norm_emb.weight", "model.mtp_norms.0.weight", "model.norm_out.weight" ] } } 2025-07-15 10:17:36,469 - mindformers./output/log[mindformers/models/utils.py:423] - INFO - Set full recompute at layer 1 2025-07-15 10:17:36,475 - mindformers./output/log[mindformers/models/utils.py:423] - INFO - Set full recompute at layer 1 2025-07-15 10:17:36,477 - mindformers./output/log[mindformers/research/deepseek3/deepseek2_model.py:134] - INFO - Using 2 data parallel, 1 context parallel and 2 model parallel for the embedding lookup. 2025-07-15 10:17:36,486 - mindformers./output/log[mindformers/models/modeling_utils.py:1494] - INFO - model built, but weights is unloaded, since the config has no checkpoint_name_or_path attribute or checkpoint_name_or_path is None. 2025-07-15 10:17:36,486 - mindformers./output/log[mindformers/research/deepseek3/deepseek2_model.py:1643] - INFO - Predict run mode:False 2025-07-15 10:17:36,494 - mindformers./output/log[mindformers/trainer/base_trainer.py:715] - INFO - Network Parameters: 91 M. 2025-07-15 10:17:36,494 - mindformers./output/log[mindformers/trainer/base_trainer.py:1010] - INFO - .........Build Optimizer For Train.......... 2025-07-15 10:17:36,494 - mindformers./output/log[mindformers/trainer/base_trainer.py:581] - INFO - .........Build Optimizer From Config.......... 2025-07-15 10:17:36,494 - mindformers./output/log[mindformers/trainer/base_trainer.py:628] - INFO - .........Build LR Schedule From Config.......... 2025-07-15 10:17:36,496 - mindformers./output/log[mindformers/trainer/optimizer_grouped_parameters.py:77] - WARNING - dynamic_lr_schedule will be reset and invalid when layer_scale is False. 2025-07-15 10:17:36,497 - mindformers./output/log[mindformers/trainer/optimizer_grouped_parameters.py:116] - INFO - Param groups = { "decay": { "weight_decay": 0.1, "params": [ "model.tok_embeddings.embedding_weight", "model.layers.0.attention.q2l_proj.weight", "model.layers.0.attention.l2q_nope_proj.weight", "model.layers.0.attention.l2q_pe_proj.weight", "model.layers.0.attention.kv2l_k_pe.weight", "model.layers.0.attention.kv2l_latent_kv.weight", "model.layers.0.attention.lkv2kv_k_nope.weight", "model.layers.0.attention.lkv2kv_v.weight", "model.layers.0.attention.wo.weight", "model.layers.0.feed_forward.routed_experts.ffn.w1.weight", "model.layers.0.feed_forward.routed_experts.ffn.w2.weight", "model.layers.0.feed_forward.routed_experts.ffn.w3.weight", "model.layers.0.feed_forward.routed_experts.router.dense.weight", "model.layers.0.feed_forward.shared_experts.w1.weight", "model.layers.0.feed_forward.shared_experts.w2.weight", "model.layers.0.feed_forward.shared_experts.w3.weight", "model.layers.1.attention.q2l_proj.weight", "model.layers.1.attention.l2q_nope_proj.weight", "model.layers.1.attention.l2q_pe_proj.weight", "model.layers.1.attention.kv2l_k_pe.weight", "model.layers.1.attention.kv2l_latent_kv.weight", "model.layers.1.attention.lkv2kv_k_nope.weight", "model.layers.1.attention.lkv2kv_v.weight", "model.layers.1.attention.wo.weight", "model.layers.1.feed_forward.routed_experts.ffn.w1.weight", "model.layers.1.feed_forward.routed_experts.ffn.w2.weight", "model.layers.1.feed_forward.routed_experts.ffn.w3.weight", "model.layers.1.feed_forward.routed_experts.router.dense.weight", "model.layers.1.feed_forward.shared_experts.w1.weight", "model.layers.1.feed_forward.shared_experts.w2.weight", "model.layers.1.feed_forward.shared_experts.w3.weight", "model.mtp_hidden_fusers.0.dense.weight", "lm_head.weight" ] }, "no_decay": { "weight_decay": 0.0, "params": [ "model.layers.0.ffn_norm.weight", "model.layers.0.attention_norm.weight", "model.layers.0.attention.lq_norm.weight", "model.layers.0.attention.lkv_norm.weight", "model.layers.1.ffn_norm.weight", "model.layers.1.attention_norm.weight", "model.layers.1.attention.lq_norm.weight", "model.layers.1.attention.lkv_norm.weight", "model.mtp_hidden_fusers.0.norm.weight", "model.mtp_hidden_fusers.0.norm_emb.weight", "model.mtp_norms.0.weight", "model.norm_out.weight" ] } } 2025-07-15 10:17:36,503 - mindformers./output/log[mindformers/trainer/base_trainer.py:1019] - INFO - .........Build Running Wrapper From Config For Train.......... 2025-07-15 10:17:36,503 - mindformers./output/log[mindformers/trainer/base_trainer.py:665] - INFO - .........Build Model Wrapper for Train From Config.......... [WARNING] DISTRIBUTED(899788,ffff9850eec0,python):2025-07-15-10:17:36.525.774 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: d9639340c2f0051c1a7a09da5ef07ed4 [const vector]{2, 6}, async: 0, submit_now: 1 2025-07-15 10:17:36,536 - mindformers./output/log[mindformers/trainer/base_trainer.py:1019] - INFO - .........Build Running Wrapper From Config For Train.......... 2025-07-15 10:17:36,536 - mindformers./output/log[mindformers/trainer/base_trainer.py:665] - INFO - .........Build Model Wrapper for Train From Config.......... [WARNING] DISTRIBUTED(899784,ffffa894eec0,python):2025-07-15-10:17:36.559.333 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: 12426c956d1bc5017082b12a969b0b7c [const vector]{1, 5}, async: 0, submit_now: 1 2025-07-15 10:20:04,480 - mindformers./output/log[mindformers/core/context/parallel.py:88] - ERROR - Notice: if you are trying to run with a single device, please set use_parallel=False. If not, please check the error message above. 2025-07-15 10:20:04,481 - mindformers./output/log[mindformers/tools/cloud_adapter/cloud_monitor.py:43] - ERROR - Traceback (most recent call last): File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/mindformers/tools/cloud_adapter/cloud_monitor.py", line 34, in wrapper result = run_func(*args, **kwargs) File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/../mindformers/run_mindformer.py", line 68, in main build_context(config) File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/mindformers/core/context/build_context.py", line 464, in build_context ctx = Context(mf_config) File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/mindformers/core/context/build_context.py", line 71, in __init__ self.parallel_opr.init_communication() File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/mindformers/core/context/parallel.py", line 86, in init_communication init() File "/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/management.py", line 203, in init init_hccl() RuntimeError: Call aclrtSetDevice failed, ret[507033]. Got device count[8] and device id[1], please check if device id is valid. ---------------------------------------------------- - C++ Call Stack: (For framework developers) ---------------------------------------------------- mindspore/ccsrc/plugin/res_manager/ascend/hal_manager/ascend_hal_manager.cc:67 InitDevice Traceback (most recent call last): File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/../mindformers/run_mindformer.py", line 336, in main(config_) File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/mindformers/tools/cloud_adapter/cloud_monitor.py", line 44, in wrapper raise exc File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/mindformers/tools/cloud_adapter/cloud_monitor.py", line 34, in wrapper result = run_func(*args, **kwargs) File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/../mindformers/run_mindformer.py", line 68, in main build_context(config) File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/mindformers/core/context/build_context.py", line 464, in build_context ctx = Context(mf_config) File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/mindformers/core/context/build_context.py", line 71, in __init__ self.parallel_opr.init_communication() File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/mindformers/core/context/parallel.py", line 86, in init_communication init() File "/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/management.py", line 203, in init init_hccl() RuntimeError: Call aclrtSetDevice failed, ret[507033]. Got device count[8] and device id[1], please check if device id is valid. ---------------------------------------------------- - C++ Call Stack: (For framework developers) ---------------------------------------------------- mindspore/ccsrc/plugin/res_manager/ascend/hal_manager/ascend_hal_manager.cc:67 InitDevice [WARNING] DEVICE(899768,ffffbc09eec0,python):2025-07-15-10:20:04.543.607 [mindspore/ccsrc/plugin/device/ascend/hal/hardware/ascend_device_res_manager.cc:350] SyncAllStreams] The ascend_res_manager_ is nullptr in scenarios where it is not actually executed [ERROR] ME(899444:281473369566912,MainProcess):2025-07-15-10:20:06.442.664 [mindspore/parallel/cluster/process_entity/_api.py:363] Worker process 899768 exit with exception. Error code: 1. [WARNING] ME(899444:281473369566912,MainProcess):2025-07-15-10:20:06.443.037 [mindspore/parallel/cluster/process_entity/_api.py:369] There's worker exits with exception, kill all other workers. [ERROR] ME(899444:281473369566912,MainProcess):2025-07-15-10:20:40.105.838 [mindspore/parallel/cluster/process_entity/_api.py:382] Scheduler process 899762 exit with exception. [ERROR] ME(899444:281473369566912,MainProcess):2025-07-15-10:20:40.107.351 [mindspore/parallel/cluster/process_entity/_api.py:603] Time out nodes are ['1'] /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/worker_1.log-38-[WARNING] DISTRIBUTED(899768,ffffbc09eec0,python):2025-07-15-10:17:33.493.279 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(2/14400). /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/worker_1.log-39-[MS_DEV_RUNTIME_CONF]Runtime config: memory_statistics:True /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/worker_1.log-40-[WARNING] DISTRIBUTED(899768,ffffbc09eec0,python):2025-07-15-10:17:33.993.454 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:249] BuildCluster] Cluster is successfully initialized. /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/worker_1.log-41-[WARNING] DISTRIBUTED(899768,ffffbc09eec0,python):2025-07-15-10:17:33.993.493 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:355] PostProcess] This node 1 rank id: 1 /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/worker_1.log-42-[MS_RUNTIME_PROF]The jit_level is: O1, and enable kernelbykernel executor. /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/worker_1.log:43:2025-07-15 10:20:04,480 - mindformers./output/log[mindformers/core/context/parallel.py:88] - ERROR - Notice: if you are trying to run with a single device, please set use_parallel=False. If not, please check the error message above. /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/worker_1.log:44:2025-07-15 10:20:04,481 - mindformers./output/log[mindformers/tools/cloud_adapter/cloud_monitor.py:43] - ERROR - Traceback (most recent call last): /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/worker_1.log-45- File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/mindformers/tools/cloud_adapter/cloud_monitor.py", line 34, in wrapper /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/worker_1.log-46- result = run_func(*args, **kwargs) /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/worker_1.log-47- File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/../mindformers/run_mindformer.py", line 68, in main /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/worker_1.log-48- build_context(config) /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/worker_1.log-49- File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/mindformers/core/context/build_context.py", line 464, in build_context -- /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/worker_1.log-52- self.parallel_opr.init_communication() /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/worker_1.log-53- File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/mindformers/core/context/parallel.py", line 86, in init_communication /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/worker_1.log-54- init() /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/worker_1.log-55- File "/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/management.py", line 203, in init /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/worker_1.log-56- init_hccl() /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/worker_1.log:57:RuntimeError: Call aclrtSetDevice failed, ret[507033]. Got device count[8] and device id[1], please check if device id is valid. /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/worker_1.log-58- /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/worker_1.log-59----------------------------------------------------- /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/worker_1.log-60-- C++ Call Stack: (For framework developers) /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/worker_1.log-61----------------------------------------------------- /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/worker_1.log-62-mindspore/ccsrc/plugin/res_manager/ascend/hal_manager/ascend_hal_manager.cc:67 InitDevice /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/worker_1.log-63- /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/worker_1.log-64- /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/worker_1.log:65:Traceback (most recent call last): /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/worker_1.log-66- File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/../mindformers/run_mindformer.py", line 336, in /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/worker_1.log-67- main(config_) /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/worker_1.log-68- File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/mindformers/tools/cloud_adapter/cloud_monitor.py", line 44, in wrapper /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/worker_1.log-69- raise exc /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/worker_1.log-70- File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/mindformers/tools/cloud_adapter/cloud_monitor.py", line 34, in wrapper -- /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/worker_1.log-77- self.parallel_opr.init_communication() /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/worker_1.log-78- File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/mindformers/core/context/parallel.py", line 86, in init_communication /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/worker_1.log-79- init() /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/worker_1.log-80- File "/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/management.py", line 203, in init /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/worker_1.log-81- init_hccl() /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/worker_1.log:82:RuntimeError: Call aclrtSetDevice failed, ret[507033]. Got device count[8] and device id[1], please check if device id is valid. /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/worker_1.log-83- /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/worker_1.log-84----------------------------------------------------- /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/worker_1.log-85-- C++ Call Stack: (For framework developers) /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/worker_1.log-86----------------------------------------------------- /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/worker_1.log-87-mindspore/ccsrc/plugin/res_manager/ascend/hal_manager/ascend_hal_manager.cc:67 InitDevice -- /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/scheduler.log-114-[WARNING] DISTRIBUTED(899762,ffff8abfeec0,python):2025-07-15-10:20:23.090.508 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:154] Finalize] This log means the cluster is successfully created. Retry to finalize the node and exit cluster... /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/scheduler.log-115-[WARNING] DISTRIBUTED(899762,ffff8abfeec0,python):2025-07-15-10:20:28.090.633 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:98] Finalize] The meta server node can not be finalized because there are still 8 alive nodes. /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/scheduler.log-116-[WARNING] DISTRIBUTED(899762,ffff8abfeec0,python):2025-07-15-10:20:28.090.668 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:154] Finalize] This log means the cluster is successfully created. Retry to finalize the node and exit cluster... /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/scheduler.log-117-[WARNING] DISTRIBUTED(899762,ffff8abfeec0,python):2025-07-15-10:20:33.090.767 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:98] Finalize] The meta server node can not be finalized because there are still 8 alive nodes. /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/scheduler.log-118-[WARNING] DISTRIBUTED(899762,ffff8abfeec0,python):2025-07-15-10:20:33.090.804 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:154] Finalize] This log means the cluster is successfully created. Retry to finalize the node and exit cluster... /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/scheduler.log:119:[ERROR] DISTRIBUTED(899762,ffff0525efa0,python):2025-07-15-10:20:35.107.905 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:511] UpdateTopoState] The node: 1 is timed out. It may exit with exception, please check this node's log. /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/scheduler.log:120:[ERROR] DISTRIBUTED(899762,ffff8abfeec0,python):2025-07-15-10:20:38.090.908 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:103] Finalize] There are 1 abnormal compute graph nodes. /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/scheduler.log:121:2025-07-15 10:20:38,091 - mindformers./output/log[mindformers/core/context/parallel.py:88] - ERROR - Notice: if you are trying to run with a single device, please set use_parallel=False. If not, please check the error message above. /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/scheduler.log:122:2025-07-15 10:20:38,092 - mindformers./output/log[mindformers/tools/cloud_adapter/cloud_monitor.py:43] - ERROR - Traceback (most recent call last): /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/scheduler.log-123- File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/mindformers/tools/cloud_adapter/cloud_monitor.py", line 34, in wrapper /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/scheduler.log-124- result = run_func(*args, **kwargs) /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/scheduler.log-125- File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/../mindformers/run_mindformer.py", line 68, in main /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/scheduler.log-126- build_context(config) /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/scheduler.log-127- File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/mindformers/core/context/build_context.py", line 464, in build_context -- /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/scheduler.log-130- self.parallel_opr.init_communication() /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/scheduler.log-131- File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/mindformers/core/context/parallel.py", line 86, in init_communication /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/scheduler.log-132- init() /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/scheduler.log-133- File "/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/management.py", line 213, in init /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/scheduler.log-134- init_cluster() /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/scheduler.log:135:RuntimeError: The total number of timed out node is 1. Timed out node list is: [const vector]{1}, worker 1 is the first one timed out, please check its log. /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/scheduler.log-136- /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/scheduler.log-137----------------------------------------------------- /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/scheduler.log-138-- C++ Call Stack: (For framework developers) /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/scheduler.log-139----------------------------------------------------- /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/scheduler.log-140-mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:517 UpdateTopoState /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/scheduler.log-141- /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/scheduler.log-142- /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/scheduler.log:143:Traceback (most recent call last): /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/scheduler.log-144- File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/../mindformers/run_mindformer.py", line 336, in /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/scheduler.log-145- main(config_) /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/scheduler.log-146- File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/mindformers/tools/cloud_adapter/cloud_monitor.py", line 44, in wrapper /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/scheduler.log-147- raise exc /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/scheduler.log-148- File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/mindformers/tools/cloud_adapter/cloud_monitor.py", line 34, in wrapper -- /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/scheduler.log-155- self.parallel_opr.init_communication() /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/scheduler.log-156- File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/mindformers/core/context/parallel.py", line 86, in init_communication /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/scheduler.log-157- init() /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/scheduler.log-158- File "/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/management.py", line 213, in init /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/scheduler.log-159- init_cluster() /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/scheduler.log:160:RuntimeError: The total number of timed out node is 1. Timed out node list is: [const vector]{1}, worker 1 is the first one timed out, please check its log. /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/scheduler.log-161- /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/scheduler.log-162----------------------------------------------------- /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/scheduler.log-163-- C++ Call Stack: (For framework developers) /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/scheduler.log-164----------------------------------------------------- /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/scheduler.log-165-mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:517 UpdateTopoState Traceback (most recent call last): File "/home/jenkins/anaconda3/envs/ci39/bin/msrun", line 8, in sys.exit(main()) File "/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/parallel/cluster/run.py", line 191, in main run(args) File "/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/parallel/cluster/run.py", line 185, in run process_manager.run() File "/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/parallel/cluster/process_entity/_api.py", line 268, in run self.join_processes() File "/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/parallel/cluster/process_entity/_api.py", line 387, in join_processes raise RuntimeError("Distributed job exited with exception. Please check logs in " RuntimeError: Distributed job exited with exception. Please check logs in directory: /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/. [MS_DEV_RUNTIME_CONF]Runtime config: memory_statistics:True F =================================== FAILURES =================================== ______________ test_deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm ______________ @arg_mark(plat_marks=['platform_ascend910b'], level_mark='level0', card_mark='allcards', essential_mark='essential') def test_deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm(): """ Feature: test deepseekv3 cell dp2mp2ep4pp2mb4gas1bs1 8p bmm Description: test deepseekv3 cell dp2mp2ep4pp2mb4gas1bs1 8p bmm Expectation: st pass """ case_name = "deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm" sh_path = os.path.split(os.path.realpath(__file__))[0] parallel_speed_up_json = {'matmul_grad_comm_overlap': True, "pp_1f1b_overlap": "AlltoAllV,AlltoAll"} deepseek_config = DeepseekConfig(parallel_speed_up_json=parallel_speed_up_json, use_gmm=False, num_layer=1, pp_interleave_num=1, first_k_dense_replace=0 ) file_path = prepare_deepseekv3_testcase_env(case_name, deepseek_config) device_num = 8 master_port = 7124 hccl_if_base_port = 63355 env_cmd = 'export MS_DEV_RUNTIME_CONF="memory_statistics:True";' env_cmd += 'export MS_MEMORY_STATISTIC=1' os.system(f"{env_cmd}; bash {sh_path}/run_llm.sh {device_num} {file_path} \ {case_name} {master_port} {hccl_if_base_port} pp") # check train over check_pair = {"Training Over": 1} real_log_path = log_path_preprocess(case_name, device_num) for log_path in real_log_path: > check_log(log_path, check_pair) test_deepseekv3_pretrain.py:173: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ file_path = './deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/worker_0.log' check_pairs = {'Training Over': 1} def check_log(file_path, check_pairs=None): # check the number of key in check_pairs in log file is equal to the value log_error_count = subprocess.check_output( ["grep -rE '%s' %s | wc -l" % ("ERROR|Traceback", file_path)], shell=True) log_cnt = str(log_error_count, 'utf-8').strip() if log_cnt != "0": os.system(f"cat {file_path}") assert log_cnt == "0", f"Error found in {file_path}" if check_pairs is not None: for key_word, value in check_pairs.items(): log_output = subprocess.check_output( ["grep -r '%s' %s | wc -l" % (key_word, file_path)], shell=True) log_cnt = str(log_output, 'utf-8').strip() > assert log_cnt == str(value), (f"Failed to find {key_word} in {file_path} or content is not correct." f"Expected occurrences: {value}, but got {log_cnt}") E AssertionError: Failed to find Training Over in ./deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm/worker_0.log or content is not correct.Expected occurrences: 1, but got 0 ../utils.py:160: AssertionError =========================== short test summary info ============================ FAILED test_deepseekv3_pretrain.py::test_deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_bmm ======================== 1 failed in 215.82s (0:03:35) =========================