============================= test session starts ============================== platform linux -- Python 3.9.21, pytest-6.2.5, py-1.11.0, pluggy-0.13.1 rootdir: /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3, configfile: ../../../../../../../../sault/virtual_test/virtualenv_002/sault/config/pytest.ini plugins: forked-1.6.0, hydra-core-1.3.2, xdist-1.32.0, anyio-4.9.0 collected 1 item test_deepseekv3_pretrain.py enable lazy inline in pp /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) Start worker process with rank id:0, log file:/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/worker_0.log. Environment variable [RANK_ID=0] is exported. [WARNING] ME(906714:281472839708352,MainProcess):2025-07-15-10:30:15.424.574 [mindspore/parallel/cluster/process_entity/_utils.py:62] Launch process with command: taskset -c 144-167 python /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/../mindformers/run_mindformer.py --config /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/pretrain_deepseek3.yaml --register_path /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/../mindformers/research/deepseek3/ Start worker process with rank id:1, log file:/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/worker_1.log. Environment variable [RANK_ID=1] is exported. [WARNING] ME(906714:281472839708352,MainProcess):2025-07-15-10:30:15.469.078 [mindspore/parallel/cluster/process_entity/_utils.py:62] Launch process with command: taskset -c 24-47 python /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/../mindformers/run_mindformer.py --config /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/pretrain_deepseek3.yaml --register_path /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/../mindformers/research/deepseek3/ Start worker process with rank id:2, log file:/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/worker_2.log. Environment variable [RANK_ID=2] is exported. [WARNING] ME(906714:281472839708352,MainProcess):2025-07-15-10:30:15.514.275 [mindspore/parallel/cluster/process_entity/_utils.py:62] Launch process with command: taskset -c 96-119 python /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/../mindformers/run_mindformer.py --config /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/pretrain_deepseek3.yaml --register_path /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/../mindformers/research/deepseek3/ Start worker process with rank id:3, log file:/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/worker_3.log. Environment variable [RANK_ID=3] is exported. [WARNING] ME(906714:281472839708352,MainProcess):2025-07-15-10:30:15.560.275 [mindspore/parallel/cluster/process_entity/_utils.py:62] Launch process with command: taskset -c 72-95 python /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/../mindformers/run_mindformer.py --config /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/pretrain_deepseek3.yaml --register_path /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/../mindformers/research/deepseek3/ Start worker process with rank id:4, log file:/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/worker_4.log. Environment variable [RANK_ID=4] is exported. [WARNING] ME(906714:281472839708352,MainProcess):2025-07-15-10:30:15.608.008 [mindspore/parallel/cluster/process_entity/_utils.py:62] Launch process with command: taskset -c 0-23 python /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/../mindformers/run_mindformer.py --config /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/pretrain_deepseek3.yaml --register_path /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/../mindformers/research/deepseek3/ Start worker process with rank id:5, log file:/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/worker_5.log. Environment variable [RANK_ID=5] is exported. [WARNING] ME(906714:281472839708352,MainProcess):2025-07-15-10:30:15.659.136 [mindspore/parallel/cluster/process_entity/_utils.py:62] Launch process with command: taskset -c 120-143 python /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/../mindformers/run_mindformer.py --config /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/pretrain_deepseek3.yaml --register_path /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/../mindformers/research/deepseek3/ Start worker process with rank id:6, log file:/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/worker_6.log. Environment variable [RANK_ID=6] is exported. [WARNING] ME(906714:281472839708352,MainProcess):2025-07-15-10:30:15.711.456 [mindspore/parallel/cluster/process_entity/_utils.py:62] Launch process with command: taskset -c 48-71 python /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/../mindformers/run_mindformer.py --config /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/pretrain_deepseek3.yaml --register_path /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/../mindformers/research/deepseek3/ Start worker process with rank id:7, log file:/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/worker_7.log. Environment variable [RANK_ID=7] is exported. [WARNING] ME(906714:281472839708352,MainProcess):2025-07-15-10:30:15.762.899 [mindspore/parallel/cluster/process_entity/_utils.py:62] Launch process with command: taskset -c 168-191 python /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/../mindformers/run_mindformer.py --config /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/pretrain_deepseek3.yaml --register_path /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/../mindformers/research/deepseek3/ [WARNING] ME(906714:281472839708352,MainProcess):2025-07-15-10:30:15.813.592 [mindspore/parallel/cluster/process_entity/_api.py:267] Distributed job is spawned. Waiting all processes to exit... /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) 2025-07-15 10:30:24,059 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config swap_config is empty. 2025-07-15 10:30:24,059 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config swap_config is empty. 2025-07-15 10:30:24,060 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config monitor_config is empty. 2025-07-15 10:30:24,060 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config monitor_config is empty. 2025-07-15 10:30:24,060 - mindformers./output/log[mindformers/tools/register/template.py:683] - WARNING - Some configs in yaml are useless for train: ['auto_tune', 'autotune_per_step', 'eval_callbacks', 'filepath_prefix', 'processor', 'remove_redundancy', 'resume_by_last_timestamp_ckpt'] 2025-07-15 10:30:24,060 - mindformers./output/log[mindformers/tools/register/template.py:683] - WARNING - Some configs in yaml are useless for train: ['auto_tune', 'autotune_per_step', 'eval_callbacks', 'filepath_prefix', 'processor', 'remove_redundancy', 'resume_by_last_timestamp_ckpt'] 2025-07-15 10:30:24,061 - mindformers./output/log[mindformers/tools/utils.py:166] - INFO - set output path to '/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/output' 2025-07-15 10:30:24,061 - mindformers./output/log[mindformers/tools/utils.py:166] - INFO - set output path to '/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/output' 2025-07-15 10:30:24,074 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config swap_config is empty. 2025-07-15 10:30:24,074 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config monitor_config is empty. 2025-07-15 10:30:24,074 - mindformers./output/log[mindformers/tools/register/template.py:683] - WARNING - Some configs in yaml are useless for train: ['auto_tune', 'autotune_per_step', 'eval_callbacks', 'filepath_prefix', 'processor', 'remove_redundancy', 'resume_by_last_timestamp_ckpt'] 2025-07-15 10:30:24,075 - mindformers./output/log[mindformers/tools/utils.py:166] - INFO - set output path to '/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/output' [WARNING] ME(907036:281473233514176,MainProcess):2025-07-15-10:30:24.791.90 [mindspore/context.py:1412] For 'context.set_context', the parameter 'device_target' will be deprecated and removed in a future version. Please use the api mindspore.set_device() instead. [WARNING] ME(907040:281473804267200,MainProcess):2025-07-15-10:30:24.793.21 [mindspore/context.py:1412] For 'context.set_context', the parameter 'device_target' will be deprecated and removed in a future version. Please use the api mindspore.set_device() instead. [WARNING] ME(907036:281473233514176,MainProcess):2025-07-15-10:30:24.799.65 [mindspore/context.py:1412] For 'context.set_context', the parameter 'max_device_memory' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] ME(907040:281473804267200,MainProcess):2025-07-15-10:30:24.800.83 [mindspore/context.py:1412] For 'context.set_context', the parameter 'max_device_memory' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] ME(907036:281473233514176,MainProcess):2025-07-15-10:30:24.803.62 [mindspore/context.py:1412] For 'context.set_context', the parameter 'max_call_depth' will be deprecated and removed in a future version. Please use the api mindspore.set_recursion_limit() instead. [WARNING] ME(907036:281473233514176,MainProcess):2025-07-15-10:30:24.804.79 [mindspore/context.py:1412] For 'context.set_context', the parameter 'ascend_config' will be deprecated and removed in a future version. Please use the api mindspore.device_context.ascend.op_precision.precision_mode(), mindspore.device_context.ascend.op_precision.op_precision_mode(), mindspore.device_context.ascend.op_precision.matmul_allow_hf32(), mindspore.device_context.ascend.op_precision.conv_allow_hf32(), mindspore.device_context.ascend.op_tuning.op_compile() instead. [WARNING] ME(907040:281473804267200,MainProcess):2025-07-15-10:30:24.804.74 [mindspore/context.py:1412] For 'context.set_context', the parameter 'max_call_depth' will be deprecated and removed in a future version. Please use the api mindspore.set_recursion_limit() instead. [WARNING] ME(907040:281473804267200,MainProcess):2025-07-15-10:30:24.805.88 [mindspore/context.py:1412] For 'context.set_context', the parameter 'ascend_config' will be deprecated and removed in a future version. Please use the api mindspore.device_context.ascend.op_precision.precision_mode(), mindspore.device_context.ascend.op_precision.op_precision_mode(), mindspore.device_context.ascend.op_precision.matmul_allow_hf32(), mindspore.device_context.ascend.op_precision.conv_allow_hf32(), mindspore.device_context.ascend.op_tuning.op_compile() instead. [WARNING] ME(907036:281473233514176,MainProcess):2025-07-15-10:30:24.808.08 [mindspore/context.py:921] For 'context.set_context', 'matmul_grad_comm_overlap' parameter is deprecated, and will be removed in the next version, Please use 'grad_matmul_communication_overlap' instead. [WARNING] ME(907040:281473804267200,MainProcess):2025-07-15-10:30:24.809.01 [mindspore/context.py:921] For 'context.set_context', 'matmul_grad_comm_overlap' parameter is deprecated, and will be removed in the next version, Please use 'grad_matmul_communication_overlap' instead. [WARNING] ME(907036:281473233514176,MainProcess):2025-07-15-10:30:24.809.50 [mindspore/context.py:1412] For 'context.set_context', the parameter 'memory_optimize_level' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] ME(907040:281473804267200,MainProcess):2025-07-15-10:30:24.810.34 [mindspore/context.py:1412] For 'context.set_context', the parameter 'memory_optimize_level' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] ME(907036:281473233514176,MainProcess):2025-07-15-10:30:24.810.49 [mindspore/context.py:1412] For 'context.set_context', the parameter 'save_graphs' will be deprecated and removed in a future version. Please use the env MS_DEV_SAVE_GRAPHS instead. [WARNING] ME(907040:281473804267200,MainProcess):2025-07-15-10:30:24.811.31 [mindspore/context.py:1412] For 'context.set_context', the parameter 'save_graphs' will be deprecated and removed in a future version. Please use the env MS_DEV_SAVE_GRAPHS instead. [WARNING] ME(907036:281473233514176,MainProcess):2025-07-15-10:30:24.811.63 [mindspore/context.py:1412] For 'context.set_context', the parameter 'save_graphs_path' will be deprecated and removed in a future version. Please use the env MS_DEV_SAVE_GRAPHS_PATH instead. [WARNING] ME(907040:281473804267200,MainProcess):2025-07-15-10:30:24.812.42 [mindspore/context.py:1412] For 'context.set_context', the parameter 'save_graphs_path' will be deprecated and removed in a future version. Please use the env MS_DEV_SAVE_GRAPHS_PATH instead. [WARNING] ME(907036:281473233514176,MainProcess):2025-07-15-10:30:24.813.58 [mindspore/context.py:1412] For 'context.set_context', the parameter 'deterministic' will be deprecated and removed in a future version. Please use the api mindspore.set_deterministic() instead. [WARNING] ME(907040:281473804267200,MainProcess):2025-07-15-10:30:24.814.26 [mindspore/context.py:1412] For 'context.set_context', the parameter 'deterministic' will be deprecated and removed in a future version. Please use the api mindspore.set_deterministic() instead. [WARNING] ME(907036:281473233514176,MainProcess):2025-07-15-10:30:24.815.73 [mindspore/context.py:1412] For 'context.set_context', the parameter 'mempool_block_size' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] ME(907040:281473804267200,MainProcess):2025-07-15-10:30:24.816.37 [mindspore/context.py:1412] For 'context.set_context', the parameter 'mempool_block_size' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] DISTRIBUTED(907036,ffff9818eec0,python):2025-07-15-10:30:24.083.607 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 21 source: 127.0.0.1:59020, destination: 127.0.0.1:7125 [WARNING] DISTRIBUTED(907036,ffff127eefa0,python):2025-07-15-10:30:24.083.610 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:59020 to 127.0.0.1:7125 is successfully created. System errno: Success [WARNING] DISTRIBUTED(907040,ffffba1deec0,python):2025-07-15-10:30:24.083.636 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 21 source: 127.0.0.1:59028, destination: 127.0.0.1:7125 [WARNING] DISTRIBUTED(907040,ffff3482efa0,python):2025-07-15-10:30:24.083.659 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:59028 to 127.0.0.1:7125 is successfully created. System errno: Success [WARNING] DISTRIBUTED(907036,ffff9818eec0,python):2025-07-15-10:30:24.083.684 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:7125 to be connected...Retry number: 1 [WARNING] DISTRIBUTED(907040,ffffba1deec0,python):2025-07-15-10:30:24.083.715 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:7125 to be connected...Retry number: 1 [WARNING] ME(907048:281473494871744,MainProcess):2025-07-15-10:30:24.937.33 [mindspore/context.py:1412] For 'context.set_context', the parameter 'device_target' will be deprecated and removed in a future version. Please use the api mindspore.set_device() instead. [WARNING] ME(907048:281473494871744,MainProcess):2025-07-15-10:30:24.944.56 [mindspore/context.py:1412] For 'context.set_context', the parameter 'max_device_memory' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] ME(907048:281473494871744,MainProcess):2025-07-15-10:30:24.948.66 [mindspore/context.py:1412] For 'context.set_context', the parameter 'max_call_depth' will be deprecated and removed in a future version. Please use the api mindspore.set_recursion_limit() instead. [WARNING] ME(907048:281473494871744,MainProcess):2025-07-15-10:30:24.949.75 [mindspore/context.py:1412] For 'context.set_context', the parameter 'ascend_config' will be deprecated and removed in a future version. Please use the api mindspore.device_context.ascend.op_precision.precision_mode(), mindspore.device_context.ascend.op_precision.op_precision_mode(), mindspore.device_context.ascend.op_precision.matmul_allow_hf32(), mindspore.device_context.ascend.op_precision.conv_allow_hf32(), mindspore.device_context.ascend.op_tuning.op_compile() instead. [WARNING] ME(907048:281473494871744,MainProcess):2025-07-15-10:30:24.952.80 [mindspore/context.py:921] For 'context.set_context', 'matmul_grad_comm_overlap' parameter is deprecated, and will be removed in the next version, Please use 'grad_matmul_communication_overlap' instead. [WARNING] ME(907048:281473494871744,MainProcess):2025-07-15-10:30:24.954.12 [mindspore/context.py:1412] For 'context.set_context', the parameter 'memory_optimize_level' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] ME(907048:281473494871744,MainProcess):2025-07-15-10:30:24.955.04 [mindspore/context.py:1412] For 'context.set_context', the parameter 'save_graphs' will be deprecated and removed in a future version. Please use the env MS_DEV_SAVE_GRAPHS instead. [WARNING] ME(907048:281473494871744,MainProcess):2025-07-15-10:30:24.956.13 [mindspore/context.py:1412] For 'context.set_context', the parameter 'save_graphs_path' will be deprecated and removed in a future version. Please use the env MS_DEV_SAVE_GRAPHS_PATH instead. [WARNING] ME(907048:281473494871744,MainProcess):2025-07-15-10:30:24.957.82 [mindspore/context.py:1412] For 'context.set_context', the parameter 'deterministic' will be deprecated and removed in a future version. Please use the api mindspore.set_deterministic() instead. [WARNING] ME(907048:281473494871744,MainProcess):2025-07-15-10:30:24.959.63 [mindspore/context.py:1412] For 'context.set_context', the parameter 'mempool_block_size' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] DISTRIBUTED(907048,ffffa7aceec0,python):2025-07-15-10:30:24.097.592 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 21 source: 127.0.0.1:59034, destination: 127.0.0.1:7125 [WARNING] DISTRIBUTED(907048,ffff2212efa0,python):2025-07-15-10:30:24.097.613 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:59034 to 127.0.0.1:7125 is successfully created. System errno: Success [WARNING] DISTRIBUTED(907048,ffffa7aceec0,python):2025-07-15-10:30:24.097.645 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:7125 to be connected...Retry number: 1 2025-07-15 10:30:24,149 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config swap_config is empty. 2025-07-15 10:30:24,149 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config monitor_config is empty. 2025-07-15 10:30:24,150 - mindformers./output/log[mindformers/tools/register/template.py:683] - WARNING - Some configs in yaml are useless for train: ['auto_tune', 'autotune_per_step', 'eval_callbacks', 'filepath_prefix', 'processor', 'remove_redundancy', 'resume_by_last_timestamp_ckpt'] 2025-07-15 10:30:24,150 - mindformers./output/log[mindformers/tools/utils.py:166] - INFO - set output path to '/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/output' 2025-07-15 10:30:24,164 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config swap_config is empty. 2025-07-15 10:30:24,165 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config monitor_config is empty. 2025-07-15 10:30:24,165 - mindformers./output/log[mindformers/tools/register/template.py:683] - WARNING - Some configs in yaml are useless for train: ['auto_tune', 'autotune_per_step', 'eval_callbacks', 'filepath_prefix', 'processor', 'remove_redundancy', 'resume_by_last_timestamp_ckpt'] 2025-07-15 10:30:24,166 - mindformers./output/log[mindformers/tools/utils.py:166] - INFO - set output path to '/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/output' [WARNING] ME(907052:281473278013120,MainProcess):2025-07-15-10:30:24.170.137 [mindspore/context.py:1412] For 'context.set_context', the parameter 'device_target' will be deprecated and removed in a future version. Please use the api mindspore.set_device() instead. [WARNING] ME(907052:281473278013120,MainProcess):2025-07-15-10:30:24.170.879 [mindspore/context.py:1412] For 'context.set_context', the parameter 'max_device_memory' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] ME(907052:281473278013120,MainProcess):2025-07-15-10:30:24.171.276 [mindspore/context.py:1412] For 'context.set_context', the parameter 'max_call_depth' will be deprecated and removed in a future version. Please use the api mindspore.set_recursion_limit() instead. [WARNING] ME(907052:281473278013120,MainProcess):2025-07-15-10:30:24.171.388 [mindspore/context.py:1412] For 'context.set_context', the parameter 'ascend_config' will be deprecated and removed in a future version. Please use the api mindspore.device_context.ascend.op_precision.precision_mode(), mindspore.device_context.ascend.op_precision.op_precision_mode(), mindspore.device_context.ascend.op_precision.matmul_allow_hf32(), mindspore.device_context.ascend.op_precision.conv_allow_hf32(), mindspore.device_context.ascend.op_tuning.op_compile() instead. [WARNING] ME(907052:281473278013120,MainProcess):2025-07-15-10:30:24.171.690 [mindspore/context.py:921] For 'context.set_context', 'matmul_grad_comm_overlap' parameter is deprecated, and will be removed in the next version, Please use 'grad_matmul_communication_overlap' instead. [WARNING] ME(907052:281473278013120,MainProcess):2025-07-15-10:30:24.171.821 [mindspore/context.py:1412] For 'context.set_context', the parameter 'memory_optimize_level' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] ME(907052:281473278013120,MainProcess):2025-07-15-10:30:24.171.916 [mindspore/context.py:1412] For 'context.set_context', the parameter 'save_graphs' will be deprecated and removed in a future version. Please use the env MS_DEV_SAVE_GRAPHS instead. [WARNING] ME(907052:281473278013120,MainProcess):2025-07-15-10:30:24.172.024 [mindspore/context.py:1412] For 'context.set_context', the parameter 'save_graphs_path' will be deprecated and removed in a future version. Please use the env MS_DEV_SAVE_GRAPHS_PATH instead. [WARNING] ME(907052:281473278013120,MainProcess):2025-07-15-10:30:24.172.206 [mindspore/context.py:1412] For 'context.set_context', the parameter 'deterministic' will be deprecated and removed in a future version. Please use the api mindspore.set_deterministic() instead. [WARNING] ME(907052:281473278013120,MainProcess):2025-07-15-10:30:24.172.406 [mindspore/context.py:1412] For 'context.set_context', the parameter 'mempool_block_size' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] DISTRIBUTED(907052,ffff9abfeec0,python):2025-07-15-10:30:24.174.283 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 21 source: 127.0.0.1:59036, destination: 127.0.0.1:7125 [WARNING] DISTRIBUTED(907052,ffff1525efa0,python):2025-07-15-10:30:24.174.307 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:59036 to 127.0.0.1:7125 is successfully created. System errno: Success [WARNING] DISTRIBUTED(907052,ffff9abfeec0,python):2025-07-15-10:30:24.174.351 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:7125 to be connected...Retry number: 1 [WARNING] ME(907044:281473352462016,MainProcess):2025-07-15-10:30:24.184.539 [mindspore/context.py:1412] For 'context.set_context', the parameter 'device_target' will be deprecated and removed in a future version. Please use the api mindspore.set_device() instead. [WARNING] ME(907044:281473352462016,MainProcess):2025-07-15-10:30:24.185.292 [mindspore/context.py:1412] For 'context.set_context', the parameter 'max_device_memory' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] ME(907044:281473352462016,MainProcess):2025-07-15-10:30:24.185.694 [mindspore/context.py:1412] For 'context.set_context', the parameter 'max_call_depth' will be deprecated and removed in a future version. Please use the api mindspore.set_recursion_limit() instead. [WARNING] ME(907044:281473352462016,MainProcess):2025-07-15-10:30:24.185.810 [mindspore/context.py:1412] For 'context.set_context', the parameter 'ascend_config' will be deprecated and removed in a future version. Please use the api mindspore.device_context.ascend.op_precision.precision_mode(), mindspore.device_context.ascend.op_precision.op_precision_mode(), mindspore.device_context.ascend.op_precision.matmul_allow_hf32(), mindspore.device_context.ascend.op_precision.conv_allow_hf32(), mindspore.device_context.ascend.op_tuning.op_compile() instead. [WARNING] ME(907044:281473352462016,MainProcess):2025-07-15-10:30:24.186.122 [mindspore/context.py:921] For 'context.set_context', 'matmul_grad_comm_overlap' parameter is deprecated, and will be removed in the next version, Please use 'grad_matmul_communication_overlap' instead. [WARNING] ME(907044:281473352462016,MainProcess):2025-07-15-10:30:24.186.260 [mindspore/context.py:1412] For 'context.set_context', the parameter 'memory_optimize_level' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] ME(907044:281473352462016,MainProcess):2025-07-15-10:30:24.186.355 [mindspore/context.py:1412] For 'context.set_context', the parameter 'save_graphs' will be deprecated and removed in a future version. Please use the env MS_DEV_SAVE_GRAPHS instead. [WARNING] ME(907044:281473352462016,MainProcess):2025-07-15-10:30:24.186.479 [mindspore/context.py:1412] For 'context.set_context', the parameter 'save_graphs_path' will be deprecated and removed in a future version. Please use the env MS_DEV_SAVE_GRAPHS_PATH instead. [WARNING] ME(907044:281473352462016,MainProcess):2025-07-15-10:30:24.186.667 [mindspore/context.py:1412] For 'context.set_context', the parameter 'deterministic' will be deprecated and removed in a future version. Please use the api mindspore.set_deterministic() instead. [WARNING] ME(907044:281473352462016,MainProcess):2025-07-15-10:30:24.186.876 [mindspore/context.py:1412] For 'context.set_context', the parameter 'mempool_block_size' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] DISTRIBUTED(907044,ffff1993efa0,python):2025-07-15-10:30:24.188.774 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:59044 to 127.0.0.1:7125 is successfully created. System errno: Success [WARNING] DISTRIBUTED(907044,ffff9f2feec0,python):2025-07-15-10:30:24.188.763 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 21 source: 127.0.0.1:59044, destination: 127.0.0.1:7125 [WARNING] DISTRIBUTED(907044,ffff9f2feec0,python):2025-07-15-10:30:24.188.963 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 22 source: 127.0.0.1:59054, destination: 127.0.0.1:7125 [WARNING] DISTRIBUTED(907044,ffff1a95efa0,python):2025-07-15-10:30:24.188.993 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:59054 to 127.0.0.1:7125 is successfully created. System errno: Success [WARNING] DISTRIBUTED(907044,ffff9f2feec0,python):2025-07-15-10:30:24.189.002 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:7125 to be connected...Retry number: 1 2025-07-15 10:30:24,337 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config swap_config is empty. 2025-07-15 10:30:24,338 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config monitor_config is empty. 2025-07-15 10:30:24,338 - mindformers./output/log[mindformers/tools/register/template.py:683] - WARNING - Some configs in yaml are useless for train: ['auto_tune', 'autotune_per_step', 'eval_callbacks', 'filepath_prefix', 'processor', 'remove_redundancy', 'resume_by_last_timestamp_ckpt'] 2025-07-15 10:30:24,338 - mindformers./output/log[mindformers/tools/utils.py:166] - INFO - set output path to '/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/output' 2025-07-15 10:30:24,349 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config swap_config is empty. 2025-07-15 10:30:24,349 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config monitor_config is empty. 2025-07-15 10:30:24,349 - mindformers./output/log[mindformers/tools/register/template.py:683] - WARNING - Some configs in yaml are useless for train: ['auto_tune', 'autotune_per_step', 'eval_callbacks', 'filepath_prefix', 'processor', 'remove_redundancy', 'resume_by_last_timestamp_ckpt'] 2025-07-15 10:30:24,350 - mindformers./output/log[mindformers/tools/utils.py:166] - INFO - set output path to '/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/output' [WARNING] ME(907056:281473104998080,MainProcess):2025-07-15-10:30:24.358.120 [mindspore/context.py:1412] For 'context.set_context', the parameter 'device_target' will be deprecated and removed in a future version. Please use the api mindspore.set_device() instead. [WARNING] ME(907056:281473104998080,MainProcess):2025-07-15-10:30:24.358.886 [mindspore/context.py:1412] For 'context.set_context', the parameter 'max_device_memory' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] ME(907056:281473104998080,MainProcess):2025-07-15-10:30:24.359.293 [mindspore/context.py:1412] For 'context.set_context', the parameter 'max_call_depth' will be deprecated and removed in a future version. Please use the api mindspore.set_recursion_limit() instead. [WARNING] ME(907056:281473104998080,MainProcess):2025-07-15-10:30:24.359.411 [mindspore/context.py:1412] For 'context.set_context', the parameter 'ascend_config' will be deprecated and removed in a future version. Please use the api mindspore.device_context.ascend.op_precision.precision_mode(), mindspore.device_context.ascend.op_precision.op_precision_mode(), mindspore.device_context.ascend.op_precision.matmul_allow_hf32(), mindspore.device_context.ascend.op_precision.conv_allow_hf32(), mindspore.device_context.ascend.op_tuning.op_compile() instead. [WARNING] ME(907056:281473104998080,MainProcess):2025-07-15-10:30:24.359.735 [mindspore/context.py:921] For 'context.set_context', 'matmul_grad_comm_overlap' parameter is deprecated, and will be removed in the next version, Please use 'grad_matmul_communication_overlap' instead. [WARNING] ME(907056:281473104998080,MainProcess):2025-07-15-10:30:24.359.876 [mindspore/context.py:1412] For 'context.set_context', the parameter 'memory_optimize_level' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] ME(907056:281473104998080,MainProcess):2025-07-15-10:30:24.359.973 [mindspore/context.py:1412] For 'context.set_context', the parameter 'save_graphs' will be deprecated and removed in a future version. Please use the env MS_DEV_SAVE_GRAPHS instead. [WARNING] ME(907056:281473104998080,MainProcess):2025-07-15-10:30:24.360.085 [mindspore/context.py:1412] For 'context.set_context', the parameter 'save_graphs_path' will be deprecated and removed in a future version. Please use the env MS_DEV_SAVE_GRAPHS_PATH instead. [WARNING] ME(907056:281473104998080,MainProcess):2025-07-15-10:30:24.360.276 [mindspore/context.py:1412] For 'context.set_context', the parameter 'deterministic' will be deprecated and removed in a future version. Please use the api mindspore.set_deterministic() instead. [WARNING] ME(907056:281473104998080,MainProcess):2025-07-15-10:30:24.360.500 [mindspore/context.py:1412] For 'context.set_context', the parameter 'mempool_block_size' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] DISTRIBUTED(907056,ffff906feec0,python):2025-07-15-10:30:24.362.477 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 21 source: 127.0.0.1:59070, destination: 127.0.0.1:7125 [WARNING] DISTRIBUTED(907056,ffff0ad3efa0,python):2025-07-15-10:30:24.362.477 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:59070 to 127.0.0.1:7125 is successfully created. System errno: Success [WARNING] DISTRIBUTED(907056,ffff906feec0,python):2025-07-15-10:30:24.362.553 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:7125 to be connected...Retry number: 1 [WARNING] ME(907062:281472888860352,MainProcess):2025-07-15-10:30:24.368.754 [mindspore/context.py:1412] For 'context.set_context', the parameter 'device_target' will be deprecated and removed in a future version. Please use the api mindspore.set_device() instead. [WARNING] ME(907062:281472888860352,MainProcess):2025-07-15-10:30:24.369.494 [mindspore/context.py:1412] For 'context.set_context', the parameter 'max_device_memory' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] ME(907062:281472888860352,MainProcess):2025-07-15-10:30:24.369.894 [mindspore/context.py:1412] For 'context.set_context', the parameter 'max_call_depth' will be deprecated and removed in a future version. Please use the api mindspore.set_recursion_limit() instead. [WARNING] ME(907062:281472888860352,MainProcess):2025-07-15-10:30:24.370.007 [mindspore/context.py:1412] For 'context.set_context', the parameter 'ascend_config' will be deprecated and removed in a future version. Please use the api mindspore.device_context.ascend.op_precision.precision_mode(), mindspore.device_context.ascend.op_precision.op_precision_mode(), mindspore.device_context.ascend.op_precision.matmul_allow_hf32(), mindspore.device_context.ascend.op_precision.conv_allow_hf32(), mindspore.device_context.ascend.op_tuning.op_compile() instead. [WARNING] ME(907062:281472888860352,MainProcess):2025-07-15-10:30:24.370.314 [mindspore/context.py:921] For 'context.set_context', 'matmul_grad_comm_overlap' parameter is deprecated, and will be removed in the next version, Please use 'grad_matmul_communication_overlap' instead. [WARNING] ME(907062:281472888860352,MainProcess):2025-07-15-10:30:24.370.461 [mindspore/context.py:1412] For 'context.set_context', the parameter 'memory_optimize_level' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] ME(907062:281472888860352,MainProcess):2025-07-15-10:30:24.370.560 [mindspore/context.py:1412] For 'context.set_context', the parameter 'save_graphs' will be deprecated and removed in a future version. Please use the env MS_DEV_SAVE_GRAPHS instead. [WARNING] ME(907062:281472888860352,MainProcess):2025-07-15-10:30:24.370.674 [mindspore/context.py:1412] For 'context.set_context', the parameter 'save_graphs_path' will be deprecated and removed in a future version. Please use the env MS_DEV_SAVE_GRAPHS_PATH instead. [WARNING] ME(907062:281472888860352,MainProcess):2025-07-15-10:30:24.370.857 [mindspore/context.py:1412] For 'context.set_context', the parameter 'deterministic' will be deprecated and removed in a future version. Please use the api mindspore.set_deterministic() instead. [WARNING] ME(907062:281472888860352,MainProcess):2025-07-15-10:30:24.371.057 [mindspore/context.py:1412] For 'context.set_context', the parameter 'mempool_block_size' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] DISTRIBUTED(907062,ffff838deec0,python):2025-07-15-10:30:24.372.797 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 21 source: 127.0.0.1:59072, destination: 127.0.0.1:7125 [WARNING] DISTRIBUTED(907062,fffefdf2efa0,python):2025-07-15-10:30:24.372.807 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:59072 to 127.0.0.1:7125 is successfully created. System errno: Success [WARNING] DISTRIBUTED(907062,ffff838deec0,python):2025-07-15-10:30:24.372.860 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:7125 to be connected...Retry number: 1 2025-07-15 10:30:24,513 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config swap_config is empty. 2025-07-15 10:30:24,514 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config monitor_config is empty. 2025-07-15 10:30:24,514 - mindformers./output/log[mindformers/tools/register/template.py:683] - WARNING - Some configs in yaml are useless for train: ['auto_tune', 'autotune_per_step', 'eval_callbacks', 'filepath_prefix', 'processor', 'remove_redundancy', 'resume_by_last_timestamp_ckpt'] 2025-07-15 10:30:24,515 - mindformers./output/log[mindformers/tools/utils.py:166] - INFO - set output path to '/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/output' [WARNING] ME(907066:281473677127360,MainProcess):2025-07-15-10:30:24.533.810 [mindspore/context.py:1412] For 'context.set_context', the parameter 'device_target' will be deprecated and removed in a future version. Please use the api mindspore.set_device() instead. [WARNING] ME(907066:281473677127360,MainProcess):2025-07-15-10:30:24.534.593 [mindspore/context.py:1412] For 'context.set_context', the parameter 'max_device_memory' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] ME(907066:281473677127360,MainProcess):2025-07-15-10:30:24.535.000 [mindspore/context.py:1412] For 'context.set_context', the parameter 'max_call_depth' will be deprecated and removed in a future version. Please use the api mindspore.set_recursion_limit() instead. [WARNING] ME(907066:281473677127360,MainProcess):2025-07-15-10:30:24.535.120 [mindspore/context.py:1412] For 'context.set_context', the parameter 'ascend_config' will be deprecated and removed in a future version. Please use the api mindspore.device_context.ascend.op_precision.precision_mode(), mindspore.device_context.ascend.op_precision.op_precision_mode(), mindspore.device_context.ascend.op_precision.matmul_allow_hf32(), mindspore.device_context.ascend.op_precision.conv_allow_hf32(), mindspore.device_context.ascend.op_tuning.op_compile() instead. [WARNING] ME(907066:281473677127360,MainProcess):2025-07-15-10:30:24.535.439 [mindspore/context.py:921] For 'context.set_context', 'matmul_grad_comm_overlap' parameter is deprecated, and will be removed in the next version, Please use 'grad_matmul_communication_overlap' instead. [WARNING] ME(907066:281473677127360,MainProcess):2025-07-15-10:30:24.535.580 [mindspore/context.py:1412] For 'context.set_context', the parameter 'memory_optimize_level' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] ME(907066:281473677127360,MainProcess):2025-07-15-10:30:24.535.681 [mindspore/context.py:1412] For 'context.set_context', the parameter 'save_graphs' will be deprecated and removed in a future version. Please use the env MS_DEV_SAVE_GRAPHS instead. [WARNING] ME(907066:281473677127360,MainProcess):2025-07-15-10:30:24.535.794 [mindspore/context.py:1412] For 'context.set_context', the parameter 'save_graphs_path' will be deprecated and removed in a future version. Please use the env MS_DEV_SAVE_GRAPHS_PATH instead. [WARNING] ME(907066:281473677127360,MainProcess):2025-07-15-10:30:24.535.985 [mindspore/context.py:1412] For 'context.set_context', the parameter 'deterministic' will be deprecated and removed in a future version. Please use the api mindspore.set_deterministic() instead. [WARNING] ME(907066:281473677127360,MainProcess):2025-07-15-10:30:24.536.201 [mindspore/context.py:1412] For 'context.set_context', the parameter 'mempool_block_size' will be deprecated and removed in a future version. Please use the api mindspore.runtime.set_memory() instead. [WARNING] DISTRIBUTED(907066,ffff2ceeefa0,python):2025-07-15-10:30:24.538.235 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:59080 to 127.0.0.1:7125 is successfully created. System errno: Success [WARNING] DISTRIBUTED(907066,ffffb289eec0,python):2025-07-15-10:30:24.538.235 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 21 source: 127.0.0.1:59080, destination: 127.0.0.1:7125 [WARNING] DISTRIBUTED(907066,ffffb289eec0,python):2025-07-15-10:30:24.538.452 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 22 source: 127.0.0.1:59090, destination: 127.0.0.1:7125 [WARNING] DISTRIBUTED(907066,ffffb289eec0,python):2025-07-15-10:30:24.538.494 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:7125 to be connected...Retry number: 1 [WARNING] DISTRIBUTED(907066,ffff2df0efa0,python):2025-07-15-10:30:24.538.490 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:59090 to 127.0.0.1:7125 is successfully created. System errno: Success [WARNING] DISTRIBUTED(907036,ffff9818eec0,python):2025-07-15-10:30:24.583.934 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 22 source: 127.0.0.1:59104, destination: 127.0.0.1:7125 [WARNING] DISTRIBUTED(907040,ffffba1deec0,python):2025-07-15-10:30:24.583.949 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 22 source: 127.0.0.1:59120, destination: 127.0.0.1:7125 [WARNING] DISTRIBUTED(907036,ffff1380efa0,python):2025-07-15-10:30:24.583.964 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:59104 to 127.0.0.1:7125 is successfully created. System errno: Success [WARNING] DISTRIBUTED(907036,ffff9818eec0,python):2025-07-15-10:30:24.583.980 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:7125 to be connected...Retry number: 2 [WARNING] DISTRIBUTED(907040,ffff3584efa0,python):2025-07-15-10:30:24.583.979 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:59120 to 127.0.0.1:7125 is successfully created. System errno: Success [WARNING] DISTRIBUTED(907040,ffffba1deec0,python):2025-07-15-10:30:24.583.996 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:7125 to be connected...Retry number: 2 [WARNING] DISTRIBUTED(907048,ffffa7aceec0,python):2025-07-15-10:30:24.597.852 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 22 source: 127.0.0.1:59136, destination: 127.0.0.1:7125 [WARNING] DISTRIBUTED(907048,ffff2314efa0,python):2025-07-15-10:30:24.597.884 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:59136 to 127.0.0.1:7125 is successfully created. System errno: Success [WARNING] DISTRIBUTED(907048,ffffa7aceec0,python):2025-07-15-10:30:24.597.892 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:7125 to be connected...Retry number: 2 [WARNING] DISTRIBUTED(907052,ffff9abfeec0,python):2025-07-15-10:30:24.674.576 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 22 source: 127.0.0.1:59152, destination: 127.0.0.1:7125 [WARNING] DISTRIBUTED(907052,ffff1627efa0,python):2025-07-15-10:30:24.674.607 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:59152 to 127.0.0.1:7125 is successfully created. System errno: Success [WARNING] DISTRIBUTED(907052,ffff9abfeec0,python):2025-07-15-10:30:24.674.621 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:7125 to be connected...Retry number: 2 [WARNING] DISTRIBUTED(907044,ffff9f2feec0,python):2025-07-15-10:30:24.689.578 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(1/14400). [WARNING] DISTRIBUTED(907056,ffff906feec0,python):2025-07-15-10:30:24.862.848 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 22 source: 127.0.0.1:59156, destination: 127.0.0.1:7125 [WARNING] DISTRIBUTED(907056,ffff0bd5efa0,python):2025-07-15-10:30:24.862.871 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:59156 to 127.0.0.1:7125 is successfully created. System errno: Success [WARNING] DISTRIBUTED(907056,ffff906feec0,python):2025-07-15-10:30:24.862.896 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:7125 to be connected...Retry number: 2 [WARNING] DISTRIBUTED(907062,ffff838deec0,python):2025-07-15-10:30:24.873.069 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:485] Connect] Connection 22 source: 127.0.0.1:59166, destination: 127.0.0.1:7125 [WARNING] DISTRIBUTED(907062,fffefef4efa0,python):2025-07-15-10:30:24.873.102 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:79] ConnectedEventHandler] Connection from 127.0.0.1:59166 to 127.0.0.1:7125 is successfully created. System errno: Success [WARNING] DISTRIBUTED(907062,ffff838deec0,python):2025-07-15-10:30:24.873.112 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:7125 to be connected...Retry number: 2 [WARNING] DISTRIBUTED(907066,ffffb289eec0,python):2025-07-15-10:30:25.039.074 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(1/14400). [WARNING] DISTRIBUTED(907036,ffff9818eec0,python):2025-07-15-10:30:25.084.585 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(1/14400). [WARNING] DISTRIBUTED(907040,ffffba1deec0,python):2025-07-15-10:30:25.084.712 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(1/14400). [WARNING] DISTRIBUTED(907048,ffffa7aceec0,python):2025-07-15-10:30:25.098.347 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(1/14400). [WARNING] DISTRIBUTED(907052,ffff9abfeec0,python):2025-07-15-10:30:25.175.198 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(1/14400). [WARNING] DISTRIBUTED(907044,ffff9f2feec0,python):2025-07-15-10:30:25.189.687 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(2/14400). [WARNING] DISTRIBUTED(907056,ffff906feec0,python):2025-07-15-10:30:25.363.380 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(1/14400). [WARNING] DISTRIBUTED(907062,ffff838deec0,python):2025-07-15-10:30:25.373.627 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(1/14400). [WARNING] DISTRIBUTED(907066,ffffb289eec0,python):2025-07-15-10:30:25.539.184 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(2/14400). [WARNING] DISTRIBUTED(907036,ffff9818eec0,python):2025-07-15-10:30:25.584.697 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(2/14400). [WARNING] DISTRIBUTED(907040,ffffba1deec0,python):2025-07-15-10:30:25.584.821 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(2/14400). [WARNING] DISTRIBUTED(907048,ffffa7aceec0,python):2025-07-15-10:30:25.598.453 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(2/14400). [WARNING] DISTRIBUTED(907052,ffff9abfeec0,python):2025-07-15-10:30:25.675.308 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(2/14400). [WARNING] DISTRIBUTED(907044,ffff9f2feec0,python):2025-07-15-10:30:25.689.791 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(3/14400). [WARNING] DISTRIBUTED(907056,ffff906feec0,python):2025-07-15-10:30:25.863.487 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(2/14400). [WARNING] DISTRIBUTED(907062,ffff838deec0,python):2025-07-15-10:30:25.873.762 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:249] BuildCluster] Cluster is successfully initialized. [WARNING] DISTRIBUTED(907062,ffff838deec0,python):2025-07-15-10:30:25.873.803 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:355] PostProcess] This node 6 rank id: 6 [WARNING] DISTRIBUTED(907066,ffffb289eec0,python):2025-07-15-10:30:26.039.373 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:249] BuildCluster] Cluster is successfully initialized. [WARNING] DISTRIBUTED(907066,ffffb289eec0,python):2025-07-15-10:30:26.039.444 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:355] PostProcess] This node 7 rank id: 7 [WARNING] DISTRIBUTED(907036,ffff9818eec0,python):2025-07-15-10:30:26.084.833 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:249] BuildCluster] Cluster is successfully initialized. [WARNING] DISTRIBUTED(907036,ffff9818eec0,python):2025-07-15-10:30:26.084.879 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:355] PostProcess] This node 0 rank id: 0 [WARNING] DISTRIBUTED(907040,ffffba1deec0,python):2025-07-15-10:30:26.084.944 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:249] BuildCluster] Cluster is successfully initialized. [WARNING] DISTRIBUTED(907040,ffffba1deec0,python):2025-07-15-10:30:26.084.987 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:355] PostProcess] This node 1 rank id: 1 [WARNING] DISTRIBUTED(907048,ffffa7aceec0,python):2025-07-15-10:30:26.098.574 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:249] BuildCluster] Cluster is successfully initialized. [WARNING] DISTRIBUTED(907048,ffffa7aceec0,python):2025-07-15-10:30:26.098.616 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:355] PostProcess] This node 3 rank id: 3 [WARNING] DISTRIBUTED(907052,ffff9abfeec0,python):2025-07-15-10:30:26.175.442 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:249] BuildCluster] Cluster is successfully initialized. [WARNING] DISTRIBUTED(907052,ffff9abfeec0,python):2025-07-15-10:30:26.175.497 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:355] PostProcess] This node 4 rank id: 4 [WARNING] DISTRIBUTED(907044,ffff9f2feec0,python):2025-07-15-10:30:26.189.926 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:249] BuildCluster] Cluster is successfully initialized. [WARNING] DISTRIBUTED(907044,ffff9f2feec0,python):2025-07-15-10:30:26.189.971 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:355] PostProcess] This node 2 rank id: 2 [WARNING] DISTRIBUTED(907056,ffff906feec0,python):2025-07-15-10:30:26.363.615 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:249] BuildCluster] Cluster is successfully initialized. [WARNING] DISTRIBUTED(907056,ffff906feec0,python):2025-07-15-10:30:26.363.662 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:355] PostProcess] This node 5 rank id: 5 [WARNING] DISTRIBUTED(907062,ffff838deec0,python):2025-07-15-10:30:27.621.393 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: hccl_world_group [const vector]{0, 1, 2, 3, 4, 5, 6, 7}, async: 1, submit_now: 1 [WARNING] DISTRIBUTED(907062,ffff838deec0,python):2025-07-15-10:30:27.621.621 [mindspore/ccsrc/distributed/collective/collective_manager.cc:393] CreateCommunicationGroup] This group's communicator is async created hccl_world_group [WARNING] DEVICE(907062,fffea695efa0,python):2025-07-15-10:30:27.621.854 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:254] SetGlobalCommInfo] Start to SetGlobalCommInfo for hccl_world_group, master_ip:2130706433, master_port:7125, node_rank:2130706433, total_rank_size:8, local_rank_size8 [WARNING] HCCL_ADPT(907062,fffea695efa0,python):2025-07-15-10:30:27.621.962 [mindspore/ccsrc/utils/dlopen_macro.h:165] DlsymAscend] Dynamically load symbol HcclSetGlobalCommInfo failed, result = /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/../lib/plugin/ascend/libhccl_plugin.so: undefined symbol: HcclSetGlobalCommInfo [WARNING] HCCL_ADPT(907062,fffea695efa0,python):2025-07-15-10:30:27.621.998 [mindspore/ccsrc/plugin/res_manager/ascend/hccl_adapter/hccl_adapter.cc:635] HcclSetGlobalCommInfo] Func HcclSetGlobalCommInfo is not supported in CANN package. [WARNING] DEVICE(907062,fffea695efa0,python):2025-07-15-10:30:27.622.045 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:265] SetGlobalCommInfo] End to SetGlobalCommInfo for hccl_world_group [WARNING] DEVICE(907062,fffea695efa0,python):2025-07-15-10:30:27.622.358 [mindspore/ccsrc/plugin/device/cpu/hal/hardware/ms_collective_comm_lib.cc:251] QueryUniqueID] Retry to lookup the unique id for group hccl_world_group from the meta server node...Retry time: 399/400, sleep 1 2025-07-15 10:30:27,623 - mindformers./output/log[mindformers/tools/utils.py:181] - INFO - set strategy path to './output/strategy/ckpt_strategy_rank_6.ckpt' 2025-07-15 10:30:27,649 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config swap_config is empty. 2025-07-15 10:30:27,650 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config metric is empty. 2025-07-15 10:30:27,650 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config monitor_config is empty. 2025-07-15 10:30:27,650 - mindformers./output/log[mindformers/tools/register/template.py:683] - WARNING - Some configs in yaml are useless for train: ['auto_tune', 'autotune_per_step', 'eval_callbacks', 'eval_dataset', 'eval_dataset_task', 'filepath_prefix', 'processor'] 2025-07-15 10:30:27,650 - mindformers./output/log[mindformers/trainer/trainer.py:1008] - INFO - Load configs in /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/configs/general/run_general_task.yaml to build trainer. 2025-07-15 10:30:27,651 - mindformers./output/log[mindformers/trainer/trainer.py:1044] - INFO - ..........Init Config.......... 2025-07-15 10:30:27,651 - mindformers./output/log[mindformers/core/parallel_config.py:41] - INFO - initial moe_config from dict: {'expert_num': 4, 'capacity_factor': 1.5, 'aux_loss_factor': 0.05, 'num_experts_chosen': 2, 'expert_group_size': 2, 'group_wise_a2a': False, 'comp_comm_parallel': False, 'comp_comm_parallel_degree': 2, 'save_token_distribution': False, 'cur_layer': 0, 'enable_cold_hot_expert': False, 'update_step': 10000, 'hot_expert_num': 0, 'cold_token_percent': 1.0, 'moe_module_name': '', 'routing_policy': 'TopkRouterV2', 'norm_topk_prob': False, 'enable_sdrop': False, 'use_fused_ops_topkrouter': True, 'router_dense_type': 'float32', 'shared_expert_num': 1, 'use_shared_expert_gating': False, 'max_router_load': 131072, 'topk_method': 'greedy', 'topk_group': 3, 'n_group': 8, 'first_k_dense_replace': 1, 'moe_intermediate_size': 2048, 'routed_scaling_factor': 2.5, 'aux_loss_types': ['expert'], 'aux_loss_factors': [0.0001], 'z_loss_factor': 0.0, 'balance_via_topk_bias': True, 'topk_bias_update_rate': 0.0001, 'use_allgather_dispatcher': False, 'moe_shared_expert_overlap': False, 'expert_model_parallel': 1, 'use_gating_sigmoid': True, 'enable_deredundency': False, 'npu_nums_per_device': 2, 'use_gmm': True, 'enable_gmm_safe_tokens': True, 'use_fused_ops_permute': True, 'callback_moe_droprate': False} 2025-07-15 10:30:27,651 - mindformers./output/log[mindformers/core/parallel_config.py:48] - INFO - initial swap_config from dict: {'swap': False, 'layer_swap': None, 'op_swap': None, 'default_prefetch': 1} 2025-07-15 10:30:27,651 - mindformers./output/log[mindformers/core/parallel_config.py:55] - INFO - initial recompute_config from dict: {'recompute': True, 'select_recompute': False, 'parallel_optimizer_comm_recompute': True, 'select_comm_recompute': False, 'mp_comm_recompute': True, 'recompute_slice_activation': True, 'select_recompute_exclude': False, 'select_comm_recompute_exclude': False} 2025-07-15 10:30:27,652 - mindformers./output/log[mindformers/core/parallel_config.py:61] - INFO - initial parallel_config from dict: {'data_parallel': 2, 'model_parallel': 2, 'context_parallel': 1, 'expert_parallel': 2, 'pipeline_stage': 2, 'micro_batch_num': 2, 'seq_split_num': 1, 'use_seq_parallel': True, 'optimizer_shard': None, 'gradient_aggregation_group': 4, 'vocab_emb_dp': True, 'context_parallel_algo': 'colossalai_cp', 'ulysses_degree_in_cp': 1, 'mem_coeff': 0.1} 2025-07-15 10:30:27,652 - mindformers./output/log[mindformers/core/parallel_config.py:63] - INFO - pipeline_stage = 2 > 1, vocab_emd_dp will be reset to False. 2025-07-15 10:30:27,653 - mindformers./output/log[mindformers/tools/utils.py:166] - INFO - set output path to '/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/output' 2025-07-15 10:30:27,653 - mindformers./output/log[mindformers/tools/utils.py:181] - INFO - set strategy path to './output/strategy/ckpt_strategy_rank_6.ckpt' [WARNING] DISTRIBUTED(907066,ffffb289eec0,python):2025-07-15-10:30:27.770.405 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: hccl_world_group [const vector]{0, 1, 2, 3, 4, 5, 6, 7}, async: 1, submit_now: 1 [WARNING] DISTRIBUTED(907066,ffffb289eec0,python):2025-07-15-10:30:27.770.678 [mindspore/ccsrc/distributed/collective/collective_manager.cc:393] CreateCommunicationGroup] This group's communicator is async created hccl_world_group [WARNING] DEVICE(907066,fffed587efa0,python):2025-07-15-10:30:27.770.909 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:254] SetGlobalCommInfo] Start to SetGlobalCommInfo for hccl_world_group, master_ip:2130706433, master_port:7125, node_rank:2130706433, total_rank_size:8, local_rank_size8 [WARNING] HCCL_ADPT(907066,fffed587efa0,python):2025-07-15-10:30:27.771.009 [mindspore/ccsrc/utils/dlopen_macro.h:165] DlsymAscend] Dynamically load symbol HcclSetGlobalCommInfo failed, result = /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/../lib/plugin/ascend/libhccl_plugin.so: undefined symbol: HcclSetGlobalCommInfo [WARNING] HCCL_ADPT(907066,fffed587efa0,python):2025-07-15-10:30:27.771.048 [mindspore/ccsrc/plugin/res_manager/ascend/hccl_adapter/hccl_adapter.cc:635] HcclSetGlobalCommInfo] Func HcclSetGlobalCommInfo is not supported in CANN package. [WARNING] DEVICE(907066,fffed587efa0,python):2025-07-15-10:30:27.771.104 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:265] SetGlobalCommInfo] End to SetGlobalCommInfo for hccl_world_group [WARNING] DEVICE(907066,fffed587efa0,python):2025-07-15-10:30:27.771.581 [mindspore/ccsrc/plugin/device/cpu/hal/hardware/ms_collective_comm_lib.cc:251] QueryUniqueID] Retry to lookup the unique id for group hccl_world_group from the meta server node...Retry time: 399/400, sleep 1 2025-07-15 10:30:27,772 - mindformers./output/log[mindformers/tools/utils.py:181] - INFO - set strategy path to './output/strategy/ckpt_strategy_rank_7.ckpt' 2025-07-15 10:30:27,783 - mindformers./output/log[mindformers/trainer/base_trainer.py:107] - INFO - host_name: ascend213, host_ip: 121.37.54.128 2025-07-15 10:30:27,784 - mindformers./output/log[mindformers/trainer/base_trainer.py:113] - INFO - Now Running Task is: text_generation, Model is: deepseekV3 2025-07-15 10:30:27,784 - mindformers./output/log[mindformers/trainer/base_trainer.py:143] - WARNING - Input model name is not in the supported list or unspecified. 2025-07-15 10:30:27,784 - mindformers./output/log[mindformers/trainer/base_trainer.py:144] - WARNING - See the list of supported task and model name: ['codellama_34b', 'common', 'deepseek1_5_7b', 'deepseek_33b', 'glm3_6b', 'glm4_9b', 'gpt2', 'gpt2_13b', 'gpt2_52b', 'gpt2_lora', 'gpt2_xl', 'gpt2_xl_lora', 'internlm_7b', 'internlm_7b_lora', 'llama2_13b', 'llama2_70b', 'llama2_7b', 'llama2_7b_lora', 'llama_7b_slora', 'yi_34b', 'yi_6b'] 2025-07-15 10:30:27,785 - mindformers./output/log[mindformers/trainer/base_trainer.py:145] - WARNING - The default model config: /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/configs/gpt2/run_gpt2.yaml will now be used for the text_generation task 2025-07-15 10:30:27,785 - mindformers./output/log[mindformers/trainer/trainer.py:1117] - INFO - ..........Init Model.......... 2025-07-15 10:30:27,785 - mindformers./output/log[mindformers/trainer/trainer.py:323] - INFO - ==========Trainer Init Success!========== 2025-07-15 10:30:27,786 - mindformers./output/log[mindformers/trainer/trainer.py:406] - WARNING - sink_size will not be able to set in a future release. Modifying sink_size may cause functional issues when resuming training from a checkpoint. 2025-07-15 10:30:27,786 - mindformers./output/log[mindformers/trainer/trainer.py:1117] - INFO - ..........Init Model.......... 2025-07-15 10:30:27,786 - mindformers./output/log[mindformers/trainer/base_trainer.py:204] - INFO - Pipeline parallel was opened: pipeline_stages = 2, full batch is True, gradient_accumulation_steps will not take effect in pipeline parallel, global batch size will be changed: global_batch_size = batch_size * data_parallel * micro_batch_num * micro_batch_interleave_num = 4 = 1 * 2 * 2 * 1). 2025-07-15 10:30:27,786 - mindformers./output/log[mindformers/trainer/base_trainer.py:338] - WARNING - When using the pipeline parallel mode, the MFPipelineWithLossScaleCell class is used by default. 2025-07-15 10:30:27,787 - mindformers./output/log[mindformers/trainer/base_trainer.py:346] - INFO - PipelineWrapper under evaluate or predict mode will not take effect. 2025-07-15 10:30:27,787 - mindformers./output/log[mindformers/trainer/base_trainer.py:920] - INFO - .........Build Dataset For Train.......... 2025-07-15 10:30:27,787 - mindformers./output/log[mindformers/trainer/base_trainer.py:464] - INFO - .........Build Dataset From Config.......... 2025-07-15 10:30:27,787 - mindformers./output/log[mindformers/dataset/causal_language_model_dataset.py:302] - INFO - Now Create Causal Language Model Dataset. 2025-07-15 10:30:27,788 - mindformers./output/log[mindformers/dataset/base_dataset.py:83] - INFO - Now dataset_strategy is full_batch, shard_id: None, num_shards: None 2025-07-15 10:30:27,794 - mindformers./output/log[mindformers/trainer/base_trainer.py:924] - INFO - Create train dataset finish, dataset size:6 2025-07-15 10:30:27,795 - mindformers./output/log[mindformers/trainer/utils.py:176] - INFO - Will be Training epochs:1, sink_size:1 2025-07-15 10:30:27,795 - mindformers./output/log[mindformers/trainer/utils.py:178] - INFO - Create training dataset finish, dataset size:6 2025-07-15 10:30:27,795 - mindformers./output/log[mindformers/trainer/base_trainer.py:971] - INFO - .........Build Net For Train.......... 2025-07-15 10:30:27,795 - mindformers./output/log[mindformers/trainer/base_trainer.py:498] - INFO - .........Build Network From Config.......... 2025-07-15 10:30:27,799 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config swap_config is empty. 2025-07-15 10:30:27,799 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config metric is empty. 2025-07-15 10:30:27,799 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config monitor_config is empty. 2025-07-15 10:30:27,799 - mindformers./output/log[mindformers/tools/register/template.py:683] - WARNING - Some configs in yaml are useless for train: ['auto_tune', 'autotune_per_step', 'eval_callbacks', 'eval_dataset', 'eval_dataset_task', 'filepath_prefix', 'processor'] 2025-07-15 10:30:27,800 - mindformers./output/log[mindformers/trainer/trainer.py:1008] - INFO - Load configs in /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/configs/general/run_general_task.yaml to build trainer. 2025-07-15 10:30:27,800 - mindformers./output/log[mindformers/trainer/trainer.py:1044] - INFO - ..........Init Config.......... 2025-07-15 10:30:27,800 - mindformers./output/log[mindformers/core/parallel_config.py:41] - INFO - initial moe_config from dict: {'expert_num': 4, 'capacity_factor': 1.5, 'aux_loss_factor': 0.05, 'num_experts_chosen': 2, 'expert_group_size': 2, 'group_wise_a2a': False, 'comp_comm_parallel': False, 'comp_comm_parallel_degree': 2, 'save_token_distribution': False, 'cur_layer': 0, 'enable_cold_hot_expert': False, 'update_step': 10000, 'hot_expert_num': 0, 'cold_token_percent': 1.0, 'moe_module_name': '', 'routing_policy': 'TopkRouterV2', 'norm_topk_prob': False, 'enable_sdrop': False, 'use_fused_ops_topkrouter': True, 'router_dense_type': 'float32', 'shared_expert_num': 1, 'use_shared_expert_gating': False, 'max_router_load': 131072, 'topk_method': 'greedy', 'topk_group': 3, 'n_group': 8, 'first_k_dense_replace': 1, 'moe_intermediate_size': 2048, 'routed_scaling_factor': 2.5, 'aux_loss_types': ['expert'], 'aux_loss_factors': [0.0001], 'z_loss_factor': 0.0, 'balance_via_topk_bias': True, 'topk_bias_update_rate': 0.0001, 'use_allgather_dispatcher': False, 'moe_shared_expert_overlap': False, 'expert_model_parallel': 1, 'use_gating_sigmoid': True, 'enable_deredundency': False, 'npu_nums_per_device': 2, 'use_gmm': True, 'enable_gmm_safe_tokens': True, 'use_fused_ops_permute': True, 'callback_moe_droprate': False} 2025-07-15 10:30:27,801 - mindformers./output/log[mindformers/core/parallel_config.py:48] - INFO - initial swap_config from dict: {'swap': False, 'layer_swap': None, 'op_swap': None, 'default_prefetch': 1} 2025-07-15 10:30:27,801 - mindformers./output/log[mindformers/core/parallel_config.py:55] - INFO - initial recompute_config from dict: {'recompute': True, 'select_recompute': False, 'parallel_optimizer_comm_recompute': True, 'select_comm_recompute': False, 'mp_comm_recompute': True, 'recompute_slice_activation': True, 'select_recompute_exclude': False, 'select_comm_recompute_exclude': False} 2025-07-15 10:30:27,801 - mindformers./output/log[mindformers/core/parallel_config.py:61] - INFO - initial parallel_config from dict: {'data_parallel': 2, 'model_parallel': 2, 'context_parallel': 1, 'expert_parallel': 2, 'pipeline_stage': 2, 'micro_batch_num': 2, 'seq_split_num': 1, 'use_seq_parallel': True, 'optimizer_shard': None, 'gradient_aggregation_group': 4, 'vocab_emb_dp': True, 'context_parallel_algo': 'colossalai_cp', 'ulysses_degree_in_cp': 1, 'mem_coeff': 0.1} 2025-07-15 10:30:27,801 - mindformers./output/log[mindformers/core/parallel_config.py:63] - INFO - pipeline_stage = 2 > 1, vocab_emd_dp will be reset to False. 2025-07-15 10:30:27,802 - mindformers./output/log[mindformers/tools/utils.py:166] - INFO - set output path to '/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/output' 2025-07-15 10:30:27,802 - mindformers./output/log[mindformers/tools/utils.py:181] - INFO - set strategy path to './output/strategy/ckpt_strategy_rank_7.ckpt' [WARNING] DISTRIBUTED(907036,ffff9818eec0,python):2025-07-15-10:30:27.810.919 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: hccl_world_group [const vector]{0, 1, 2, 3, 4, 5, 6, 7}, async: 1, submit_now: 1 [WARNING] DISTRIBUTED(907036,ffff9818eec0,python):2025-07-15-10:30:27.811.165 [mindspore/ccsrc/distributed/collective/collective_manager.cc:393] CreateCommunicationGroup] This group's communicator is async created hccl_world_group [WARNING] DEVICE(907036,fffebaefefa0,python):2025-07-15-10:30:27.811.407 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:254] SetGlobalCommInfo] Start to SetGlobalCommInfo for hccl_world_group, master_ip:2130706433, master_port:7125, node_rank:2130706433, total_rank_size:8, local_rank_size8 [WARNING] HCCL_ADPT(907036,fffebaefefa0,python):2025-07-15-10:30:27.811.505 [mindspore/ccsrc/utils/dlopen_macro.h:165] DlsymAscend] Dynamically load symbol HcclSetGlobalCommInfo failed, result = /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/../lib/plugin/ascend/libhccl_plugin.so: undefined symbol: HcclSetGlobalCommInfo [WARNING] HCCL_ADPT(907036,fffebaefefa0,python):2025-07-15-10:30:27.811.566 [mindspore/ccsrc/plugin/res_manager/ascend/hccl_adapter/hccl_adapter.cc:635] HcclSetGlobalCommInfo] Func HcclSetGlobalCommInfo is not supported in CANN package. [WARNING] DEVICE(907036,fffebaefefa0,python):2025-07-15-10:30:27.811.598 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:265] SetGlobalCommInfo] End to SetGlobalCommInfo for hccl_world_group 2025-07-15 10:30:27,812 - mindformers./output/log[mindformers/tools/utils.py:181] - INFO - set strategy path to './output/strategy/ckpt_strategy_rank_0.ckpt' [WARNING] DISTRIBUTED(907036,fffebaefefa0,python):2025-07-15-10:30:27.819.332 [mindspore/ccsrc/distributed/collective/collective_manager.cc:1021] CreateDeviceCommunicator] Begin initialize communication group on the device side: hccl_world_group [WARNING] DEVICE(907036,fffeb8ebefa0,python):2025-07-15-10:30:27.819.653 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:169] InitByRootInfoConfig] Start to initialize communicator by HcclCommInitRootInfoConfig for hccl_world_group, hcclBufferSize is 200 MB, hcclDeterministic is 0 2025-07-15 10:30:27,839 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config swap_config is empty. 2025-07-15 10:30:27,839 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config metric is empty. 2025-07-15 10:30:27,840 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config monitor_config is empty. 2025-07-15 10:30:27,840 - mindformers./output/log[mindformers/tools/register/template.py:683] - WARNING - Some configs in yaml are useless for train: ['auto_tune', 'autotune_per_step', 'eval_callbacks', 'eval_dataset', 'eval_dataset_task', 'filepath_prefix', 'processor'] 2025-07-15 10:30:27,840 - mindformers./output/log[mindformers/trainer/trainer.py:1008] - INFO - Load configs in /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/configs/general/run_general_task.yaml to build trainer. 2025-07-15 10:30:27,840 - mindformers./output/log[mindformers/trainer/trainer.py:1044] - INFO - ..........Init Config.......... 2025-07-15 10:30:27,840 - mindformers./output/log[mindformers/core/parallel_config.py:41] - INFO - initial moe_config from dict: {'expert_num': 4, 'capacity_factor': 1.5, 'aux_loss_factor': 0.05, 'num_experts_chosen': 2, 'expert_group_size': 2, 'group_wise_a2a': False, 'comp_comm_parallel': False, 'comp_comm_parallel_degree': 2, 'save_token_distribution': False, 'cur_layer': 0, 'enable_cold_hot_expert': False, 'update_step': 10000, 'hot_expert_num': 0, 'cold_token_percent': 1.0, 'moe_module_name': '', 'routing_policy': 'TopkRouterV2', 'norm_topk_prob': False, 'enable_sdrop': False, 'use_fused_ops_topkrouter': True, 'router_dense_type': 'float32', 'shared_expert_num': 1, 'use_shared_expert_gating': False, 'max_router_load': 131072, 'topk_method': 'greedy', 'topk_group': 3, 'n_group': 8, 'first_k_dense_replace': 1, 'moe_intermediate_size': 2048, 'routed_scaling_factor': 2.5, 'aux_loss_types': ['expert'], 'aux_loss_factors': [0.0001], 'z_loss_factor': 0.0, 'balance_via_topk_bias': True, 'topk_bias_update_rate': 0.0001, 'use_allgather_dispatcher': False, 'moe_shared_expert_overlap': False, 'expert_model_parallel': 1, 'use_gating_sigmoid': True, 'enable_deredundency': False, 'npu_nums_per_device': 2, 'use_gmm': True, 'enable_gmm_safe_tokens': True, 'use_fused_ops_permute': True, 'callback_moe_droprate': False} 2025-07-15 10:30:27,841 - mindformers./output/log[mindformers/core/parallel_config.py:48] - INFO - initial swap_config from dict: {'swap': False, 'layer_swap': None, 'op_swap': None, 'default_prefetch': 1} 2025-07-15 10:30:27,841 - mindformers./output/log[mindformers/core/parallel_config.py:55] - INFO - initial recompute_config from dict: {'recompute': True, 'select_recompute': False, 'parallel_optimizer_comm_recompute': True, 'select_comm_recompute': False, 'mp_comm_recompute': True, 'recompute_slice_activation': True, 'select_recompute_exclude': False, 'select_comm_recompute_exclude': False} 2025-07-15 10:30:27,841 - mindformers./output/log[mindformers/core/parallel_config.py:61] - INFO - initial parallel_config from dict: {'data_parallel': 2, 'model_parallel': 2, 'context_parallel': 1, 'expert_parallel': 2, 'pipeline_stage': 2, 'micro_batch_num': 2, 'seq_split_num': 1, 'use_seq_parallel': True, 'optimizer_shard': None, 'gradient_aggregation_group': 4, 'vocab_emb_dp': True, 'context_parallel_algo': 'colossalai_cp', 'ulysses_degree_in_cp': 1, 'mem_coeff': 0.1} 2025-07-15 10:30:27,842 - mindformers./output/log[mindformers/core/parallel_config.py:63] - INFO - pipeline_stage = 2 > 1, vocab_emd_dp will be reset to False. 2025-07-15 10:30:27,842 - mindformers./output/log[mindformers/tools/utils.py:166] - INFO - set output path to '/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/output' 2025-07-15 10:30:27,843 - mindformers./output/log[mindformers/tools/utils.py:181] - INFO - set strategy path to './output/strategy/ckpt_strategy_rank_0.ckpt' [WARNING] DISTRIBUTED(907048,ffffa7aceec0,python):2025-07-15-10:30:27.858.579 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: hccl_world_group [const vector]{0, 1, 2, 3, 4, 5, 6, 7}, async: 1, submit_now: 1 [WARNING] DISTRIBUTED(907048,ffffa7aceec0,python):2025-07-15-10:30:27.858.783 [mindspore/ccsrc/distributed/collective/collective_manager.cc:393] CreateCommunicationGroup] This group's communicator is async created hccl_world_group [WARNING] DEVICE(907048,fffecaefefa0,python):2025-07-15-10:30:27.858.992 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:254] SetGlobalCommInfo] Start to SetGlobalCommInfo for hccl_world_group, master_ip:2130706433, master_port:7125, node_rank:2130706433, total_rank_size:8, local_rank_size8 [WARNING] HCCL_ADPT(907048,fffecaefefa0,python):2025-07-15-10:30:27.859.085 [mindspore/ccsrc/utils/dlopen_macro.h:165] DlsymAscend] Dynamically load symbol HcclSetGlobalCommInfo failed, result = /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/../lib/plugin/ascend/libhccl_plugin.so: undefined symbol: HcclSetGlobalCommInfo [WARNING] HCCL_ADPT(907048,fffecaefefa0,python):2025-07-15-10:30:27.859.135 [mindspore/ccsrc/plugin/res_manager/ascend/hccl_adapter/hccl_adapter.cc:635] HcclSetGlobalCommInfo] Func HcclSetGlobalCommInfo is not supported in CANN package. [WARNING] DEVICE(907048,fffecaefefa0,python):2025-07-15-10:30:27.859.165 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:265] SetGlobalCommInfo] End to SetGlobalCommInfo for hccl_world_group [WARNING] DISTRIBUTED(907048,fffecaefefa0,python):2025-07-15-10:30:27.859.629 [mindspore/ccsrc/distributed/collective/collective_manager.cc:1021] CreateDeviceCommunicator] Begin initialize communication group on the device side: hccl_world_group [WARNING] DEVICE(907048,fffeca6eefa0,python):2025-07-15-10:30:27.859.954 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:169] InitByRootInfoConfig] Start to initialize communicator by HcclCommInitRootInfoConfig for hccl_world_group, hcclBufferSize is 200 MB, hcclDeterministic is 0 2025-07-15 10:30:27,860 - mindformers./output/log[mindformers/tools/utils.py:181] - INFO - set strategy path to './output/strategy/ckpt_strategy_rank_3.ckpt' 2025-07-15 10:30:27,887 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config swap_config is empty. 2025-07-15 10:30:27,887 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config metric is empty. 2025-07-15 10:30:27,887 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config monitor_config is empty. 2025-07-15 10:30:27,888 - mindformers./output/log[mindformers/tools/register/template.py:683] - WARNING - Some configs in yaml are useless for train: ['auto_tune', 'autotune_per_step', 'eval_callbacks', 'eval_dataset', 'eval_dataset_task', 'filepath_prefix', 'processor'] 2025-07-15 10:30:27,888 - mindformers./output/log[mindformers/trainer/trainer.py:1008] - INFO - Load configs in /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/configs/general/run_general_task.yaml to build trainer. 2025-07-15 10:30:27,888 - mindformers./output/log[mindformers/trainer/trainer.py:1044] - INFO - ..........Init Config.......... 2025-07-15 10:30:27,888 - mindformers./output/log[mindformers/core/parallel_config.py:41] - INFO - initial moe_config from dict: {'expert_num': 4, 'capacity_factor': 1.5, 'aux_loss_factor': 0.05, 'num_experts_chosen': 2, 'expert_group_size': 2, 'group_wise_a2a': False, 'comp_comm_parallel': False, 'comp_comm_parallel_degree': 2, 'save_token_distribution': False, 'cur_layer': 0, 'enable_cold_hot_expert': False, 'update_step': 10000, 'hot_expert_num': 0, 'cold_token_percent': 1.0, 'moe_module_name': '', 'routing_policy': 'TopkRouterV2', 'norm_topk_prob': False, 'enable_sdrop': False, 'use_fused_ops_topkrouter': True, 'router_dense_type': 'float32', 'shared_expert_num': 1, 'use_shared_expert_gating': False, 'max_router_load': 131072, 'topk_method': 'greedy', 'topk_group': 3, 'n_group': 8, 'first_k_dense_replace': 1, 'moe_intermediate_size': 2048, 'routed_scaling_factor': 2.5, 'aux_loss_types': ['expert'], 'aux_loss_factors': [0.0001], 'z_loss_factor': 0.0, 'balance_via_topk_bias': True, 'topk_bias_update_rate': 0.0001, 'use_allgather_dispatcher': False, 'moe_shared_expert_overlap': False, 'expert_model_parallel': 1, 'use_gating_sigmoid': True, 'enable_deredundency': False, 'npu_nums_per_device': 2, 'use_gmm': True, 'enable_gmm_safe_tokens': True, 'use_fused_ops_permute': True, 'callback_moe_droprate': False} 2025-07-15 10:30:27,889 - mindformers./output/log[mindformers/core/parallel_config.py:48] - INFO - initial swap_config from dict: {'swap': False, 'layer_swap': None, 'op_swap': None, 'default_prefetch': 1} 2025-07-15 10:30:27,889 - mindformers./output/log[mindformers/core/parallel_config.py:55] - INFO - initial recompute_config from dict: {'recompute': True, 'select_recompute': False, 'parallel_optimizer_comm_recompute': True, 'select_comm_recompute': False, 'mp_comm_recompute': True, 'recompute_slice_activation': True, 'select_recompute_exclude': False, 'select_comm_recompute_exclude': False} 2025-07-15 10:30:27,889 - mindformers./output/log[mindformers/core/parallel_config.py:61] - INFO - initial parallel_config from dict: {'data_parallel': 2, 'model_parallel': 2, 'context_parallel': 1, 'expert_parallel': 2, 'pipeline_stage': 2, 'micro_batch_num': 2, 'seq_split_num': 1, 'use_seq_parallel': True, 'optimizer_shard': None, 'gradient_aggregation_group': 4, 'vocab_emb_dp': True, 'context_parallel_algo': 'colossalai_cp', 'ulysses_degree_in_cp': 1, 'mem_coeff': 0.1} 2025-07-15 10:30:27,889 - mindformers./output/log[mindformers/core/parallel_config.py:63] - INFO - pipeline_stage = 2 > 1, vocab_emd_dp will be reset to False. 2025-07-15 10:30:27,890 - mindformers./output/log[mindformers/tools/utils.py:166] - INFO - set output path to '/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/output' 2025-07-15 10:30:27,890 - mindformers./output/log[mindformers/tools/utils.py:181] - INFO - set strategy path to './output/strategy/ckpt_strategy_rank_3.ckpt' 2025-07-15 10:30:27,939 - mindformers./output/log[mindformers/trainer/base_trainer.py:107] - INFO - host_name: ascend213, host_ip: 121.37.54.128 2025-07-15 10:30:27,940 - mindformers./output/log[mindformers/trainer/base_trainer.py:113] - INFO - Now Running Task is: text_generation, Model is: deepseekV3 2025-07-15 10:30:27,940 - mindformers./output/log[mindformers/trainer/base_trainer.py:143] - WARNING - Input model name is not in the supported list or unspecified. 2025-07-15 10:30:27,940 - mindformers./output/log[mindformers/trainer/base_trainer.py:144] - WARNING - See the list of supported task and model name: ['codellama_34b', 'common', 'deepseek1_5_7b', 'deepseek_33b', 'glm3_6b', 'glm4_9b', 'gpt2', 'gpt2_13b', 'gpt2_52b', 'gpt2_lora', 'gpt2_xl', 'gpt2_xl_lora', 'internlm_7b', 'internlm_7b_lora', 'llama2_13b', 'llama2_70b', 'llama2_7b', 'llama2_7b_lora', 'llama_7b_slora', 'yi_34b', 'yi_6b'] 2025-07-15 10:30:27,941 - mindformers./output/log[mindformers/trainer/base_trainer.py:145] - WARNING - The default model config: /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/configs/gpt2/run_gpt2.yaml will now be used for the text_generation task 2025-07-15 10:30:27,941 - mindformers./output/log[mindformers/trainer/trainer.py:1117] - INFO - ..........Init Model.......... 2025-07-15 10:30:27,942 - mindformers./output/log[mindformers/trainer/trainer.py:323] - INFO - ==========Trainer Init Success!========== [WARNING] DISTRIBUTED(907052,ffff9abfeec0,python):2025-07-15-10:30:27.942.144 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: hccl_world_group [const vector]{0, 1, 2, 3, 4, 5, 6, 7}, async: 1, submit_now: 1 [WARNING] DISTRIBUTED(907052,ffff9abfeec0,python):2025-07-15-10:30:27.942.387 [mindspore/ccsrc/distributed/collective/collective_manager.cc:393] CreateCommunicationGroup] This group's communicator is async created hccl_world_group 2025-07-15 10:30:27,942 - mindformers./output/log[mindformers/trainer/trainer.py:406] - WARNING - sink_size will not be able to set in a future release. Modifying sink_size may cause functional issues when resuming training from a checkpoint. [WARNING] DEVICE(907052,fffec880efa0,python):2025-07-15-10:30:27.942.615 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:254] SetGlobalCommInfo] Start to SetGlobalCommInfo for hccl_world_group, master_ip:2130706433, master_port:7125, node_rank:2130706433, total_rank_size:8, local_rank_size8 2025-07-15 10:30:27,942 - mindformers./output/log[mindformers/trainer/trainer.py:1117] - INFO - ..........Init Model.......... [WARNING] HCCL_ADPT(907052,fffec880efa0,python):2025-07-15-10:30:27.942.700 [mindspore/ccsrc/utils/dlopen_macro.h:165] DlsymAscend] Dynamically load symbol HcclSetGlobalCommInfo failed, result = /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/../lib/plugin/ascend/libhccl_plugin.so: undefined symbol: HcclSetGlobalCommInfo [WARNING] HCCL_ADPT(907052,fffec880efa0,python):2025-07-15-10:30:27.942.759 [mindspore/ccsrc/plugin/res_manager/ascend/hccl_adapter/hccl_adapter.cc:635] HcclSetGlobalCommInfo] Func HcclSetGlobalCommInfo is not supported in CANN package. [WARNING] DEVICE(907052,fffec880efa0,python):2025-07-15-10:30:27.942.787 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:265] SetGlobalCommInfo] End to SetGlobalCommInfo for hccl_world_group 2025-07-15 10:30:27,943 - mindformers./output/log[mindformers/trainer/base_trainer.py:204] - INFO - Pipeline parallel was opened: pipeline_stages = 2, full batch is True, gradient_accumulation_steps will not take effect in pipeline parallel, global batch size will be changed: global_batch_size = batch_size * data_parallel * micro_batch_num * micro_batch_interleave_num = 4 = 1 * 2 * 2 * 1). [WARNING] DISTRIBUTED(907052,fffec880efa0,python):2025-07-15-10:30:27.943.343 [mindspore/ccsrc/distributed/collective/collective_manager.cc:1021] CreateDeviceCommunicator] Begin initialize communication group on the device side: hccl_world_group 2025-07-15 10:30:27,943 - mindformers./output/log[mindformers/trainer/base_trainer.py:338] - WARNING - When using the pipeline parallel mode, the MFPipelineWithLossScaleCell class is used by default. 2025-07-15 10:30:27,943 - mindformers./output/log[mindformers/trainer/base_trainer.py:346] - INFO - PipelineWrapper under evaluate or predict mode will not take effect. [WARNING] DEVICE(907052,fffebdb1efa0,python):2025-07-15-10:30:27.943.690 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:169] InitByRootInfoConfig] Start to initialize communicator by HcclCommInitRootInfoConfig for hccl_world_group, hcclBufferSize is 200 MB, hcclDeterministic is 0 2025-07-15 10:30:27,943 - mindformers./output/log[mindformers/trainer/base_trainer.py:920] - INFO - .........Build Dataset For Train.......... 2025-07-15 10:30:27,943 - mindformers./output/log[mindformers/trainer/base_trainer.py:464] - INFO - .........Build Dataset From Config.......... 2025-07-15 10:30:27,943 - mindformers./output/log[mindformers/tools/utils.py:181] - INFO - set strategy path to './output/strategy/ckpt_strategy_rank_4.ckpt' 2025-07-15 10:30:27,944 - mindformers./output/log[mindformers/dataset/causal_language_model_dataset.py:302] - INFO - Now Create Causal Language Model Dataset. 2025-07-15 10:30:27,945 - mindformers./output/log[mindformers/dataset/base_dataset.py:83] - INFO - Now dataset_strategy is full_batch, shard_id: None, num_shards: None 2025-07-15 10:30:27,952 - mindformers./output/log[mindformers/trainer/base_trainer.py:924] - INFO - Create train dataset finish, dataset size:6 2025-07-15 10:30:27,952 - mindformers./output/log[mindformers/trainer/utils.py:176] - INFO - Will be Training epochs:1, sink_size:1 2025-07-15 10:30:27,952 - mindformers./output/log[mindformers/trainer/utils.py:178] - INFO - Create training dataset finish, dataset size:6 2025-07-15 10:30:27,953 - mindformers./output/log[mindformers/trainer/base_trainer.py:971] - INFO - .........Build Net For Train.......... 2025-07-15 10:30:27,953 - mindformers./output/log[mindformers/trainer/base_trainer.py:498] - INFO - .........Build Network From Config.......... [WARNING] DISTRIBUTED(907044,ffff9f2feec0,python):2025-07-15-10:30:27.967.939 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: hccl_world_group [const vector]{0, 1, 2, 3, 4, 5, 6, 7}, async: 1, submit_now: 1 [WARNING] DISTRIBUTED(907044,ffff9f2feec0,python):2025-07-15-10:30:27.968.192 [mindspore/ccsrc/distributed/collective/collective_manager.cc:393] CreateCommunicationGroup] This group's communicator is async created hccl_world_group [WARNING] DEVICE(907044,fffec276efa0,python):2025-07-15-10:30:27.968.435 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:254] SetGlobalCommInfo] Start to SetGlobalCommInfo for hccl_world_group, master_ip:2130706433, master_port:7125, node_rank:2130706433, total_rank_size:8, local_rank_size8 [WARNING] HCCL_ADPT(907044,fffec276efa0,python):2025-07-15-10:30:27.968.533 [mindspore/ccsrc/utils/dlopen_macro.h:165] DlsymAscend] Dynamically load symbol HcclSetGlobalCommInfo failed, result = /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/../lib/plugin/ascend/libhccl_plugin.so: undefined symbol: HcclSetGlobalCommInfo [WARNING] HCCL_ADPT(907044,fffec276efa0,python):2025-07-15-10:30:27.968.589 [mindspore/ccsrc/plugin/res_manager/ascend/hccl_adapter/hccl_adapter.cc:635] HcclSetGlobalCommInfo] Func HcclSetGlobalCommInfo is not supported in CANN package. [WARNING] DEVICE(907044,fffec276efa0,python):2025-07-15-10:30:27.968.620 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:265] SetGlobalCommInfo] End to SetGlobalCommInfo for hccl_world_group [WARNING] DISTRIBUTED(907044,fffec276efa0,python):2025-07-15-10:30:27.969.144 [mindspore/ccsrc/distributed/collective/collective_manager.cc:1021] CreateDeviceCommunicator] Begin initialize communication group on the device side: hccl_world_group [WARNING] DEVICE(907044,fffec1f5efa0,python):2025-07-15-10:30:27.969.462 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:169] InitByRootInfoConfig] Start to initialize communicator by HcclCommInitRootInfoConfig for hccl_world_group, hcclBufferSize is 200 MB, hcclDeterministic is 0 2025-07-15 10:30:27,969 - mindformers./output/log[mindformers/tools/utils.py:181] - INFO - set strategy path to './output/strategy/ckpt_strategy_rank_2.ckpt' 2025-07-15 10:30:27,970 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config swap_config is empty. 2025-07-15 10:30:27,970 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config metric is empty. 2025-07-15 10:30:27,971 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config monitor_config is empty. 2025-07-15 10:30:27,971 - mindformers./output/log[mindformers/tools/register/template.py:683] - WARNING - Some configs in yaml are useless for train: ['auto_tune', 'autotune_per_step', 'eval_callbacks', 'eval_dataset', 'eval_dataset_task', 'filepath_prefix', 'processor'] 2025-07-15 10:30:27,971 - mindformers./output/log[mindformers/trainer/trainer.py:1008] - INFO - Load configs in /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/configs/general/run_general_task.yaml to build trainer. 2025-07-15 10:30:27,971 - mindformers./output/log[mindformers/trainer/trainer.py:1044] - INFO - ..........Init Config.......... 2025-07-15 10:30:27,971 - mindformers./output/log[mindformers/core/parallel_config.py:41] - INFO - initial moe_config from dict: {'expert_num': 4, 'capacity_factor': 1.5, 'aux_loss_factor': 0.05, 'num_experts_chosen': 2, 'expert_group_size': 2, 'group_wise_a2a': False, 'comp_comm_parallel': False, 'comp_comm_parallel_degree': 2, 'save_token_distribution': False, 'cur_layer': 0, 'enable_cold_hot_expert': False, 'update_step': 10000, 'hot_expert_num': 0, 'cold_token_percent': 1.0, 'moe_module_name': '', 'routing_policy': 'TopkRouterV2', 'norm_topk_prob': False, 'enable_sdrop': False, 'use_fused_ops_topkrouter': True, 'router_dense_type': 'float32', 'shared_expert_num': 1, 'use_shared_expert_gating': False, 'max_router_load': 131072, 'topk_method': 'greedy', 'topk_group': 3, 'n_group': 8, 'first_k_dense_replace': 1, 'moe_intermediate_size': 2048, 'routed_scaling_factor': 2.5, 'aux_loss_types': ['expert'], 'aux_loss_factors': [0.0001], 'z_loss_factor': 0.0, 'balance_via_topk_bias': True, 'topk_bias_update_rate': 0.0001, 'use_allgather_dispatcher': False, 'moe_shared_expert_overlap': False, 'expert_model_parallel': 1, 'use_gating_sigmoid': True, 'enable_deredundency': False, 'npu_nums_per_device': 2, 'use_gmm': True, 'enable_gmm_safe_tokens': True, 'use_fused_ops_permute': True, 'callback_moe_droprate': False} 2025-07-15 10:30:27,972 - mindformers./output/log[mindformers/core/parallel_config.py:48] - INFO - initial swap_config from dict: {'swap': False, 'layer_swap': None, 'op_swap': None, 'default_prefetch': 1} 2025-07-15 10:30:27,972 - mindformers./output/log[mindformers/core/parallel_config.py:55] - INFO - initial recompute_config from dict: {'recompute': True, 'select_recompute': False, 'parallel_optimizer_comm_recompute': True, 'select_comm_recompute': False, 'mp_comm_recompute': True, 'recompute_slice_activation': True, 'select_recompute_exclude': False, 'select_comm_recompute_exclude': False} 2025-07-15 10:30:27,972 - mindformers./output/log[mindformers/core/parallel_config.py:61] - INFO - initial parallel_config from dict: {'data_parallel': 2, 'model_parallel': 2, 'context_parallel': 1, 'expert_parallel': 2, 'pipeline_stage': 2, 'micro_batch_num': 2, 'seq_split_num': 1, 'use_seq_parallel': True, 'optimizer_shard': None, 'gradient_aggregation_group': 4, 'vocab_emb_dp': True, 'context_parallel_algo': 'colossalai_cp', 'ulysses_degree_in_cp': 1, 'mem_coeff': 0.1} 2025-07-15 10:30:27,973 - mindformers./output/log[mindformers/core/parallel_config.py:63] - INFO - pipeline_stage = 2 > 1, vocab_emd_dp will be reset to False. 2025-07-15 10:30:27,973 - mindformers./output/log[mindformers/tools/utils.py:166] - INFO - set output path to '/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/output' 2025-07-15 10:30:27,974 - mindformers./output/log[mindformers/tools/utils.py:181] - INFO - set strategy path to './output/strategy/ckpt_strategy_rank_4.ckpt' 2025-07-15 10:30:27,982 - mindformers./output/log[mindformers/trainer/base_trainer.py:107] - INFO - host_name: ascend213, host_ip: 121.37.54.128 2025-07-15 10:30:27,983 - mindformers./output/log[mindformers/trainer/base_trainer.py:113] - INFO - Now Running Task is: text_generation, Model is: deepseekV3 2025-07-15 10:30:27,983 - mindformers./output/log[mindformers/trainer/base_trainer.py:143] - WARNING - Input model name is not in the supported list or unspecified. 2025-07-15 10:30:27,983 - mindformers./output/log[mindformers/trainer/base_trainer.py:144] - WARNING - See the list of supported task and model name: ['codellama_34b', 'common', 'deepseek1_5_7b', 'deepseek_33b', 'glm3_6b', 'glm4_9b', 'gpt2', 'gpt2_13b', 'gpt2_52b', 'gpt2_lora', 'gpt2_xl', 'gpt2_xl_lora', 'internlm_7b', 'internlm_7b_lora', 'llama2_13b', 'llama2_70b', 'llama2_7b', 'llama2_7b_lora', 'llama_7b_slora', 'yi_34b', 'yi_6b'] 2025-07-15 10:30:27,984 - mindformers./output/log[mindformers/trainer/base_trainer.py:145] - WARNING - The default model config: /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/configs/gpt2/run_gpt2.yaml will now be used for the text_generation task 2025-07-15 10:30:27,984 - mindformers./output/log[mindformers/trainer/trainer.py:1117] - INFO - ..........Init Model.......... 2025-07-15 10:30:27,984 - mindformers./output/log[mindformers/trainer/trainer.py:323] - INFO - ==========Trainer Init Success!========== 2025-07-15 10:30:27,985 - mindformers./output/log[mindformers/trainer/trainer.py:406] - WARNING - sink_size will not be able to set in a future release. Modifying sink_size may cause functional issues when resuming training from a checkpoint. 2025-07-15 10:30:27,985 - mindformers./output/log[mindformers/trainer/trainer.py:1117] - INFO - ..........Init Model.......... 2025-07-15 10:30:27,985 - mindformers./output/log[mindformers/trainer/base_trainer.py:204] - INFO - Pipeline parallel was opened: pipeline_stages = 2, full batch is True, gradient_accumulation_steps will not take effect in pipeline parallel, global batch size will be changed: global_batch_size = batch_size * data_parallel * micro_batch_num * micro_batch_interleave_num = 4 = 1 * 2 * 2 * 1). 2025-07-15 10:30:27,986 - mindformers./output/log[mindformers/trainer/base_trainer.py:338] - WARNING - When using the pipeline parallel mode, the MFPipelineWithLossScaleCell class is used by default. 2025-07-15 10:30:27,986 - mindformers./output/log[mindformers/trainer/base_trainer.py:346] - INFO - PipelineWrapper under evaluate or predict mode will not take effect. 2025-07-15 10:30:27,986 - mindformers./output/log[mindformers/trainer/base_trainer.py:920] - INFO - .........Build Dataset For Train.......... 2025-07-15 10:30:27,986 - mindformers./output/log[mindformers/trainer/base_trainer.py:464] - INFO - .........Build Dataset From Config.......... 2025-07-15 10:30:27,986 - mindformers./output/log[mindformers/dataset/causal_language_model_dataset.py:302] - INFO - Now Create Causal Language Model Dataset. 2025-07-15 10:30:27,987 - mindformers./output/log[mindformers/dataset/base_dataset.py:83] - INFO - Now dataset_strategy is full_batch, shard_id: None, num_shards: None 2025-07-15 10:30:27,995 - mindformers./output/log[mindformers/trainer/base_trainer.py:924] - INFO - Create train dataset finish, dataset size:6 2025-07-15 10:30:27,995 - mindformers./output/log[mindformers/trainer/utils.py:176] - INFO - Will be Training epochs:1, sink_size:1 2025-07-15 10:30:27,995 - mindformers./output/log[mindformers/trainer/utils.py:178] - INFO - Create training dataset finish, dataset size:6 2025-07-15 10:30:27,995 - mindformers./output/log[mindformers/trainer/base_trainer.py:971] - INFO - .........Build Net For Train.......... 2025-07-15 10:30:27,996 - mindformers./output/log[mindformers/trainer/base_trainer.py:498] - INFO - .........Build Network From Config.......... 2025-07-15 10:30:27,996 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config swap_config is empty. 2025-07-15 10:30:27,996 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config metric is empty. 2025-07-15 10:30:27,997 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config monitor_config is empty. 2025-07-15 10:30:27,997 - mindformers./output/log[mindformers/tools/register/template.py:683] - WARNING - Some configs in yaml are useless for train: ['auto_tune', 'autotune_per_step', 'eval_callbacks', 'eval_dataset', 'eval_dataset_task', 'filepath_prefix', 'processor'] 2025-07-15 10:30:27,997 - mindformers./output/log[mindformers/trainer/base_trainer.py:107] - INFO - host_name: ascend213, host_ip: 121.37.54.128 2025-07-15 10:30:27,997 - mindformers./output/log[mindformers/trainer/trainer.py:1008] - INFO - Load configs in /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/configs/general/run_general_task.yaml to build trainer. 2025-07-15 10:30:27,997 - mindformers./output/log[mindformers/trainer/trainer.py:1044] - INFO - ..........Init Config.......... 2025-07-15 10:30:27,997 - mindformers./output/log[mindformers/trainer/base_trainer.py:113] - INFO - Now Running Task is: text_generation, Model is: deepseekV3 2025-07-15 10:30:27,998 - mindformers./output/log[mindformers/core/parallel_config.py:41] - INFO - initial moe_config from dict: {'expert_num': 4, 'capacity_factor': 1.5, 'aux_loss_factor': 0.05, 'num_experts_chosen': 2, 'expert_group_size': 2, 'group_wise_a2a': False, 'comp_comm_parallel': False, 'comp_comm_parallel_degree': 2, 'save_token_distribution': False, 'cur_layer': 0, 'enable_cold_hot_expert': False, 'update_step': 10000, 'hot_expert_num': 0, 'cold_token_percent': 1.0, 'moe_module_name': '', 'routing_policy': 'TopkRouterV2', 'norm_topk_prob': False, 'enable_sdrop': False, 'use_fused_ops_topkrouter': True, 'router_dense_type': 'float32', 'shared_expert_num': 1, 'use_shared_expert_gating': False, 'max_router_load': 131072, 'topk_method': 'greedy', 'topk_group': 3, 'n_group': 8, 'first_k_dense_replace': 1, 'moe_intermediate_size': 2048, 'routed_scaling_factor': 2.5, 'aux_loss_types': ['expert'], 'aux_loss_factors': [0.0001], 'z_loss_factor': 0.0, 'balance_via_topk_bias': True, 'topk_bias_update_rate': 0.0001, 'use_allgather_dispatcher': False, 'moe_shared_expert_overlap': False, 'expert_model_parallel': 1, 'use_gating_sigmoid': True, 'enable_deredundency': False, 'npu_nums_per_device': 2, 'use_gmm': True, 'enable_gmm_safe_tokens': True, 'use_fused_ops_permute': True, 'callback_moe_droprate': False} 2025-07-15 10:30:27,998 - mindformers./output/log[mindformers/trainer/base_trainer.py:143] - WARNING - Input model name is not in the supported list or unspecified. 2025-07-15 10:30:27,998 - mindformers./output/log[mindformers/core/parallel_config.py:48] - INFO - initial swap_config from dict: {'swap': False, 'layer_swap': None, 'op_swap': None, 'default_prefetch': 1} 2025-07-15 10:30:27,998 - mindformers./output/log[mindformers/core/parallel_config.py:55] - INFO - initial recompute_config from dict: {'recompute': True, 'select_recompute': False, 'parallel_optimizer_comm_recompute': True, 'select_comm_recompute': False, 'mp_comm_recompute': True, 'recompute_slice_activation': True, 'select_recompute_exclude': False, 'select_comm_recompute_exclude': False} 2025-07-15 10:30:27,998 - mindformers./output/log[mindformers/trainer/base_trainer.py:144] - WARNING - See the list of supported task and model name: ['codellama_34b', 'common', 'deepseek1_5_7b', 'deepseek_33b', 'glm3_6b', 'glm4_9b', 'gpt2', 'gpt2_13b', 'gpt2_52b', 'gpt2_lora', 'gpt2_xl', 'gpt2_xl_lora', 'internlm_7b', 'internlm_7b_lora', 'llama2_13b', 'llama2_70b', 'llama2_7b', 'llama2_7b_lora', 'llama_7b_slora', 'yi_34b', 'yi_6b'] 2025-07-15 10:30:27,998 - mindformers./output/log[mindformers/core/parallel_config.py:61] - INFO - initial parallel_config from dict: {'data_parallel': 2, 'model_parallel': 2, 'context_parallel': 1, 'expert_parallel': 2, 'pipeline_stage': 2, 'micro_batch_num': 2, 'seq_split_num': 1, 'use_seq_parallel': True, 'optimizer_shard': None, 'gradient_aggregation_group': 4, 'vocab_emb_dp': True, 'context_parallel_algo': 'colossalai_cp', 'ulysses_degree_in_cp': 1, 'mem_coeff': 0.1} 2025-07-15 10:30:27,999 - mindformers./output/log[mindformers/trainer/base_trainer.py:145] - WARNING - The default model config: /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/configs/gpt2/run_gpt2.yaml will now be used for the text_generation task 2025-07-15 10:30:27,999 - mindformers./output/log[mindformers/trainer/trainer.py:1117] - INFO - ..........Init Model.......... 2025-07-15 10:30:27,999 - mindformers./output/log[mindformers/core/parallel_config.py:63] - INFO - pipeline_stage = 2 > 1, vocab_emd_dp will be reset to False. 2025-07-15 10:30:27,999 - mindformers./output/log[mindformers/trainer/trainer.py:323] - INFO - ==========Trainer Init Success!========== 2025-07-15 10:30:27,999 - mindformers./output/log[mindformers/trainer/trainer.py:406] - WARNING - sink_size will not be able to set in a future release. Modifying sink_size may cause functional issues when resuming training from a checkpoint. 2025-07-15 10:30:27,999 - mindformers./output/log[mindformers/tools/utils.py:166] - INFO - set output path to '/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/output' 2025-07-15 10:30:28,000 - mindformers./output/log[mindformers/trainer/trainer.py:1117] - INFO - ..........Init Model.......... 2025-07-15 10:30:28,000 - mindformers./output/log[mindformers/tools/utils.py:181] - INFO - set strategy path to './output/strategy/ckpt_strategy_rank_2.ckpt' 2025-07-15 10:30:28,000 - mindformers./output/log[mindformers/trainer/base_trainer.py:204] - INFO - Pipeline parallel was opened: pipeline_stages = 2, full batch is True, gradient_accumulation_steps will not take effect in pipeline parallel, global batch size will be changed: global_batch_size = batch_size * data_parallel * micro_batch_num * micro_batch_interleave_num = 4 = 1 * 2 * 2 * 1). 2025-07-15 10:30:28,000 - mindformers./output/log[mindformers/trainer/base_trainer.py:338] - WARNING - When using the pipeline parallel mode, the MFPipelineWithLossScaleCell class is used by default. 2025-07-15 10:30:28,000 - mindformers./output/log[mindformers/trainer/base_trainer.py:346] - INFO - PipelineWrapper under evaluate or predict mode will not take effect. 2025-07-15 10:30:28,001 - mindformers./output/log[mindformers/trainer/base_trainer.py:920] - INFO - .........Build Dataset For Train.......... 2025-07-15 10:30:28,001 - mindformers./output/log[mindformers/trainer/base_trainer.py:464] - INFO - .........Build Dataset From Config.......... 2025-07-15 10:30:28,001 - mindformers./output/log[mindformers/dataset/causal_language_model_dataset.py:302] - INFO - Now Create Causal Language Model Dataset. 2025-07-15 10:30:28,002 - mindformers./output/log[mindformers/dataset/base_dataset.py:83] - INFO - Now dataset_strategy is full_batch, shard_id: None, num_shards: None 2025-07-15 10:30:28,008 - mindformers./output/log[mindformers/trainer/base_trainer.py:924] - INFO - Create train dataset finish, dataset size:6 2025-07-15 10:30:28,009 - mindformers./output/log[mindformers/trainer/utils.py:176] - INFO - Will be Training epochs:1, sink_size:1 2025-07-15 10:30:28,009 - mindformers./output/log[mindformers/trainer/utils.py:178] - INFO - Create training dataset finish, dataset size:6 2025-07-15 10:30:28,009 - mindformers./output/log[mindformers/trainer/base_trainer.py:971] - INFO - .........Build Net For Train.......... 2025-07-15 10:30:28,010 - mindformers./output/log[mindformers/trainer/base_trainer.py:498] - INFO - .........Build Network From Config.......... 2025-07-15 10:30:28,111 - mindformers./output/log[mindformers/trainer/base_trainer.py:107] - INFO - host_name: ascend213, host_ip: 121.37.54.128 2025-07-15 10:30:28,112 - mindformers./output/log[mindformers/trainer/base_trainer.py:113] - INFO - Now Running Task is: text_generation, Model is: deepseekV3 2025-07-15 10:30:28,112 - mindformers./output/log[mindformers/trainer/base_trainer.py:143] - WARNING - Input model name is not in the supported list or unspecified. 2025-07-15 10:30:28,112 - mindformers./output/log[mindformers/trainer/base_trainer.py:144] - WARNING - See the list of supported task and model name: ['codellama_34b', 'common', 'deepseek1_5_7b', 'deepseek_33b', 'glm3_6b', 'glm4_9b', 'gpt2', 'gpt2_13b', 'gpt2_52b', 'gpt2_lora', 'gpt2_xl', 'gpt2_xl_lora', 'internlm_7b', 'internlm_7b_lora', 'llama2_13b', 'llama2_70b', 'llama2_7b', 'llama2_7b_lora', 'llama_7b_slora', 'yi_34b', 'yi_6b'] 2025-07-15 10:30:28,113 - mindformers./output/log[mindformers/trainer/base_trainer.py:145] - WARNING - The default model config: /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/configs/gpt2/run_gpt2.yaml will now be used for the text_generation task 2025-07-15 10:30:28,113 - mindformers./output/log[mindformers/trainer/trainer.py:1117] - INFO - ..........Init Model.......... 2025-07-15 10:30:28,113 - mindformers./output/log[mindformers/trainer/trainer.py:323] - INFO - ==========Trainer Init Success!========== 2025-07-15 10:30:28,113 - mindformers./output/log[mindformers/trainer/trainer.py:406] - WARNING - sink_size will not be able to set in a future release. Modifying sink_size may cause functional issues when resuming training from a checkpoint. 2025-07-15 10:30:28,114 - mindformers./output/log[mindformers/trainer/trainer.py:1117] - INFO - ..........Init Model.......... 2025-07-15 10:30:28,114 - mindformers./output/log[mindformers/trainer/base_trainer.py:204] - INFO - Pipeline parallel was opened: pipeline_stages = 2, full batch is True, gradient_accumulation_steps will not take effect in pipeline parallel, global batch size will be changed: global_batch_size = batch_size * data_parallel * micro_batch_num * micro_batch_interleave_num = 4 = 1 * 2 * 2 * 1). 2025-07-15 10:30:28,114 - mindformers./output/log[mindformers/trainer/base_trainer.py:338] - WARNING - When using the pipeline parallel mode, the MFPipelineWithLossScaleCell class is used by default. 2025-07-15 10:30:28,114 - mindformers./output/log[mindformers/trainer/base_trainer.py:346] - INFO - PipelineWrapper under evaluate or predict mode will not take effect. 2025-07-15 10:30:28,115 - mindformers./output/log[mindformers/trainer/base_trainer.py:920] - INFO - .........Build Dataset For Train.......... 2025-07-15 10:30:28,115 - mindformers./output/log[mindformers/trainer/base_trainer.py:464] - INFO - .........Build Dataset From Config.......... 2025-07-15 10:30:28,115 - mindformers./output/log[mindformers/dataset/causal_language_model_dataset.py:302] - INFO - Now Create Causal Language Model Dataset. 2025-07-15 10:30:28,116 - mindformers./output/log[mindformers/dataset/base_dataset.py:83] - INFO - Now dataset_strategy is full_batch, shard_id: None, num_shards: None [WARNING] DISTRIBUTED(907062,fffea695efa0,python):2025-07-15-10:30:28.122.944 [mindspore/ccsrc/distributed/collective/collective_manager.cc:1021] CreateDeviceCommunicator] Begin initialize communication group on the device side: hccl_world_group 2025-07-15 10:30:28,123 - mindformers./output/log[mindformers/trainer/base_trainer.py:924] - INFO - Create train dataset finish, dataset size:6 [WARNING] DEVICE(907062,fffea493dfa0,python):2025-07-15-10:30:28.123.403 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:169] InitByRootInfoConfig] Start to initialize communicator by HcclCommInitRootInfoConfig for hccl_world_group, hcclBufferSize is 200 MB, hcclDeterministic is 0 2025-07-15 10:30:28,123 - mindformers./output/log[mindformers/trainer/utils.py:176] - INFO - Will be Training epochs:1, sink_size:1 2025-07-15 10:30:28,123 - mindformers./output/log[mindformers/trainer/utils.py:178] - INFO - Create training dataset finish, dataset size:6 2025-07-15 10:30:28,124 - mindformers./output/log[mindformers/trainer/base_trainer.py:971] - INFO - .........Build Net For Train.......... 2025-07-15 10:30:28,124 - mindformers./output/log[mindformers/trainer/base_trainer.py:498] - INFO - .........Build Network From Config.......... 2025-07-15 10:30:28,126 - mindformers./output/log[mindformers/version_control.py:140] - INFO - The Lazy Inline compilation acceleration feature is turned on. 2025-07-15 10:30:28,131 - mindformers./output/log[mindformers/research/deepseek3/deepseek2_model.py:1216] - INFO - Enable flash attention. [WARNING] DISTRIBUTED(907056,ffff906feec0,python):2025-07-15-10:30:28.140.363 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: hccl_world_group [const vector]{0, 1, 2, 3, 4, 5, 6, 7}, async: 1, submit_now: 1 [WARNING] DISTRIBUTED(907056,ffff906feec0,python):2025-07-15-10:30:28.140.617 [mindspore/ccsrc/distributed/collective/collective_manager.cc:393] CreateCommunicationGroup] This group's communicator is async created hccl_world_group [WARNING] DEVICE(907056,fffe777eefa0,python):2025-07-15-10:30:28.140.844 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:254] SetGlobalCommInfo] Start to SetGlobalCommInfo for hccl_world_group, master_ip:2130706433, master_port:7125, node_rank:2130706433, total_rank_size:8, local_rank_size8 [WARNING] HCCL_ADPT(907056,fffe777eefa0,python):2025-07-15-10:30:28.140.945 [mindspore/ccsrc/utils/dlopen_macro.h:165] DlsymAscend] Dynamically load symbol HcclSetGlobalCommInfo failed, result = /home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/../lib/plugin/ascend/libhccl_plugin.so: undefined symbol: HcclSetGlobalCommInfo [WARNING] HCCL_ADPT(907056,fffe777eefa0,python):2025-07-15-10:30:28.141.005 [mindspore/ccsrc/plugin/res_manager/ascend/hccl_adapter/hccl_adapter.cc:635] HcclSetGlobalCommInfo] Func HcclSetGlobalCommInfo is not supported in CANN package. [WARNING] DEVICE(907056,fffe777eefa0,python):2025-07-15-10:30:28.141.039 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:265] SetGlobalCommInfo] End to SetGlobalCommInfo for hccl_world_group [WARNING] DISTRIBUTED(907056,fffe777eefa0,python):2025-07-15-10:30:28.141.613 [mindspore/ccsrc/distributed/collective/collective_manager.cc:1021] CreateDeviceCommunicator] Begin initialize communication group on the device side: hccl_world_group [WARNING] DEVICE(907056,fffe76fdefa0,python):2025-07-15-10:30:28.141.914 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:169] InitByRootInfoConfig] Start to initialize communicator by HcclCommInitRootInfoConfig for hccl_world_group, hcclBufferSize is 200 MB, hcclDeterministic is 0 2025-07-15 10:30:28,142 - mindformers./output/log[mindformers/tools/utils.py:181] - INFO - set strategy path to './output/strategy/ckpt_strategy_rank_5.ckpt' 2025-07-15 10:30:28,142 - mindformers./output/log[mindformers/trainer/base_trainer.py:107] - INFO - host_name: ascend213, host_ip: 121.37.54.128 2025-07-15 10:30:28,143 - mindformers./output/log[mindformers/trainer/base_trainer.py:113] - INFO - Now Running Task is: text_generation, Model is: deepseekV3 2025-07-15 10:30:28,143 - mindformers./output/log[mindformers/trainer/base_trainer.py:143] - WARNING - Input model name is not in the supported list or unspecified. 2025-07-15 10:30:28,144 - mindformers./output/log[mindformers/trainer/base_trainer.py:144] - WARNING - See the list of supported task and model name: ['codellama_34b', 'common', 'deepseek1_5_7b', 'deepseek_33b', 'glm3_6b', 'glm4_9b', 'gpt2', 'gpt2_13b', 'gpt2_52b', 'gpt2_lora', 'gpt2_xl', 'gpt2_xl_lora', 'internlm_7b', 'internlm_7b_lora', 'llama2_13b', 'llama2_70b', 'llama2_7b', 'llama2_7b_lora', 'llama_7b_slora', 'yi_34b', 'yi_6b'] 2025-07-15 10:30:28,144 - mindformers./output/log[mindformers/trainer/base_trainer.py:145] - WARNING - The default model config: /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/configs/gpt2/run_gpt2.yaml will now be used for the text_generation task 2025-07-15 10:30:28,144 - mindformers./output/log[mindformers/trainer/trainer.py:1117] - INFO - ..........Init Model.......... 2025-07-15 10:30:28,145 - mindformers./output/log[mindformers/trainer/trainer.py:323] - INFO - ==========Trainer Init Success!========== 2025-07-15 10:30:28,145 - mindformers./output/log[mindformers/trainer/trainer.py:406] - WARNING - sink_size will not be able to set in a future release. Modifying sink_size may cause functional issues when resuming training from a checkpoint. 2025-07-15 10:30:28,145 - mindformers./output/log[mindformers/trainer/trainer.py:1117] - INFO - ..........Init Model.......... 2025-07-15 10:30:28,146 - mindformers./output/log[mindformers/trainer/base_trainer.py:204] - INFO - Pipeline parallel was opened: pipeline_stages = 2, full batch is True, gradient_accumulation_steps will not take effect in pipeline parallel, global batch size will be changed: global_batch_size = batch_size * data_parallel * micro_batch_num * micro_batch_interleave_num = 4 = 1 * 2 * 2 * 1). 2025-07-15 10:30:28,146 - mindformers./output/log[mindformers/trainer/base_trainer.py:338] - WARNING - When using the pipeline parallel mode, the MFPipelineWithLossScaleCell class is used by default. 2025-07-15 10:30:28,146 - mindformers./output/log[mindformers/trainer/base_trainer.py:346] - INFO - PipelineWrapper under evaluate or predict mode will not take effect. 2025-07-15 10:30:28,146 - mindformers./output/log[mindformers/trainer/base_trainer.py:920] - INFO - .........Build Dataset For Train.......... 2025-07-15 10:30:28,146 - mindformers./output/log[mindformers/trainer/base_trainer.py:464] - INFO - .........Build Dataset From Config.......... 2025-07-15 10:30:28,147 - mindformers./output/log[mindformers/dataset/causal_language_model_dataset.py:302] - INFO - Now Create Causal Language Model Dataset. 2025-07-15 10:30:28,148 - mindformers./output/log[mindformers/dataset/base_dataset.py:83] - INFO - Now dataset_strategy is full_batch, shard_id: None, num_shards: None 2025-07-15 10:30:28,155 - mindformers./output/log[mindformers/trainer/base_trainer.py:924] - INFO - Create train dataset finish, dataset size:6 2025-07-15 10:30:28,155 - mindformers./output/log[mindformers/trainer/utils.py:176] - INFO - Will be Training epochs:1, sink_size:1 2025-07-15 10:30:28,155 - mindformers./output/log[mindformers/trainer/utils.py:178] - INFO - Create training dataset finish, dataset size:6 2025-07-15 10:30:28,156 - mindformers./output/log[mindformers/trainer/base_trainer.py:971] - INFO - .........Build Net For Train.......... 2025-07-15 10:30:28,156 - mindformers./output/log[mindformers/trainer/base_trainer.py:498] - INFO - .........Build Network From Config.......... 2025-07-15 10:30:28,169 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config swap_config is empty. 2025-07-15 10:30:28,169 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config metric is empty. 2025-07-15 10:30:28,169 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config monitor_config is empty. 2025-07-15 10:30:28,169 - mindformers./output/log[mindformers/tools/register/template.py:683] - WARNING - Some configs in yaml are useless for train: ['auto_tune', 'autotune_per_step', 'eval_callbacks', 'eval_dataset', 'eval_dataset_task', 'filepath_prefix', 'processor'] 2025-07-15 10:30:28,170 - mindformers./output/log[mindformers/trainer/trainer.py:1008] - INFO - Load configs in /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/configs/general/run_general_task.yaml to build trainer. 2025-07-15 10:30:28,170 - mindformers./output/log[mindformers/trainer/trainer.py:1044] - INFO - ..........Init Config.......... 2025-07-15 10:30:28,170 - mindformers./output/log[mindformers/core/parallel_config.py:41] - INFO - initial moe_config from dict: {'expert_num': 4, 'capacity_factor': 1.5, 'aux_loss_factor': 0.05, 'num_experts_chosen': 2, 'expert_group_size': 2, 'group_wise_a2a': False, 'comp_comm_parallel': False, 'comp_comm_parallel_degree': 2, 'save_token_distribution': False, 'cur_layer': 0, 'enable_cold_hot_expert': False, 'update_step': 10000, 'hot_expert_num': 0, 'cold_token_percent': 1.0, 'moe_module_name': '', 'routing_policy': 'TopkRouterV2', 'norm_topk_prob': False, 'enable_sdrop': False, 'use_fused_ops_topkrouter': True, 'router_dense_type': 'float32', 'shared_expert_num': 1, 'use_shared_expert_gating': False, 'max_router_load': 131072, 'topk_method': 'greedy', 'topk_group': 3, 'n_group': 8, 'first_k_dense_replace': 1, 'moe_intermediate_size': 2048, 'routed_scaling_factor': 2.5, 'aux_loss_types': ['expert'], 'aux_loss_factors': [0.0001], 'z_loss_factor': 0.0, 'balance_via_topk_bias': True, 'topk_bias_update_rate': 0.0001, 'use_allgather_dispatcher': False, 'moe_shared_expert_overlap': False, 'expert_model_parallel': 1, 'use_gating_sigmoid': True, 'enable_deredundency': False, 'npu_nums_per_device': 2, 'use_gmm': True, 'enable_gmm_safe_tokens': True, 'use_fused_ops_permute': True, 'callback_moe_droprate': False} 2025-07-15 10:30:28,171 - mindformers./output/log[mindformers/core/parallel_config.py:48] - INFO - initial swap_config from dict: {'swap': False, 'layer_swap': None, 'op_swap': None, 'default_prefetch': 1} 2025-07-15 10:30:28,171 - mindformers./output/log[mindformers/core/parallel_config.py:55] - INFO - initial recompute_config from dict: {'recompute': True, 'select_recompute': False, 'parallel_optimizer_comm_recompute': True, 'select_comm_recompute': False, 'mp_comm_recompute': True, 'recompute_slice_activation': True, 'select_recompute_exclude': False, 'select_comm_recompute_exclude': False} 2025-07-15 10:30:28,171 - mindformers./output/log[mindformers/core/parallel_config.py:61] - INFO - initial parallel_config from dict: {'data_parallel': 2, 'model_parallel': 2, 'context_parallel': 1, 'expert_parallel': 2, 'pipeline_stage': 2, 'micro_batch_num': 2, 'seq_split_num': 1, 'use_seq_parallel': True, 'optimizer_shard': None, 'gradient_aggregation_group': 4, 'vocab_emb_dp': True, 'context_parallel_algo': 'colossalai_cp', 'ulysses_degree_in_cp': 1, 'mem_coeff': 0.1} 2025-07-15 10:30:28,171 - mindformers./output/log[mindformers/core/parallel_config.py:63] - INFO - pipeline_stage = 2 > 1, vocab_emd_dp will be reset to False. 2025-07-15 10:30:28,172 - mindformers./output/log[mindformers/tools/utils.py:166] - INFO - set output path to '/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/output' 2025-07-15 10:30:28,172 - mindformers./output/log[mindformers/tools/utils.py:181] - INFO - set strategy path to './output/strategy/ckpt_strategy_rank_5.ckpt' 2025-07-15 10:30:28,268 - mindformers./output/log[mindformers/version_control.py:140] - INFO - The Lazy Inline compilation acceleration feature is turned on. [WARNING] DISTRIBUTED(907066,fffed587efa0,python):2025-07-15-10:30:28.272.289 [mindspore/ccsrc/distributed/collective/collective_manager.cc:1021] CreateDeviceCommunicator] Begin initialize communication group on the device side: hccl_world_group [WARNING] DEVICE(907066,fffe877eefa0,python):2025-07-15-10:30:28.272.767 [mindspore/ccsrc/plugin/res_manager/ascend/collective/ascend_communication_group.cc:169] InitByRootInfoConfig] Start to initialize communicator by HcclCommInitRootInfoConfig for hccl_world_group, hcclBufferSize is 200 MB, hcclDeterministic is 0 2025-07-15 10:30:28,273 - mindformers./output/log[mindformers/research/deepseek3/deepseek2_model.py:1216] - INFO - Enable flash attention. 2025-07-15 10:30:28,308 - mindformers./output/log[mindformers/models/utils.py:190] - INFO - num_layers per stage: [[1, 1], [1, 1]] 2025-07-15 10:30:28,309 - mindformers./output/log[mindformers/models/utils.py:191] - INFO - Accumulated num_layers per stage: [[1, 2], [3, 4]] 2025-07-15 10:30:28,309 - mindformers./output/log[mindformers/models/utils.py:193] - INFO - Pipeline id list with start_stage: [0, 1, 0, 1] 2025-07-15 10:30:28,309 - mindformers./output/log[mindformers/models/utils.py:194] - INFO - Interleave id list: [0, 0, 1, 1] 2025-07-15 10:30:28,309 - mindformers./output/log[mindformers/models/utils.py:212] - INFO - Formative layer_recompute: [[1, 1], [1, 1]] 2025-07-15 10:30:28,309 - mindformers./output/log[mindformers/models/utils.py:214] - INFO - The configuration of select_recompute_exclude and select_comm_recompute_exclude have the highest priority. 2025-07-15 10:30:28,310 - mindformers./output/log[mindformers/models/utils.py:220] - INFO - Formative select_recompute: {'feed_forward\\.mul': [[0, 0], [0, 0]], 'feed_forward\\.w1\\.activation\\.silu': [[0, 0], [0, 0]]} 2025-07-15 10:30:28,310 - mindformers./output/log[mindformers/models/utils.py:221] - INFO - Formative select_comm_recompute: {'.*\\.norm': [[0, 0], [0, 0]]} 2025-07-15 10:30:28,310 - mindformers./output/log[mindformers/models/utils.py:222] - INFO - Formative select_recompute_exclude: {} 2025-07-15 10:30:28,310 - mindformers./output/log[mindformers/models/utils.py:223] - INFO - Formative select_comm_recompute_exclude: {} 2025-07-15 10:30:28,314 - mindformers./output/log[mindformers/trainer/base_trainer.py:107] - INFO - host_name: ascend213, host_ip: 121.37.54.128 2025-07-15 10:30:28,315 - mindformers./output/log[mindformers/trainer/base_trainer.py:113] - INFO - Now Running Task is: text_generation, Model is: deepseekV3 2025-07-15 10:30:28,315 - mindformers./output/log[mindformers/trainer/base_trainer.py:143] - WARNING - Input model name is not in the supported list or unspecified. 2025-07-15 10:30:28,315 - mindformers./output/log[mindformers/trainer/base_trainer.py:144] - WARNING - See the list of supported task and model name: ['codellama_34b', 'common', 'deepseek1_5_7b', 'deepseek_33b', 'glm3_6b', 'glm4_9b', 'gpt2', 'gpt2_13b', 'gpt2_52b', 'gpt2_lora', 'gpt2_xl', 'gpt2_xl_lora', 'internlm_7b', 'internlm_7b_lora', 'llama2_13b', 'llama2_70b', 'llama2_7b', 'llama2_7b_lora', 'llama_7b_slora', 'yi_34b', 'yi_6b'] 2025-07-15 10:30:28,316 - mindformers./output/log[mindformers/trainer/base_trainer.py:145] - WARNING - The default model config: /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/configs/gpt2/run_gpt2.yaml will now be used for the text_generation task 2025-07-15 10:30:28,316 - mindformers./output/log[mindformers/trainer/trainer.py:1117] - INFO - ..........Init Model.......... 2025-07-15 10:30:28,316 - mindformers./output/log[mindformers/trainer/trainer.py:323] - INFO - ==========Trainer Init Success!========== 2025-07-15 10:30:28,317 - mindformers./output/log[mindformers/trainer/trainer.py:406] - WARNING - sink_size will not be able to set in a future release. Modifying sink_size may cause functional issues when resuming training from a checkpoint. 2025-07-15 10:30:28,317 - mindformers./output/log[mindformers/trainer/trainer.py:1117] - INFO - ..........Init Model.......... 2025-07-15 10:30:28,318 - mindformers./output/log[mindformers/trainer/base_trainer.py:204] - INFO - Pipeline parallel was opened: pipeline_stages = 2, full batch is True, gradient_accumulation_steps will not take effect in pipeline parallel, global batch size will be changed: global_batch_size = batch_size * data_parallel * micro_batch_num * micro_batch_interleave_num = 4 = 1 * 2 * 2 * 1). 2025-07-15 10:30:28,318 - mindformers./output/log[mindformers/trainer/base_trainer.py:338] - WARNING - When using the pipeline parallel mode, the MFPipelineWithLossScaleCell class is used by default. 2025-07-15 10:30:28,318 - mindformers./output/log[mindformers/trainer/base_trainer.py:346] - INFO - PipelineWrapper under evaluate or predict mode will not take effect. 2025-07-15 10:30:28,318 - mindformers./output/log[mindformers/trainer/base_trainer.py:920] - INFO - .........Build Dataset For Train.......... 2025-07-15 10:30:28,318 - mindformers./output/log[mindformers/trainer/base_trainer.py:464] - INFO - .........Build Dataset From Config.......... 2025-07-15 10:30:28,319 - mindformers./output/log[mindformers/dataset/causal_language_model_dataset.py:302] - INFO - Now Create Causal Language Model Dataset. 2025-07-15 10:30:28,319 - mindformers./output/log[mindformers/version_control.py:140] - INFO - The Lazy Inline compilation acceleration feature is turned on. 2025-07-15 10:30:28,320 - mindformers./output/log[mindformers/dataset/base_dataset.py:83] - INFO - Now dataset_strategy is full_batch, shard_id: None, num_shards: None 2025-07-15 10:30:28,320 - mindformers./output/log[mindformers/version_control.py:140] - INFO - The Lazy Inline compilation acceleration feature is turned on. 2025-07-15 10:30:28,324 - mindformers./output/log[mindformers/research/deepseek3/deepseek2_model.py:1216] - INFO - Enable flash attention. 2025-07-15 10:30:28,325 - mindformers./output/log[mindformers/research/deepseek3/deepseek2_model.py:1216] - INFO - Enable flash attention. 2025-07-15 10:30:28,327 - mindformers./output/log[mindformers/trainer/base_trainer.py:924] - INFO - Create train dataset finish, dataset size:6 2025-07-15 10:30:28,327 - mindformers./output/log[mindformers/trainer/utils.py:176] - INFO - Will be Training epochs:1, sink_size:1 2025-07-15 10:30:28,327 - mindformers./output/log[mindformers/trainer/utils.py:178] - INFO - Create training dataset finish, dataset size:6 2025-07-15 10:30:28,328 - mindformers./output/log[mindformers/trainer/base_trainer.py:971] - INFO - .........Build Net For Train.......... 2025-07-15 10:30:28,328 - mindformers./output/log[mindformers/trainer/base_trainer.py:498] - INFO - .........Build Network From Config.......... 2025-07-15 10:30:28,329 - mindformers./output/log[mindformers/research/deepseek3/deepseek2_model.py:1042] - WARNING - first_k_dense_replace is provided in MoEConfig, a normal dense FFN will be used in this block. 2025-07-15 10:30:28,337 - mindformers./output/log[mindformers/models/utils.py:423] - INFO - Set full recompute at layer 0 2025-07-15 10:30:28,355 - mindformers./output/log[mindformers/research/deepseek3/deepseek2_model.py:1072] - INFO - MoE config is provided, use MoE FFN with shared ffn [WARNING] DISTRIBUTED(907062,ffff838deec0,python):2025-07-15-10:30:28.361.244 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: c0a3056327cfa3486f48201b4f88a5fa [const vector]{6, 7}, async: 0, submit_now: 1 2025-07-15 10:30:28,435 - mindformers./output/log[mindformers/version_control.py:140] - INFO - The Lazy Inline compilation acceleration feature is turned on. 2025-07-15 10:30:28,440 - mindformers./output/log[mindformers/research/deepseek3/deepseek2_model.py:1216] - INFO - Enable flash attention. 2025-07-15 10:30:28,449 - mindformers./output/log[mindformers/models/utils.py:190] - INFO - num_layers per stage: [[1, 1], [1, 1]] 2025-07-15 10:30:28,449 - mindformers./output/log[mindformers/models/utils.py:191] - INFO - Accumulated num_layers per stage: [[1, 2], [3, 4]] 2025-07-15 10:30:28,449 - mindformers./output/log[mindformers/models/utils.py:193] - INFO - Pipeline id list with start_stage: [0, 1, 0, 1] 2025-07-15 10:30:28,449 - mindformers./output/log[mindformers/models/utils.py:194] - INFO - Interleave id list: [0, 0, 1, 1] 2025-07-15 10:30:28,449 - mindformers./output/log[mindformers/models/utils.py:212] - INFO - Formative layer_recompute: [[1, 1], [1, 1]] 2025-07-15 10:30:28,450 - mindformers./output/log[mindformers/models/utils.py:214] - INFO - The configuration of select_recompute_exclude and select_comm_recompute_exclude have the highest priority. 2025-07-15 10:30:28,450 - mindformers./output/log[mindformers/models/utils.py:220] - INFO - Formative select_recompute: {'feed_forward\\.mul': [[0, 0], [0, 0]], 'feed_forward\\.w1\\.activation\\.silu': [[0, 0], [0, 0]]} 2025-07-15 10:30:28,450 - mindformers./output/log[mindformers/models/utils.py:221] - INFO - Formative select_comm_recompute: {'.*\\.norm': [[0, 0], [0, 0]]} 2025-07-15 10:30:28,450 - mindformers./output/log[mindformers/models/utils.py:222] - INFO - Formative select_recompute_exclude: {} 2025-07-15 10:30:28,450 - mindformers./output/log[mindformers/models/utils.py:223] - INFO - Formative select_comm_recompute_exclude: {} 2025-07-15 10:30:28,470 - mindformers./output/log[mindformers/research/deepseek3/deepseek2_model.py:1042] - WARNING - first_k_dense_replace is provided in MoEConfig, a normal dense FFN will be used in this block. 2025-07-15 10:30:28,477 - mindformers./output/log[mindformers/models/utils.py:423] - INFO - Set full recompute at layer 0 2025-07-15 10:30:28,478 - mindformers./output/log[mindformers/version_control.py:140] - INFO - The Lazy Inline compilation acceleration feature is turned on. 2025-07-15 10:30:28,483 - mindformers./output/log[mindformers/research/deepseek3/deepseek2_model.py:1216] - INFO - Enable flash attention. 2025-07-15 10:30:28,496 - mindformers./output/log[mindformers/research/deepseek3/deepseek2_model.py:1072] - INFO - MoE config is provided, use MoE FFN with shared ffn 2025-07-15 10:30:28,501 - mindformers./output/log[mindformers/models/utils.py:190] - INFO - num_layers per stage: [[1, 1], [1, 1]] 2025-07-15 10:30:28,501 - mindformers./output/log[mindformers/models/utils.py:191] - INFO - Accumulated num_layers per stage: [[1, 2], [3, 4]] 2025-07-15 10:30:28,501 - mindformers./output/log[mindformers/models/utils.py:193] - INFO - Pipeline id list with start_stage: [0, 1, 0, 1] 2025-07-15 10:30:28,501 - mindformers./output/log[mindformers/models/utils.py:194] - INFO - Interleave id list: [0, 0, 1, 1] 2025-07-15 10:30:28,501 - mindformers./output/log[mindformers/models/utils.py:212] - INFO - Formative layer_recompute: [[1, 1], [1, 1]] 2025-07-15 10:30:28,502 - mindformers./output/log[mindformers/models/utils.py:214] - INFO - The configuration of select_recompute_exclude and select_comm_recompute_exclude have the highest priority. 2025-07-15 10:30:28,502 - mindformers./output/log[mindformers/models/utils.py:220] - INFO - Formative select_recompute: {'feed_forward\\.mul': [[0, 0], [0, 0]], 'feed_forward\\.w1\\.activation\\.silu': [[0, 0], [0, 0]]} 2025-07-15 10:30:28,502 - mindformers./output/log[mindformers/models/utils.py:221] - INFO - Formative select_comm_recompute: {'.*\\.norm': [[0, 0], [0, 0]]} 2025-07-15 10:30:28,502 - mindformers./output/log[mindformers/models/utils.py:222] - INFO - Formative select_recompute_exclude: {} 2025-07-15 10:30:28,502 - mindformers./output/log[mindformers/models/utils.py:223] - INFO - Formative select_comm_recompute_exclude: {} [WARNING] DISTRIBUTED(907066,ffffb289eec0,python):2025-07-15-10:30:28.502.900 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: c0a3056327cfa3486f48201b4f88a5fa [const vector]{6, 7}, async: 0, submit_now: 1 2025-07-15 10:30:28,504 - mindformers./output/log[mindformers/models/utils.py:190] - INFO - num_layers per stage: [[1, 1], [1, 1]] 2025-07-15 10:30:28,504 - mindformers./output/log[mindformers/models/utils.py:191] - INFO - Accumulated num_layers per stage: [[1, 2], [3, 4]] 2025-07-15 10:30:28,504 - mindformers./output/log[mindformers/models/utils.py:193] - INFO - Pipeline id list with start_stage: [0, 1, 0, 1] 2025-07-15 10:30:28,504 - mindformers./output/log[mindformers/models/utils.py:194] - INFO - Interleave id list: [0, 0, 1, 1] 2025-07-15 10:30:28,505 - mindformers./output/log[mindformers/models/utils.py:212] - INFO - Formative layer_recompute: [[1, 1], [1, 1]] 2025-07-15 10:30:28,505 - mindformers./output/log[mindformers/models/utils.py:214] - INFO - The configuration of select_recompute_exclude and select_comm_recompute_exclude have the highest priority. 2025-07-15 10:30:28,505 - mindformers./output/log[mindformers/models/utils.py:220] - INFO - Formative select_recompute: {'feed_forward\\.mul': [[0, 0], [0, 0]], 'feed_forward\\.w1\\.activation\\.silu': [[0, 0], [0, 0]]} 2025-07-15 10:30:28,505 - mindformers./output/log[mindformers/models/utils.py:221] - INFO - Formative select_comm_recompute: {'.*\\.norm': [[0, 0], [0, 0]]} 2025-07-15 10:30:28,505 - mindformers./output/log[mindformers/models/utils.py:222] - INFO - Formative select_recompute_exclude: {} 2025-07-15 10:30:28,505 - mindformers./output/log[mindformers/models/utils.py:223] - INFO - Formative select_comm_recompute_exclude: {} 2025-07-15 10:30:28,521 - mindformers./output/log[mindformers/research/deepseek3/deepseek2_model.py:1042] - WARNING - first_k_dense_replace is provided in MoEConfig, a normal dense FFN will be used in this block. 2025-07-15 10:30:28,525 - mindformers./output/log[mindformers/research/deepseek3/deepseek2_model.py:1042] - WARNING - first_k_dense_replace is provided in MoEConfig, a normal dense FFN will be used in this block. 2025-07-15 10:30:28,529 - mindformers./output/log[mindformers/models/utils.py:423] - INFO - Set full recompute at layer 0 2025-07-15 10:30:28,533 - mindformers./output/log[mindformers/models/utils.py:423] - INFO - Set full recompute at layer 0 2025-07-15 10:30:28,547 - mindformers./output/log[mindformers/research/deepseek3/deepseek2_model.py:1072] - INFO - MoE config is provided, use MoE FFN with shared ffn 2025-07-15 10:30:28,551 - mindformers./output/log[mindformers/research/deepseek3/deepseek2_model.py:1072] - INFO - MoE config is provided, use MoE FFN with shared ffn [WARNING] DISTRIBUTED(907048,ffffa7aceec0,python):2025-07-15-10:30:28.553.848 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: c22c60349630d688cef20a3fd708ad87 [const vector]{2, 3}, async: 0, submit_now: 1 [WARNING] DISTRIBUTED(907036,ffff9818eec0,python):2025-07-15-10:30:28.558.048 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: c2eb282156233b5d827219971c8b04c2 [const vector]{0, 1}, async: 0, submit_now: 1 2025-07-15 10:30:28,616 - mindformers./output/log[mindformers/models/utils.py:190] - INFO - num_layers per stage: [[1, 1], [1, 1]] 2025-07-15 10:30:28,616 - mindformers./output/log[mindformers/models/utils.py:191] - INFO - Accumulated num_layers per stage: [[1, 2], [3, 4]] 2025-07-15 10:30:28,616 - mindformers./output/log[mindformers/models/utils.py:193] - INFO - Pipeline id list with start_stage: [0, 1, 0, 1] 2025-07-15 10:30:28,616 - mindformers./output/log[mindformers/models/utils.py:194] - INFO - Interleave id list: [0, 0, 1, 1] 2025-07-15 10:30:28,616 - mindformers./output/log[mindformers/models/utils.py:212] - INFO - Formative layer_recompute: [[1, 1], [1, 1]] 2025-07-15 10:30:28,617 - mindformers./output/log[mindformers/models/utils.py:214] - INFO - The configuration of select_recompute_exclude and select_comm_recompute_exclude have the highest priority. 2025-07-15 10:30:28,617 - mindformers./output/log[mindformers/models/utils.py:220] - INFO - Formative select_recompute: {'feed_forward\\.mul': [[0, 0], [0, 0]], 'feed_forward\\.w1\\.activation\\.silu': [[0, 0], [0, 0]]} 2025-07-15 10:30:28,617 - mindformers./output/log[mindformers/models/utils.py:221] - INFO - Formative select_comm_recompute: {'.*\\.norm': [[0, 0], [0, 0]]} 2025-07-15 10:30:28,617 - mindformers./output/log[mindformers/models/utils.py:222] - INFO - Formative select_recompute_exclude: {} 2025-07-15 10:30:28,617 - mindformers./output/log[mindformers/models/utils.py:223] - INFO - Formative select_comm_recompute_exclude: {} 2025-07-15 10:30:28,636 - mindformers./output/log[mindformers/research/deepseek3/deepseek2_model.py:1042] - WARNING - first_k_dense_replace is provided in MoEConfig, a normal dense FFN will be used in this block. 2025-07-15 10:30:28,643 - mindformers./output/log[mindformers/models/utils.py:423] - INFO - Set full recompute at layer 0 2025-07-15 10:30:28,655 - mindformers./output/log[mindformers/version_control.py:140] - INFO - The Lazy Inline compilation acceleration feature is turned on. 2025-07-15 10:30:28,659 - mindformers./output/log[mindformers/models/utils.py:190] - INFO - num_layers per stage: [[1, 1], [1, 1]] 2025-07-15 10:30:28,660 - mindformers./output/log[mindformers/models/utils.py:191] - INFO - Accumulated num_layers per stage: [[1, 2], [3, 4]] 2025-07-15 10:30:28,660 - mindformers./output/log[mindformers/models/utils.py:193] - INFO - Pipeline id list with start_stage: [0, 1, 0, 1] 2025-07-15 10:30:28,660 - mindformers./output/log[mindformers/models/utils.py:194] - INFO - Interleave id list: [0, 0, 1, 1] 2025-07-15 10:30:28,660 - mindformers./output/log[mindformers/models/utils.py:212] - INFO - Formative layer_recompute: [[1, 1], [1, 1]] 2025-07-15 10:30:28,661 - mindformers./output/log[mindformers/models/utils.py:214] - INFO - The configuration of select_recompute_exclude and select_comm_recompute_exclude have the highest priority. 2025-07-15 10:30:28,661 - mindformers./output/log[mindformers/research/deepseek3/deepseek2_model.py:1216] - INFO - Enable flash attention. 2025-07-15 10:30:28,661 - mindformers./output/log[mindformers/models/utils.py:220] - INFO - Formative select_recompute: {'feed_forward\\.mul': [[0, 0], [0, 0]], 'feed_forward\\.w1\\.activation\\.silu': [[0, 0], [0, 0]]} 2025-07-15 10:30:28,661 - mindformers./output/log[mindformers/models/utils.py:221] - INFO - Formative select_comm_recompute: {'.*\\.norm': [[0, 0], [0, 0]]} 2025-07-15 10:30:28,661 - mindformers./output/log[mindformers/models/utils.py:222] - INFO - Formative select_recompute_exclude: {} 2025-07-15 10:30:28,661 - mindformers./output/log[mindformers/models/utils.py:223] - INFO - Formative select_comm_recompute_exclude: {} 2025-07-15 10:30:28,661 - mindformers./output/log[mindformers/research/deepseek3/deepseek2_model.py:1072] - INFO - MoE config is provided, use MoE FFN with shared ffn [WARNING] DISTRIBUTED(907052,ffff9abfeec0,python):2025-07-15-10:30:28.667.542 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: 5270c475a57ca8f687fafaee7ef53c7f [const vector]{4, 5}, async: 0, submit_now: 1 2025-07-15 10:30:28,680 - mindformers./output/log[mindformers/research/deepseek3/deepseek2_model.py:1042] - WARNING - first_k_dense_replace is provided in MoEConfig, a normal dense FFN will be used in this block. 2025-07-15 10:30:28,688 - mindformers./output/log[mindformers/models/utils.py:423] - INFO - Set full recompute at layer 0 2025-07-15 10:30:28,707 - mindformers./output/log[mindformers/research/deepseek3/deepseek2_model.py:1072] - INFO - MoE config is provided, use MoE FFN with shared ffn [WARNING] DISTRIBUTED(907044,ffff9f2feec0,python):2025-07-15-10:30:28.713.024 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: c22c60349630d688cef20a3fd708ad87 [const vector]{2, 3}, async: 0, submit_now: 1 2025-07-15 10:30:28,838 - mindformers./output/log[mindformers/models/utils.py:190] - INFO - num_layers per stage: [[1, 1], [1, 1]] 2025-07-15 10:30:28,838 - mindformers./output/log[mindformers/models/utils.py:191] - INFO - Accumulated num_layers per stage: [[1, 2], [3, 4]] 2025-07-15 10:30:28,838 - mindformers./output/log[mindformers/models/utils.py:193] - INFO - Pipeline id list with start_stage: [0, 1, 0, 1] 2025-07-15 10:30:28,839 - mindformers./output/log[mindformers/models/utils.py:194] - INFO - Interleave id list: [0, 0, 1, 1] 2025-07-15 10:30:28,839 - mindformers./output/log[mindformers/models/utils.py:212] - INFO - Formative layer_recompute: [[1, 1], [1, 1]] 2025-07-15 10:30:28,839 - mindformers./output/log[mindformers/models/utils.py:214] - INFO - The configuration of select_recompute_exclude and select_comm_recompute_exclude have the highest priority. 2025-07-15 10:30:28,839 - mindformers./output/log[mindformers/models/utils.py:220] - INFO - Formative select_recompute: {'feed_forward\\.mul': [[0, 0], [0, 0]], 'feed_forward\\.w1\\.activation\\.silu': [[0, 0], [0, 0]]} 2025-07-15 10:30:28,839 - mindformers./output/log[mindformers/models/utils.py:221] - INFO - Formative select_comm_recompute: {'.*\\.norm': [[0, 0], [0, 0]]} 2025-07-15 10:30:28,839 - mindformers./output/log[mindformers/models/utils.py:222] - INFO - Formative select_recompute_exclude: {} 2025-07-15 10:30:28,840 - mindformers./output/log[mindformers/models/utils.py:223] - INFO - Formative select_comm_recompute_exclude: {} 2025-07-15 10:30:28,859 - mindformers./output/log[mindformers/research/deepseek3/deepseek2_model.py:1042] - WARNING - first_k_dense_replace is provided in MoEConfig, a normal dense FFN will be used in this block. 2025-07-15 10:30:28,867 - mindformers./output/log[mindformers/models/utils.py:423] - INFO - Set full recompute at layer 0 2025-07-15 10:30:28,886 - mindformers./output/log[mindformers/research/deepseek3/deepseek2_model.py:1072] - INFO - MoE config is provided, use MoE FFN with shared ffn [WARNING] DISTRIBUTED(907056,ffff906feec0,python):2025-07-15-10:30:28.892.297 [mindspore/ccsrc/distributed/collective/collective_manager.cc:341] CreateCommunicationGroup] Start to create communication group: 5270c475a57ca8f687fafaee7ef53c7f [const vector]{4, 5}, async: 0, submit_now: 1 2025-07-15 10:33:10,893 - mindformers./output/log[mindformers/core/context/parallel.py:88] - ERROR - Notice: if you are trying to run with a single device, please set use_parallel=False. If not, please check the error message above. 2025-07-15 10:33:10,894 - mindformers./output/log[mindformers/tools/cloud_adapter/cloud_monitor.py:43] - ERROR - Traceback (most recent call last): File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/mindformers/tools/cloud_adapter/cloud_monitor.py", line 34, in wrapper result = run_func(*args, **kwargs) File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/../mindformers/run_mindformer.py", line 68, in main build_context(config) File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/mindformers/core/context/build_context.py", line 464, in build_context ctx = Context(mf_config) File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/mindformers/core/context/build_context.py", line 71, in __init__ self.parallel_opr.init_communication() File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/mindformers/core/context/parallel.py", line 86, in init_communication init() File "/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/management.py", line 203, in init init_hccl() RuntimeError: Call aclrtSetDevice failed, ret[507033]. Got device count[8] and device id[1], please check if device id is valid. ---------------------------------------------------- - C++ Call Stack: (For framework developers) ---------------------------------------------------- mindspore/ccsrc/plugin/res_manager/ascend/hal_manager/ascend_hal_manager.cc:67 InitDevice Traceback (most recent call last): File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/../mindformers/run_mindformer.py", line 336, in main(config_) File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/mindformers/tools/cloud_adapter/cloud_monitor.py", line 44, in wrapper raise exc File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/mindformers/tools/cloud_adapter/cloud_monitor.py", line 34, in wrapper result = run_func(*args, **kwargs) File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/../mindformers/run_mindformer.py", line 68, in main build_context(config) File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/mindformers/core/context/build_context.py", line 464, in build_context ctx = Context(mf_config) File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/mindformers/core/context/build_context.py", line 71, in __init__ self.parallel_opr.init_communication() File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/mindformers/core/context/parallel.py", line 86, in init_communication init() File "/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/management.py", line 203, in init init_hccl() RuntimeError: Call aclrtSetDevice failed, ret[507033]. Got device count[8] and device id[1], please check if device id is valid. ---------------------------------------------------- - C++ Call Stack: (For framework developers) ---------------------------------------------------- mindspore/ccsrc/plugin/res_manager/ascend/hal_manager/ascend_hal_manager.cc:67 InitDevice [WARNING] DEVICE(907040,ffffba1deec0,python):2025-07-15-10:33:11.136.860 [mindspore/ccsrc/plugin/device/ascend/hal/hardware/ascend_device_res_manager.cc:350] SyncAllStreams] The ascend_res_manager_ is nullptr in scenarios where it is not actually executed [ERROR] ME(906714:281472839708352,MainProcess):2025-07-15-10:33:13.696.72 [mindspore/parallel/cluster/process_entity/_api.py:363] Worker process 907040 exit with exception. Error code: 1. [WARNING] ME(906714:281472839708352,MainProcess):2025-07-15-10:33:13.699.38 [mindspore/parallel/cluster/process_entity/_api.py:369] There's worker exits with exception, kill all other workers. [ERROR] ME(906714:281472839708352,MainProcess):2025-07-15-10:33:47.556.811 [mindspore/parallel/cluster/process_entity/_api.py:382] Scheduler process 907034 exit with exception. [ERROR] ME(906714:281472839708352,MainProcess):2025-07-15-10:33:47.558.314 [mindspore/parallel/cluster/process_entity/_api.py:603] Time out nodes are ['1'] /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/worker_1.log-32-[WARNING] DISTRIBUTED(907040,ffffba1deec0,python):2025-07-15-10:30:24.583.996 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:494] Connect] Waiting for the state of the connection to 127.0.0.1:7125 to be connected...Retry number: 2 /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/worker_1.log-33-[WARNING] DISTRIBUTED(907040,ffffba1deec0,python):2025-07-15-10:30:25.084.712 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(1/14400). /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/worker_1.log-34-[WARNING] DISTRIBUTED(907040,ffffba1deec0,python):2025-07-15-10:30:25.584.821 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:246] BuildCluster] Topology build timed out., retry(2/14400). /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/worker_1.log-35-[WARNING] DISTRIBUTED(907040,ffffba1deec0,python):2025-07-15-10:30:26.084.944 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:249] BuildCluster] Cluster is successfully initialized. /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/worker_1.log-36-[WARNING] DISTRIBUTED(907040,ffffba1deec0,python):2025-07-15-10:30:26.084.987 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:355] PostProcess] This node 1 rank id: 1 /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/worker_1.log:37:2025-07-15 10:33:10,893 - mindformers./output/log[mindformers/core/context/parallel.py:88] - ERROR - Notice: if you are trying to run with a single device, please set use_parallel=False. If not, please check the error message above. /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/worker_1.log:38:2025-07-15 10:33:10,894 - mindformers./output/log[mindformers/tools/cloud_adapter/cloud_monitor.py:43] - ERROR - Traceback (most recent call last): /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/worker_1.log-39- File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/mindformers/tools/cloud_adapter/cloud_monitor.py", line 34, in wrapper /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/worker_1.log-40- result = run_func(*args, **kwargs) /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/worker_1.log-41- File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/../mindformers/run_mindformer.py", line 68, in main /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/worker_1.log-42- build_context(config) /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/worker_1.log-43- File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/mindformers/core/context/build_context.py", line 464, in build_context -- /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/worker_1.log-46- self.parallel_opr.init_communication() /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/worker_1.log-47- File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/mindformers/core/context/parallel.py", line 86, in init_communication /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/worker_1.log-48- init() /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/worker_1.log-49- File "/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/management.py", line 203, in init /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/worker_1.log-50- init_hccl() /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/worker_1.log:51:RuntimeError: Call aclrtSetDevice failed, ret[507033]. Got device count[8] and device id[1], please check if device id is valid. /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/worker_1.log-52- /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/worker_1.log-53----------------------------------------------------- /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/worker_1.log-54-- C++ Call Stack: (For framework developers) /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/worker_1.log-55----------------------------------------------------- /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/worker_1.log-56-mindspore/ccsrc/plugin/res_manager/ascend/hal_manager/ascend_hal_manager.cc:67 InitDevice /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/worker_1.log-57- /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/worker_1.log-58- /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/worker_1.log:59:Traceback (most recent call last): /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/worker_1.log-60- File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/../mindformers/run_mindformer.py", line 336, in /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/worker_1.log-61- main(config_) /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/worker_1.log-62- File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/mindformers/tools/cloud_adapter/cloud_monitor.py", line 44, in wrapper /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/worker_1.log-63- raise exc /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/worker_1.log-64- File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/mindformers/tools/cloud_adapter/cloud_monitor.py", line 34, in wrapper -- /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/worker_1.log-71- self.parallel_opr.init_communication() /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/worker_1.log-72- File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/mindformers/core/context/parallel.py", line 86, in init_communication /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/worker_1.log-73- init() /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/worker_1.log-74- File "/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/management.py", line 203, in init /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/worker_1.log-75- init_hccl() /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/worker_1.log:76:RuntimeError: Call aclrtSetDevice failed, ret[507033]. Got device count[8] and device id[1], please check if device id is valid. /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/worker_1.log-77- /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/worker_1.log-78----------------------------------------------------- /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/worker_1.log-79-- C++ Call Stack: (For framework developers) /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/worker_1.log-80----------------------------------------------------- /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/worker_1.log-81-mindspore/ccsrc/plugin/res_manager/ascend/hal_manager/ascend_hal_manager.cc:67 InitDevice -- /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/scheduler.log-118-[WARNING] DISTRIBUTED(907034,ffffb971eec0,python):2025-07-15-10:33:30.566.358 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:154] Finalize] This log means the cluster is successfully created. Retry to finalize the node and exit cluster... /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/scheduler.log-119-[WARNING] DISTRIBUTED(907034,ffffb971eec0,python):2025-07-15-10:33:35.566.473 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:98] Finalize] The meta server node can not be finalized because there are still 8 alive nodes. /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/scheduler.log-120-[WARNING] DISTRIBUTED(907034,ffffb971eec0,python):2025-07-15-10:33:35.566.508 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:154] Finalize] This log means the cluster is successfully created. Retry to finalize the node and exit cluster... /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/scheduler.log-121-[WARNING] DISTRIBUTED(907034,ffffb971eec0,python):2025-07-15-10:33:40.566.653 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:98] Finalize] The meta server node can not be finalized because there are still 8 alive nodes. /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/scheduler.log-122-[WARNING] DISTRIBUTED(907034,ffffb971eec0,python):2025-07-15-10:33:40.566.688 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:154] Finalize] This log means the cluster is successfully created. Retry to finalize the node and exit cluster... /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/scheduler.log:123:[ERROR] DISTRIBUTED(907034,ffff2f7eefa0,python):2025-07-15-10:33:41.086.190 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:511] UpdateTopoState] The node: 1 is timed out. It may exit with exception, please check this node's log. /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/scheduler.log:124:[ERROR] DISTRIBUTED(907034,ffffb971eec0,python):2025-07-15-10:33:45.566.787 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:103] Finalize] There are 1 abnormal compute graph nodes. /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/scheduler.log:125:2025-07-15 10:33:45,567 - mindformers./output/log[mindformers/core/context/parallel.py:88] - ERROR - Notice: if you are trying to run with a single device, please set use_parallel=False. If not, please check the error message above. /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/scheduler.log:126:2025-07-15 10:33:45,568 - mindformers./output/log[mindformers/tools/cloud_adapter/cloud_monitor.py:43] - ERROR - Traceback (most recent call last): /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/scheduler.log-127- File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/mindformers/tools/cloud_adapter/cloud_monitor.py", line 34, in wrapper /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/scheduler.log-128- result = run_func(*args, **kwargs) /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/scheduler.log-129- File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/../mindformers/run_mindformer.py", line 68, in main /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/scheduler.log-130- build_context(config) /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/scheduler.log-131- File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/mindformers/core/context/build_context.py", line 464, in build_context -- /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/scheduler.log-134- self.parallel_opr.init_communication() /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/scheduler.log-135- File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/mindformers/core/context/parallel.py", line 86, in init_communication /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/scheduler.log-136- init() /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/scheduler.log-137- File "/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/management.py", line 213, in init /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/scheduler.log-138- init_cluster() /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/scheduler.log:139:RuntimeError: The total number of timed out node is 1. Timed out node list is: [const vector]{1}, worker 1 is the first one timed out, please check its log. /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/scheduler.log-140- /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/scheduler.log-141----------------------------------------------------- /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/scheduler.log-142-- C++ Call Stack: (For framework developers) /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/scheduler.log-143----------------------------------------------------- /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/scheduler.log-144-mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:517 UpdateTopoState /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/scheduler.log-145- /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/scheduler.log-146- /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/scheduler.log:147:Traceback (most recent call last): /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/scheduler.log-148- File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/../mindformers/run_mindformer.py", line 336, in /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/scheduler.log-149- main(config_) /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/scheduler.log-150- File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/mindformers/tools/cloud_adapter/cloud_monitor.py", line 44, in wrapper /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/scheduler.log-151- raise exc /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/scheduler.log-152- File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/mindformers/tools/cloud_adapter/cloud_monitor.py", line 34, in wrapper -- /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/scheduler.log-159- self.parallel_opr.init_communication() /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/scheduler.log-160- File "/home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/mindformers/mindformers/core/context/parallel.py", line 86, in init_communication /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/scheduler.log-161- init() /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/scheduler.log-162- File "/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/management.py", line 213, in init /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/scheduler.log-163- init_cluster() /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/scheduler.log:164:RuntimeError: The total number of timed out node is 1. Timed out node list is: [const vector]{1}, worker 1 is the first one timed out, please check its log. /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/scheduler.log-165- /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/scheduler.log-166----------------------------------------------------- /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/scheduler.log-167-- C++ Call Stack: (For framework developers) /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/scheduler.log-168----------------------------------------------------- /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/scheduler.log-169-mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:517 UpdateTopoState Traceback (most recent call last): File "/home/jenkins/anaconda3/envs/ci39/bin/msrun", line 8, in sys.exit(main()) File "/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/parallel/cluster/run.py", line 191, in main run(args) File "/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/parallel/cluster/run.py", line 185, in run process_manager.run() File "/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/parallel/cluster/process_entity/_api.py", line 268, in run self.join_processes() File "/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/parallel/cluster/process_entity/_api.py", line 387, in join_processes raise RuntimeError("Distributed job exited with exception. Please check logs in " RuntimeError: Distributed job exited with exception. Please check logs in directory: /home/jenkins/mindspore/testcases/testcases/tests/st/networks/llm_parallel_feature/deepseekv3/deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/. F =================================== FAILURES =================================== _______ test_deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance ________ @arg_mark(plat_marks=['platform_ascend910b'], level_mark='level0', card_mark='allcards', essential_mark='essential') def test_deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance(): """ Feature: test deepseekv3 cell dp2mp2ep4pp2mb4gas1bs1 8p gmm 1b1f performance Description: test deepseekv3 cell dp2mp2ep4pp2mb4gas1bs1 8p gmm 1b1f performance Expectation: st pass """ case_name = "deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance" sh_path = os.path.split(os.path.realpath(__file__))[0] # set the speed up json parallel_speed_up_json = {'matmul_grad_comm_overlap': True, 'pp_1f1b_overlap': 'AlltoAllV,AlltoAll'} # set the config deepseek_config = DeepseekConfig(num_samples=24, hidden_size=4096, intermediate_size=8192, moe_intermediate_size=2048, parallel_speed_up_json=parallel_speed_up_json, use_gmm=True, enable_deredundency=False, npu_nums_per_device=2, use_fused_ops_permute=True, use_fused_swiglu=True, enable_fa_var_len=True, use_fused_rope=True, pp_interleave_num=2, deterministic="OFF" ) file_path = prepare_deepseekv3_testcase_env(case_name, deepseek_config) # set the communication parameters device_num = 8 master_port = 7125 hccl_if_base_port = 63415 # set env for training graph_kernel_flags = "--enable_pass=grouped_matmul_assignadd_fusion " \ "--enable_cluster_ops=MatMul,BatchMatMul,Reshape --online_tuning=1" os.system(f"bash {sh_path}/run_llm.sh {device_num} \ {file_path} {case_name} {master_port} {hccl_if_base_port} pp mindrecord \"{graph_kernel_flags}\"") # check train over check_pair = {"Training Over": 1} real_log_path = log_path_preprocess(case_name, device_num) for log_path in real_log_path: > check_log(log_path, check_pair) test_deepseekv3_pretrain.py:369: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ file_path = './deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/worker_0.log' check_pairs = {'Training Over': 1} def check_log(file_path, check_pairs=None): # check the number of key in check_pairs in log file is equal to the value log_error_count = subprocess.check_output( ["grep -rE '%s' %s | wc -l" % ("ERROR|Traceback", file_path)], shell=True) log_cnt = str(log_error_count, 'utf-8').strip() if log_cnt != "0": os.system(f"cat {file_path}") assert log_cnt == "0", f"Error found in {file_path}" if check_pairs is not None: for key_word, value in check_pairs.items(): log_output = subprocess.check_output( ["grep -r '%s' %s | wc -l" % (key_word, file_path)], shell=True) log_cnt = str(log_output, 'utf-8').strip() > assert log_cnt == str(value), (f"Failed to find {key_word} in {file_path} or content is not correct." f"Expected occurrences: {value}, but got {log_cnt}") E AssertionError: Failed to find Training Over in ./deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance/worker_0.log or content is not correct.Expected occurrences: 1, but got 0 ../utils.py:160: AssertionError =========================== short test summary info ============================ FAILED test_deepseekv3_pretrain.py::test_deepseekv3_cell_dp2mp2ep2pp2mb4gas1bs1_8p_1b1f_performance ======================== 1 failed in 229.72s (0:03:49) =========================