============================= test session starts ============================== platform linux -- Python 3.9.21, pytest-6.2.5, py-1.11.0, pluggy-0.13.1 rootdir: /home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf, configfile: ../../../../../../sault/virtual_test/virtualenv_002/sault/config/pytest.ini plugins: forked-1.6.0, hydra-core-1.3.2, xdist-1.32.0, anyio-4.9.0 collected 1 item test_qwen_grpo.py WORKDIR is /home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf PYTHONPATH is /home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/mindrlhf/:/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/mindformers/:/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf:/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:/usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe:/home/jenkins/mindspore/testcases/testcases: /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) 0%| | 0/4 [00:00 type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) Start worker process with rank id:0, log file:/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_one_log/worker_0.log. Environment variable [RANK_ID=0] is exported. Start worker process with rank id:1, log file:/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_one_log/worker_1.log. Environment variable [RANK_ID=1] is exported. Start worker process with rank id:2, log file:/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_one_log/worker_2.log. Environment variable [RANK_ID=2] is exported. Start worker process with rank id:3, log file:/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_one_log/worker_3.log. Environment variable [RANK_ID=3] is exported. Start worker process with rank id:4, log file:/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_one_log/worker_4.log. Environment variable [RANK_ID=4] is exported. Start worker process with rank id:5, log file:/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_one_log/worker_5.log. Environment variable [RANK_ID=5] is exported. Start worker process with rank id:6, log file:/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_one_log/worker_6.log. Environment variable [RANK_ID=6] is exported. Start worker process with rank id:7, log file:/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_one_log/worker_7.log. Environment variable [RANK_ID=7] is exported. /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for type is zero. setattr(self, word, getattr(machar, word).flat[0]) /home/jenkins/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for type is zero. return self._float_to_str(self.smallest_subnormal) 2025-07-15 11:40:36,955 - mindformers./output/log[/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/mindrlhf/mindrlhf/trainer/spmd/grpo_trainer.py:77] - INFO - GRPOTrainer: _init_grpo_configs Namespace(config='./qwen2_5/grpo_config_st.yaml', sft_path_infer='./qwen2_5/predict_qwen2_5_7b_instruct_st.yaml', sft_path_train='./qwen2_5/finetune_qwen2_5_7b_st.yaml', vocab_path='/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_5/vocab.json', merges_file_path='/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_5/merges.txt', save_data_file='/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/grpo_data/grpo.mindrecord', mind_dataset_dir='/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/dataset/mini_gsm8k.mindrecord', save_ckpt_dir='/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/ckpt/train', use_parallel='True', load_sft_checkpoint_infer='', load_sft_checkpoint_train='', load_ref_checkpoint='', enable_compile_cache='False', pre_num_generations=1, pre_store_data=16, reward_funcs=['format_reward'], reward_weights=[1.0]) in main task 2025-07-15 11:40:36,964 - mindformers./output/log[/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/mindrlhf/mindrlhf/trainer/spmd/grpo_trainer.py:90] - INFO - vllm mode: VllmMode.ORIGIN, hf_config_path: ./config.json 2025-07-15 11:40:36,970 - mindformers./output/log[/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/mindrlhf/mindrlhf/trainer/spmd/grpo_trainer.py:77] - INFO - GRPOTrainer: _init_grpo_configs Namespace(config='./qwen2_5/grpo_config_st.yaml', sft_path_infer='./qwen2_5/predict_qwen2_5_7b_instruct_st.yaml', sft_path_train='./qwen2_5/finetune_qwen2_5_7b_st.yaml', vocab_path='/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_5/vocab.json', merges_file_path='/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_5/merges.txt', save_data_file='/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/grpo_data/grpo.mindrecord', mind_dataset_dir='/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/dataset/mini_gsm8k.mindrecord', save_ckpt_dir='/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/ckpt/train', use_parallel='True', load_sft_checkpoint_infer='', load_sft_checkpoint_train='', load_ref_checkpoint='', enable_compile_cache='False', pre_num_generations=1, pre_store_data=16, reward_funcs=['format_reward'], reward_weights=[1.0]) in main task 2025-07-15 11:40:36,978 - mindformers./output/log[/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/mindrlhf/mindrlhf/trainer/spmd/grpo_trainer.py:90] - INFO - vllm mode: VllmMode.ORIGIN, hf_config_path: ./config.json 2025-07-15 11:40:37,006 - mindformers./output/log[/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/mindrlhf/mindrlhf/trainer/spmd/grpo_trainer.py:77] - INFO - GRPOTrainer: _init_grpo_configs Namespace(config='./qwen2_5/grpo_config_st.yaml', sft_path_infer='./qwen2_5/predict_qwen2_5_7b_instruct_st.yaml', sft_path_train='./qwen2_5/finetune_qwen2_5_7b_st.yaml', vocab_path='/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_5/vocab.json', merges_file_path='/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_5/merges.txt', save_data_file='/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/grpo_data/grpo.mindrecord', mind_dataset_dir='/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/dataset/mini_gsm8k.mindrecord', save_ckpt_dir='/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/ckpt/train', use_parallel='True', load_sft_checkpoint_infer='', load_sft_checkpoint_train='', load_ref_checkpoint='', enable_compile_cache='False', pre_num_generations=1, pre_store_data=16, reward_funcs=['format_reward'], reward_weights=[1.0]) in main task 2025-07-15 11:40:37,013 - mindformers./output/log[/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/mindrlhf/mindrlhf/trainer/spmd/grpo_trainer.py:77] - INFO - GRPOTrainer: _init_grpo_configs Namespace(config='./qwen2_5/grpo_config_st.yaml', sft_path_infer='./qwen2_5/predict_qwen2_5_7b_instruct_st.yaml', sft_path_train='./qwen2_5/finetune_qwen2_5_7b_st.yaml', vocab_path='/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_5/vocab.json', merges_file_path='/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_5/merges.txt', save_data_file='/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/grpo_data/grpo.mindrecord', mind_dataset_dir='/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/dataset/mini_gsm8k.mindrecord', save_ckpt_dir='/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/ckpt/train', use_parallel='True', load_sft_checkpoint_infer='', load_sft_checkpoint_train='', load_ref_checkpoint='', enable_compile_cache='False', pre_num_generations=1, pre_store_data=16, reward_funcs=['format_reward'], reward_weights=[1.0]) in main task 2025-07-15 11:40:37,015 - mindformers./output/log[/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/mindrlhf/mindrlhf/trainer/spmd/grpo_trainer.py:90] - INFO - vllm mode: VllmMode.ORIGIN, hf_config_path: ./config.json 2025-07-15 11:40:37,021 - mindformers./output/log[/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/mindrlhf/mindrlhf/trainer/spmd/grpo_trainer.py:90] - INFO - vllm mode: VllmMode.ORIGIN, hf_config_path: ./config.json 2025-07-15 11:40:37,065 - mindformers./output/log[/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/mindrlhf/mindrlhf/trainer/spmd/grpo_trainer.py:77] - INFO - GRPOTrainer: _init_grpo_configs Namespace(config='./qwen2_5/grpo_config_st.yaml', sft_path_infer='./qwen2_5/predict_qwen2_5_7b_instruct_st.yaml', sft_path_train='./qwen2_5/finetune_qwen2_5_7b_st.yaml', vocab_path='/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_5/vocab.json', merges_file_path='/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_5/merges.txt', save_data_file='/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/grpo_data/grpo.mindrecord', mind_dataset_dir='/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/dataset/mini_gsm8k.mindrecord', save_ckpt_dir='/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/ckpt/train', use_parallel='True', load_sft_checkpoint_infer='', load_sft_checkpoint_train='', load_ref_checkpoint='', enable_compile_cache='False', pre_num_generations=1, pre_store_data=16, reward_funcs=['format_reward'], reward_weights=[1.0]) in main task 2025-07-15 11:40:37,073 - mindformers./output/log[/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/mindrlhf/mindrlhf/trainer/spmd/grpo_trainer.py:90] - INFO - vllm mode: VllmMode.ORIGIN, hf_config_path: ./config.json 2025-07-15 11:40:37,100 - mindformers./output/log[/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/mindrlhf/mindrlhf/trainer/spmd/grpo_trainer.py:77] - INFO - GRPOTrainer: _init_grpo_configs Namespace(config='./qwen2_5/grpo_config_st.yaml', sft_path_infer='./qwen2_5/predict_qwen2_5_7b_instruct_st.yaml', sft_path_train='./qwen2_5/finetune_qwen2_5_7b_st.yaml', vocab_path='/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_5/vocab.json', merges_file_path='/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_5/merges.txt', save_data_file='/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/grpo_data/grpo.mindrecord', mind_dataset_dir='/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/dataset/mini_gsm8k.mindrecord', save_ckpt_dir='/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/ckpt/train', use_parallel='True', load_sft_checkpoint_infer='', load_sft_checkpoint_train='', load_ref_checkpoint='', enable_compile_cache='False', pre_num_generations=1, pre_store_data=16, reward_funcs=['format_reward'], reward_weights=[1.0]) in main task 2025-07-15 11:40:37,109 - mindformers./output/log[/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/mindrlhf/mindrlhf/trainer/spmd/grpo_trainer.py:90] - INFO - vllm mode: VllmMode.ORIGIN, hf_config_path: ./config.json 2025-07-15 11:40:37,287 - mindformers./output/log[/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/mindrlhf/mindrlhf/trainer/spmd/grpo_trainer.py:102] - INFO - GRPOTrainer: _init_reward_fn 2025-07-15 11:40:37,288 - mindformers./output/log[/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/mindrlhf/mindrlhf/trainer/spmd/grpo_trainer.py:54] - INFO - GRPOTrainer: start init workers 2025-07-15 11:40:37,288 - mindformers./output/log[/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/mindrlhf/mindrlhf/worker/infer_worker.py:57] - INFO - init InferWorker 2025-07-15 11:40:37,293 - mindformers./output/log[/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/mindrlhf/mindrlhf/trainer/spmd/grpo_trainer.py:102] - INFO - GRPOTrainer: _init_reward_fn 2025-07-15 11:40:37,294 - mindformers./output/log[/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/mindrlhf/mindrlhf/trainer/spmd/grpo_trainer.py:54] - INFO - GRPOTrainer: start init workers 2025-07-15 11:40:37,294 - mindformers./output/log[/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/mindrlhf/mindrlhf/worker/infer_worker.py:57] - INFO - init InferWorker 2025-07-15 11:40:37,323 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config moe_config is empty. 2025-07-15 11:40:37,324 - mindformers./output/log[/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/mindrlhf/mindrlhf/worker/infer_worker.py:66] - INFO - launch actor roll out sft_config_infer.use_parallel True 2025-07-15 11:40:37,327 - mindformers./output/log[/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/mindrlhf/mindrlhf/trainer/spmd/grpo_trainer.py:102] - INFO - GRPOTrainer: _init_reward_fn 2025-07-15 11:40:37,327 - mindformers./output/log[/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/mindrlhf/mindrlhf/trainer/spmd/grpo_trainer.py:54] - INFO - GRPOTrainer: start init workers 2025-07-15 11:40:37,327 - mindformers./output/log[/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/mindrlhf/mindrlhf/worker/infer_worker.py:57] - INFO - init InferWorker 2025-07-15 11:40:37,329 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config moe_config is empty. 2025-07-15 11:40:37,330 - mindformers./output/log[/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/mindrlhf/mindrlhf/trainer/spmd/grpo_trainer.py:102] - INFO - GRPOTrainer: _init_reward_fn 2025-07-15 11:40:37,330 - mindformers./output/log[/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/mindrlhf/mindrlhf/trainer/spmd/grpo_trainer.py:54] - INFO - GRPOTrainer: start init workers 2025-07-15 11:40:37,330 - mindformers./output/log[/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/mindrlhf/mindrlhf/worker/infer_worker.py:66] - INFO - launch actor roll out sft_config_infer.use_parallel True 2025-07-15 11:40:37,330 - mindformers./output/log[/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/mindrlhf/mindrlhf/worker/infer_worker.py:57] - INFO - init InferWorker 2025-07-15 11:40:37,358 - mindformers./output/log[mindformers/core/context/build_context.py:168] - INFO - Predict context config, jit_level: O0, infer_boost: on 2025-07-15 11:40:37,362 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config moe_config is empty. 2025-07-15 11:40:37,363 - mindformers./output/log[/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/mindrlhf/mindrlhf/worker/infer_worker.py:66] - INFO - launch actor roll out sft_config_infer.use_parallel True 2025-07-15 11:40:37,365 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config moe_config is empty. 2025-07-15 11:40:37,366 - mindformers./output/log[/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/mindrlhf/mindrlhf/worker/infer_worker.py:66] - INFO - launch actor roll out sft_config_infer.use_parallel True 2025-07-15 11:40:37,367 - mindformers./output/log[mindformers/core/context/build_context.py:168] - INFO - Predict context config, jit_level: O0, infer_boost: on 2025-07-15 11:40:37,394 - mindformers./output/log[/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/mindrlhf/mindrlhf/trainer/spmd/grpo_trainer.py:102] - INFO - GRPOTrainer: _init_reward_fn 2025-07-15 11:40:37,395 - mindformers./output/log[/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/mindrlhf/mindrlhf/trainer/spmd/grpo_trainer.py:54] - INFO - GRPOTrainer: start init workers 2025-07-15 11:40:37,395 - mindformers./output/log[/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/mindrlhf/mindrlhf/worker/infer_worker.py:57] - INFO - init InferWorker 2025-07-15 11:40:37,396 - mindformers./output/log[mindformers/core/context/build_context.py:168] - INFO - Predict context config, jit_level: O0, infer_boost: on 2025-07-15 11:40:37,400 - mindformers./output/log[mindformers/core/context/build_context.py:168] - INFO - Predict context config, jit_level: O0, infer_boost: on 2025-07-15 11:40:37,420 - mindformers./output/log[/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/mindrlhf/mindrlhf/trainer/spmd/grpo_trainer.py:102] - INFO - GRPOTrainer: _init_reward_fn 2025-07-15 11:40:37,420 - mindformers./output/log[/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/mindrlhf/mindrlhf/trainer/spmd/grpo_trainer.py:54] - INFO - GRPOTrainer: start init workers 2025-07-15 11:40:37,420 - mindformers./output/log[/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/mindrlhf/mindrlhf/worker/infer_worker.py:57] - INFO - init InferWorker 2025-07-15 11:40:37,431 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config moe_config is empty. 2025-07-15 11:40:37,432 - mindformers./output/log[/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/mindrlhf/mindrlhf/worker/infer_worker.py:66] - INFO - launch actor roll out sft_config_infer.use_parallel True 2025-07-15 11:40:37,446 - mindformers./output/log[/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/mindrlhf/mindrlhf/trainer/spmd/grpo_trainer.py:77] - INFO - GRPOTrainer: _init_grpo_configs Namespace(config='./qwen2_5/grpo_config_st.yaml', sft_path_infer='./qwen2_5/predict_qwen2_5_7b_instruct_st.yaml', sft_path_train='./qwen2_5/finetune_qwen2_5_7b_st.yaml', vocab_path='/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_5/vocab.json', merges_file_path='/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_5/merges.txt', save_data_file='/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/grpo_data/grpo.mindrecord', mind_dataset_dir='/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/dataset/mini_gsm8k.mindrecord', save_ckpt_dir='/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/ckpt/train', use_parallel='True', load_sft_checkpoint_infer='', load_sft_checkpoint_train='', load_ref_checkpoint='', enable_compile_cache='False', pre_num_generations=1, pre_store_data=16, reward_funcs=['format_reward'], reward_weights=[1.0]) in main task 2025-07-15 11:40:37,455 - mindformers./output/log[/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/mindrlhf/mindrlhf/trainer/spmd/grpo_trainer.py:90] - INFO - vllm mode: VllmMode.ORIGIN, hf_config_path: ./config.json 2025-07-15 11:40:37,455 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config moe_config is empty. 2025-07-15 11:40:37,456 - mindformers./output/log[/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/mindrlhf/mindrlhf/worker/infer_worker.py:66] - INFO - launch actor roll out sft_config_infer.use_parallel True 2025-07-15 11:40:37,468 - mindformers./output/log[mindformers/core/context/build_context.py:168] - INFO - Predict context config, jit_level: O0, infer_boost: on 2025-07-15 11:40:37,487 - mindformers./output/log[mindformers/core/context/build_context.py:168] - INFO - Predict context config, jit_level: O0, infer_boost: on 2025-07-15 11:40:37,578 - mindformers./output/log[/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/mindrlhf/mindrlhf/trainer/spmd/grpo_trainer.py:77] - INFO - GRPOTrainer: _init_grpo_configs Namespace(config='./qwen2_5/grpo_config_st.yaml', sft_path_infer='./qwen2_5/predict_qwen2_5_7b_instruct_st.yaml', sft_path_train='./qwen2_5/finetune_qwen2_5_7b_st.yaml', vocab_path='/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_5/vocab.json', merges_file_path='/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_5/merges.txt', save_data_file='/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/grpo_data/grpo.mindrecord', mind_dataset_dir='/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/dataset/mini_gsm8k.mindrecord', save_ckpt_dir='/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/ckpt/train', use_parallel='True', load_sft_checkpoint_infer='', load_sft_checkpoint_train='', load_ref_checkpoint='', enable_compile_cache='False', pre_num_generations=1, pre_store_data=16, reward_funcs=['format_reward'], reward_weights=[1.0]) in main task 2025-07-15 11:40:37,586 - mindformers./output/log[/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/mindrlhf/mindrlhf/trainer/spmd/grpo_trainer.py:90] - INFO - vllm mode: VllmMode.ORIGIN, hf_config_path: ./config.json 2025-07-15 11:40:37,791 - mindformers./output/log[/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/mindrlhf/mindrlhf/trainer/spmd/grpo_trainer.py:102] - INFO - GRPOTrainer: _init_reward_fn 2025-07-15 11:40:37,791 - mindformers./output/log[/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/mindrlhf/mindrlhf/trainer/spmd/grpo_trainer.py:54] - INFO - GRPOTrainer: start init workers 2025-07-15 11:40:37,792 - mindformers./output/log[/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/mindrlhf/mindrlhf/worker/infer_worker.py:57] - INFO - init InferWorker 2025-07-15 11:40:37,832 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config moe_config is empty. 2025-07-15 11:40:37,833 - mindformers./output/log[/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/mindrlhf/mindrlhf/worker/infer_worker.py:66] - INFO - launch actor roll out sft_config_infer.use_parallel True 2025-07-15 11:40:37,877 - mindformers./output/log[mindformers/core/context/build_context.py:168] - INFO - Predict context config, jit_level: O0, infer_boost: on 2025-07-15 11:40:37,904 - mindformers./output/log[/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/mindrlhf/mindrlhf/trainer/spmd/grpo_trainer.py:102] - INFO - GRPOTrainer: _init_reward_fn 2025-07-15 11:40:37,904 - mindformers./output/log[/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/mindrlhf/mindrlhf/trainer/spmd/grpo_trainer.py:54] - INFO - GRPOTrainer: start init workers 2025-07-15 11:40:37,905 - mindformers./output/log[/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/mindrlhf/mindrlhf/worker/infer_worker.py:57] - INFO - init InferWorker 2025-07-15 11:40:37,941 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config moe_config is empty. 2025-07-15 11:40:37,941 - mindformers./output/log[/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/mindrlhf/mindrlhf/worker/infer_worker.py:66] - INFO - launch actor roll out sft_config_infer.use_parallel True 2025-07-15 11:40:37,983 - mindformers./output/log[mindformers/core/context/build_context.py:168] - INFO - Predict context config, jit_level: O0, infer_boost: on [MS_ALLOC_CONF]Runtime config: enable_vmm:False [MS_ALLOC_CONF]Runtime config: enable_vmm:False [MS_ALLOC_CONF]Runtime config: enable_vmm:False [MS_ALLOC_CONF]Runtime config: enable_vmm:False [MS_ALLOC_CONF]Runtime config: enable_vmm:False [MS_ALLOC_CONF]Runtime config: enable_vmm:False [MS_ALLOC_CONF]Runtime config: enable_vmm:False 2025-07-15 11:40:42,342 - mindformers./output/log[mindformers/tools/utils.py:181] - INFO - set strategy path to './output/strategy/ckpt_strategy_rank_3.ckpt' 2025-07-15 11:40:42,345 - mindformers./output/log[mindformers/core/context/build_context.py:383] - INFO - cann workqueue cpus: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191] 2025-07-15 11:40:42,345 - mindformers./output/log[mindformers/core/context/build_context.py:387] - WARNING - CANN use cpus: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191], model get empty cpu list, disable binding cores 2025-07-15 11:40:42,346 - mindformers./output/log[mindformers/core/context/build_context.py:395] - INFO - cpu_affinity, rank_id: 3, device_num: 8 2025-07-15 11:40:42,346 - mindformers./output/log[mindformers/core/parallel_config.py:41] - INFO - initial moe_config from dict: {'expert_num': 1, 'capacity_factor': 1.1, 'aux_loss_factor': 0.05, 'num_experts_chosen': 1, 'expert_group_size': None, 'group_wise_a2a': False, 'comp_comm_parallel': False, 'comp_comm_parallel_degree': 2, 'save_token_distribution': False, 'cur_layer': 0, 'enable_cold_hot_expert': False, 'update_step': 10000, 'hot_expert_num': 0, 'cold_token_percent': 1.0, 'moe_module_name': '', 'routing_policy': 'TopkRouterV1', 'norm_topk_prob': True, 'enable_sdrop': False, 'use_fused_ops_topkrouter': False, 'router_dense_type': 'float32', 'shared_expert_num': 0, 'use_shared_expert_gating': False, 'max_router_load': 131072, 'topk_method': 'greedy', 'topk_group': None, 'n_group': None, 'first_k_dense_replace': True, 'moe_intermediate_size': 1407, 'routed_scaling_factor': 1.0, 'aux_loss_types': None, 'aux_loss_factors': None, 'z_loss_factor': 0.0, 'balance_via_topk_bias': False, 'topk_bias_update_rate': 0.0, 'use_allgather_dispatcher': False, 'moe_shared_expert_overlap': False, 'expert_model_parallel': None, 'use_gating_sigmoid': False, 'enable_deredundency': False, 'npu_nums_per_device': 1, 'use_gmm': False, 'enable_gmm_safe_tokens': False, 'use_fused_ops_permute': False, 'callback_moe_droprate': False} 2025-07-15 11:40:42,347 - mindformers./output/log[mindformers/core/parallel_config.py:61] - INFO - initial parallel_config from dict: {'data_parallel': 2, 'model_parallel': 4, 'context_parallel': 1, 'expert_parallel': 1, 'pipeline_stage': 1, 'micro_batch_num': 1, 'seq_split_num': 1, 'use_seq_parallel': False, 'optimizer_shard': None, 'gradient_aggregation_group': 4, 'vocab_emb_dp': False, 'context_parallel_algo': 'colossalai_cp', 'ulysses_degree_in_cp': 1, 'mem_coeff': 0.1} 2025-07-15 11:40:42,676 - mindformers./output/log[mindformers/tools/utils.py:181] - INFO - set strategy path to './output/strategy/ckpt_strategy_rank_5.ckpt' 2025-07-15 11:40:42,679 - mindformers./output/log[mindformers/core/context/build_context.py:383] - INFO - cann workqueue cpus: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191] 2025-07-15 11:40:42,680 - mindformers./output/log[mindformers/core/context/build_context.py:387] - WARNING - CANN use cpus: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191], model get empty cpu list, disable binding cores 2025-07-15 11:40:42,680 - mindformers./output/log[mindformers/core/context/build_context.py:395] - INFO - cpu_affinity, rank_id: 5, device_num: 8 2025-07-15 11:40:42,680 - mindformers./output/log[mindformers/core/parallel_config.py:41] - INFO - initial moe_config from dict: {'expert_num': 1, 'capacity_factor': 1.1, 'aux_loss_factor': 0.05, 'num_experts_chosen': 1, 'expert_group_size': None, 'group_wise_a2a': False, 'comp_comm_parallel': False, 'comp_comm_parallel_degree': 2, 'save_token_distribution': False, 'cur_layer': 0, 'enable_cold_hot_expert': False, 'update_step': 10000, 'hot_expert_num': 0, 'cold_token_percent': 1.0, 'moe_module_name': '', 'routing_policy': 'TopkRouterV1', 'norm_topk_prob': True, 'enable_sdrop': False, 'use_fused_ops_topkrouter': False, 'router_dense_type': 'float32', 'shared_expert_num': 0, 'use_shared_expert_gating': False, 'max_router_load': 131072, 'topk_method': 'greedy', 'topk_group': None, 'n_group': None, 'first_k_dense_replace': True, 'moe_intermediate_size': 1407, 'routed_scaling_factor': 1.0, 'aux_loss_types': None, 'aux_loss_factors': None, 'z_loss_factor': 0.0, 'balance_via_topk_bias': False, 'topk_bias_update_rate': 0.0, 'use_allgather_dispatcher': False, 'moe_shared_expert_overlap': False, 'expert_model_parallel': None, 'use_gating_sigmoid': False, 'enable_deredundency': False, 'npu_nums_per_device': 1, 'use_gmm': False, 'enable_gmm_safe_tokens': False, 'use_fused_ops_permute': False, 'callback_moe_droprate': False} 2025-07-15 11:40:42,681 - mindformers./output/log[mindformers/core/parallel_config.py:61] - INFO - initial parallel_config from dict: {'data_parallel': 2, 'model_parallel': 4, 'context_parallel': 1, 'expert_parallel': 1, 'pipeline_stage': 1, 'micro_batch_num': 1, 'seq_split_num': 1, 'use_seq_parallel': False, 'optimizer_shard': None, 'gradient_aggregation_group': 4, 'vocab_emb_dp': False, 'context_parallel_algo': 'colossalai_cp', 'ulysses_degree_in_cp': 1, 'mem_coeff': 0.1} 2025-07-15 11:40:42,688 - mindformers./output/log[mindformers/tools/utils.py:181] - INFO - set strategy path to './output/strategy/ckpt_strategy_rank_4.ckpt' 2025-07-15 11:40:42,689 - mindformers./output/log[mindformers/tools/utils.py:181] - INFO - set strategy path to './output/strategy/ckpt_strategy_rank_0.ckpt' 2025-07-15 11:40:42,692 - mindformers./output/log[mindformers/core/context/build_context.py:383] - INFO - cann workqueue cpus: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191] 2025-07-15 11:40:42,692 - mindformers./output/log[mindformers/core/context/build_context.py:383] - INFO - cann workqueue cpus: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191] 2025-07-15 11:40:42,693 - mindformers./output/log[mindformers/core/context/build_context.py:387] - WARNING - CANN use cpus: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191], model get empty cpu list, disable binding cores 2025-07-15 11:40:42,693 - mindformers./output/log[mindformers/core/context/build_context.py:387] - WARNING - CANN use cpus: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191], model get empty cpu list, disable binding cores 2025-07-15 11:40:42,693 - mindformers./output/log[mindformers/core/context/build_context.py:395] - INFO - cpu_affinity, rank_id: 0, device_num: 8 2025-07-15 11:40:42,694 - mindformers./output/log[mindformers/core/context/build_context.py:395] - INFO - cpu_affinity, rank_id: 4, device_num: 8 2025-07-15 11:40:42,694 - mindformers./output/log[mindformers/core/parallel_config.py:41] - INFO - initial moe_config from dict: {'expert_num': 1, 'capacity_factor': 1.1, 'aux_loss_factor': 0.05, 'num_experts_chosen': 1, 'expert_group_size': None, 'group_wise_a2a': False, 'comp_comm_parallel': False, 'comp_comm_parallel_degree': 2, 'save_token_distribution': False, 'cur_layer': 0, 'enable_cold_hot_expert': False, 'update_step': 10000, 'hot_expert_num': 0, 'cold_token_percent': 1.0, 'moe_module_name': '', 'routing_policy': 'TopkRouterV1', 'norm_topk_prob': True, 'enable_sdrop': False, 'use_fused_ops_topkrouter': False, 'router_dense_type': 'float32', 'shared_expert_num': 0, 'use_shared_expert_gating': False, 'max_router_load': 131072, 'topk_method': 'greedy', 'topk_group': None, 'n_group': None, 'first_k_dense_replace': True, 'moe_intermediate_size': 1407, 'routed_scaling_factor': 1.0, 'aux_loss_types': None, 'aux_loss_factors': None, 'z_loss_factor': 0.0, 'balance_via_topk_bias': False, 'topk_bias_update_rate': 0.0, 'use_allgather_dispatcher': False, 'moe_shared_expert_overlap': False, 'expert_model_parallel': None, 'use_gating_sigmoid': False, 'enable_deredundency': False, 'npu_nums_per_device': 1, 'use_gmm': False, 'enable_gmm_safe_tokens': False, 'use_fused_ops_permute': False, 'callback_moe_droprate': False} 2025-07-15 11:40:42,695 - mindformers./output/log[mindformers/core/parallel_config.py:61] - INFO - initial parallel_config from dict: {'data_parallel': 2, 'model_parallel': 4, 'context_parallel': 1, 'expert_parallel': 1, 'pipeline_stage': 1, 'micro_batch_num': 1, 'seq_split_num': 1, 'use_seq_parallel': False, 'optimizer_shard': None, 'gradient_aggregation_group': 4, 'vocab_emb_dp': False, 'context_parallel_algo': 'colossalai_cp', 'ulysses_degree_in_cp': 1, 'mem_coeff': 0.1} 2025-07-15 11:40:42,695 - mindformers./output/log[mindformers/core/parallel_config.py:41] - INFO - initial moe_config from dict: {'expert_num': 1, 'capacity_factor': 1.1, 'aux_loss_factor': 0.05, 'num_experts_chosen': 1, 'expert_group_size': None, 'group_wise_a2a': False, 'comp_comm_parallel': False, 'comp_comm_parallel_degree': 2, 'save_token_distribution': False, 'cur_layer': 0, 'enable_cold_hot_expert': False, 'update_step': 10000, 'hot_expert_num': 0, 'cold_token_percent': 1.0, 'moe_module_name': '', 'routing_policy': 'TopkRouterV1', 'norm_topk_prob': True, 'enable_sdrop': False, 'use_fused_ops_topkrouter': False, 'router_dense_type': 'float32', 'shared_expert_num': 0, 'use_shared_expert_gating': False, 'max_router_load': 131072, 'topk_method': 'greedy', 'topk_group': None, 'n_group': None, 'first_k_dense_replace': True, 'moe_intermediate_size': 1407, 'routed_scaling_factor': 1.0, 'aux_loss_types': None, 'aux_loss_factors': None, 'z_loss_factor': 0.0, 'balance_via_topk_bias': False, 'topk_bias_update_rate': 0.0, 'use_allgather_dispatcher': False, 'moe_shared_expert_overlap': False, 'expert_model_parallel': None, 'use_gating_sigmoid': False, 'enable_deredundency': False, 'npu_nums_per_device': 1, 'use_gmm': False, 'enable_gmm_safe_tokens': False, 'use_fused_ops_permute': False, 'callback_moe_droprate': False} 2025-07-15 11:40:42,696 - mindformers./output/log[mindformers/core/parallel_config.py:61] - INFO - initial parallel_config from dict: {'data_parallel': 2, 'model_parallel': 4, 'context_parallel': 1, 'expert_parallel': 1, 'pipeline_stage': 1, 'micro_batch_num': 1, 'seq_split_num': 1, 'use_seq_parallel': False, 'optimizer_shard': None, 'gradient_aggregation_group': 4, 'vocab_emb_dp': False, 'context_parallel_algo': 'colossalai_cp', 'ulysses_degree_in_cp': 1, 'mem_coeff': 0.1} 2025-07-15 11:40:42,709 - mindformers./output/log[mindformers/tools/utils.py:181] - INFO - set strategy path to './output/strategy/ckpt_strategy_rank_6.ckpt' 2025-07-15 11:40:42,712 - mindformers./output/log[mindformers/core/context/build_context.py:383] - INFO - cann workqueue cpus: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191] 2025-07-15 11:40:42,712 - mindformers./output/log[mindformers/core/context/build_context.py:387] - WARNING - CANN use cpus: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191], model get empty cpu list, disable binding cores 2025-07-15 11:40:42,714 - mindformers./output/log[mindformers/core/context/build_context.py:395] - INFO - cpu_affinity, rank_id: 6, device_num: 8 2025-07-15 11:40:42,715 - mindformers./output/log[mindformers/core/parallel_config.py:41] - INFO - initial moe_config from dict: {'expert_num': 1, 'capacity_factor': 1.1, 'aux_loss_factor': 0.05, 'num_experts_chosen': 1, 'expert_group_size': None, 'group_wise_a2a': False, 'comp_comm_parallel': False, 'comp_comm_parallel_degree': 2, 'save_token_distribution': False, 'cur_layer': 0, 'enable_cold_hot_expert': False, 'update_step': 10000, 'hot_expert_num': 0, 'cold_token_percent': 1.0, 'moe_module_name': '', 'routing_policy': 'TopkRouterV1', 'norm_topk_prob': True, 'enable_sdrop': False, 'use_fused_ops_topkrouter': False, 'router_dense_type': 'float32', 'shared_expert_num': 0, 'use_shared_expert_gating': False, 'max_router_load': 131072, 'topk_method': 'greedy', 'topk_group': None, 'n_group': None, 'first_k_dense_replace': True, 'moe_intermediate_size': 1407, 'routed_scaling_factor': 1.0, 'aux_loss_types': None, 'aux_loss_factors': None, 'z_loss_factor': 0.0, 'balance_via_topk_bias': False, 'topk_bias_update_rate': 0.0, 'use_allgather_dispatcher': False, 'moe_shared_expert_overlap': False, 'expert_model_parallel': None, 'use_gating_sigmoid': False, 'enable_deredundency': False, 'npu_nums_per_device': 1, 'use_gmm': False, 'enable_gmm_safe_tokens': False, 'use_fused_ops_permute': False, 'callback_moe_droprate': False} 2025-07-15 11:40:42,716 - mindformers./output/log[mindformers/core/parallel_config.py:61] - INFO - initial parallel_config from dict: {'data_parallel': 2, 'model_parallel': 4, 'context_parallel': 1, 'expert_parallel': 1, 'pipeline_stage': 1, 'micro_batch_num': 1, 'seq_split_num': 1, 'use_seq_parallel': False, 'optimizer_shard': None, 'gradient_aggregation_group': 4, 'vocab_emb_dp': False, 'context_parallel_algo': 'colossalai_cp', 'ulysses_degree_in_cp': 1, 'mem_coeff': 0.1} 2025-07-15 11:40:42,784 - mindformers./output/log[mindformers/tools/utils.py:181] - INFO - set strategy path to './output/strategy/ckpt_strategy_rank_2.ckpt' 2025-07-15 11:40:42,788 - mindformers./output/log[mindformers/core/context/build_context.py:383] - INFO - cann workqueue cpus: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191] 2025-07-15 11:40:42,789 - mindformers./output/log[mindformers/core/context/build_context.py:387] - WARNING - CANN use cpus: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191], model get empty cpu list, disable binding cores 2025-07-15 11:40:42,789 - mindformers./output/log[mindformers/core/context/build_context.py:395] - INFO - cpu_affinity, rank_id: 2, device_num: 8 2025-07-15 11:40:42,790 - mindformers./output/log[mindformers/core/parallel_config.py:41] - INFO - initial moe_config from dict: {'expert_num': 1, 'capacity_factor': 1.1, 'aux_loss_factor': 0.05, 'num_experts_chosen': 1, 'expert_group_size': None, 'group_wise_a2a': False, 'comp_comm_parallel': False, 'comp_comm_parallel_degree': 2, 'save_token_distribution': False, 'cur_layer': 0, 'enable_cold_hot_expert': False, 'update_step': 10000, 'hot_expert_num': 0, 'cold_token_percent': 1.0, 'moe_module_name': '', 'routing_policy': 'TopkRouterV1', 'norm_topk_prob': True, 'enable_sdrop': False, 'use_fused_ops_topkrouter': False, 'router_dense_type': 'float32', 'shared_expert_num': 0, 'use_shared_expert_gating': False, 'max_router_load': 131072, 'topk_method': 'greedy', 'topk_group': None, 'n_group': None, 'first_k_dense_replace': True, 'moe_intermediate_size': 1407, 'routed_scaling_factor': 1.0, 'aux_loss_types': None, 'aux_loss_factors': None, 'z_loss_factor': 0.0, 'balance_via_topk_bias': False, 'topk_bias_update_rate': 0.0, 'use_allgather_dispatcher': False, 'moe_shared_expert_overlap': False, 'expert_model_parallel': None, 'use_gating_sigmoid': False, 'enable_deredundency': False, 'npu_nums_per_device': 1, 'use_gmm': False, 'enable_gmm_safe_tokens': False, 'use_fused_ops_permute': False, 'callback_moe_droprate': False} 2025-07-15 11:40:42,790 - mindformers./output/log[mindformers/core/parallel_config.py:61] - INFO - initial parallel_config from dict: {'data_parallel': 2, 'model_parallel': 4, 'context_parallel': 1, 'expert_parallel': 1, 'pipeline_stage': 1, 'micro_batch_num': 1, 'seq_split_num': 1, 'use_seq_parallel': False, 'optimizer_shard': None, 'gradient_aggregation_group': 4, 'vocab_emb_dp': False, 'context_parallel_algo': 'colossalai_cp', 'ulysses_degree_in_cp': 1, 'mem_coeff': 0.1} 2025-07-15 11:40:42,816 - mindformers./output/log[mindformers/tools/utils.py:181] - INFO - set strategy path to './output/strategy/ckpt_strategy_rank_7.ckpt' 2025-07-15 11:40:42,820 - mindformers./output/log[mindformers/core/context/build_context.py:383] - INFO - cann workqueue cpus: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191] 2025-07-15 11:40:42,820 - mindformers./output/log[mindformers/core/context/build_context.py:387] - WARNING - CANN use cpus: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191], model get empty cpu list, disable binding cores 2025-07-15 11:40:42,822 - mindformers./output/log[mindformers/core/context/build_context.py:395] - INFO - cpu_affinity, rank_id: 7, device_num: 8 2025-07-15 11:40:42,823 - mindformers./output/log[mindformers/core/parallel_config.py:41] - INFO - initial moe_config from dict: {'expert_num': 1, 'capacity_factor': 1.1, 'aux_loss_factor': 0.05, 'num_experts_chosen': 1, 'expert_group_size': None, 'group_wise_a2a': False, 'comp_comm_parallel': False, 'comp_comm_parallel_degree': 2, 'save_token_distribution': False, 'cur_layer': 0, 'enable_cold_hot_expert': False, 'update_step': 10000, 'hot_expert_num': 0, 'cold_token_percent': 1.0, 'moe_module_name': '', 'routing_policy': 'TopkRouterV1', 'norm_topk_prob': True, 'enable_sdrop': False, 'use_fused_ops_topkrouter': False, 'router_dense_type': 'float32', 'shared_expert_num': 0, 'use_shared_expert_gating': False, 'max_router_load': 131072, 'topk_method': 'greedy', 'topk_group': None, 'n_group': None, 'first_k_dense_replace': True, 'moe_intermediate_size': 1407, 'routed_scaling_factor': 1.0, 'aux_loss_types': None, 'aux_loss_factors': None, 'z_loss_factor': 0.0, 'balance_via_topk_bias': False, 'topk_bias_update_rate': 0.0, 'use_allgather_dispatcher': False, 'moe_shared_expert_overlap': False, 'expert_model_parallel': None, 'use_gating_sigmoid': False, 'enable_deredundency': False, 'npu_nums_per_device': 1, 'use_gmm': False, 'enable_gmm_safe_tokens': False, 'use_fused_ops_permute': False, 'callback_moe_droprate': False} 2025-07-15 11:40:42,824 - mindformers./output/log[mindformers/core/parallel_config.py:61] - INFO - initial parallel_config from dict: {'data_parallel': 2, 'model_parallel': 4, 'context_parallel': 1, 'expert_parallel': 1, 'pipeline_stage': 1, 'micro_batch_num': 1, 'seq_split_num': 1, 'use_seq_parallel': False, 'optimizer_shard': None, 'gradient_aggregation_group': 4, 'vocab_emb_dp': False, 'context_parallel_algo': 'colossalai_cp', 'ulysses_degree_in_cp': 1, 'mem_coeff': 0.1} 2025-07-15 11:43:21,652 - mindformers./output/log[mindformers/core/context/parallel.py:88] - ERROR - Notice: if you are trying to run with a single device, please set use_parallel=False. If not, please check the error message above. Traceback (most recent call last): File "/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/./qwen2_5/grpo_train.py", line 45, in main() File "/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/./qwen2_5/grpo_train.py", line 22, in main trainer = GRPOTrainer(args) File "/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/mindrlhf/mindrlhf/trainer/spmd/grpo_trainer.py", line 55, in __init__ self.infer = InferWorker(grpo_config=self.grpo_config, File "/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/mindrlhf/mindrlhf/worker/infer_worker.py", line 67, in __init__ build_context(sft_config_infer) File "/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/mindformers/mindformers/core/context/build_context.py", line 464, in build_context ctx = Context(mf_config) File "/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/mindformers/mindformers/core/context/build_context.py", line 71, in __init__ self.parallel_opr.init_communication() File "/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/mindformers/mindformers/core/context/parallel.py", line 86, in init_communication init() File "/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/management.py", line 203, in init init_hccl() RuntimeError: Call aclrtSetDevice failed, ret[507033]. Got device count[8] and device id[1], please check if device id is valid. ---------------------------------------------------- - C++ Call Stack: (For framework developers) ---------------------------------------------------- mindspore/ccsrc/plugin/res_manager/ascend/hal_manager/ascend_hal_manager.cc:67 InitDevice [MS_ALLOC_CONF]Runtime config: enable_vmm:False [ERROR] ME(1374691:281473718808256,MainProcess):2025-07-15-11:43:24.912.767 [mindspore/parallel/cluster/process_entity/_api.py:363] Worker process 1374813 exit with exception. Error code: 1. [ERROR] ME(1374691:281473718808256,MainProcess):2025-07-15-11:43:59.198.740 [mindspore/parallel/cluster/process_entity/_api.py:382] Scheduler process 1374807 exit with exception. [ERROR] ME(1374691:281473718808256,MainProcess):2025-07-15-11:43:59.200.438 [mindspore/parallel/cluster/process_entity/_api.py:603] Time out nodes are ['1'] /home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_one_log/worker_1.log-12-2025-07-15 11:40:37,294 - mindformers./output/log[/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/mindrlhf/mindrlhf/trainer/spmd/grpo_trainer.py:54] - INFO - GRPOTrainer: start init workers /home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_one_log/worker_1.log-13-2025-07-15 11:40:37,294 - mindformers./output/log[/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/mindrlhf/mindrlhf/worker/infer_worker.py:57] - INFO - init InferWorker /home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_one_log/worker_1.log-14-2025-07-15 11:40:37,329 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config moe_config is empty. /home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_one_log/worker_1.log-15-2025-07-15 11:40:37,330 - mindformers./output/log[/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/mindrlhf/mindrlhf/worker/infer_worker.py:66] - INFO - launch actor roll out sft_config_infer.use_parallel True /home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_one_log/worker_1.log-16-2025-07-15 11:40:37,367 - mindformers./output/log[mindformers/core/context/build_context.py:168] - INFO - Predict context config, jit_level: O0, infer_boost: on /home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_one_log/worker_1.log:17:2025-07-15 11:43:21,652 - mindformers./output/log[mindformers/core/context/parallel.py:88] - ERROR - Notice: if you are trying to run with a single device, please set use_parallel=False. If not, please check the error message above. /home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_one_log/worker_1.log:18:Traceback (most recent call last): /home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_one_log/worker_1.log-19- File "/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/./qwen2_5/grpo_train.py", line 45, in /home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_one_log/worker_1.log-20- main() /home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_one_log/worker_1.log-21- File "/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/./qwen2_5/grpo_train.py", line 22, in main /home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_one_log/worker_1.log-22- trainer = GRPOTrainer(args) /home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_one_log/worker_1.log-23- File "/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/mindrlhf/mindrlhf/trainer/spmd/grpo_trainer.py", line 55, in __init__ -- /home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_one_log/worker_1.log-30- self.parallel_opr.init_communication() /home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_one_log/worker_1.log-31- File "/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/mindformers/mindformers/core/context/parallel.py", line 86, in init_communication /home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_one_log/worker_1.log-32- init() /home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_one_log/worker_1.log-33- File "/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/management.py", line 203, in init /home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_one_log/worker_1.log-34- init_hccl() /home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_one_log/worker_1.log:35:RuntimeError: Call aclrtSetDevice failed, ret[507033]. Got device count[8] and device id[1], please check if device id is valid. /home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_one_log/worker_1.log-36- /home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_one_log/worker_1.log-37----------------------------------------------------- /home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_one_log/worker_1.log-38-- C++ Call Stack: (For framework developers) /home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_one_log/worker_1.log-39----------------------------------------------------- /home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_one_log/worker_1.log-40-mindspore/ccsrc/plugin/res_manager/ascend/hal_manager/ascend_hal_manager.cc:67 InitDevice -- /home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_one_log/scheduler.log-12-2025-07-15 11:40:37,630 - mindformers./output/log[/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/mindrlhf/mindrlhf/trainer/spmd/grpo_trainer.py:54] - INFO - GRPOTrainer: start init workers /home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_one_log/scheduler.log-13-2025-07-15 11:40:37,631 - mindformers./output/log[/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/mindrlhf/mindrlhf/worker/infer_worker.py:57] - INFO - init InferWorker /home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_one_log/scheduler.log-14-2025-07-15 11:40:37,665 - mindformers./output/log[mindformers/tools/register/template.py:84] - WARNING - The input config moe_config is empty. /home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_one_log/scheduler.log-15-2025-07-15 11:40:37,666 - mindformers./output/log[/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/mindrlhf/mindrlhf/worker/infer_worker.py:66] - INFO - launch actor roll out sft_config_infer.use_parallel True /home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_one_log/scheduler.log-16-2025-07-15 11:40:37,698 - mindformers./output/log[mindformers/core/context/build_context.py:168] - INFO - Predict context config, jit_level: O0, infer_boost: on /home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_one_log/scheduler.log:17:[ERROR] DISTRIBUTED(1374807,ffff11f8efa0,python):2025-07-15-11:43:52.238.772 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:511] UpdateTopoState] The node: 1 is timed out. It may exit with exception, please check this node's log. /home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_one_log/scheduler.log:18:[ERROR] DISTRIBUTED(1374807,ffff9b17eec0,python):2025-07-15-11:43:55.262.712 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:103] Finalize] There are 1 abnormal compute graph nodes. /home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_one_log/scheduler.log:19:2025-07-15 11:43:55,271 - mindformers./output/log[mindformers/core/context/parallel.py:88] - ERROR - Notice: if you are trying to run with a single device, please set use_parallel=False. If not, please check the error message above. /home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_one_log/scheduler.log:20:Traceback (most recent call last): /home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_one_log/scheduler.log-21- File "/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/./qwen2_5/grpo_train.py", line 45, in /home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_one_log/scheduler.log-22- main() /home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_one_log/scheduler.log-23- File "/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/./qwen2_5/grpo_train.py", line 22, in main /home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_one_log/scheduler.log-24- trainer = GRPOTrainer(args) /home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_one_log/scheduler.log-25- File "/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/mindrlhf/mindrlhf/trainer/spmd/grpo_trainer.py", line 55, in __init__ -- /home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_one_log/scheduler.log-32- self.parallel_opr.init_communication() /home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_one_log/scheduler.log-33- File "/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/mindformers/mindformers/core/context/parallel.py", line 86, in init_communication /home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_one_log/scheduler.log-34- init() /home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_one_log/scheduler.log-35- File "/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/communication/management.py", line 213, in init /home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_one_log/scheduler.log-36- init_cluster() /home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_one_log/scheduler.log:37:RuntimeError: The total number of timed out node is 1. Timed out node list is: [const vector]{1}, worker 1 is the first one timed out, please check its log. /home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_one_log/scheduler.log-38- /home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_one_log/scheduler.log-39----------------------------------------------------- /home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_one_log/scheduler.log-40-- C++ Call Stack: (For framework developers) /home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_one_log/scheduler.log-41----------------------------------------------------- /home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_one_log/scheduler.log-42-mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:517 UpdateTopoState Traceback (most recent call last): File "/home/jenkins/anaconda3/envs/ci39/bin/msrun", line 8, in sys.exit(main()) File "/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/parallel/cluster/run.py", line 191, in main run(args) File "/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/parallel/cluster/run.py", line 185, in run process_manager.run() File "/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/parallel/cluster/process_entity/_api.py", line 268, in run self.join_processes() File "/home/jenkins/anaconda3/envs/ci39/lib/python3.9/site-packages/mindspore/parallel/cluster/process_entity/_api.py", line 387, in join_processes raise RuntimeError("Distributed job exited with exception. Please check logs in " RuntimeError: Distributed job exited with exception. Please check logs in directory: /home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_one_log. F =================================== FAILURES =================================== ________________________________ test_qwen_grpo ________________________________ @arg_mark(plat_marks=['platform_ascend910b'], level_mark='level0', card_mark='allcards', essential_mark='essential') def test_qwen_grpo(): """ Feature: test Qwen GRPO training Description: test Qwen GRPO training Expectation: success """ os.system(f"bash {root_path}/run_qwen_grpo_test.sh") log_path = f"{root_path}/qwen2_one_log/worker_0.log" check_pair = {"Save checkpoints in": 1} > check_log(log_path, check_pair) test_qwen_grpo.py:36: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ file_path = '/home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_one_log/worker_0.log' check_pairs = {'Save checkpoints in': 1} def check_log(file_path, check_pairs=None): # check the number of key in check_pairs in log file is equal to the value if check_pairs is not None: for key_word, value in check_pairs.items(): log_output = subprocess.check_output( ["grep -r '%s' %s | wc -l" % (key_word, file_path)], shell=True) log_cnt = str(log_output, 'utf-8').strip() > assert log_cnt == str(value), (f"Failed to find {key_word} in {file_path} or content is not correct." f"Expected occurrences: {value}, but got {log_cnt}") E AssertionError: Failed to find Save checkpoints in in /home/jenkins/mindspore/testcases/testcases/tests/st/mindrlhf/qwen2_one_log/worker_0.log or content is not correct.Expected occurrences: 1, but got 0 utils.py:27: AssertionError =========================== short test summary info ============================ FAILED test_qwen_grpo.py::test_qwen_grpo - AssertionError: Failed to find Sav... ======================== 1 failed in 232.37s (0:03:52) =========================