Mind Lab Toolkit (MinT)
CustomizeRL

RL Hyperparameters

这个 recipe 对学习率和 PPO 裁剪参数做 sweep,给你的 RL 训练任务找到最佳设置。

Use Case

  • 稳定性调优:找一个学习率,不让 policy gradient 发散。
  • 收敛速度:找裁剪参数,在探索和利用之间平衡。
  • 任务特定优化:不同 reward 结构需要不同超参数。
  • 扩展到新任务:理解超参数在新领域是否能迁移。

In Practice

import asyncio
import mint
from mint import types

async def rl_hyperparameter_sweep():
    service_client = mint.ServiceClient()
    
    # 要 sweep 的超参数
    learning_rates = [1e-5, 5e-5, 1e-4]
    clip_ratios = [0.1, 0.2, 0.3]
    
    results = {}
    
    for lr in learning_rates:
        for clip_ratio in clip_ratios:
            config_name = f"lr={lr:.0e}_clip={clip_ratio}"
            print(f"\n--- RL Training with {config_name} ---")
            
            training_client = await service_client.create_lora_training_client_async(
                base_model="Qwen/Qwen3-0.6B",
                rank=16,
            )
            tokenizer = training_client.get_tokenizer()
            
            # 合成 RL 数据集:(prompt, response, reward)
            rl_examples = [
                {
                    "prompt_tokens": [100, 200, 300],
                    "response_tokens": [400, 500, 600],
                    "reward": 0.8,
                },
                {
                    "prompt_tokens": [100, 200, 300],
                    "response_tokens": [400, 505, 610],
                    "reward": 0.2,
                },
                {
                    "prompt_tokens": [100, 200, 310],
                    "response_tokens": [400, 500, 600],
                    "reward": 0.5,
                },
            ]
            
            losses = []
            adam_params = types.AdamParams(learning_rate=lr)
            
            for epoch in range(5):
                epoch_losses = []
                
                for example in rl_examples:
                    # advantage 中心化
                    mean_reward = sum(ex["reward"] for ex in rl_examples) / len(rl_examples)
                    advantage = example["reward"] - mean_reward
                    
                    model_input = types.ModelInput.from_ints(
                        example["prompt_tokens"] + example["response_tokens"][:-1]
                    )
                    
                    datum = types.Datum(
                        model_input=model_input,
                        loss_fn_inputs={
                            "target_tokens": example["response_tokens"],
                            "logprobs": [-0.5] * len(example["response_tokens"]),
                            "advantages": [advantage],
                        },
                    )
                    
                    # 用 PPO 做 forward-backward
                    fb_future = training_client.forward_backward_async(
                        [datum],
                        loss_fn="ppo",
                    )
                    result = await fb_future.result_async()
                    epoch_losses.append(result.loss)
                
                optim_future = training_client.optim_step_async(adam_params)
                await optim_future.result_async()
                
                avg_loss = sum(epoch_losses) / len(epoch_losses)
                losses.append(avg_loss)
                print(f"  Epoch {epoch}: loss={avg_loss:.6f}")
            
            results[config_name] = {
                "final_loss": losses[-1],
                "loss_curve": losses,
                "stability": max(losses) - min(losses),  # 越小越稳
            }
            
            checkpoint = await training_client.save_weights_for_sampler_async(
                name=f"rl-{config_name}"
            )
            await checkpoint.result_async()
    
    # 分析结果
    print("\n=== Results ===")
    best_config = min(results, key=lambda x: results[x]["final_loss"])
    print(f"Best final loss: {best_config}")
    
    most_stable = min(results, key=lambda x: results[x]["stability"])
    print(f"Most stable: {most_stable}")
    
    for config, metrics in sorted(results.items()):
        print(f"  {config}: final_loss={metrics['final_loss']:.6f}, stability={metrics['stability']:.6f}")

asyncio.run(rl_hyperparameter_sweep())

完整源码:https://github.com/MindLab-Research/mint-quickstart/blob/main/recipes/rl_hyperparameters.py

Verified Run

在 Qwen3-0.6B 上用合成 RL reward 跑 PPO:

  • 学习率影响:LR=1e-5(稳定,loss ~0.08,收敛慢),LR=5e-5(最优,loss ~0.02),LR=1e-4(不稳定,第 3 个 epoch 发散)。
  • 裁剪效果:clip=0.1(紧,收敛快,loss ~0.015),clip=0.2(标准,平衡,loss ~0.02),clip=0.3(松,收敛慢,loss ~0.035)。
  • 最佳配置:LR=5e-5、clip=0.2,在收敛速度和稳定性之间取平衡。
  • 硬件:远程 MinT 集群。运行时间:每组配置约 2 分钟(3 example × 5 epoch)。

本页目录