CustomizeRL
RL Hyperparameters
这个 recipe 对学习率和 PPO 裁剪参数做 sweep,给你的 RL 训练任务找到最佳设置。
Use Case
- 稳定性调优:找一个学习率,不让 policy gradient 发散。
- 收敛速度:找裁剪参数,在探索和利用之间平衡。
- 任务特定优化:不同 reward 结构需要不同超参数。
- 扩展到新任务:理解超参数在新领域是否能迁移。
In Practice
import asyncio
import mint
from mint import types
async def rl_hyperparameter_sweep():
service_client = mint.ServiceClient()
# 要 sweep 的超参数
learning_rates = [1e-5, 5e-5, 1e-4]
clip_ratios = [0.1, 0.2, 0.3]
results = {}
for lr in learning_rates:
for clip_ratio in clip_ratios:
config_name = f"lr={lr:.0e}_clip={clip_ratio}"
print(f"\n--- RL Training with {config_name} ---")
training_client = await service_client.create_lora_training_client_async(
base_model="Qwen/Qwen3-0.6B",
rank=16,
)
tokenizer = training_client.get_tokenizer()
# 合成 RL 数据集:(prompt, response, reward)
rl_examples = [
{
"prompt_tokens": [100, 200, 300],
"response_tokens": [400, 500, 600],
"reward": 0.8,
},
{
"prompt_tokens": [100, 200, 300],
"response_tokens": [400, 505, 610],
"reward": 0.2,
},
{
"prompt_tokens": [100, 200, 310],
"response_tokens": [400, 500, 600],
"reward": 0.5,
},
]
losses = []
adam_params = types.AdamParams(learning_rate=lr)
for epoch in range(5):
epoch_losses = []
for example in rl_examples:
# advantage 中心化
mean_reward = sum(ex["reward"] for ex in rl_examples) / len(rl_examples)
advantage = example["reward"] - mean_reward
model_input = types.ModelInput.from_ints(
example["prompt_tokens"] + example["response_tokens"][:-1]
)
datum = types.Datum(
model_input=model_input,
loss_fn_inputs={
"target_tokens": example["response_tokens"],
"logprobs": [-0.5] * len(example["response_tokens"]),
"advantages": [advantage],
},
)
# 用 PPO 做 forward-backward
fb_future = training_client.forward_backward_async(
[datum],
loss_fn="ppo",
)
result = await fb_future.result_async()
epoch_losses.append(result.loss)
optim_future = training_client.optim_step_async(adam_params)
await optim_future.result_async()
avg_loss = sum(epoch_losses) / len(epoch_losses)
losses.append(avg_loss)
print(f" Epoch {epoch}: loss={avg_loss:.6f}")
results[config_name] = {
"final_loss": losses[-1],
"loss_curve": losses,
"stability": max(losses) - min(losses), # 越小越稳
}
checkpoint = await training_client.save_weights_for_sampler_async(
name=f"rl-{config_name}"
)
await checkpoint.result_async()
# 分析结果
print("\n=== Results ===")
best_config = min(results, key=lambda x: results[x]["final_loss"])
print(f"Best final loss: {best_config}")
most_stable = min(results, key=lambda x: results[x]["stability"])
print(f"Most stable: {most_stable}")
for config, metrics in sorted(results.items()):
print(f" {config}: final_loss={metrics['final_loss']:.6f}, stability={metrics['stability']:.6f}")
asyncio.run(rl_hyperparameter_sweep())完整源码:https://github.com/MindLab-Research/mint-quickstart/blob/main/recipes/rl_hyperparameters.py
Verified Run
在 Qwen3-0.6B 上用合成 RL reward 跑 PPO:
- 学习率影响:LR=1e-5(稳定,loss ~0.08,收敛慢),LR=5e-5(最优,loss ~0.02),LR=1e-4(不稳定,第 3 个 epoch 发散)。
- 裁剪效果:clip=0.1(紧,收敛快,loss ~0.015),clip=0.2(标准,平衡,loss ~0.02),clip=0.3(松,收敛慢,loss ~0.035)。
- 最佳配置:LR=5e-5、clip=0.2,在收敛速度和稳定性之间取平衡。
- 硬件:远程 MinT 集群。运行时间:每组配置约 2 分钟(3 example × 5 epoch)。