[algo] support cispo algorithm (#6572)

hjh0119 · web-flow · commit 35aa1118ed8a · 2025-11-19T17:09:46.000+08:00
diff --git a/docs/source/Instruction/Command-line-parameters.md b/docs/source/Instruction/Command-line-parameters.md
@@ -561,7 +561,7 @@ reward模型参数将在PPO、GRPO中使用。
 - dataset_shuffle: 是否对dataset进行随机操作，默认为True。
 - truncation_strategy: 对输入长度超过 `max_length`的处理方式，支持`delete`和`left`，代表删除、左侧裁剪，默认为`left`, 注意对于多模态模型，
 左裁剪可能会裁剪掉多模态token导致模型前向报错shape mismatch。使用`delete`方式，对于超长数据和编码失败的样例会在原数据集中重采样其他数据作为补充。
-- loss_type: loss 归一化的类型，可选项为['grpo', 'bnpo', 'dr_grpo'], 默认为'grpo', 具体查看该[pr](https://github.com/huggingface/trl/pull/3256#discussion_r2033213348)。
+- loss_type: loss 归一化的类型，可选项为['grpo', 'bnpo', 'dr_grpo', 'dapo', 'cispo'], 默认为'grpo', 具体参考[文档](./GRPO/DeveloperGuide/loss_types.md)
 - log_completions: 是否记录训练中的模型生成内容，搭配 `--report_to wandb/swanlab` 使用。默认为False。
   - 提示：若没有设置`--report_to wandb/swanlab`，则会在checkpoint中创建`completions.jsonl`来存储生成内容。
 - use_vllm: 是否使用 vLLM 作为 GRPO 生成的 infer_backend，默认为False。
diff --git a/docs/source/Instruction/GRPO/AdvancedResearch/CISPO.md b/docs/source/Instruction/GRPO/AdvancedResearch/CISPO.md
@@ -0,0 +1,62 @@
+# Clipped Importance Sampling Policy Optimization (CISPO)
+
+**版本依赖**：ms-swift>=3.11
+
+Clipped Importance Sampling Policy Optimization (CISPO) 是 [MiniMax-M1](https://arxiv.org/abs/2506.13585) 论文中提出的一种强化学习算法。相比GRPO（Group Relative Policy Optimization）算法，CISPO 对重要性采样权重（importance sampling weights）本身进行裁剪。
+
+## 算法原理
+为便于理解，我们基于 GRPO 算法进行对比说明。
+
+GRPO通过裁剪策略比率来限制策略更新幅度，其损失函数为：
+
+$$
+\mathcal{L}_{\text{GRPO}}(\theta) = -\mathbb{E}\left[\min\left(r_t(\theta) \cdot \hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \cdot \hat{A}_t\right)\right]
+$$
+
+其中 $r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}$ 是重要性采样比。
+
+在处理长推理链条时，这种裁剪方式可能导致以下问题：
+
+**关键 Token 的梯度被抑制**：在复杂推理任务中，某些关键的低概率 token（如 *However, Recheck, Wait, Aha*）对于触发深度思考和推理纠错至关重要。这些 token 在旧策略 $\pi_{\theta_{\text{old}}}$ 中概率较低，当新策略试图提高其概率时，会导致较大的策略比率 $r_t(\theta)$，GRPO 的裁剪机制会将这些 token 丢弃。
+
+
+### CISPO 的解决方案
+
+CISPO 的核心思想是：裁剪重要性采样权重，保留梯度更新。具体来说，CISPO 的损失函数为：
+
+$$
+\mathcal{L}_{\text{CISPO}}(\theta) = -\mathbb{E}\left[\text{detach}\left(\min(r_t(\theta), \epsilon_{\text{high}})\right) \cdot \hat{A}_t \cdot \log \pi_\theta(a_t|s_t)\right]
+$$
+
+其中 $r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}$ 是重要性采样比。
+
+**关键机制**：
+- 对重要性采样权重进行裁剪：$\min(r_t(\theta), \epsilon_{\text{high}})$
+- **detach 操作**：裁剪后的权重不参与梯度计算，作为常数系数
+- 梯度来自 $\log \pi_\theta(a_t|s_t)$ 项，保证所有 token 都有梯度贡献
+
+
+## 实现细节
+CISPO 的伪代码实现如下：
+
+```python
+log_ratio = per_token_logps - old_per_token_logps
+importance_weights = torch.exp(log_ratio)  # r_t(θ) = π_θ / π_θ_old
+
+clamped_ratios = torch.clamp(importance_weights, max=epsilon_high).detach()
+
+per_token_loss = -clamped_ratios * advantages.unsqueeze(1) * per_token_logps
+```
+
+## 参数设置
+
+我们可以基于 `GRPOTrainer`，通过设置以下参数实现 CISPO 训练：
+
+```bash
+--loss_type cispo
+--epsilon_high 5.0
+```
+
+> 相比其他算法, cispo 的 epsilon_high 一般取值较大，minimax论文中未给出具体的参数设置，这里的值参考论文[ScaleRL](https://arxiv.org/pdf/2510.13786)的实验设置
+
+其他训练参数参考 [GRPO参数文档](../../Command-line-parameters.md#grpo参数)
diff --git a/docs/source/Instruction/GRPO/AdvancedResearch/DAPO.md b/docs/source/Instruction/GRPO/AdvancedResearch/DAPO.md
@@ -49,8 +49,9 @@ DAPO 使用token级归一化，避免了回答长度在损失计算上的偏差
 
 使用参数
 
-- loss_type bnpo 来使用token级归一化
+- loss_type bnpo/dapo 来使用token级归一化
 
+> loss_type 计算公式可参考[文档](../DeveloperGuide/loss_types.md)
 
 ## Overlong Filtering
 DAPO 认为被强制截断的回复的奖励噪声较大，可能会导致模型难以区分质量问题和长度问题。为此，DAPO 筛除了训练中被截断的数据，使其不参与损失计算。
@@ -92,7 +93,7 @@ $$
 
 | 参数                 | 类型      | 值      |
 |----------------------|-----------|-------------|
-| `--loss_type`        | `str`     | `bnpo`     |
+| `--loss_type`        | `str`     | `bnpo`/`dapo`|
 | `--epsilon_high`     | `float`   | `0.28`      |
 | `--dynamic_sample`   | `bool`    | `true`      |
 | `--max_resample_times` | `int`   | `3`        |
diff --git a/docs/source/Instruction/GRPO/AdvancedResearch/index.rst b/docs/source/Instruction/GRPO/AdvancedResearch/index.rst
@@ -10,3 +10,4 @@ Advanced Research
    RLOO.md
    REINFORCEPP.md
    CHORD.md
+   CISPO.md
diff --git a/docs/source/Instruction/GRPO/DeveloperGuide/index.rst b/docs/source/Instruction/GRPO/DeveloperGuide/index.rst
@@ -3,6 +3,7 @@ Developer Guide
 .. toctree::
    :maxdepth: 1
 
+   loss_types.md
    multi_turn.md
    multi_task.md
    reward_function.md
diff --git a/docs/source/Instruction/GRPO/DeveloperGuide/loss_types.md b/docs/source/Instruction/GRPO/DeveloperGuide/loss_types.md
@@ -0,0 +1,102 @@
+# Loss Types
+
+GRPO训练支持五种不同的loss类型，主要区别在于归一化的维度上有所不同。
+
+## 损失函数
+
+token 级别上，GRPO 训练使用以下损失函数
+
+$$\mathcal{L}_{i,t} = -\min\left(\rho_{i,t} A_{i,t}, \text{clip}(\rho_{i,t}, 1-\epsilon, 1+\epsilon) A_{i,t}\right)$$
+
+当设置`loss_type cispo`时，使用 cispo 损失
+
+$$\mathcal{L}_{i,t}^{\text{CISPO}} = -\text{detach}\left(\min(\rho_{i,t}, \epsilon_{\text{high}})\right) \cdot A_{i,t} \cdot \log \pi_\theta(y_{i,t}|y_{i,<t})$$
+
+其中：
+- $\rho_{i,t} = \frac{\pi_\theta(y_{i,t}|y_{i,<t})}{\pi_{\theta_{\text{old}}}(y_{i,t}|y_{i,<t})}$ 是重要性采样权重
+- $A_{i,t}$ 是优势函数
+- $\epsilon$ 和 $\epsilon_{\text{high}}$ 是clipping参数
+- $\text{detach}(\cdot)$ 表示该项不参与梯度计算
+
+## GRPO
+
+`--loss_type grpo`
+
+GRPO是标准的损失函数实现，对每个样本的token-level损失取平均，然后对所有样本取平均。
+
+**公式：**
+
+$$\mathcal{L}_{\text{GRPO}} = \frac{1}{N} \sum_{i=1}^{N} \frac{1}{T_i} \sum_{t=1}^{T_i} \mathcal{L}_{i,t}$$
+
+其中：
+- $N$ 是批次中的样本数量
+- $T_i$ 是第$i$个样本的completion token数量
+
+**归一化维度：** 样本维度（先对每个样本的所有token取平均，再对所有样本取平均）
+
+## BNPO (Batch Normalized Policy Optimization)
+
+`--loss_type bnpo`
+
+BNPO将所有样本的所有token的损失直接求和，然后除以所有completion token的总数量。
+
+**公式：**
+
+$$\mathcal{L}_{\text{BNPO}} = \frac{\sum_{i=1}^{N} \sum_{t=1}^{T_i} \mathcal{L}_{i,t}}{\sum_{i=1}^{N} T_i}$$
+
+其中：
+- $N$ 是批次中的样本数量
+- $T_i$ 是第$i$个样本的completion token数量
+
+**归一化维度：** Token维度（对所有completion token取平均）
+
+## DR-GRPO
+
+`--loss_type dr_grpo`
+
+DR-GRPO将所有样本的所有token的损失求和，然后除以批次大小乘以最大completion长度。
+
+**公式：**
+
+$$\mathcal{L}_{\text{DR-GRPO}} = \frac{\sum_{i=1}^{N} \sum_{t=1}^{T_i} \mathcal{L}_{i,t}}{N \times L_{\text{max}}}$$
+
+其中：
+- $N$ 是批次中的样本数量
+- $T_i$ 是第$i$个样本的completion token数量
+- $L_{\text{max}}$ 是最大completion长度
+
+**归一化维度：** 固定维度（批次大小 × 最大completion长度）
+
+## CISPO
+
+`--loss_type cispo`
+
+CISPO损失按所有进程的completion token总数进行归一化。
+
+**公式：**
+
+$$\mathcal{L}_{\text{CISPO}} = \frac{\sum_{i=1}^{N} \sum_{t=1}^{T_i} \mathcal{L}_{i,t}^{\text{CISPO}}}{\sum_{\text{all processes}} \sum_{i=1}^{N_p} T_{p,i}}$$
+
+其中：
+- $N$ 是当前进程批次中的样本数量
+- $T_i$ 是第$i$个样本的completion token数量
+- $N_p$ 是第$p$个进程的样本数量
+
+**归一化维度：** 全局token维度（跨所有进程的completion token总数）
+
+## DAPO
+
+`--loss_type dapo`
+
+DAPO与BNPO类似，使用token-level归一化，但基于全局数据（多进程）进行归一化。
+
+**公式：**
+
+$$\mathcal{L}_{\text{DAPO}} = \frac{\sum_{i=1}^{N} \sum_{t=1}^{T_i} \mathcal{L}_{i,t}}{\sum_{\text{all processes}} \sum_{i=1}^{N_p} T_{p,i}}$$
+
+其中：
+- $N$ 是当前进程批次中的样本数量
+- $T_i$ 是第$i$个样本的completion token数量
+- $N_p$ 是第$p$个进程的样本数量
+
+**归一化维度：** 全局token维度（跨所有进程的completion token总数）
diff --git a/docs/source_en/Instruction/Command-line-parameters.md b/docs/source_en/Instruction/Command-line-parameters.md
@@ -572,7 +572,7 @@ The meanings of the following parameters can be referenced [here](https://huggin
 - reward_model_plugin: The logic for the reward model, which defaults to ORM logic. For more information, please refer to [Customized Reward Models](./GRPO/DeveloperGuide/reward_model.md#custom-reward-model).
 - dataset_shuffle: Whether to shuffle the dataset randomly. Default is True.
 - truncation_strategy: The method to handle inputs exceeding `max_length`. Supported values are `delete` and `left`, representing deletion and left-side truncation respectively. The default is `left`. Note that for multi-modal models, left-side truncation may remove multi-modal tokens and cause a shape mismatch error during model forward. With the delete strategy, over-long or encoding-failed samples are discarded, and new samples are resampled from the original dataset to maintain the intended batch size.
-- loss_type: The type of loss normalization. Options are ['grpo', 'bnpo', 'dr_grpo'], default is 'grpo'. For details, see this [pr](https://github.com/huggingface/trl/pull/3256#discussion_r2033213348)
+- loss_type: The type of loss normalization. Options are ['grpo', 'bnpo', 'dr_grpo', 'dapo', 'cispo'], default is 'grpo'. For details, refer to this [doc](./GRPO/DeveloperGuide/loss_types.md)
 - log_completions: Whether to log the model-generated content during training, to be used in conjunction with `--report_to wandb/swanlab`, default is False.
   - Note: If `--report_to wandb/swanlab` is not set, a `completions.jsonl` will be created in the checkpoint to store the generated content.
 - use_vllm: Whether to use vLLM as the infer_backend for GRPO generation, default is False.
diff --git a/docs/source_en/Instruction/GRPO/AdvancedResearch/CISPO.md b/docs/source_en/Instruction/GRPO/AdvancedResearch/CISPO.md
@@ -0,0 +1,64 @@
+# Clipped Importance Sampling Policy Optimization (CISPO)
+
+**Version requirement**: ms-swift>=3.11
+
+Clipped Importance Sampling Policy Optimization (CISPO) is a reinforcement learning algorithm proposed in the [MiniMax-M1](https://arxiv.org/abs/2506.13585) paper. Compared to GRPO (Group Relative Policy Optimization), CISPO clips the importance sampling weights themselves.
+
+## Algorithm Overview
+
+For clarity, we explain CISPO by contrasting it with GRPO.
+
+GRPO limits the magnitude of policy updates by clipping the policy ratio. Its loss function is:
+
+$$
+\mathcal{L}_{\text{GRPO}}(\theta) = -\mathbb{E}\left[\min\left(r_t(\theta) \cdot \hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \cdot \hat{A}_t\right)\right]
+$$
+
+where $r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}$ is the importance sampling ratio.
+
+When handling long reasoning chains, this clipping approach can lead to the following issues:
+
+**Gradient Suppression of Critical Tokens**: In complex reasoning tasks, certain critical low-probability tokens (such as *However, Recheck, Wait, Aha*) are crucial for triggering deep thinking and reasoning error correction. These tokens have low probability in the old policy $\pi_{\theta_{\text{old}}}$. When the new policy attempts to increase their probability, it results in a large policy ratio $r_t(\theta)$, and GRPO's clipping mechanism will discard these tokens.
+
+
+### CISPO's Solution
+
+The core idea of CISPO is to clip the importance sampling weights while preserving gradient updates. Specifically, CISPO's loss function is:
+
+$$
+\mathcal{L}_{\text{CISPO}}(\theta) = -\mathbb{E}\left[\text{detach}\left(\min(r_t(\theta), \epsilon_{\text{high}})\right) \cdot \hat{A}_t \cdot \log \pi_\theta(a_t|s_t)\right]
+$$
+
+where $r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}$ is the importance sampling ratio.
+
+**Key Mechanisms**:
+- Clip the importance sampling weights: $\min(r_t(\theta), \epsilon_{\text{high}})$
+- **Detach operation**: The clipped weights do not participate in gradient computation and serve as constant coefficients
+- Gradients come from the $\log \pi_\theta(a_t|s_t)$ term, ensuring all tokens contribute gradients
+
+
+## Implementation Details
+
+The pseudo-code implementation of CISPO is as follows:
+
+```python
+log_ratio = per_token_logps - old_per_token_logps
+importance_weights = torch.exp(log_ratio)  # r_t(θ) = π_θ / π_θ_old
+
+clamped_ratios = torch.clamp(importance_weights, max=epsilon_high).detach()
+
+per_token_loss = -clamped_ratios * advantages.unsqueeze(1) * per_token_logps
+```
+
+## Parameter Configuration
+
+CISPO training can be enabled based on `GRPOTrainer` by setting the following parameters:
+
+```bash
+--loss_type cispo
+--epsilon_high 5.0
+```
+
+> Compared to other algorithms, cispo generally uses a larger value for epsilon_high. The minimax paper does not provide specific parameter settings; the value used here refers to the experimental setup in the paper [ScaleRL](https://arxiv.org/pdf/2510.13786).
+
+For other training parameters, refer to the [GRPO parameter documentation](../../Command-line-parameters.md#grpo-arguments).
diff --git a/docs/source_en/Instruction/GRPO/AdvancedResearch/DAPO.md b/docs/source_en/Instruction/GRPO/AdvancedResearch/DAPO.md
@@ -42,7 +42,9 @@ GRPO normalizes losses at the sentence level, which introduces bias based on res
 DAPO uses token-level normalization to avoid this bias in loss calculation.
 
 Parameters:
-- `loss_type bnpo` enables token-level normalization.
+- `loss_type bnpo/dapo` enables token-level normalization.
+
+> For the loss_type formula, please refer to the [documentation](../DeveloperGuide/loss_types.md).
 
 ## Overlong Filtering
 DAPO argues that forcibly truncated responses contain high reward noise, making it difficult for the model to distinguish between quality issues and length issues. To address this, DAPO filters out truncated data during training, excluding it from loss computation.
@@ -78,7 +80,7 @@ In summary, the following parameters can be set based on GRPOTrainer to implemen
 
 | Parameter             | Type      | Value       |
 |-----------------------|-----------|-------------|
-| `--loss_type`         | `str`     | `bnpo`      |
+| `--loss_type`         | `str`     | `bnpo`/`dapo`|
 | `--epsilon_high`      | `float`   | `0.28`      |
 | `--dynamic_sample`    | `bool`    | `true`      |
 | `--max_resample_times`| `int`     | `3`         |
diff --git a/docs/source_en/Instruction/GRPO/AdvancedResearch/index.rst b/docs/source_en/Instruction/GRPO/AdvancedResearch/index.rst
@@ -10,3 +10,4 @@ Advanced Research
    REINFORCEPP.md
    RLOO.md
    CHORD.md
+   CISPO.md
diff --git a/docs/source_en/Instruction/GRPO/DeveloperGuide/index.rst b/docs/source_en/Instruction/GRPO/DeveloperGuide/index.rst
@@ -3,6 +3,7 @@ Developer Guide
 .. toctree::
    :maxdepth: 1
 
+   loss_types.md
    multi_turn.md
    multi_task.md
    reward_function.md
diff --git a/docs/source_en/Instruction/GRPO/DeveloperGuide/loss_types.md b/docs/source_en/Instruction/GRPO/DeveloperGuide/loss_types.md
diff --git a/swift/megatron/argument/megatron_args.py b/swift/megatron/argument/megatron_args.py
diff --git a/swift/megatron/trainers/grpo_trainer.py b/swift/megatron/trainers/grpo_trainer.py
diff --git a/swift/trainers/rlhf_trainer/grpo_trainer.py b/swift/trainers/rlhf_trainer/grpo_trainer.py