Skip to content

Commit a97e30d

Browse files
authored
[dataset] refactor cached_dataset (#6561)
1 parent d370250 commit a97e30d

File tree

29 files changed

+215
-105
lines changed

29 files changed

+215
-105
lines changed

docs/source/Instruction/Command-line-parameters.md

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,9 @@
5656
- 子数据集: 该参数只有当dataset为ID或者文件夹时生效。若注册时指定了subsets,且只有一个子数据集,则默认选择注册时指定的子数据集,否则默认为'default'。你可以使用`/`来选择多个子数据集,例如:`<dataset_id>:subset1/subset2`。你也可以使用'all'来选择注册时指定的所有子数据集,例如:`<dataset_id>:all`。注册例子可以参考[这里](https://modelscope.cn/datasets/swift/garbage_competition)
5757
- 采样数量: 默认使用完整的数据集。你可以通过设置`#采样数`对选择的数据集进行采样。若采样数少于数据样本总数,则进行随机选择(不重复采样)。若采样数高于数据样本总数,则只额外随机采样`采样数%数据样本总数`的样本,数据样本重复采样`采样数//数据样本总数`次。注意:流式数据集(`--streaming true`)只进行顺序采样。若设置`--dataset_shuffle false`,则非流式数据集也进行顺序采样。
5858
- 🔥val_dataset: 验证集id或路径的list。默认为`[]`
59+
- 🔥cached_dataset: 使用缓存数据集(使用`swift export --to_cached_dataset true ...`命令产生),避免大型数据集训练/推理时,tokenize过程占用gpu时间。默认为`[]`。例子参考[这里](https://github.com/modelscope/ms-swift/tree/main/examples/export/cached_dataset)
60+
- 提示:在"ms-swift>=3.11",cached_dataset只会在数据集中额外存储length字段(为避免存储压力),并过滤掉会报错的数据样本。在训练/推理时,支持`--max_length`参数进行超长数据过滤/裁剪以及`--packing`参数。数据实际预处理过程将在训练时同步进行,该过程和训练是重叠的,并不会影响训练速度。
61+
- cached_dataset在`ms-swift``Megatron-SWIFT`之间是通用的,且支持pt/sft/infer/rlhf(需"ms-swift>=3.11")。
5962
- 🔥split_dataset_ratio: 不指定val_dataset时从训练集拆分验证集的比例,默认为0.,即不从训练集切分验证集。
6063
- 注意:该参数在"ms-swift<3.6"的默认值为0.01。
6164
- data_seed: 数据集随机种子,默认为42。
@@ -450,8 +453,6 @@ Vera使用`target_modules`、`target_regex`、`modules_to_save`三个参数,
450453
- packing_num_proc: packing的进程数,默认为1。需要注意的是,不同的`packing_num_proc`,最终形成的packed数据集是不同的。(该参数在流式packing时不生效)
451454
- lazy_tokenize: 是否使用lazy_tokenize。若该参数设置为False,则在训练之前对所有的数据集样本进行tokenize(多模态模型则包括从磁盘中读取图片)。该参数默认为None,在LLM训练中默认为False,而MLLM训练默认为True,节约内存。
452455
- 注意:若你要进行图像的数据增强,你需要将lazy_tokenize(或streaming)设置为True,并修改Template类中的encode方法。
453-
- cached_dataset: 训练中使用缓存数据集(使用`swift export --to_cached_dataset true ...`命令产生),避免大型数据集训练时,tokenize过程占用gpu时间。默认为`[]`。例子参考[这里](https://github.com/modelscope/ms-swift/tree/main/examples/export/cached_dataset)
454-
- 注意:cached_dataset支持`--packing`,但不支持`--lazy_tokenize``--streaming`
455456
- use_logits_to_keep: 通过在`forward`中根据labels传入logits_to_keep,减少无效logits的计算与存储,从而减少显存占用并加快训练速度。默认为None,进行自动选择。
456457
- acc_strategy: 训练和验证时计算acc的策略。可选为`seq``token`级别的acc,默认为`token`
457458
- max_new_tokens: 覆盖生成参数。predict_with_generate=True时的最大生成token数量,默认64。
@@ -700,8 +701,9 @@ App参数继承于[部署参数](#部署参数), [Web-UI参数](#Web-UI参数)
700701
- max_length: 校准集的max_length, 默认值2048。
701702
- quant_batch_size: 量化batch_size,默认为1。
702703
- group_size: 量化group大小,默认为128。
703-
- to_cached_dataset: 提前对数据集进行tokenize并导出,默认为False。例子参考[这里](https://github.com/modelscope/ms-swift/tree/main/examples/export/cached_dataset)
704-
- 注意:数据packing在训练时进行,而不在此处。
704+
- to_cached_dataset: 提前对数据集进行tokenize并导出,默认为False。例子参考[这里](https://github.com/modelscope/ms-swift/tree/main/examples/export/cached_dataset)。更多介绍请查看`cached_dataset`
705+
- 提示:cached_dataset需提前区分好训练集和验证集。你可以通过`--split_dataset_ratio`或者`--val_dataset`指定验证集内容。
706+
- template_mode: 用于支持对`swift rlhf`训练的`cached_dataset`功能。该参数只在`--to_cached_dataset true`时生效。可选项包括: 'train'、'rlhf'和'kto'。其中`swift pt/sft`使用'train',`swift rlhf --rlhf_type kto`使用'kto',其他rlhf算法使用'rlhf'。注意:当前'gkd', 'ppo', 'grpo'算法不支持`cached_dataset`功能。默认为'train'。
705707
- to_ollama: 产生ollama所需的Modelfile文件。默认为False。
706708
- 🔥to_mcore: HF格式权重转成Megatron格式。默认为False。
707709
- to_hf: Megatron格式权重转成HF格式。默认为False。

docs/source/Megatron-SWIFT/Command-line-parameters.md

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -282,8 +282,6 @@ Megatron训练参数继承自Megatron参数和基本参数(**与ms-swift共用
282282
- 注意:因为流式数据集无法获得其长度,因此需要设置`--train_iters`参数。设置`max_epochs`参数确保训练到对应epochs时退出训练,并对权重进行验证和保存。
283283
- 注意:流式数据集可以跳过预处理等待,将预处理时间与训练时间重叠。流式数据集的预处理只在rank0上进行,并通过数据分发的方式同步到其他进程,**其通常效率不如非流式数据集采用的数据分片读取方式**。当训练的world_size较大时,预处理和数据分发将成为训练瓶颈。
284284
- lazy_tokenize: 是否使用lazy_tokenize。若该参数设置为False,则在训练之前对所有的数据集样本进行tokenize(多模态模型则包括从磁盘中读取图片)。该参数默认为None,在LLM训练中默认为False,而MLLM训练默认为True,节约内存。
285-
- cached_dataset: 训练中使用缓存数据集(使用`swift export --to_cached_dataset true ...`命令产生),避免大型数据集训练时,tokenize过程占用gpu时间。默认为`[]`。例子参考[这里](https://github.com/modelscope/ms-swift/tree/main/examples/export/cached_dataset)
286-
- 注意:cached_dataset支持`--packing`,但不支持`--lazy_tokenize``--streaming`。cached_dataset暂不支持CP。
287285
- enable_dft_loss: 是否在SFT训练中使用[DFT](https://arxiv.org/abs/2508.05629) (Dynamic Fine-Tuning) loss,默认为False。
288286
- enable_channel_loss: 启用channel loss,默认为`False`。你需要在数据集中准备"channel"字段,ms-swift会根据该字段分组统计loss(若未准备"channel"字段,则归为默认`None` channel)。数据集格式参考[channel loss](../Customization/Custom-dataset.md#channel-loss)。channel loss兼容packing/padding_free/loss_scale等技术。
289287
- new_special_tokens: 需要新增的特殊tokens。默认为`[]`。例子参考[这里](https://github.com/modelscope/ms-swift/blob/main/examples/megatron/lora/new_special_tokens.sh)

docs/source_en/Instruction/Command-line-parameters.md

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,9 @@ The command-line arguments will be introduced in four categories: basic argument
5555
- Subset: This parameter is only effective when the dataset is a dataset ID or a folder. If subsets were specified during registration and only one exists, that subset is selected by default; otherwise, the default subset `'default'` is used. You can select multiple subsets using `/`, e.g., `<dataset_id>:subset1/subset2`. You can also use `'all'` to select all registered subsets, e.g., `<dataset_id>:all`. See an example of registration [here](https://modelscope.cn/datasets/swift/garbage_competition).
5656
- Sampling count: By default, the full dataset is used. You can sample the dataset by specifying `#sample_count`. If the sample count is less than the total number of samples, random sampling without replacement is performed. If the sample count exceeds the total, the dataset is repeated `sample_count // total_samples` times, with an additional `sample_count % total_samples` samples randomly sampled. Note: For streaming datasets (`--streaming true`), only sequential sampling is performed. If `--dataset_shuffle false` is set, non-streaming datasets also use sequential sampling.
5757
- 🔥val_dataset: A list of validation dataset IDs or paths. Default is `[]`.
58+
- 🔥cached_dataset: Use cached dataset (generated using `swift export --to_cached_dataset true ...` command) to avoid GPU time consumed by the tokenization process during large dataset training/inference. Default is `[]`. For examples, refer to [here](https://github.com/modelscope/ms-swift/tree/main/examples/export/cached_dataset).
59+
- Note: In "ms-swift>=3.11", cached_dataset only stores an additional length field in the dataset (to avoid storage pressure) and filters out data samples that would cause errors. During training/inference, the `--max_length` parameter is supported for filtering/truncating excessively long data and the `--packing` parameter is supported. The actual data preprocessing process occurs synchronously during training and overlaps with the training process, which does not affect training speed.
60+
- cached_dataset is compatible between `ms-swift` and `Megatron-SWIFT`, and supports pt/sft/infer/rlhf (requires "ms-swift>=3.11").
5861
- 🔥split_dataset_ratio: The ratio for splitting a validation set from the training set when `val_dataset` is not specified. Default is `0.`, meaning no splitting occurs.
5962
- Note: In "ms-swift<3.6", the default value was `0.01`.
6063
- data_seed: Random seed for dataset operations. Default is `42`.
@@ -458,8 +461,6 @@ Training arguments include the [base arguments](#base-arguments), [Seq2SeqTraine
458461
- packing_num_proc: Number of processes for packing, default is 1. Note that different values of `packing_num_proc` will result in different packed datasets. (This parameter does not take effect during streaming packing)
459462
- lazy_tokenize: Whether to use lazy tokenization. If set to `False`, all dataset samples will be tokenized (and for multimodal models, images will be loaded from disk) before training begins. Default is `None`: in LLM training, it defaults to `False`; in MLLM training, it defaults to `True` to save memory.
460463
- Note: If you want to perform image data augmentation, you need to set `lazy_tokenize` (or `streaming`) to True and modify the `encode` method in the Template class.
461-
- cached_dataset: Use a cached dataset (generated with `swift export --to_cached_dataset true ...`) during training to avoid GPU time spent on tokenizing large datasets. Default is `[]`. Example: [here](https://github.com/modelscope/ms-swift/tree/main/examples/export/cached_dataset).
462-
- Note: cached_dataset supports `--packing` but does not support `--lazy_tokenize` or `--streaming`.
463464
- use_logits_to_keep: Pass `logits_to_keep` in the `forward` method based on labels to reduce the computation and storage of unnecessary logits, thereby reducing memory usage and accelerating training. The default is `None`, which enables automatic selection.
464465
- acc_strategy: Strategy for calculating accuracy during training and validation. Options are `seq`-level and `token`-level accuracy, with `token` as the default.
465466
- max_new_tokens: Generation parameter override. The maximum number of tokens to generate when `predict_with_generate=True`, defaulting to 64.
@@ -718,8 +719,9 @@ Export Arguments include the [basic arguments](#base-arguments) and [merge argum
718719
- max_length: Max length for the calibration set, default value is 2048.
719720
- quant_batch_size: Quantization batch size, default is 1.
720721
- group_size: Group size for quantization, default is 128.
721-
- to_cached_dataset: pre-tokenize the dataset and export it in advance, default is False. See the example [here](https://github.com/modelscope/ms-swift/tree/main/examples/export/cached_dataset).
722-
- Note: data packing is performed during training, not in this step.
722+
- to_cached_dataset: pre-tokenize the dataset and export it in advance, default is False. See the example [here](https://github.com/modelscope/ms-swift/tree/main/examples/export/cached_dataset). For more information, please refer to cached_dataset.
723+
- Note: cached_dataset requires the training set and validation set to be distinguished in advance. You can specify the validation set content through `--split_dataset_ratio` or `--val_dataset`.
724+
- template_mode: Used to support the `cached_dataset` feature for `swift rlhf` training. This parameter only takes effect when `--to_cached_dataset true` is set. Available options include: 'train', 'rlhf', and 'kto'. Among them, `swift pt/sft` uses 'train', `swift rlhf --rlhf_type kto` uses 'kto', and other rlhf algorithms use 'rlhf'. Note: Currently, 'gkd', 'ppo', and 'grpo' algorithms do not support the `cached_dataset` feature. Default is 'train'.
723725
- to_ollama: Generate the Modelfile required by Ollama. Default is False.
724726
- 🔥to_mcore: Convert weights from HF format to Megatron format. Default is False.
725727
- to_hf: Convert weights from Megatron format to HF format. Default is False.

docs/source_en/Megatron-SWIFT/Command-line-parameters.md

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -300,8 +300,6 @@ Megatron training parameters are inherited from Megatron parameters and basic pa
300300
- Note: Since the length of a streaming dataset cannot be determined, the `--train_iters` parameter must be set. Also set the `max_epochs` parameter to ensure training exits after the specified number of epochs, and to validate and save the model weights accordingly.
301301
- Note: Streaming datasets can skip preprocessing wait time by overlapping preprocessing with training. Preprocessing for streaming datasets is performed only on rank 0 and then synchronized to other processes via data distribution. **This is generally less efficient than the data sharding approach used in non-streaming datasets.** When the training world_size is large, preprocessing and data distribution can become a training bottleneck.
302302
- lazy_tokenize: Whether to use lazy tokenization. If set to `False`, all dataset samples will be tokenized (and for multimodal models, images will be loaded from disk) before training begins. Default is `None`: in LLM training, it defaults to `False`; in MLLM training, it defaults to `True` to save memory.
303-
- cached_dataset: Use a cached dataset (generated with `swift export --to_cached_dataset true ...`) during training to avoid GPU time spent on tokenizing large datasets. Default is `[]`. Example: [here](https://github.com/modelscope/ms-swift/tree/main/examples/export/cached_dataset).
304-
- Note: cached_dataset supports `--packing` but does not support `--lazy_tokenize` or `--streaming`. Cached dataset is currently not supported for CP.
305303
- enable_dft_loss: Whether to use [DFT](https://arxiv.org/abs/2508.05629) (Dynamic Fine-Tuning) loss in SFT training, default is False.
306304
- enable_channel_loss: Enable channel-based loss. Default is `False`. Requires a `"channel"` field in the dataset. ms-swift groups and computes loss by this field (samples without `"channel"` are grouped into the default `None` channel). Dataset format reference: [channel loss](../Customization/Custom-dataset.md#channel-loss). Channel loss is compatible with packing, padding_free, and loss_scale techniques.
307305
- new_special_tokens: List of additional special tokens to be added. Default is `[]`. Example usage can be found [here](https://github.com/modelscope/ms-swift/blob/main/examples/megatron/lora/new_special_tokens.sh).
Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
# ms-swift>=3.11
2+
OMP_NUM_THREADS=14 \
3+
IMAGE_MAX_TOKEN_NUM=1024 \
4+
VIDEO_MAX_TOKEN_NUM=128 \
5+
FPS_MAX_FRAMES=16 \
6+
swift export \
7+
--model Qwen/Qwen3-VL-30B-A3B-Instruct \
8+
--dataset swift/RLAIF-V-Dataset \
9+
--split_dataset_ratio 0.01 \
10+
--dataset_num_proc 8 \
11+
--to_cached_dataset true \
12+
--template_mode rlhf \
13+
--output_dir ./qwen3_vl_cached_dataset
14+
15+
16+
# 16s/it; 8 * 65GiB
17+
PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
18+
NPROC_PER_NODE=8 \
19+
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
20+
IMAGE_MAX_TOKEN_NUM=1024 \
21+
VIDEO_MAX_TOKEN_NUM=128 \
22+
FPS_MAX_FRAMES=16 \
23+
megatron rlhf \
24+
--rlhf_type dpo \
25+
--model Qwen/Qwen3-VL-30B-A3B-Instruct \
26+
--load_safetensors true \
27+
--save_safetensors true \
28+
--cached_dataset qwen3_vl_cached_dataset \
29+
--load_from_cache_file true \
30+
--train_type full \
31+
--tensor_model_parallel_size 4 \
32+
--expert_tensor_parallel_size 1 \
33+
--pipeline_model_parallel_size 2 \
34+
--expert_model_parallel_size 4 \
35+
--moe_permute_fusion true \
36+
--moe_grouped_gemm true \
37+
--moe_shared_expert_overlap true \
38+
--moe_aux_loss_coeff 1e-6 \
39+
--micro_batch_size 1 \
40+
--global_batch_size 4 \
41+
--packing true \
42+
--recompute_granularity full \
43+
--recompute_method uniform \
44+
--recompute_num_layers 1 \
45+
--finetune true \
46+
--cross_entropy_loss_fusion true \
47+
--lr 1e-5 \
48+
--lr_warmup_fraction 0.05 \
49+
--min_lr 1e-6 \
50+
--save megatron_output/Qwen3-VL-30B-A3B-Instruct \
51+
--eval_interval 500 \
52+
--save_interval 500 \
53+
--max_length 8192 \
54+
--max_epochs 1 \
55+
--num_workers 8 \
56+
--dataset_num_proc 8 \
57+
--no_save_optim true \
58+
--no_save_rng true \
59+
--sequence_parallel true \
60+
--freeze_llm false \
61+
--freeze_vit true \
62+
--freeze_aligner true \
63+
--optimizer_cpu_offload true \
64+
--use_precision_aware_optimizer true \
65+
--optimizer_offload_fraction 0.4 \
66+
--attention_backend flash \
67+
--rpo_alpha 0.1 \
68+
--beta 0.1 \
69+
--loss_type sigmoid

examples/export/cached_dataset/mcore.sh

Lines changed: 15 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,7 @@
1-
# Note: cached_dataset does not support CP temporarily.
1+
# ms-swift>=3.11
22
swift export \
33
--model Qwen/Qwen3-30B-A3B-Base \
44
--dataset 'swift/Chinese-Qwen3-235B-2507-Distill-data-110k-SFT' \
5-
--max_length 8192 \
65
--split_dataset_ratio 0.01 \
76
--dataset_num_proc 64 \
87
--to_cached_dataset true \
@@ -14,18 +13,20 @@ PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
1413
NPROC_PER_NODE=4 \
1514
CUDA_VISIBLE_DEVICES=0,1,2,3 \
1615
megatron sft \
17-
--load Qwen3-30B-A3B-Base-mcore \
16+
--model Qwen/Qwen3-30B-A3B-Base \
17+
--load_safetensors true \
18+
--save_safetensors true \
19+
--merge_lora false \
1820
--cached_dataset './qwen3_cached_dataset' \
1921
--train_type lora \
2022
--lora_rank 32 \
2123
--lora_alpha 64 \
2224
--target_modules all-linear \
23-
--split_dataset_ratio 0.01 \
2425
--moe_permute_fusion true \
2526
--expert_model_parallel_size 4 \
2627
--moe_grouped_gemm true \
2728
--moe_shared_expert_overlap true \
28-
--moe_aux_loss_coeff 1e-3 \
29+
--moe_aux_loss_coeff 1e-6 \
2930
--micro_batch_size 1 \
3031
--global_batch_size 16 \
3132
--recompute_granularity full \
@@ -48,3 +49,12 @@ megatron sft \
4849
--no_save_rng true \
4950
--sequence_parallel true \
5051
--attention_backend flash
52+
53+
54+
CUDA_VISIBLE_DEVICES=0 \
55+
swift infer \
56+
--adapters megatron_output/Qwen3-30B-A3B-Base/vx-xxx/checkpoint-xxx \
57+
--load_data_args true \
58+
--attn_impl flash_attn \
59+
--stream true \
60+
--max_new_tokens 512

examples/export/cached_dataset/pretrained.sh

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
1+
# ms-swift>=3.11
12
swift export \
23
--model Qwen/Qwen2.5-7B \
34
--dataset 'AI-ModelScope/ruozhiba:all' \
4-
--max_length 8192 \
55
--dataset_num_proc 64 \
66
--to_cached_dataset true \
77
--split_dataset_ratio 0.01 \
@@ -17,7 +17,6 @@ swift pt \
1717
--train_type full \
1818
--cached_dataset './pretrain_cached_dataset' \
1919
--num_train_epochs 3 \
20-
--split_dataset_ratio 0.01 \
2120
--torch_dtype bfloat16 \
2221
--per_device_train_batch_size 1 \
2322
--per_device_eval_batch_size 1 \

examples/export/cached_dataset/sft.sh

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
1+
# ms-swift>=3.11
12
swift export \
23
--model Qwen/Qwen2.5-7B \
34
--dataset 'swift/Chinese-Qwen3-235B-2507-Distill-data-110k-SFT' \
4-
--max_length 8192 \
55
--dataset_num_proc 64 \
66
--split_dataset_ratio 0.01 \
77
--to_cached_dataset true \
@@ -16,7 +16,6 @@ swift sft \
1616
--train_type full \
1717
--cached_dataset './qwen2_5_cached_dataset' \
1818
--num_train_epochs 3 \
19-
--split_dataset_ratio 0.01 \
2019
--torch_dtype bfloat16 \
2120
--per_device_train_batch_size 1 \
2221
--per_device_eval_batch_size 1 \

0 commit comments

Comments
 (0)