Training and Deployment with Private Datasets#

Overview#

This tutorial provides guidance on how to use your own collected and prepared datasets to configure FluxVLA for the complete workflow from training to real-robot deployment.

Before proceeding with this tutorial, please ensure the following prerequisites are met:

  1. The environment has been set up according to the Installation Guide

  2. Raw data has been converted to the LeRobot v2.1 format following the Real-Robot Data Preparation guide

  3. Pre-trained model weights (e.g., GR00T-N1.5-3B) have been downloaded

This tutorial uses configs/gr00t/gr00t_eagle_3b_aloha_full_finetune.py as the base template, demonstrating how to create a complete training and deployment configuration for your private dataset.


Overall Workflow#

The complete workflow for training and deployment with a private dataset is as follows:

  1. Prepare the data — Ensure the data has been converted to LeRobot v2.1 format

  2. Create the configuration file — Modify key parameters based on an existing template

  3. Launch training — Execute distributed training using scripts/train.sh

  4. Real-robot deployment — Perform inference using the trained checkpoint

Among these steps, creating the configuration file is the most critical. A configuration file comprises four major modules, each of which must be adjusted according to the characteristics of your dataset:

Configuration Module

Description

Modification Required

model

Model architecture and pre-trained weights

Case-dependent

train_dataloader

Training data paths, camera configuration, dimension parameters

Required

runner

Training hyperparameters (epochs, learning rate, etc.)

Recommended

inference

Task descriptions and hardware configuration for real-robot deployment

Required


Step 1: Verify the Dataset Structure#

Before beginning the configuration, confirm that your dataset conforms to the LeRobot v2.1 format with the following directory structure:

your_dataset_path/
├── data/
│   ├── train/
│   │   ├── episode_0.parquet
│   │   ├── episode_1.parquet
│   │   └── ...
│   └── video/
│       ├── episode_0/
│       │   ├── observation.images.cam_high.mp4
│       │   ├── observation.images.cam_left_wrist.mp4
│       │   └── ...
│       └── ...
├── info.json
└── meta.json

You need to determine the following key information, which will be referenced repeatedly in subsequent configuration steps:

Information to Confirm

Example Value (Aloha Dual-Arm)

Your Value

Dataset path

./datasets/RealRobot_AgileX_aloha_lerobot_v2/20250601_20250615_02_4090

Camera name list

cam_high, cam_left_wrist, cam_right_wrist

Number of cameras

3

Raw action dimensionality of the robot (ori_action_dim)

14 (7 per arm)

Robot state dimensionality

14

Task description (natural language)

"pick up the green bowl with right arm"


Step 2: Configure the Model (model)#

The model configuration defines the network architecture of the VLA. If you are using the same GR00T-Eagle architecture as Aloha for full fine-tuning, you typically only need to focus on the dimension-related parameters in vla_head.

model = dict(
    type='LlavaVLA',
    pretrained_name_or_path='./checkpoints/GR00T-N1.5-3B',
    vlm_backbone=dict(
        type='EagleBackbone',
        vlm_path='fluxvla/models/third_party_models/eagle2_hg_model'),
    vla_head=dict(
        type='FlowMatchingHead',
        state_dim=64,             # 模型内部状态维度(padding 后)
        hidden_size=1024,
        input_embedding_dim=1536,
        num_layers=1,
        num_heads=4,
        num_inference_timesteps=4,
        num_steps=32,
        traj_length=10,           # 预测轨迹长度
        action_dim=32,            # 模型内部动作维度(padding 后)
        ori_action_dim=14),       # ⬅️ 修改:你的机器人原始动作维度
    freeze_vlm_backbone=False,
    name_mapping={
        'vlm_backbone.vlm': 'backbone.eagle_model',
        'vla_head': 'action_head'
    },
    freeze_projector=False)

Parameters to Modify#

Parameter

Description

Modification Guideline

ori_action_dim

The raw action dimensionality of the robot

Modify according to your robot. Aloha dual-arm = 14, UR3 single-arm = 7

action_dim

Internal action dimensionality after padding

Generally, keeping the default value of 32 is sufficient

state_dim

Internal state dimensionality after padding

Generally, keeping the default value of 64 is sufficient

traj_length

Number of future trajectory steps to predict

Default is 10; adjust based on task complexity

Note: action_dim and state_dim represent the internal dimensionalities after padding and do not need to equal the raw dimensionalities of the robot. ori_action_dim corresponds to the actual action space dimensionality of your robot.

inference_model (Optional)#

If a different model configuration is required during inference (e.g., a different pre-trained path), an additional inference_model can be defined with the same format as model:

inference_model = dict(
    type='LlavaVLA',
    pretrained_name_or_path='./checkpoints/GR00T-N1.5-3B',
    vlm_backbone=dict(
        type='EagleBackbone',
        dtype=None,
        vlm_path='fluxvla/models/third_party_models/eagle2_hg_model'),
    vla_head=dict(
        type='FlowMatchingHead',
        state_dim=64,
        hidden_size=1024,
        input_embedding_dim=1536,
        num_layers=1,
        num_heads=4,
        num_steps=32,
        num_inference_timesteps=4,
        traj_length=10,
        ori_action_dim=14,        # ⬅️ 与 model 保持一致
        action_dim=32),
    freeze_vlm_backbone=False,
    name_mapping={
        'vlm_backbone.vlm': 'backbone.eagle_model',
        'vla_head': 'action_head'
    },
    freeze_projector=False)

Step 3: Configure the Training Data (train_dataloader)#

This is the section that requires the most careful modification. You need to adjust the data paths, camera configuration, and dimension parameters according to the characteristics of your dataset.

train_dataloader = dict(
    per_device_batch_size=8,          # 每个 GPU 的 batch size
    per_device_num_workers=4,         # 每个 GPU 的数据加载进程数
    dataset=dict(
        type='DistributedRepeatingDataset',
        name_mappings={'observation.state': ['proprio', 'action']},
        statistic_keys=[
            'observation.state', 'observation.eepose', 'timestamp'
        ],
        datasets=[
            dict(
                type='ParquetDataset',
                data_root_path=[
                    # ⬅️ 修改:替换为你的数据集路径
                    './datasets/your_dataset_path/batch_01',
                    './datasets/your_dataset_path/batch_02',
                ],
                transforms=[...],     # 见下方详细说明
                action_window_size=32 # ⬅️ 修改:动作窗口大小
            )
        ]))

3.1 Dataset Path#

data_root_path supports a list format, allowing you to specify multiple data batch paths simultaneously:

data_root_path=[
    './datasets/RealRobot_AgileX_aloha_lerobot_v2/20250601_20250615_02_4090',
],

3.2 Data Transform Pipeline (transforms)#

The transform pipeline is the core of data preprocessing and should be configured with the following 5 transform steps in order:

Step 1: ProcessParquetInputs — Parse Parquet Data#

dict(
    type='ProcessParquetInputs',
    embodiment_id=0,              # ⬅️ 机器人类型 ID(Aloha=0, UR3=2)
    parquet_keys=[                # 需要从 parquet 中读取的字段
        'observation.state',
        'observation.eepose',
        'timestamp',
        'actions',
        'info',
        'stats',
        'action_masks'
    ],
    video_keys=[                  # ⬅️ 修改:你的相机视频键名
        'observation.images.cam_high',
        'observation.images.cam_left_wrist',
        'observation.images.cam_right_wrist'
    ],
    name_mappings={'observation.state': ['states']}),

⚠️ Important: The camera names in video_keys must match the filenames under the video/ directory in the dataset. For example, if your video file is observation.images.head_cam.mp4, the corresponding key should be observation.images.head_cam.

Step 2: ParquetPrompter — Generate Text Prompts#

dict(type='ParquetPrompter'),

This step extracts task descriptions (the task field) from the Parquet data for use as text prompts. No additional configuration is required.

Step 3: ProcessPromptsWithImage — Text Tokenization and Image Association#

dict(
    type='ProcessPromptsWithImage',
    max_len=900,
    num_images=3,             # ⬅️ 修改:相机数量
    tokenizer=dict(
        type='PretrainedTokenizer',
        model_path='path/to/eagle2_hg_model',
    )),

Parameter

Description

num_images

Number of cameras; must match the number of entries in video_keys

max_len

Maximum text sequence length

Step 4: Image Preprocessing#

dict(type='ResizeImages', height=224, width=224),
dict(
    type='NormalizeImages',
    means=[[123.515625, 116.04492188, 103.59375],   # 每个相机的 RGB 均值
           [123.515625, 116.04492188, 103.59375],
           [123.515625, 116.04492188, 103.59375]],   # ⬅️ 列表长度 = 相机数量
    stds=[[58.27148438, 57.02636719, 57.27539062],
          [58.27148438, 57.02636719, 57.27539062],
          [58.27148438, 57.02636719, 57.27539062]],
),

Note: The length of the means and stds lists must equal the number of cameras. When using the Eagle model, the default values shown above are sufficient.

Step 5: NormalizeStatesAndActions — State and Action Normalization#

dict(
    type='NormalizeStatesAndActions',
    state_dim=64,             # 与 model.vla_head.state_dim 一致
    action_dim=32,            # 与 model.vla_head.action_dim 一致
    state_key='proprio',
    action_key='action',
    norm_type='mean_std')     # 归一化方式:'mean_std' 或 'min_max'

3.3 Core Parameter Reference Table#

The following table summarizes all parameters in the data configuration that need to be modified according to your specific scenario:

Parameter

Location

Description

Aloha Example Value

data_root_path

ParquetDataset

Dataset path

./datasets/RealRobot_AgileX_aloha_lerobot_v2/...

embodiment_id

ProcessParquetInputs

Robot type ID

0

video_keys

ProcessParquetInputs

List of camera video keys

3 cameras

num_images

ProcessPromptsWithImage

Number of cameras

3

means / stds

NormalizeImages

Image normalization parameters (list length = number of cameras)

3 groups

state_dim

NormalizeStatesAndActions

Internal state dimensionality (must match the model)

64

action_dim

NormalizeStatesAndActions

Internal action dimensionality (must match the model)

32

norm_type

NormalizeStatesAndActions

Normalization method

mean_std

action_window_size

ParquetDataset

Action window size

32


Step 4: Configure Training Parameters (runner)#

Training parameters typically do not require significant modifications. Adjust them appropriately based on your data volume and GPU resources:

runner = dict(
    type='FSDPTrainRunner',
    max_epochs=6,                     # ⬅️ 训练轮数,数据量少可适当增大
    learning_rate=2e-5,               # ⬅️ 学习率
    weight_decay=0.0,
    max_grad_norm=1.0,
    sampler=None,
    tokenizer=dict(
        type='PretrainedTokenizer',
        model_path='path/to/eagle2_hg_model',
    ),
    collator=dict(
        type='DictCollator',
        keys=[
            'states', 'observation.eepose', 'timestamp', 'images',
            'img_masks', 'lang_tokens', 'lang_masks', 'actions',
            'action_masks', 'embodiment_ids'
        ],
        meta_keys=['task_description', 'prompt', 'info', 'stats']),
    metric=dict(
        type='VLAMetric',
        active_trackers=('jsonl', 'wandb'),
        run_dir='work_dirs',
        grad_accumulation_steps=1,
        window_size=1),
    lr_scheduler_type='constant',     # 学习率调度策略
    warmup_ratio=0.0,
    enable_gradient_checkpointing=False,
    enable_mixed_precision_training=True,
    mixed_precision_dtype='bf16',
    change_key_name=False)

Common Adjustment Recommendations#

Parameter

Adjustment Guideline

max_epochs

For smaller datasets (< 100 episodes), 10–24 is recommended; for larger datasets, 3–6 is sufficient

learning_rate

For full fine-tuning, a range of 1e-5 to 5e-5 is recommended

lr_scheduler_type

Options include constant (constant rate) or linear-warmup+cosine-decay (warmup + cosine decay)

warmup_ratio

When using cosine decay, a value of 0.03 is recommended

enable_gradient_checkpointing

Set to True when GPU memory is insufficient, trading computation time for memory

per_device_batch_size (in train_dataloader)

Reduce when GPU memory is insufficient, e.g., to 4 or 2


Step 5: Configure Inference Deployment (inference)#

The inference configuration is used for real-robot deployment and requires specifying task descriptions, the data processing pipeline, and the robot hardware interface.

inference = dict(
    type='AlohaInferenceRunner',      # ⬅️ 推理器类型,根据机器人选择
    seed=7,
    task_descriptions={               # ⬅️ 修改:你的任务描述
        '1': 'pick up the green bowl with right arm',
        '2': 'place it on the blue plate with left arm',
    },
    dataset=dict(
        type='PrivateInferenceDataset',
        embodiment_id=0,              # ⬅️ 与训练一致
        img_keys=[                    # ⬅️ 修改:相机名称(不含前缀)
            'cam_high',
            'cam_left_wrist',
            'cam_right_wrist'
        ],
        transforms=[
            dict(
                type='ProcessPromptsWithImage',
                max_len=900,
                num_images=3,         # ⬅️ 与训练一致
                tokenizer=dict(type='PretrainedTokenizer')),
            dict(type='ResizeImages', height=224, width=224),
            dict(
                type='NormalizeImages',
                means=[[123.515625, 116.04492188, 103.59375],
                       [123.515625, 116.04492188, 103.59375],
                       [123.515625, 116.04492188, 103.59375]],
                stds=[[58.27148438, 57.02636719, 57.27539062],
                      [58.27148438, 57.02636719, 57.27539062],
                      [58.27148438, 57.02636719, 57.27539062]],
            ),
            dict(
                type='NormalizeStatesAndActions',
                state_dim=64,         # ⬅️ 与训练一致
                state_key='proprio',
                action_key='action',
                norm_type='mean_std') # ⬅️ 与训练一致
        ]),
    denormalize_action=dict(
        type='DenormalizePrivateAction',
        norm_type='mean_std',         # ⬅️ 与训练的 norm_type 一致
        action_dim=14),               # ⬅️ 机器人的原始动作维度
    operator=dict(                    # ⬅️ 修改:你的机器人 ROS topic 配置
        type='AlohaOperator',
        img_front_topic='/camera_f/color/image_raw',
        img_left_topic='/camera_l/color/image_raw',
        img_right_topic='/camera_r/color/image_raw',
        img_front_depth_topic='/camera_f/depth/image_raw',
        img_left_depth_topic='/camera_l/depth/image_raw',
        img_right_depth_topic='/camera_r/depth/image_raw',
        puppet_arm_left_cmd_topic='/master/joint_left',
        puppet_arm_right_cmd_topic='/master/joint_right',
        puppet_arm_left_topic='/puppet/joint_left',
        puppet_arm_right_topic='/puppet/joint_right',
        robot_base_topic='/odom_raw',
        robot_base_cmd_topic='/cmd_vel',
    ))

Key Points for Inference Configuration#

  1. type: Select the inference runner according to your robot platform. The Aloha dual-arm system uses AlohaInferenceRunner, while UR-series robots use URInferenceRunner.

  2. task_descriptions: Defines the available task instructions. Keys are task IDs (strings), and values are natural language task descriptions. Tasks are selected via keyboard input during inference.

  3. dataset.transforms: The data preprocessing pipeline during inference must remain consistent with the one used during training (normalization method, image dimensions, etc.).

  4. denormalize_action: Converts the model’s normalized action output into real robot actions. Both norm_type and action_dim must be consistent with the training configuration.

  5. operator: Defines the ROS topic communication interface for the robot and should be modified according to your actual hardware configuration.


Step 6: Launch Training#

Once the configuration file is prepared, training can be launched. Save the configuration file to the configs/ directory (e.g., configs/gr00t/my_custom_finetune.py) and execute the following:

# 设置环境变量(单节点 8 GPU 示例)
export MLP_WORKER_GPU=8
export MLP_WORKER_NUM=1
export MLP_ROLE_INDEX=0
export MLP_WORKER_0_HOST=localhost
export MLP_WORKER_0_PORT=29500

# 启动训练
bash scripts/train.sh \
    configs/gr00t/my_custom_finetune.py \
    work_dirs/my_custom_finetune

Upon completion of training, the following files will be generated in work_dirs/my_custom_finetune/:

  • checkpoint_*.pt — Model checkpoints

  • dataset_statistics.json — Dataset statistics (used during inference)

  • config.yaml / config.json — Configuration backups


Step 7: Real-Robot Deployment#

Use the trained checkpoint for real-robot inference:

python scripts/inference_real_robot.py \
    --config configs/gr00t/my_custom_finetune.py \
    --ckpt-path work_dirs/my_custom_finetune/checkpoint_epoch_6.pt

⚠️ Important: The configuration file used during inference should be the same as the one used during training to ensure that data normalization methods, dimension parameters, and other settings are fully consistent.


Complete Example: Aloha Dual-Arm Robot#

The following is the complete configuration from configs/gr00t/gr00t_eagle_3b_aloha_full_finetune.py, serving as a reference template for creating your own configuration file.

# ============================================================
# 模型配置
# ============================================================
model = dict(
    type='LlavaVLA',
    pretrained_name_or_path='./checkpoints/GR00T-N1.5-3B',
    vlm_backbone=dict(
        type='EagleBackbone',
        vlm_path='fluxvla/models/third_party_models/eagle2_hg_model'),
    vla_head=dict(
        type='FlowMatchingHead',
        state_dim=64,
        hidden_size=1024,
        input_embedding_dim=1536,
        num_layers=1,
        num_heads=4,
        num_inference_timesteps=4,
        num_steps=32,
        traj_length=10,
        action_dim=32,
        ori_action_dim=14),           # Aloha 双臂:左臂 7 + 右臂 7 = 14
    freeze_vlm_backbone=False,
    name_mapping={
        'vlm_backbone.vlm': 'backbone.eagle_model',
        'vla_head': 'action_head'
    },
    freeze_projector=False)

inference_model = dict(
    type='LlavaVLA',
    pretrained_name_or_path='./checkpoints/GR00T-N1.5-3B',
    vlm_backbone=dict(
        type='EagleBackbone',
        dtype=None,
        vlm_path='fluxvla/models/third_party_models/eagle2_hg_model'),
    vla_head=dict(
        type='FlowMatchingHead',
        state_dim=64,
        hidden_size=1024,
        input_embedding_dim=1536,
        num_layers=1,
        num_heads=4,
        num_steps=32,
        num_inference_timesteps=4,
        traj_length=10,
        ori_action_dim=14,
        action_dim=32),
    freeze_vlm_backbone=False,
    name_mapping={
        'vlm_backbone.vlm': 'backbone.eagle_model',
        'vla_head': 'action_head'
    },
    freeze_projector=False)

# ============================================================
# 数据配置
# ============================================================
train_dataloader = dict(
    per_device_batch_size=8,
    per_device_num_workers=4,
    dataset=dict(
        type='DistributedRepeatingDataset',
        name_mappings={'observation.state': ['proprio', 'action']},
        statistic_keys=[
            'observation.state', 'observation.eepose', 'timestamp'
        ],
        datasets=[
            dict(
                type='ParquetDataset',
                data_root_path=[
                    './datasets/RealRobot_AgileX_aloha_lerobot_v2/20250601_20250615_02_4090',
                ],
                transforms=[
                    dict(
                        type='ProcessParquetInputs',
                        embodiment_id=0,          # Aloha
                        parquet_keys=[
                            'observation.state', 'observation.eepose',
                            'timestamp', 'actions', 'info', 'stats',
                            'action_masks'
                        ],
                        video_keys=[              # 3 个相机
                            'observation.images.cam_high',
                            'observation.images.cam_left_wrist',
                            'observation.images.cam_right_wrist'
                        ],
                        name_mappings={'observation.state': ['states']}),
                    dict(type='ParquetPrompter'),
                    dict(
                        type='ProcessPromptsWithImage',
                        max_len=900,
                        num_images=3,             # 相机数量
                        tokenizer=dict(
                            type='PretrainedTokenizer',
                            model_path='path/to/eagle2_hg_model',
                        )),
                    dict(type='ResizeImages', height=224, width=224),
                    dict(
                        type='NormalizeImages',
                        means=[[123.515625, 116.04492188, 103.59375],
                               [123.515625, 116.04492188, 103.59375],
                               [123.515625, 116.04492188, 103.59375]],
                        stds=[[58.27148438, 57.02636719, 57.27539062],
                              [58.27148438, 57.02636719, 57.27539062],
                              [58.27148438, 57.02636719, 57.27539062]],
                    ),
                    dict(
                        type='NormalizeStatesAndActions',
                        state_dim=64,
                        action_dim=32,
                        state_key='proprio',
                        action_key='action',
                        norm_type='mean_std')
                ],
                action_window_size=32)
        ]))

# ============================================================
# 训练配置
# ============================================================
runner = dict(
    type='FSDPTrainRunner',
    max_epochs=6,
    learning_rate=2e-5,
    weight_decay=0.0,
    max_grad_norm=1.0,
    sampler=None,
    tokenizer=dict(
        type='PretrainedTokenizer',
        model_path='path/to/eagle2_hg_model',
    ),
    collator=dict(
        type='DictCollator',
        keys=[
            'states', 'observation.eepose', 'timestamp', 'images',
            'img_masks', 'lang_tokens', 'lang_masks', 'actions',
            'action_masks', 'embodiment_ids'
        ],
        meta_keys=['task_description', 'prompt', 'info', 'stats']),
    metric=dict(
        type='VLAMetric',
        active_trackers=('jsonl', 'wandb'),
        run_dir='work_dirs',
        grad_accumulation_steps=1,
        window_size=1),
    lr_scheduler_type='constant',
    warmup_ratio=0.0,
    enable_gradient_checkpointing=False,
    enable_mixed_precision_training=True,
    mixed_precision_dtype='bf16',
    change_key_name=False)

# ============================================================
# 推理配置
# ============================================================
inference = dict(
    type='AlohaInferenceRunner',
    seed=7,
    task_descriptions={
        '1': 'pick up the green bowl with right arm',
        '2': 'place it on the blue plate with left arm',
        # ... 根据你的任务添加更多
    },
    dataset=dict(
        type='PrivateInferenceDataset',
        embodiment_id=0,
        img_keys=['cam_high', 'cam_left_wrist', 'cam_right_wrist'],
        transforms=[
            dict(
                type='ProcessPromptsWithImage',
                max_len=900,
                num_images=3,
                tokenizer=dict(type='PretrainedTokenizer')),
            dict(type='ResizeImages', height=224, width=224),
            dict(
                type='NormalizeImages',
                means=[[123.515625, 116.04492188, 103.59375],
                       [123.515625, 116.04492188, 103.59375],
                       [123.515625, 116.04492188, 103.59375]],
                stds=[[58.27148438, 57.02636719, 57.27539062],
                      [58.27148438, 57.02636719, 57.27539062],
                      [58.27148438, 57.02636719, 57.27539062]],
            ),
            dict(
                type='NormalizeStatesAndActions',
                state_dim=64,
                state_key='proprio',
                action_key='action',
                norm_type='mean_std')
        ]),
    denormalize_action=dict(
        type='DenormalizePrivateAction',
        norm_type='mean_std',
        action_dim=14),               # 原始动作维度
    operator=dict(
        type='AlohaOperator',
        img_front_topic='/camera_f/color/image_raw',
        img_left_topic='/camera_l/color/image_raw',
        img_right_topic='/camera_r/color/image_raw',
        img_front_depth_topic='/camera_f/depth/image_raw',
        img_left_depth_topic='/camera_l/depth/image_raw',
        img_right_depth_topic='/camera_r/depth/image_raw',
        puppet_arm_left_cmd_topic='/master/joint_left',
        puppet_arm_right_cmd_topic='/master/joint_right',
        puppet_arm_left_topic='/puppet/joint_left',
        puppet_arm_right_topic='/puppet/joint_right',
        robot_base_topic='/odom_raw',
        robot_base_cmd_topic='/cmd_vel',
    ))

Frequently Asked Questions#

Q: How to adapt the configuration for a different number of cameras?#

The following locations must be updated synchronously:

  1. ProcessParquetInputs.video_keys — List all camera video keys

  2. ProcessPromptsWithImage.num_images — Set the number of cameras

  3. NormalizeImages.means / stds — Provide one group of normalization parameters per camera

  4. inference.dataset.img_keys — List the camera names for inference

For example, a configuration with 2 cameras:

video_keys=[
    'observation.images.cam_high',
    'observation.images.cam_wrist'
],
# ...
num_images=2,
# ...
means=[[123.515625, 116.04492188, 103.59375],
       [123.515625, 116.04492188, 103.59375]],
stds=[[58.27148438, 57.02636719, 57.27539062],
      [58.27148438, 57.02636719, 57.27539062]],

Q: What happens if the normalization methods are inconsistent between training and inference?#

Inference performance will degrade significantly. Ensure that the following parameters are fully consistent between the training and inference configurations:

  • norm_type (mean_std / min_max)

  • state_dim, action_dim

  • Image normalization means / stds

  • Image dimensions height / width

Q: How to perform mixed training with multiple datasets?#

Add multiple ParquetDataset configurations to the datasets list:

datasets=[
    dict(
        type='ParquetDataset',
        data_root_path=['./datasets/task_A/batch_01'],
        transforms=[...],
        action_window_size=32),
    dict(
        type='ParquetDataset',
        data_root_path=['./datasets/task_B/batch_01'],
        transforms=[...],
        action_window_size=32),
]

Q: How to select an appropriate action_window_size?#

action_window_size determines the length of the action sequence included in each training sample. It should be consistent with num_steps, both representing the chunk size length, so that the model generates an entire window of actions in a single prediction.