Training and Deployment with Private Datasets#
Overview#
This tutorial provides guidance on how to use your own collected and prepared datasets to configure FluxVLA for the complete workflow from training to real-robot deployment.
Before proceeding with this tutorial, please ensure the following prerequisites are met:
The environment has been set up according to the Installation Guide
Raw data has been converted to the LeRobot v2.1 format following the Real-Robot Data Preparation guide
Pre-trained model weights (e.g.,
GR00T-N1.5-3B) have been downloaded
This tutorial uses configs/gr00t/gr00t_eagle_3b_aloha_full_finetune.py as the base template, demonstrating how to create a complete training and deployment configuration for your private dataset.
Overall Workflow#
The complete workflow for training and deployment with a private dataset is as follows:
Prepare the data — Ensure the data has been converted to LeRobot v2.1 format
Create the configuration file — Modify key parameters based on an existing template
Launch training — Execute distributed training using
scripts/train.shReal-robot deployment — Perform inference using the trained checkpoint
Among these steps, creating the configuration file is the most critical. A configuration file comprises four major modules, each of which must be adjusted according to the characteristics of your dataset:
Configuration Module |
Description |
Modification Required |
|---|---|---|
|
Model architecture and pre-trained weights |
Case-dependent |
|
Training data paths, camera configuration, dimension parameters |
Required |
|
Training hyperparameters (epochs, learning rate, etc.) |
Recommended |
|
Task descriptions and hardware configuration for real-robot deployment |
Required |
Step 1: Verify the Dataset Structure#
Before beginning the configuration, confirm that your dataset conforms to the LeRobot v2.1 format with the following directory structure:
your_dataset_path/
├── data/
│ ├── train/
│ │ ├── episode_0.parquet
│ │ ├── episode_1.parquet
│ │ └── ...
│ └── video/
│ ├── episode_0/
│ │ ├── observation.images.cam_high.mp4
│ │ ├── observation.images.cam_left_wrist.mp4
│ │ └── ...
│ └── ...
├── info.json
└── meta.json
You need to determine the following key information, which will be referenced repeatedly in subsequent configuration steps:
Information to Confirm |
Example Value (Aloha Dual-Arm) |
Your Value |
|---|---|---|
Dataset path |
|
|
Camera name list |
|
|
Number of cameras |
3 |
|
Raw action dimensionality of the robot ( |
14 (7 per arm) |
|
Robot state dimensionality |
14 |
|
Task description (natural language) |
|
Step 2: Configure the Model (model)#
The model configuration defines the network architecture of the VLA. If you are using the same GR00T-Eagle architecture as Aloha for full fine-tuning, you typically only need to focus on the dimension-related parameters in vla_head.
model = dict(
type='LlavaVLA',
pretrained_name_or_path='./checkpoints/GR00T-N1.5-3B',
vlm_backbone=dict(
type='EagleBackbone',
vlm_path='fluxvla/models/third_party_models/eagle2_hg_model'),
vla_head=dict(
type='FlowMatchingHead',
state_dim=64, # 模型内部状态维度(padding 后)
hidden_size=1024,
input_embedding_dim=1536,
num_layers=1,
num_heads=4,
num_inference_timesteps=4,
num_steps=32,
traj_length=10, # 预测轨迹长度
action_dim=32, # 模型内部动作维度(padding 后)
ori_action_dim=14), # ⬅️ 修改:你的机器人原始动作维度
freeze_vlm_backbone=False,
name_mapping={
'vlm_backbone.vlm': 'backbone.eagle_model',
'vla_head': 'action_head'
},
freeze_projector=False)
Parameters to Modify#
Parameter |
Description |
Modification Guideline |
|---|---|---|
|
The raw action dimensionality of the robot |
Modify according to your robot. Aloha dual-arm = 14, UR3 single-arm = 7 |
|
Internal action dimensionality after padding |
Generally, keeping the default value of 32 is sufficient |
|
Internal state dimensionality after padding |
Generally, keeping the default value of 64 is sufficient |
|
Number of future trajectory steps to predict |
Default is 10; adjust based on task complexity |
Note:
action_dimandstate_dimrepresent the internal dimensionalities after padding and do not need to equal the raw dimensionalities of the robot.ori_action_dimcorresponds to the actual action space dimensionality of your robot.
inference_model (Optional)#
If a different model configuration is required during inference (e.g., a different pre-trained path), an additional inference_model can be defined with the same format as model:
inference_model = dict(
type='LlavaVLA',
pretrained_name_or_path='./checkpoints/GR00T-N1.5-3B',
vlm_backbone=dict(
type='EagleBackbone',
dtype=None,
vlm_path='fluxvla/models/third_party_models/eagle2_hg_model'),
vla_head=dict(
type='FlowMatchingHead',
state_dim=64,
hidden_size=1024,
input_embedding_dim=1536,
num_layers=1,
num_heads=4,
num_steps=32,
num_inference_timesteps=4,
traj_length=10,
ori_action_dim=14, # ⬅️ 与 model 保持一致
action_dim=32),
freeze_vlm_backbone=False,
name_mapping={
'vlm_backbone.vlm': 'backbone.eagle_model',
'vla_head': 'action_head'
},
freeze_projector=False)
Step 3: Configure the Training Data (train_dataloader)#
This is the section that requires the most careful modification. You need to adjust the data paths, camera configuration, and dimension parameters according to the characteristics of your dataset.
train_dataloader = dict(
per_device_batch_size=8, # 每个 GPU 的 batch size
per_device_num_workers=4, # 每个 GPU 的数据加载进程数
dataset=dict(
type='DistributedRepeatingDataset',
name_mappings={'observation.state': ['proprio', 'action']},
statistic_keys=[
'observation.state', 'observation.eepose', 'timestamp'
],
datasets=[
dict(
type='ParquetDataset',
data_root_path=[
# ⬅️ 修改:替换为你的数据集路径
'./datasets/your_dataset_path/batch_01',
'./datasets/your_dataset_path/batch_02',
],
transforms=[...], # 见下方详细说明
action_window_size=32 # ⬅️ 修改:动作窗口大小
)
]))
3.1 Dataset Path#
data_root_path supports a list format, allowing you to specify multiple data batch paths simultaneously:
data_root_path=[
'./datasets/RealRobot_AgileX_aloha_lerobot_v2/20250601_20250615_02_4090',
],
3.2 Data Transform Pipeline (transforms)#
The transform pipeline is the core of data preprocessing and should be configured with the following 5 transform steps in order:
Step 1: ProcessParquetInputs — Parse Parquet Data#
dict(
type='ProcessParquetInputs',
embodiment_id=0, # ⬅️ 机器人类型 ID(Aloha=0, UR3=2)
parquet_keys=[ # 需要从 parquet 中读取的字段
'observation.state',
'observation.eepose',
'timestamp',
'actions',
'info',
'stats',
'action_masks'
],
video_keys=[ # ⬅️ 修改:你的相机视频键名
'observation.images.cam_high',
'observation.images.cam_left_wrist',
'observation.images.cam_right_wrist'
],
name_mappings={'observation.state': ['states']}),
⚠️ Important: The camera names in
video_keysmust match the filenames under thevideo/directory in the dataset. For example, if your video file isobservation.images.head_cam.mp4, the corresponding key should beobservation.images.head_cam.
Step 2: ParquetPrompter — Generate Text Prompts#
dict(type='ParquetPrompter'),
This step extracts task descriptions (the task field) from the Parquet data for use as text prompts. No additional configuration is required.
Step 3: ProcessPromptsWithImage — Text Tokenization and Image Association#
dict(
type='ProcessPromptsWithImage',
max_len=900,
num_images=3, # ⬅️ 修改:相机数量
tokenizer=dict(
type='PretrainedTokenizer',
model_path='path/to/eagle2_hg_model',
)),
Parameter |
Description |
|---|---|
|
Number of cameras; must match the number of entries in |
|
Maximum text sequence length |
Step 4: Image Preprocessing#
dict(type='ResizeImages', height=224, width=224),
dict(
type='NormalizeImages',
means=[[123.515625, 116.04492188, 103.59375], # 每个相机的 RGB 均值
[123.515625, 116.04492188, 103.59375],
[123.515625, 116.04492188, 103.59375]], # ⬅️ 列表长度 = 相机数量
stds=[[58.27148438, 57.02636719, 57.27539062],
[58.27148438, 57.02636719, 57.27539062],
[58.27148438, 57.02636719, 57.27539062]],
),
Note: The length of the
meansandstdslists must equal the number of cameras. When using the Eagle model, the default values shown above are sufficient.
Step 5: NormalizeStatesAndActions — State and Action Normalization#
dict(
type='NormalizeStatesAndActions',
state_dim=64, # 与 model.vla_head.state_dim 一致
action_dim=32, # 与 model.vla_head.action_dim 一致
state_key='proprio',
action_key='action',
norm_type='mean_std') # 归一化方式:'mean_std' 或 'min_max'
3.3 Core Parameter Reference Table#
The following table summarizes all parameters in the data configuration that need to be modified according to your specific scenario:
Parameter |
Location |
Description |
Aloha Example Value |
|---|---|---|---|
|
|
Dataset path |
|
|
|
Robot type ID |
|
|
|
List of camera video keys |
3 cameras |
|
|
Number of cameras |
|
|
|
Image normalization parameters (list length = number of cameras) |
3 groups |
|
|
Internal state dimensionality (must match the model) |
|
|
|
Internal action dimensionality (must match the model) |
|
|
|
Normalization method |
|
|
|
Action window size |
|
Step 4: Configure Training Parameters (runner)#
Training parameters typically do not require significant modifications. Adjust them appropriately based on your data volume and GPU resources:
runner = dict(
type='FSDPTrainRunner',
max_epochs=6, # ⬅️ 训练轮数,数据量少可适当增大
learning_rate=2e-5, # ⬅️ 学习率
weight_decay=0.0,
max_grad_norm=1.0,
sampler=None,
tokenizer=dict(
type='PretrainedTokenizer',
model_path='path/to/eagle2_hg_model',
),
collator=dict(
type='DictCollator',
keys=[
'states', 'observation.eepose', 'timestamp', 'images',
'img_masks', 'lang_tokens', 'lang_masks', 'actions',
'action_masks', 'embodiment_ids'
],
meta_keys=['task_description', 'prompt', 'info', 'stats']),
metric=dict(
type='VLAMetric',
active_trackers=('jsonl', 'wandb'),
run_dir='work_dirs',
grad_accumulation_steps=1,
window_size=1),
lr_scheduler_type='constant', # 学习率调度策略
warmup_ratio=0.0,
enable_gradient_checkpointing=False,
enable_mixed_precision_training=True,
mixed_precision_dtype='bf16',
change_key_name=False)
Common Adjustment Recommendations#
Parameter |
Adjustment Guideline |
|---|---|
|
For smaller datasets (< 100 episodes), 10–24 is recommended; for larger datasets, 3–6 is sufficient |
|
For full fine-tuning, a range of 1e-5 to 5e-5 is recommended |
|
Options include |
|
When using cosine decay, a value of 0.03 is recommended |
|
Set to |
|
Reduce when GPU memory is insufficient, e.g., to 4 or 2 |
Step 5: Configure Inference Deployment (inference)#
The inference configuration is used for real-robot deployment and requires specifying task descriptions, the data processing pipeline, and the robot hardware interface.
inference = dict(
type='AlohaInferenceRunner', # ⬅️ 推理器类型,根据机器人选择
seed=7,
task_descriptions={ # ⬅️ 修改:你的任务描述
'1': 'pick up the green bowl with right arm',
'2': 'place it on the blue plate with left arm',
},
dataset=dict(
type='PrivateInferenceDataset',
embodiment_id=0, # ⬅️ 与训练一致
img_keys=[ # ⬅️ 修改:相机名称(不含前缀)
'cam_high',
'cam_left_wrist',
'cam_right_wrist'
],
transforms=[
dict(
type='ProcessPromptsWithImage',
max_len=900,
num_images=3, # ⬅️ 与训练一致
tokenizer=dict(type='PretrainedTokenizer')),
dict(type='ResizeImages', height=224, width=224),
dict(
type='NormalizeImages',
means=[[123.515625, 116.04492188, 103.59375],
[123.515625, 116.04492188, 103.59375],
[123.515625, 116.04492188, 103.59375]],
stds=[[58.27148438, 57.02636719, 57.27539062],
[58.27148438, 57.02636719, 57.27539062],
[58.27148438, 57.02636719, 57.27539062]],
),
dict(
type='NormalizeStatesAndActions',
state_dim=64, # ⬅️ 与训练一致
state_key='proprio',
action_key='action',
norm_type='mean_std') # ⬅️ 与训练一致
]),
denormalize_action=dict(
type='DenormalizePrivateAction',
norm_type='mean_std', # ⬅️ 与训练的 norm_type 一致
action_dim=14), # ⬅️ 机器人的原始动作维度
operator=dict( # ⬅️ 修改:你的机器人 ROS topic 配置
type='AlohaOperator',
img_front_topic='/camera_f/color/image_raw',
img_left_topic='/camera_l/color/image_raw',
img_right_topic='/camera_r/color/image_raw',
img_front_depth_topic='/camera_f/depth/image_raw',
img_left_depth_topic='/camera_l/depth/image_raw',
img_right_depth_topic='/camera_r/depth/image_raw',
puppet_arm_left_cmd_topic='/master/joint_left',
puppet_arm_right_cmd_topic='/master/joint_right',
puppet_arm_left_topic='/puppet/joint_left',
puppet_arm_right_topic='/puppet/joint_right',
robot_base_topic='/odom_raw',
robot_base_cmd_topic='/cmd_vel',
))
Key Points for Inference Configuration#
type: Select the inference runner according to your robot platform. The Aloha dual-arm system usesAlohaInferenceRunner, while UR-series robots useURInferenceRunner.task_descriptions: Defines the available task instructions. Keys are task IDs (strings), and values are natural language task descriptions. Tasks are selected via keyboard input during inference.dataset.transforms: The data preprocessing pipeline during inference must remain consistent with the one used during training (normalization method, image dimensions, etc.).denormalize_action: Converts the model’s normalized action output into real robot actions. Bothnorm_typeandaction_dimmust be consistent with the training configuration.operator: Defines the ROS topic communication interface for the robot and should be modified according to your actual hardware configuration.
Step 6: Launch Training#
Once the configuration file is prepared, training can be launched. Save the configuration file to the configs/ directory (e.g., configs/gr00t/my_custom_finetune.py) and execute the following:
# 设置环境变量(单节点 8 GPU 示例)
export MLP_WORKER_GPU=8
export MLP_WORKER_NUM=1
export MLP_ROLE_INDEX=0
export MLP_WORKER_0_HOST=localhost
export MLP_WORKER_0_PORT=29500
# 启动训练
bash scripts/train.sh \
configs/gr00t/my_custom_finetune.py \
work_dirs/my_custom_finetune
Upon completion of training, the following files will be generated in work_dirs/my_custom_finetune/:
checkpoint_*.pt— Model checkpointsdataset_statistics.json— Dataset statistics (used during inference)config.yaml/config.json— Configuration backups
Step 7: Real-Robot Deployment#
Use the trained checkpoint for real-robot inference:
python scripts/inference_real_robot.py \
--config configs/gr00t/my_custom_finetune.py \
--ckpt-path work_dirs/my_custom_finetune/checkpoint_epoch_6.pt
⚠️ Important: The configuration file used during inference should be the same as the one used during training to ensure that data normalization methods, dimension parameters, and other settings are fully consistent.
Complete Example: Aloha Dual-Arm Robot#
The following is the complete configuration from configs/gr00t/gr00t_eagle_3b_aloha_full_finetune.py, serving as a reference template for creating your own configuration file.
# ============================================================
# 模型配置
# ============================================================
model = dict(
type='LlavaVLA',
pretrained_name_or_path='./checkpoints/GR00T-N1.5-3B',
vlm_backbone=dict(
type='EagleBackbone',
vlm_path='fluxvla/models/third_party_models/eagle2_hg_model'),
vla_head=dict(
type='FlowMatchingHead',
state_dim=64,
hidden_size=1024,
input_embedding_dim=1536,
num_layers=1,
num_heads=4,
num_inference_timesteps=4,
num_steps=32,
traj_length=10,
action_dim=32,
ori_action_dim=14), # Aloha 双臂:左臂 7 + 右臂 7 = 14
freeze_vlm_backbone=False,
name_mapping={
'vlm_backbone.vlm': 'backbone.eagle_model',
'vla_head': 'action_head'
},
freeze_projector=False)
inference_model = dict(
type='LlavaVLA',
pretrained_name_or_path='./checkpoints/GR00T-N1.5-3B',
vlm_backbone=dict(
type='EagleBackbone',
dtype=None,
vlm_path='fluxvla/models/third_party_models/eagle2_hg_model'),
vla_head=dict(
type='FlowMatchingHead',
state_dim=64,
hidden_size=1024,
input_embedding_dim=1536,
num_layers=1,
num_heads=4,
num_steps=32,
num_inference_timesteps=4,
traj_length=10,
ori_action_dim=14,
action_dim=32),
freeze_vlm_backbone=False,
name_mapping={
'vlm_backbone.vlm': 'backbone.eagle_model',
'vla_head': 'action_head'
},
freeze_projector=False)
# ============================================================
# 数据配置
# ============================================================
train_dataloader = dict(
per_device_batch_size=8,
per_device_num_workers=4,
dataset=dict(
type='DistributedRepeatingDataset',
name_mappings={'observation.state': ['proprio', 'action']},
statistic_keys=[
'observation.state', 'observation.eepose', 'timestamp'
],
datasets=[
dict(
type='ParquetDataset',
data_root_path=[
'./datasets/RealRobot_AgileX_aloha_lerobot_v2/20250601_20250615_02_4090',
],
transforms=[
dict(
type='ProcessParquetInputs',
embodiment_id=0, # Aloha
parquet_keys=[
'observation.state', 'observation.eepose',
'timestamp', 'actions', 'info', 'stats',
'action_masks'
],
video_keys=[ # 3 个相机
'observation.images.cam_high',
'observation.images.cam_left_wrist',
'observation.images.cam_right_wrist'
],
name_mappings={'observation.state': ['states']}),
dict(type='ParquetPrompter'),
dict(
type='ProcessPromptsWithImage',
max_len=900,
num_images=3, # 相机数量
tokenizer=dict(
type='PretrainedTokenizer',
model_path='path/to/eagle2_hg_model',
)),
dict(type='ResizeImages', height=224, width=224),
dict(
type='NormalizeImages',
means=[[123.515625, 116.04492188, 103.59375],
[123.515625, 116.04492188, 103.59375],
[123.515625, 116.04492188, 103.59375]],
stds=[[58.27148438, 57.02636719, 57.27539062],
[58.27148438, 57.02636719, 57.27539062],
[58.27148438, 57.02636719, 57.27539062]],
),
dict(
type='NormalizeStatesAndActions',
state_dim=64,
action_dim=32,
state_key='proprio',
action_key='action',
norm_type='mean_std')
],
action_window_size=32)
]))
# ============================================================
# 训练配置
# ============================================================
runner = dict(
type='FSDPTrainRunner',
max_epochs=6,
learning_rate=2e-5,
weight_decay=0.0,
max_grad_norm=1.0,
sampler=None,
tokenizer=dict(
type='PretrainedTokenizer',
model_path='path/to/eagle2_hg_model',
),
collator=dict(
type='DictCollator',
keys=[
'states', 'observation.eepose', 'timestamp', 'images',
'img_masks', 'lang_tokens', 'lang_masks', 'actions',
'action_masks', 'embodiment_ids'
],
meta_keys=['task_description', 'prompt', 'info', 'stats']),
metric=dict(
type='VLAMetric',
active_trackers=('jsonl', 'wandb'),
run_dir='work_dirs',
grad_accumulation_steps=1,
window_size=1),
lr_scheduler_type='constant',
warmup_ratio=0.0,
enable_gradient_checkpointing=False,
enable_mixed_precision_training=True,
mixed_precision_dtype='bf16',
change_key_name=False)
# ============================================================
# 推理配置
# ============================================================
inference = dict(
type='AlohaInferenceRunner',
seed=7,
task_descriptions={
'1': 'pick up the green bowl with right arm',
'2': 'place it on the blue plate with left arm',
# ... 根据你的任务添加更多
},
dataset=dict(
type='PrivateInferenceDataset',
embodiment_id=0,
img_keys=['cam_high', 'cam_left_wrist', 'cam_right_wrist'],
transforms=[
dict(
type='ProcessPromptsWithImage',
max_len=900,
num_images=3,
tokenizer=dict(type='PretrainedTokenizer')),
dict(type='ResizeImages', height=224, width=224),
dict(
type='NormalizeImages',
means=[[123.515625, 116.04492188, 103.59375],
[123.515625, 116.04492188, 103.59375],
[123.515625, 116.04492188, 103.59375]],
stds=[[58.27148438, 57.02636719, 57.27539062],
[58.27148438, 57.02636719, 57.27539062],
[58.27148438, 57.02636719, 57.27539062]],
),
dict(
type='NormalizeStatesAndActions',
state_dim=64,
state_key='proprio',
action_key='action',
norm_type='mean_std')
]),
denormalize_action=dict(
type='DenormalizePrivateAction',
norm_type='mean_std',
action_dim=14), # 原始动作维度
operator=dict(
type='AlohaOperator',
img_front_topic='/camera_f/color/image_raw',
img_left_topic='/camera_l/color/image_raw',
img_right_topic='/camera_r/color/image_raw',
img_front_depth_topic='/camera_f/depth/image_raw',
img_left_depth_topic='/camera_l/depth/image_raw',
img_right_depth_topic='/camera_r/depth/image_raw',
puppet_arm_left_cmd_topic='/master/joint_left',
puppet_arm_right_cmd_topic='/master/joint_right',
puppet_arm_left_topic='/puppet/joint_left',
puppet_arm_right_topic='/puppet/joint_right',
robot_base_topic='/odom_raw',
robot_base_cmd_topic='/cmd_vel',
))
Frequently Asked Questions#
Q: How to adapt the configuration for a different number of cameras?#
The following locations must be updated synchronously:
ProcessParquetInputs.video_keys— List all camera video keysProcessPromptsWithImage.num_images— Set the number of camerasNormalizeImages.means/stds— Provide one group of normalization parameters per camerainference.dataset.img_keys— List the camera names for inference
For example, a configuration with 2 cameras:
video_keys=[
'observation.images.cam_high',
'observation.images.cam_wrist'
],
# ...
num_images=2,
# ...
means=[[123.515625, 116.04492188, 103.59375],
[123.515625, 116.04492188, 103.59375]],
stds=[[58.27148438, 57.02636719, 57.27539062],
[58.27148438, 57.02636719, 57.27539062]],
Q: What happens if the normalization methods are inconsistent between training and inference?#
Inference performance will degrade significantly. Ensure that the following parameters are fully consistent between the training and inference configurations:
norm_type(mean_std/min_max)state_dim,action_dimImage normalization
means/stdsImage dimensions
height/width
Q: How to perform mixed training with multiple datasets?#
Add multiple ParquetDataset configurations to the datasets list:
datasets=[
dict(
type='ParquetDataset',
data_root_path=['./datasets/task_A/batch_01'],
transforms=[...],
action_window_size=32),
dict(
type='ParquetDataset',
data_root_path=['./datasets/task_B/batch_01'],
transforms=[...],
action_window_size=32),
]
Q: How to select an appropriate action_window_size?#
action_window_size determines the length of the action sequence included in each training sample. It should be consistent with num_steps, both representing the chunk size length, so that the model generates an entire window of actions in a single prediction.