GR00T-N1.5 Model Training#

This tutorial provides a comprehensive guide on how to fine-tune the GR00T-N1.5 model using the FluxVLA framework. The content covers two typical scenarios:

LIBERO Simulation Data Training — Using the ready-to-use LIBERO benchmark dataset for rapid validation, LIBERO Simulation Data Training.
Private Real-Robot Data Training — Using self-collected robot data for customized fine-tuning (see Real-Robot Data Preparation).

Model Architecture#

GR00T-N1.5 in FluxVLA consists of the following three core modules:

Module	Type	Function
VLA Container	`LlavaVLA`	Model wrapper responsible for loading pretrained weights and coordinating submodules
Vision-Language Backbone	`EagleBackbone`	Eagle2 multimodal model for image feature extraction and language understanding
Action Head	`FlowMatchingHead`	Flow Matching-based action generator that maps vision-language features to robot action sequences

The model maps pretrained checkpoint weights to the FluxVLA module structure via name_mapping:

name_mapping={
    'vlm_backbone.vlm': 'backbone.eagle_model',
    'vla_head': 'action_head'
}

Scenario 2: Private Real-Robot Data Training#

This section uses the Aloha dual-arm robot as an example to demonstrate how to train GR00T-N1.5 with self-collected data and deploy it on a real robot. The workflow for other robots such as UR3 single-arm is similar; only the dimension and camera parameters need to be adjusted.

1. Data Preparation#

1.1 Data Format Requirements#

Private data must be converted to the LeRobot v2.1 format. If the original data is in HDF5 format, the data conversion script can be used for conversion (see [Real-Robot Data Preparation]).

Directory structure after conversion:

your_dataset_path/
├── data/
│   └── chunk-000/
│       ├── episode_000000.parquet
│       └── ...
├── videos/
│   └── chunk-000/
│       ├── observation.images.cam_high/
│       │   ├── episode_000000.mp4
│       │   └── ...
│       ├── observation.images.cam_left_wrist/
│       │   └── ...
│       └── observation.images.cam_right_wrist/
│           └── ...
└── meta/
    ├── episodes.jsonl
    ├── episodes_stats.jsonl
    ├── info.json
    └── tasks.jsonl

1.2 Identifying Key Information#

Before starting the configuration, first determine the following data characteristics:

Information	Aloha Dual-Arm Example	UR3 Single-Arm Example
Dataset Path	`RealRobot_AgileX_aloha_lerobot_v2/...`	`RealRobot_UR3_lerobot_v2/...`
Camera Names	`cam_high`, `cam_left_wrist`, `cam_right_wrist`	`cam_high`, `cam_wrist`
Number of Cameras	3	2
Action Dimension	14 (left arm 7 + right arm 7)	7 (6 joints + 1 gripper)
State Dimension	14	7

2. Configuration File Details#

Using configs/gr00t/gr00t_eagle_3b_aloha_4090_full_train.py as the reference example.

2.1 Model Configuration#

model = dict(
    type='LlavaVLA',
    pretrained_name_or_path='/path/to/models/GR00T-N1.5-3B',
    vlm_backbone=dict(
        type='EagleBackbone',
        vlm_path='FluxVLA/models/third_party_models/eagle2_hg_model'),
    vla_head=dict(
        type='FlowMatchingHead',
        state_dim=14,             # Aloha 双臂状态维度
        hidden_size=1024,
        input_embedding_dim=1536,
        num_layers=1,
        num_heads=4,
        num_inference_timesteps=4,
        num_step=32,              # Flow Matching 采样步数
        traj_length=10,
        action_dim=14),           # Aloha 双臂动作维度
    freeze_vlm_backbone=False,
    name_mapping={
        'vlm_backbone.vlm': 'backbone.eagle_model',
        'vla_head': 'action_head'
    },
    freeze_projector=False)

Differences from the LIBERO configuration:

Parameter	LIBERO	Aloha	Description
`state_dim`	8	14	Robot state dimension
`action_dim`	7	14	Robot action dimension
`num_step`	Not set (default)	32	Number of Flow Matching diffusion steps; more complex action spaces require more steps
`traj_length`	10	10	Predicted trajectory length; identical in both cases

For inference, an additional inference_model can be defined with the same format as model, generally only requiring the addition of the dtype=None parameter:

inference_model = dict(
    type='LlavaVLA',
    pretrained_name_or_path='/path/to/models/GR00T-N1.5-3B',
    vlm_backbone=dict(
        type='EagleBackbone',
        dtype=None,
        vlm_path='FluxVLA/models/third_party_models/eagle2_hg_model'),
    vla_head=dict(
        type='FlowMatchingHead',
        state_dim=14,
        hidden_size=1024,
        input_embedding_dim=1536,
        num_layers=1,
        num_heads=4,
        num_step=32,
        num_inference_timesteps=4,
        traj_length=10,
        action_dim=14),
    freeze_vlm_backbone=False,
    name_mapping={
        'vlm_backbone.vlm': 'backbone.eagle_model',
        'vla_head': 'action_head'
    },
    freeze_projector=False)

2.2 Data Configuration#

train_dataloader = dict(
    batch_size=128,
    per_device_batch_size=8,
    per_device_num_workers=4,
    dataset=dict(
        type='DistributedRepeatingDataset',
        name_mappings={'observation.state': ['proprio', 'action']},
        statistic_keys=[
            'observation.state', 'observation.eepose', 'timestamp'
        ],
        datasets=[
            dict(
                type='ParquetDataset',
                data_root_path=[
                    '/path/to/data/RealRobot_AgileX_aloha_lerobot_v2/20250601_20250615_02_4090',
                    '/path/to/data/RealRobot_AgileX_aloha_lerobot_v2/20250616_20250630_02_4090',
                    # ... 可添加更多数据批次
                ],
                transforms=[
                    dict(
                        type='ProcessParquetInputs',
                        embodiment_id=30,
                        parquet_keys=[
                            'observation.state', 'observation.eepose',
                            'timestamp', 'actions', 'info', 'stats',
                            'action_masks'
                        ],
                        video_keys=[
                            'observation.images.cam_high',
                            'observation.images.cam_left_wrist',
                            'observation.images.cam_right_wrist'
                        ],
                        name_mappings={'observation.state': ['states']}),
                    dict(type='ParquetPrompter'),
                    dict(
                        type='ProcessPromptsWithImage',
                        max_len=900,
                        num_images=3,
                        tokenizer=dict(
                            type='PretrainedTokenizer',
                            model_path='/path/to/models/eagle2_hg_model',
                        )),
                    dict(type='ResizeImages', height=224, width=224),
                    dict(
                        type='NormalizeImages',
                        means=[[123.515625, 116.04492188, 103.59375],
                               [123.515625, 116.04492188, 103.59375],
                               [123.515625, 116.04492188, 103.59375]],
                        stds=[[58.27148438, 57.02636719, 57.27539062],
                              [58.27148438, 57.02636719, 57.27539062],
                              [58.27148438, 57.02636719, 57.27539062]]),
                    dict(
                        type='NormalizeStatesAndActions',
                        action_dim=14,
                        state_key='proprio',
                        action_key='action',
                        use_quantiles=False)
                ],
                action_window_size=32)
        ]))

Key differences from the LIBERO data configuration:

Configuration Item	LIBERO	Aloha Private Data	Adaptation Notes
`data_root_path`	Single path (string)	List of paths	Private data is typically stored in separate batches
`name_mappings`	state → proprio; action → action	state → [proprio, action]	In LIBERO, state and action are separate; in private data, state serves as reference for both proprio and action
`embodiment_id`	31	30 (Aloha)	Different robot platforms use different IDs
`video_keys`	2 cameras	3 cameras	Based on actual hardware configuration
`num_images`	2	3	Must match the number of `video_keys`
`max_len`	600	900	More cameras require a longer token sequence
`action_window_size`	10	32	Real-robot scenarios typically require a longer action window

Note

The length of the means and stds lists in NormalizeImages must equal the number of cameras. When using the Eagle backbone, the default values shown above are applicable to all cameras.

2.3 Training Configuration#

runner = dict(
    type='FSDPTrainRunner',
    max_epochs=3,                     # 真机数据量大时 epoch 可较少
    learning_rate=2e-5,
    weight_decay=0.0,
    max_grad_norm=1.0,
    sampler=None,
    tokenizer=dict(
        type='PretrainedTokenizer',
        model_path='/path/to/models/eagle2_hg_model',
    ),
    collator=dict(
        type='DictCollator',
        keys=[
            'states', 'observation.eepose', 'timestamp', 'images',
            'img_masks', 'lang_tokens', 'lang_masks', 'actions',
            'action_masks', 'embodiment_ids'
        ],
        meta_keys=['task_description', 'prompt', 'info', 'stats']),
    metric=dict(
        type='VLAMetric',
        active_trackers=('jsonl', 'wandb'),
        run_dir='work_dirs',
        wandb_project='FluxVLA',
        wandb_entity='limx',
        grad_accumulation_steps=1,
        window_size=1),
    lr_scheduler_type='constant',
    warmup_ratio=0.0,
    enable_gradient_checkpointing=True,
    enable_mixed_precision_training=True,
    mixed_precision_dtype='bf16',
    sharding_strategy='full-shard',
    change_key_name=False)

Recommended number of training epochs:

Data Scale	Recommended `max_epochs`	Description
< 100 episodes	10 ~ 24	Small data volume requires more epochs for fitting
100 ~ 1000 episodes	3 ~ 6	Medium scale, typical real-robot collection volume
1000 episodes	1 ~ 3	Large-scale data requires only a few epochs

2.4 Inference Configuration#

inference = dict(
    type='AlohaInferenceRunner',
    seed=7,
    task_descriptions={
        '1': 'pick up the yellow chicken toy with left arm',
        '2': 'place it in the brown flat cardboard box with right arm',
        # ... 根据实际任务添加更多
    },
    dataset=dict(
        type='PrivateInferenceDataset',
        embodiment_id=30,
        img_keys=['cam_high', 'cam_left_wrist', 'cam_right_wrist'],
        transforms=[
            dict(type='PrivatePrompter'),
            dict(
                type='ProcessPromptsWithImage',
                max_len=900,
                num_images=3,
                tokenizer=dict(type='PretrainedTokenizer')),
            dict(type='ResizeImages', height=224, width=224),
            dict(
                type='NormalizeImages',
                means=[[123.515625, 116.04492188, 103.59375],
                       [123.515625, 116.04492188, 103.59375],
                       [123.515625, 116.04492188, 103.59375]],
                stds=[[58.27148438, 57.02636719, 57.27539062],
                      [58.27148438, 57.02636719, 57.27539062],
                      [58.27148438, 57.02636719, 57.27539062]]),
            dict(
                type='NormalizeStatesAndActions',
                action_dim=14,
                state_key='proprio',
                action_key='action',
                use_quantiles=False)
        ]),
    denormalize_action=dict(
        type='DenormalizePrivateAction',
        use_quantiles=False),
    operator=dict(
        type='AlohaOperator',
        img_front_topic='/camera_f/color/image_raw',
        img_left_topic='/camera_l/color/image_raw',
        img_right_topic='/camera_r/color/image_raw',
        img_front_depth_topic='/camera_f/depth/image_raw',
        img_left_depth_topic='/camera_l/depth/image_raw',
        img_right_depth_topic='/camera_r/depth/image_raw',
        puppet_arm_left_cmd_topic='/master/joint_left',
        puppet_arm_right_cmd_topic='/master/joint_right',
        puppet_arm_left_topic='/puppet/joint_left',
        puppet_arm_right_topic='/puppet/joint_right',
        robot_base_topic='/odom_raw',
        robot_base_cmd_topic='/cmd_vel'))

Key points of the inference configuration:

task_descriptions — Keys are task IDs (strings), values are natural language descriptions. Tasks are selected via keyboard during inference
PrivateInferenceDataset — Dataset type for real-robot inference; loads dataset_statistics.json for denormalization
PrivatePrompter — Replaces the ParquetPrompter used during training; constructs prompts from inference inputs
DenormalizePrivateAction — Converts normalized model output actions to real robot actions
operator — Defines the ROS topic communication interface; must be modified according to the actual hardware

Warning

The data preprocessing parameters during inference (normalization method, image size, action_dim, etc.) must be consistent with those used during training; otherwise, inference performance will degrade significantly.

3. Launching Training#

cd /path/to/fluxvla

# Single node, 8 GPU training
export MLP_WORKER_GPU=8
export MLP_WORKER_NUM=1
export MLP_ROLE_INDEX=0
export MLP_WORKER_0_HOST=localhost
export MLP_WORKER_0_PORT=29500

bash scripts/train.sh \
    configs/gr00t/gr00t_eagle_3b_aloha_4090_full_train.py \
    work_dirs/gr00t_eagle_3b_aloha_4090_full_train

Multi-node training example (2 nodes × 8 GPUs):

Node 0 (master node):

export MLP_WORKER_GPU=8
export MLP_WORKER_NUM=2
export MLP_ROLE_INDEX=0
export MLP_WORKER_0_HOST={MASTER_NODE_IP}
export MLP_WORKER_0_PORT=29500

bash scripts/train.sh \
    configs/gr00t/gr00t_eagle_3b_aloha_4090_full_train.py \
    work_dirs/gr00t_eagle_3b_aloha_4090_full_train

Node 1:

export MLP_WORKER_GPU=8
export MLP_WORKER_NUM=2
export MLP_ROLE_INDEX=1
export MLP_WORKER_0_HOST={MASTER_NODE_IP}
export MLP_WORKER_0_PORT=29500

bash scripts/train.sh \
    configs/gr00t/gr00t_eagle_3b_aloha_4090_full_train.py \
    work_dirs/gr00t_eagle_3b_aloha_4090_full_train

4. Real-Robot Deployment#

Use the trained checkpoint for inference:

python scripts/inference_real_robot.py \
    --config configs/gr00t/gr00t_eagle_3b_aloha_4090_full_train.py \
    --ckpt-path work_dirs/gr00t_eagle_3b_aloha_4090_full_train/checkpoint_epoch_3.pt

Adapting to Other Robots: UR3 Single-Arm Example#

Using configs/gr00t/gr00t_eagle_3b_ur3_full_train.py as reference, the key differences between the UR3 and Aloha configurations are as follows:

Configuration Item	Aloha Dual-Arm	UR3 Single-Arm	Description
`state_dim` / `action_dim`	14	7	UR3 has only 1 manipulator arm
`embodiment_id`	30	31	Identifier for different robot platforms
`video_keys`	3 cameras	2 cameras (`cam_high`, `cam_wrist`)	UR3 typically uses head + wrist cameras
`num_images`	3	2	Must match the number of cameras
`max_len`	900	600	Fewer cameras require a shorter token sequence
`inference.type`	`AlohaInferenceRunner`	`URInferenceRunner`	Different inference runners
`operator.type`	`AlohaOperator`	`UROperator`	Different hardware communication interfaces

When adapting to a new robot, simply create a new configuration file and adjust the parameters listed above. If the existing inference runners and operators do not meet the requirements, refer to the :doc:../tutorials/private_engine tutorial for extension.

Available Configuration Summary#

Configuration File	Data	Backbone	Robot
`gr00t_eagle_3b_libero_10_full_train.py`	LIBERO-10	Eagle	Simulation (LIBERO)
`gr00t_paligemma_3b_libero_10_full_train.py`	LIBERO-10	PaliGemma	Simulation (LIBERO)
`gr00t_eagle_3b_aloha_4090_full_train.py`	Aloha private data	Eagle	Aloha dual-arm
`gr00t_eagle_3b_aloha_fold_towel_3cam_4090_full_train.py`	Aloha towel folding (3 cameras)	Eagle	Aloha dual-arm
`gr00t_eagle_3b_aloha_fold_towel_4cam_4090_full_train.py`	Aloha towel folding (4 cameras)	Eagle	Aloha dual-arm
`gr00t_eagle_3b_ur3_full_train.py`	UR3 private data	Eagle	UR3 single-arm

Configuring Weights & Biases#

Training loss curves and metrics can be tracked and visualized through Weights & Biases.

# 安装
pip install wandb

# 登录
wandb login

# 设置项目信息
export WANDB_PROJECT=FluxVLA
export WANDB_ENTITY=your-team-name
export WANDB_MODE=online

# 如需禁用
export WANDB_MODE=disabled

Frequently Asked Questions#

Q: Can LIBERO training and private data training be combined?

Yes. Simply configure both LIBERO and private data ParquetDataset entries in the datasets list. However, note that the embodiment_id, video_keys, action_dim, and other parameters may differ between the two, and their respective transforms must be configured separately.

Q: How can training be resumed from an interruption?

Use the --resume-from flag to specify the checkpoint path:

bash scripts/train.sh \
    configs/gr00t/gr00t_eagle_3b_aloha_4090_full_train.py \
    work_dirs/gr00t_eagle_3b_aloha_4090_full_train \
    --resume-from work_dirs/gr00t_eagle_3b_aloha_4090_full_train/checkpoint_epoch_2.pt

Q: What can be done if GPU memory is insufficient?

The following approaches can be attempted:

Reduce per_device_batch_size (e.g., from 8 to 4 or 2)
Enable enable_gradient_checkpointing=True (trades computation time for memory)
Adjust the FSDP sharding strategy (sharding_strategy='full-shard')

Q: What are the differences between GR00T-N1.5 and π₀?

Both share the same VLA head (FlowMatchingHead) in FluxVLA. The primary difference lies in the vision-language backbone: GR00T-N1.5 uses EagleBackbone (Eagle2 model), while π₀ uses PaliGemma (PaliGemma model). The Eagle backbone generally performs better in multi-image scenarios.

Q: How can configuration file parameters be overridden?

Parameters can be dynamically overridden via the command line using --cfg-options:

bash scripts/train.sh \
    configs/gr00t/gr00t_eagle_3b_libero_10_full_train.py \
    work_dirs/gr00t_custom \
    --cfg-options runner.max_epochs=50 runner.learning_rate=1e-5