GR00T-N1.5 Model Training#

This tutorial provides a comprehensive guide on how to fine-tune the GR00T-N1.5 model using the FluxVLA framework. The content covers two typical scenarios:

  1. LIBERO Simulation Data Training โ€” Using the ready-to-use LIBERO benchmark dataset for rapid validation, LIBERO Simulation Data Training.

  2. Private Real-Robot Data Training โ€” Using self-collected robot data for customized fine-tuning (see Real-Robot Data Preparation).

Model Architecture#

GR00T-N1.5 in FluxVLA consists of the following three core modules:

Module

Type

Function

VLA Container

LlavaVLA

Model wrapper responsible for loading pretrained weights and coordinating submodules

Vision-Language Backbone

EagleBackbone

Eagle2 multimodal model for image feature extraction and language understanding

Action Head

FlowMatchingHead

Flow Matching-based action generator that maps vision-language features to robot action sequences

The model maps pretrained checkpoint weights to the FluxVLA module structure via name_mapping:

name_mapping={
    'vlm_backbone.vlm': 'backbone.eagle_model',
    'vla_head': 'action_head'
}

Scenario 2: Private Real-Robot Data Training#

This section uses the Aloha dual-arm robot as an example to demonstrate how to train GR00T-N1.5 with self-collected data and deploy it on a real robot. The workflow for other robots such as UR3 single-arm is similar; only the dimension and camera parameters need to be adjusted.

1. Data Preparation#

1.1 Data Format Requirements#

Private data must be converted to the LeRobot v2.1 format. If the original data is in HDF5 format, the data conversion script can be used for conversion (see [Real-Robot Data Preparation]).

Directory structure after conversion:

your_dataset_path/
โ”œโ”€โ”€ data/
โ”‚   โ””โ”€โ”€ chunk-000/
โ”‚       โ”œโ”€โ”€ episode_000000.parquet
โ”‚       โ””โ”€โ”€ ...
โ”œโ”€โ”€ videos/
โ”‚   โ””โ”€โ”€ chunk-000/
โ”‚       โ”œโ”€โ”€ observation.images.cam_high/
โ”‚       โ”‚   โ”œโ”€โ”€ episode_000000.mp4
โ”‚       โ”‚   โ””โ”€โ”€ ...
โ”‚       โ”œโ”€โ”€ observation.images.cam_left_wrist/
โ”‚       โ”‚   โ””โ”€โ”€ ...
โ”‚       โ””โ”€โ”€ observation.images.cam_right_wrist/
โ”‚           โ””โ”€โ”€ ...
โ””โ”€โ”€ meta/
    โ”œโ”€โ”€ episodes.jsonl
    โ”œโ”€โ”€ episodes_stats.jsonl
    โ”œโ”€โ”€ info.json
    โ””โ”€โ”€ tasks.jsonl

1.2 Identifying Key Information#

Before starting the configuration, first determine the following data characteristics:

Information

Aloha Dual-Arm Example

UR3 Single-Arm Example

Dataset Path

RealRobot_AgileX_aloha_lerobot_v2/...

RealRobot_UR3_lerobot_v2/...

Camera Names

cam_high, cam_left_wrist, cam_right_wrist

cam_high, cam_wrist

Number of Cameras

3

2

Action Dimension

14 (left arm 7 + right arm 7)

7 (6 joints + 1 gripper)

State Dimension

14

7

2. Configuration File Details#

Using configs/gr00t/gr00t_eagle_3b_aloha_4090_full_train.py as the reference example.

2.1 Model Configuration#

model = dict(
    type='LlavaVLA',
    pretrained_name_or_path='/path/to/models/GR00T-N1.5-3B',
    vlm_backbone=dict(
        type='EagleBackbone',
        vlm_path='FluxVLA/models/third_party_models/eagle2_hg_model'),
    vla_head=dict(
        type='FlowMatchingHead',
        state_dim=14,             # Aloha ๅŒ่‡‚็Šถๆ€็ปดๅบฆ
        hidden_size=1024,
        input_embedding_dim=1536,
        num_layers=1,
        num_heads=4,
        num_inference_timesteps=4,
        num_step=32,              # Flow Matching ้‡‡ๆ ทๆญฅๆ•ฐ
        traj_length=10,
        action_dim=14),           # Aloha ๅŒ่‡‚ๅŠจไฝœ็ปดๅบฆ
    freeze_vlm_backbone=False,
    name_mapping={
        'vlm_backbone.vlm': 'backbone.eagle_model',
        'vla_head': 'action_head'
    },
    freeze_projector=False)

Differences from the LIBERO configuration:

Parameter

LIBERO

Aloha

Description

state_dim

8

14

Robot state dimension

action_dim

7

14

Robot action dimension

num_step

Not set (default)

32

Number of Flow Matching diffusion steps; more complex action spaces require more steps

traj_length

10

10

Predicted trajectory length; identical in both cases

For inference, an additional inference_model can be defined with the same format as model, generally only requiring the addition of the dtype=None parameter:

inference_model = dict(
    type='LlavaVLA',
    pretrained_name_or_path='/path/to/models/GR00T-N1.5-3B',
    vlm_backbone=dict(
        type='EagleBackbone',
        dtype=None,
        vlm_path='FluxVLA/models/third_party_models/eagle2_hg_model'),
    vla_head=dict(
        type='FlowMatchingHead',
        state_dim=14,
        hidden_size=1024,
        input_embedding_dim=1536,
        num_layers=1,
        num_heads=4,
        num_step=32,
        num_inference_timesteps=4,
        traj_length=10,
        action_dim=14),
    freeze_vlm_backbone=False,
    name_mapping={
        'vlm_backbone.vlm': 'backbone.eagle_model',
        'vla_head': 'action_head'
    },
    freeze_projector=False)

2.2 Data Configuration#

train_dataloader = dict(
    batch_size=128,
    per_device_batch_size=8,
    per_device_num_workers=4,
    dataset=dict(
        type='DistributedRepeatingDataset',
        name_mappings={'observation.state': ['proprio', 'action']},
        statistic_keys=[
            'observation.state', 'observation.eepose', 'timestamp'
        ],
        datasets=[
            dict(
                type='ParquetDataset',
                data_root_path=[
                    '/path/to/data/RealRobot_AgileX_aloha_lerobot_v2/20250601_20250615_02_4090',
                    '/path/to/data/RealRobot_AgileX_aloha_lerobot_v2/20250616_20250630_02_4090',
                    # ... ๅฏๆทปๅŠ ๆ›ดๅคšๆ•ฐๆฎๆ‰นๆฌก
                ],
                transforms=[
                    dict(
                        type='ProcessParquetInputs',
                        embodiment_id=30,
                        parquet_keys=[
                            'observation.state', 'observation.eepose',
                            'timestamp', 'actions', 'info', 'stats',
                            'action_masks'
                        ],
                        video_keys=[
                            'observation.images.cam_high',
                            'observation.images.cam_left_wrist',
                            'observation.images.cam_right_wrist'
                        ],
                        name_mappings={'observation.state': ['states']}),
                    dict(type='ParquetPrompter'),
                    dict(
                        type='ProcessPromptsWithImage',
                        max_len=900,
                        num_images=3,
                        tokenizer=dict(
                            type='PretrainedTokenizer',
                            model_path='/path/to/models/eagle2_hg_model',
                        )),
                    dict(type='ResizeImages', height=224, width=224),
                    dict(
                        type='NormalizeImages',
                        means=[[123.515625, 116.04492188, 103.59375],
                               [123.515625, 116.04492188, 103.59375],
                               [123.515625, 116.04492188, 103.59375]],
                        stds=[[58.27148438, 57.02636719, 57.27539062],
                              [58.27148438, 57.02636719, 57.27539062],
                              [58.27148438, 57.02636719, 57.27539062]]),
                    dict(
                        type='NormalizeStatesAndActions',
                        action_dim=14,
                        state_key='proprio',
                        action_key='action',
                        use_quantiles=False)
                ],
                action_window_size=32)
        ]))

Key differences from the LIBERO data configuration:

Configuration Item

LIBERO

Aloha Private Data

Adaptation Notes

data_root_path

Single path (string)

List of paths

Private data is typically stored in separate batches

name_mappings

state โ†’ proprio; action โ†’ action

state โ†’ [proprio, action]

In LIBERO, state and action are separate; in private data, state serves as reference for both proprio and action

embodiment_id

31

30 (Aloha)

Different robot platforms use different IDs

video_keys

2 cameras

3 cameras

Based on actual hardware configuration

num_images

2

3

Must match the number of video_keys

max_len

600

900

More cameras require a longer token sequence

action_window_size

10

32

Real-robot scenarios typically require a longer action window

Note

The length of the means and stds lists in NormalizeImages must equal the number of cameras. When using the Eagle backbone, the default values shown above are applicable to all cameras.

2.3 Training Configuration#

runner = dict(
    type='FSDPTrainRunner',
    max_epochs=3,                     # ็œŸๆœบๆ•ฐๆฎ้‡ๅคงๆ—ถ epoch ๅฏ่พƒๅฐ‘
    learning_rate=2e-5,
    weight_decay=0.0,
    max_grad_norm=1.0,
    sampler=None,
    tokenizer=dict(
        type='PretrainedTokenizer',
        model_path='/path/to/models/eagle2_hg_model',
    ),
    collator=dict(
        type='DictCollator',
        keys=[
            'states', 'observation.eepose', 'timestamp', 'images',
            'img_masks', 'lang_tokens', 'lang_masks', 'actions',
            'action_masks', 'embodiment_ids'
        ],
        meta_keys=['task_description', 'prompt', 'info', 'stats']),
    metric=dict(
        type='VLAMetric',
        active_trackers=('jsonl', 'wandb'),
        run_dir='work_dirs',
        wandb_project='FluxVLA',
        wandb_entity='limx',
        grad_accumulation_steps=1,
        window_size=1),
    lr_scheduler_type='constant',
    warmup_ratio=0.0,
    enable_gradient_checkpointing=True,
    enable_mixed_precision_training=True,
    mixed_precision_dtype='bf16',
    sharding_strategy='full-shard',
    change_key_name=False)

Recommended number of training epochs:

Data Scale

Recommended max_epochs

Description

< 100 episodes

10 ~ 24

Small data volume requires more epochs for fitting

100 ~ 1000 episodes

3 ~ 6

Medium scale, typical real-robot collection volume

1000 episodes

1 ~ 3

Large-scale data requires only a few epochs

2.4 Inference Configuration#

inference = dict(
    type='AlohaInferenceRunner',
    seed=7,
    task_descriptions={
        '1': 'pick up the yellow chicken toy with left arm',
        '2': 'place it in the brown flat cardboard box with right arm',
        # ... ๆ นๆฎๅฎž้™…ไปปๅŠกๆทปๅŠ ๆ›ดๅคš
    },
    dataset=dict(
        type='PrivateInferenceDataset',
        embodiment_id=30,
        img_keys=['cam_high', 'cam_left_wrist', 'cam_right_wrist'],
        transforms=[
            dict(type='PrivatePrompter'),
            dict(
                type='ProcessPromptsWithImage',
                max_len=900,
                num_images=3,
                tokenizer=dict(type='PretrainedTokenizer')),
            dict(type='ResizeImages', height=224, width=224),
            dict(
                type='NormalizeImages',
                means=[[123.515625, 116.04492188, 103.59375],
                       [123.515625, 116.04492188, 103.59375],
                       [123.515625, 116.04492188, 103.59375]],
                stds=[[58.27148438, 57.02636719, 57.27539062],
                      [58.27148438, 57.02636719, 57.27539062],
                      [58.27148438, 57.02636719, 57.27539062]]),
            dict(
                type='NormalizeStatesAndActions',
                action_dim=14,
                state_key='proprio',
                action_key='action',
                use_quantiles=False)
        ]),
    denormalize_action=dict(
        type='DenormalizePrivateAction',
        use_quantiles=False),
    operator=dict(
        type='AlohaOperator',
        img_front_topic='/camera_f/color/image_raw',
        img_left_topic='/camera_l/color/image_raw',
        img_right_topic='/camera_r/color/image_raw',
        img_front_depth_topic='/camera_f/depth/image_raw',
        img_left_depth_topic='/camera_l/depth/image_raw',
        img_right_depth_topic='/camera_r/depth/image_raw',
        puppet_arm_left_cmd_topic='/master/joint_left',
        puppet_arm_right_cmd_topic='/master/joint_right',
        puppet_arm_left_topic='/puppet/joint_left',
        puppet_arm_right_topic='/puppet/joint_right',
        robot_base_topic='/odom_raw',
        robot_base_cmd_topic='/cmd_vel'))

Key points of the inference configuration:

  • task_descriptions โ€” Keys are task IDs (strings), values are natural language descriptions. Tasks are selected via keyboard during inference

  • PrivateInferenceDataset โ€” Dataset type for real-robot inference; loads dataset_statistics.json for denormalization

  • PrivatePrompter โ€” Replaces the ParquetPrompter used during training; constructs prompts from inference inputs

  • DenormalizePrivateAction โ€” Converts normalized model output actions to real robot actions

  • operator โ€” Defines the ROS topic communication interface; must be modified according to the actual hardware

Warning

The data preprocessing parameters during inference (normalization method, image size, action_dim, etc.) must be consistent with those used during training; otherwise, inference performance will degrade significantly.

3. Launching Training#

cd /path/to/fluxvla

# Single node, 8 GPU training
export MLP_WORKER_GPU=8
export MLP_WORKER_NUM=1
export MLP_ROLE_INDEX=0
export MLP_WORKER_0_HOST=localhost
export MLP_WORKER_0_PORT=29500

bash scripts/train.sh \
    configs/gr00t/gr00t_eagle_3b_aloha_4090_full_train.py \
    work_dirs/gr00t_eagle_3b_aloha_4090_full_train

Multi-node training example (2 nodes ร— 8 GPUs):

Node 0 (master node):

export MLP_WORKER_GPU=8
export MLP_WORKER_NUM=2
export MLP_ROLE_INDEX=0
export MLP_WORKER_0_HOST={MASTER_NODE_IP}
export MLP_WORKER_0_PORT=29500

bash scripts/train.sh \
    configs/gr00t/gr00t_eagle_3b_aloha_4090_full_train.py \
    work_dirs/gr00t_eagle_3b_aloha_4090_full_train

Node 1:

export MLP_WORKER_GPU=8
export MLP_WORKER_NUM=2
export MLP_ROLE_INDEX=1
export MLP_WORKER_0_HOST={MASTER_NODE_IP}
export MLP_WORKER_0_PORT=29500

bash scripts/train.sh \
    configs/gr00t/gr00t_eagle_3b_aloha_4090_full_train.py \
    work_dirs/gr00t_eagle_3b_aloha_4090_full_train

4. Real-Robot Deployment#

Use the trained checkpoint for inference:

python scripts/inference_real_robot.py \
    --config configs/gr00t/gr00t_eagle_3b_aloha_4090_full_train.py \
    --ckpt-path work_dirs/gr00t_eagle_3b_aloha_4090_full_train/checkpoint_epoch_3.pt

Adapting to Other Robots: UR3 Single-Arm Example#

Using configs/gr00t/gr00t_eagle_3b_ur3_full_train.py as reference, the key differences between the UR3 and Aloha configurations are as follows:

Configuration Item

Aloha Dual-Arm

UR3 Single-Arm

Description

state_dim / action_dim

14

7

UR3 has only 1 manipulator arm

embodiment_id

30

31

Identifier for different robot platforms

video_keys

3 cameras

2 cameras (cam_high, cam_wrist)

UR3 typically uses head + wrist cameras

num_images

3

2

Must match the number of cameras

max_len

900

600

Fewer cameras require a shorter token sequence

inference.type

AlohaInferenceRunner

URInferenceRunner

Different inference runners

operator.type

AlohaOperator

UROperator

Different hardware communication interfaces

When adapting to a new robot, simply create a new configuration file and adjust the parameters listed above. If the existing inference runners and operators do not meet the requirements, refer to the :doc:../tutorials/private_engine tutorial for extension.

Available Configuration Summary#

Configuration File

Data

Backbone

Robot

gr00t_eagle_3b_libero_10_full_train.py

LIBERO-10

Eagle

Simulation (LIBERO)

gr00t_paligemma_3b_libero_10_full_train.py

LIBERO-10

PaliGemma

Simulation (LIBERO)

gr00t_eagle_3b_aloha_4090_full_train.py

Aloha private data

Eagle

Aloha dual-arm

gr00t_eagle_3b_aloha_fold_towel_3cam_4090_full_train.py

Aloha towel folding (3 cameras)

Eagle

Aloha dual-arm

gr00t_eagle_3b_aloha_fold_towel_4cam_4090_full_train.py

Aloha towel folding (4 cameras)

Eagle

Aloha dual-arm

gr00t_eagle_3b_ur3_full_train.py

UR3 private data

Eagle

UR3 single-arm

Configuring Weights & Biases#

Training loss curves and metrics can be tracked and visualized through Weights & Biases.

# ๅฎ‰่ฃ…
pip install wandb

# ็™ปๅฝ•
wandb login

# ่ฎพ็ฝฎ้กน็›ฎไฟกๆฏ
export WANDB_PROJECT=FluxVLA
export WANDB_ENTITY=your-team-name
export WANDB_MODE=online

# ๅฆ‚้œ€็ฆ็”จ
export WANDB_MODE=disabled

Frequently Asked Questions#

Q: Can LIBERO training and private data training be combined?

Yes. Simply configure both LIBERO and private data ParquetDataset entries in the datasets list. However, note that the embodiment_id, video_keys, action_dim, and other parameters may differ between the two, and their respective transforms must be configured separately.

Q: How can training be resumed from an interruption?

Use the --resume-from flag to specify the checkpoint path:

bash scripts/train.sh \
    configs/gr00t/gr00t_eagle_3b_aloha_4090_full_train.py \
    work_dirs/gr00t_eagle_3b_aloha_4090_full_train \
    --resume-from work_dirs/gr00t_eagle_3b_aloha_4090_full_train/checkpoint_epoch_2.pt

Q: What can be done if GPU memory is insufficient?

The following approaches can be attempted:

  1. Reduce per_device_batch_size (e.g., from 8 to 4 or 2)

  2. Enable enable_gradient_checkpointing=True (trades computation time for memory)

  3. Adjust the FSDP sharding strategy (sharding_strategy='full-shard')

Q: What are the differences between GR00T-N1.5 and ฯ€โ‚€?

Both share the same VLA head (FlowMatchingHead) in FluxVLA. The primary difference lies in the vision-language backbone: GR00T-N1.5 uses EagleBackbone (Eagle2 model), while ฯ€โ‚€ uses PaliGemma (PaliGemma model). The Eagle backbone generally performs better in multi-image scenarios.

Q: How can configuration file parameters be overridden?

Parameters can be dynamically overridden via the command line using --cfg-options:

bash scripts/train.sh \
    configs/gr00t/gr00t_eagle_3b_libero_10_full_train.py \
    work_dirs/gr00t_custom \
    --cfg-options runner.max_epochs=50 runner.learning_rate=1e-5