GR00T-N1.5 Model Training#
This tutorial provides a comprehensive guide on how to fine-tune the GR00T-N1.5 model using the FluxVLA framework. The content covers two typical scenarios:
LIBERO Simulation Data Training โ Using the ready-to-use LIBERO benchmark dataset for rapid validation, LIBERO Simulation Data Training.
Private Real-Robot Data Training โ Using self-collected robot data for customized fine-tuning (see Real-Robot Data Preparation).
Model Architecture#
GR00T-N1.5 in FluxVLA consists of the following three core modules:
Module |
Type |
Function |
|---|---|---|
VLA Container |
|
Model wrapper responsible for loading pretrained weights and coordinating submodules |
Vision-Language Backbone |
|
Eagle2 multimodal model for image feature extraction and language understanding |
Action Head |
|
Flow Matching-based action generator that maps vision-language features to robot action sequences |
The model maps pretrained checkpoint weights to the FluxVLA module structure via name_mapping:
name_mapping={
'vlm_backbone.vlm': 'backbone.eagle_model',
'vla_head': 'action_head'
}
Scenario 2: Private Real-Robot Data Training#
This section uses the Aloha dual-arm robot as an example to demonstrate how to train GR00T-N1.5 with self-collected data and deploy it on a real robot. The workflow for other robots such as UR3 single-arm is similar; only the dimension and camera parameters need to be adjusted.
1. Data Preparation#
1.1 Data Format Requirements#
Private data must be converted to the LeRobot v2.1 format. If the original data is in HDF5 format, the data conversion script can be used for conversion (see [Real-Robot Data Preparation]).
Directory structure after conversion:
your_dataset_path/
โโโ data/
โ โโโ chunk-000/
โ โโโ episode_000000.parquet
โ โโโ ...
โโโ videos/
โ โโโ chunk-000/
โ โโโ observation.images.cam_high/
โ โ โโโ episode_000000.mp4
โ โ โโโ ...
โ โโโ observation.images.cam_left_wrist/
โ โ โโโ ...
โ โโโ observation.images.cam_right_wrist/
โ โโโ ...
โโโ meta/
โโโ episodes.jsonl
โโโ episodes_stats.jsonl
โโโ info.json
โโโ tasks.jsonl
1.2 Identifying Key Information#
Before starting the configuration, first determine the following data characteristics:
Information |
Aloha Dual-Arm Example |
UR3 Single-Arm Example |
|---|---|---|
Dataset Path |
|
|
Camera Names |
|
|
Number of Cameras |
3 |
2 |
Action Dimension |
14 (left arm 7 + right arm 7) |
7 (6 joints + 1 gripper) |
State Dimension |
14 |
7 |
2. Configuration File Details#
Using configs/gr00t/gr00t_eagle_3b_aloha_4090_full_train.py as the reference example.
2.1 Model Configuration#
model = dict(
type='LlavaVLA',
pretrained_name_or_path='/path/to/models/GR00T-N1.5-3B',
vlm_backbone=dict(
type='EagleBackbone',
vlm_path='FluxVLA/models/third_party_models/eagle2_hg_model'),
vla_head=dict(
type='FlowMatchingHead',
state_dim=14, # Aloha ๅ่็ถๆ็ปดๅบฆ
hidden_size=1024,
input_embedding_dim=1536,
num_layers=1,
num_heads=4,
num_inference_timesteps=4,
num_step=32, # Flow Matching ้ๆ ทๆญฅๆฐ
traj_length=10,
action_dim=14), # Aloha ๅ่ๅจไฝ็ปดๅบฆ
freeze_vlm_backbone=False,
name_mapping={
'vlm_backbone.vlm': 'backbone.eagle_model',
'vla_head': 'action_head'
},
freeze_projector=False)
Differences from the LIBERO configuration:
Parameter |
LIBERO |
Aloha |
Description |
|---|---|---|---|
|
8 |
14 |
Robot state dimension |
|
7 |
14 |
Robot action dimension |
|
Not set (default) |
32 |
Number of Flow Matching diffusion steps; more complex action spaces require more steps |
|
10 |
10 |
Predicted trajectory length; identical in both cases |
For inference, an additional inference_model can be defined with the same format as model, generally only requiring the addition of the dtype=None parameter:
inference_model = dict(
type='LlavaVLA',
pretrained_name_or_path='/path/to/models/GR00T-N1.5-3B',
vlm_backbone=dict(
type='EagleBackbone',
dtype=None,
vlm_path='FluxVLA/models/third_party_models/eagle2_hg_model'),
vla_head=dict(
type='FlowMatchingHead',
state_dim=14,
hidden_size=1024,
input_embedding_dim=1536,
num_layers=1,
num_heads=4,
num_step=32,
num_inference_timesteps=4,
traj_length=10,
action_dim=14),
freeze_vlm_backbone=False,
name_mapping={
'vlm_backbone.vlm': 'backbone.eagle_model',
'vla_head': 'action_head'
},
freeze_projector=False)
2.2 Data Configuration#
train_dataloader = dict(
batch_size=128,
per_device_batch_size=8,
per_device_num_workers=4,
dataset=dict(
type='DistributedRepeatingDataset',
name_mappings={'observation.state': ['proprio', 'action']},
statistic_keys=[
'observation.state', 'observation.eepose', 'timestamp'
],
datasets=[
dict(
type='ParquetDataset',
data_root_path=[
'/path/to/data/RealRobot_AgileX_aloha_lerobot_v2/20250601_20250615_02_4090',
'/path/to/data/RealRobot_AgileX_aloha_lerobot_v2/20250616_20250630_02_4090',
# ... ๅฏๆทปๅ ๆดๅคๆฐๆฎๆนๆฌก
],
transforms=[
dict(
type='ProcessParquetInputs',
embodiment_id=30,
parquet_keys=[
'observation.state', 'observation.eepose',
'timestamp', 'actions', 'info', 'stats',
'action_masks'
],
video_keys=[
'observation.images.cam_high',
'observation.images.cam_left_wrist',
'observation.images.cam_right_wrist'
],
name_mappings={'observation.state': ['states']}),
dict(type='ParquetPrompter'),
dict(
type='ProcessPromptsWithImage',
max_len=900,
num_images=3,
tokenizer=dict(
type='PretrainedTokenizer',
model_path='/path/to/models/eagle2_hg_model',
)),
dict(type='ResizeImages', height=224, width=224),
dict(
type='NormalizeImages',
means=[[123.515625, 116.04492188, 103.59375],
[123.515625, 116.04492188, 103.59375],
[123.515625, 116.04492188, 103.59375]],
stds=[[58.27148438, 57.02636719, 57.27539062],
[58.27148438, 57.02636719, 57.27539062],
[58.27148438, 57.02636719, 57.27539062]]),
dict(
type='NormalizeStatesAndActions',
action_dim=14,
state_key='proprio',
action_key='action',
use_quantiles=False)
],
action_window_size=32)
]))
Key differences from the LIBERO data configuration:
Configuration Item |
LIBERO |
Aloha Private Data |
Adaptation Notes |
|---|---|---|---|
|
Single path (string) |
List of paths |
Private data is typically stored in separate batches |
|
state โ proprio; action โ action |
state โ [proprio, action] |
In LIBERO, state and action are separate; in private data, state serves as reference for both proprio and action |
|
31 |
30 (Aloha) |
Different robot platforms use different IDs |
|
2 cameras |
3 cameras |
Based on actual hardware configuration |
|
2 |
3 |
Must match the number of |
|
600 |
900 |
More cameras require a longer token sequence |
|
10 |
32 |
Real-robot scenarios typically require a longer action window |
Note
The length of the means and stds lists in NormalizeImages must equal the number of cameras. When using the Eagle backbone, the default values shown above are applicable to all cameras.
2.3 Training Configuration#
runner = dict(
type='FSDPTrainRunner',
max_epochs=3, # ็ๆบๆฐๆฎ้ๅคงๆถ epoch ๅฏ่พๅฐ
learning_rate=2e-5,
weight_decay=0.0,
max_grad_norm=1.0,
sampler=None,
tokenizer=dict(
type='PretrainedTokenizer',
model_path='/path/to/models/eagle2_hg_model',
),
collator=dict(
type='DictCollator',
keys=[
'states', 'observation.eepose', 'timestamp', 'images',
'img_masks', 'lang_tokens', 'lang_masks', 'actions',
'action_masks', 'embodiment_ids'
],
meta_keys=['task_description', 'prompt', 'info', 'stats']),
metric=dict(
type='VLAMetric',
active_trackers=('jsonl', 'wandb'),
run_dir='work_dirs',
wandb_project='FluxVLA',
wandb_entity='limx',
grad_accumulation_steps=1,
window_size=1),
lr_scheduler_type='constant',
warmup_ratio=0.0,
enable_gradient_checkpointing=True,
enable_mixed_precision_training=True,
mixed_precision_dtype='bf16',
sharding_strategy='full-shard',
change_key_name=False)
Recommended number of training epochs:
Data Scale |
Recommended |
Description |
|---|---|---|
< 100 episodes |
10 ~ 24 |
Small data volume requires more epochs for fitting |
100 ~ 1000 episodes |
3 ~ 6 |
Medium scale, typical real-robot collection volume |
|
1 ~ 3 |
Large-scale data requires only a few epochs |
2.4 Inference Configuration#
inference = dict(
type='AlohaInferenceRunner',
seed=7,
task_descriptions={
'1': 'pick up the yellow chicken toy with left arm',
'2': 'place it in the brown flat cardboard box with right arm',
# ... ๆ นๆฎๅฎ้
ไปปๅกๆทปๅ ๆดๅค
},
dataset=dict(
type='PrivateInferenceDataset',
embodiment_id=30,
img_keys=['cam_high', 'cam_left_wrist', 'cam_right_wrist'],
transforms=[
dict(type='PrivatePrompter'),
dict(
type='ProcessPromptsWithImage',
max_len=900,
num_images=3,
tokenizer=dict(type='PretrainedTokenizer')),
dict(type='ResizeImages', height=224, width=224),
dict(
type='NormalizeImages',
means=[[123.515625, 116.04492188, 103.59375],
[123.515625, 116.04492188, 103.59375],
[123.515625, 116.04492188, 103.59375]],
stds=[[58.27148438, 57.02636719, 57.27539062],
[58.27148438, 57.02636719, 57.27539062],
[58.27148438, 57.02636719, 57.27539062]]),
dict(
type='NormalizeStatesAndActions',
action_dim=14,
state_key='proprio',
action_key='action',
use_quantiles=False)
]),
denormalize_action=dict(
type='DenormalizePrivateAction',
use_quantiles=False),
operator=dict(
type='AlohaOperator',
img_front_topic='/camera_f/color/image_raw',
img_left_topic='/camera_l/color/image_raw',
img_right_topic='/camera_r/color/image_raw',
img_front_depth_topic='/camera_f/depth/image_raw',
img_left_depth_topic='/camera_l/depth/image_raw',
img_right_depth_topic='/camera_r/depth/image_raw',
puppet_arm_left_cmd_topic='/master/joint_left',
puppet_arm_right_cmd_topic='/master/joint_right',
puppet_arm_left_topic='/puppet/joint_left',
puppet_arm_right_topic='/puppet/joint_right',
robot_base_topic='/odom_raw',
robot_base_cmd_topic='/cmd_vel'))
Key points of the inference configuration:
task_descriptionsโ Keys are task IDs (strings), values are natural language descriptions. Tasks are selected via keyboard during inferencePrivateInferenceDatasetโ Dataset type for real-robot inference; loadsdataset_statistics.jsonfor denormalizationPrivatePrompterโ Replaces theParquetPrompterused during training; constructs prompts from inference inputsDenormalizePrivateActionโ Converts normalized model output actions to real robot actionsoperatorโ Defines the ROS topic communication interface; must be modified according to the actual hardware
Warning
The data preprocessing parameters during inference (normalization method, image size, action_dim, etc.) must be consistent with those used during training; otherwise, inference performance will degrade significantly.
3. Launching Training#
cd /path/to/fluxvla
# Single node, 8 GPU training
export MLP_WORKER_GPU=8
export MLP_WORKER_NUM=1
export MLP_ROLE_INDEX=0
export MLP_WORKER_0_HOST=localhost
export MLP_WORKER_0_PORT=29500
bash scripts/train.sh \
configs/gr00t/gr00t_eagle_3b_aloha_4090_full_train.py \
work_dirs/gr00t_eagle_3b_aloha_4090_full_train
Multi-node training example (2 nodes ร 8 GPUs):
Node 0 (master node):
export MLP_WORKER_GPU=8
export MLP_WORKER_NUM=2
export MLP_ROLE_INDEX=0
export MLP_WORKER_0_HOST={MASTER_NODE_IP}
export MLP_WORKER_0_PORT=29500
bash scripts/train.sh \
configs/gr00t/gr00t_eagle_3b_aloha_4090_full_train.py \
work_dirs/gr00t_eagle_3b_aloha_4090_full_train
Node 1:
export MLP_WORKER_GPU=8
export MLP_WORKER_NUM=2
export MLP_ROLE_INDEX=1
export MLP_WORKER_0_HOST={MASTER_NODE_IP}
export MLP_WORKER_0_PORT=29500
bash scripts/train.sh \
configs/gr00t/gr00t_eagle_3b_aloha_4090_full_train.py \
work_dirs/gr00t_eagle_3b_aloha_4090_full_train
4. Real-Robot Deployment#
Use the trained checkpoint for inference:
python scripts/inference_real_robot.py \
--config configs/gr00t/gr00t_eagle_3b_aloha_4090_full_train.py \
--ckpt-path work_dirs/gr00t_eagle_3b_aloha_4090_full_train/checkpoint_epoch_3.pt
Adapting to Other Robots: UR3 Single-Arm Example#
Using configs/gr00t/gr00t_eagle_3b_ur3_full_train.py as reference, the key differences between the UR3 and Aloha configurations are as follows:
Configuration Item |
Aloha Dual-Arm |
UR3 Single-Arm |
Description |
|---|---|---|---|
|
14 |
7 |
UR3 has only 1 manipulator arm |
|
30 |
31 |
Identifier for different robot platforms |
|
3 cameras |
2 cameras ( |
UR3 typically uses head + wrist cameras |
|
3 |
2 |
Must match the number of cameras |
|
900 |
600 |
Fewer cameras require a shorter token sequence |
|
|
|
Different inference runners |
|
|
|
Different hardware communication interfaces |
When adapting to a new robot, simply create a new configuration file and adjust the parameters listed above. If the existing inference runners and operators do not meet the requirements, refer to the :doc:../tutorials/private_engine tutorial for extension.
Available Configuration Summary#
Configuration File |
Data |
Backbone |
Robot |
|---|---|---|---|
|
LIBERO-10 |
Eagle |
Simulation (LIBERO) |
|
LIBERO-10 |
PaliGemma |
Simulation (LIBERO) |
|
Aloha private data |
Eagle |
Aloha dual-arm |
|
Aloha towel folding (3 cameras) |
Eagle |
Aloha dual-arm |
|
Aloha towel folding (4 cameras) |
Eagle |
Aloha dual-arm |
|
UR3 private data |
Eagle |
UR3 single-arm |
Configuring Weights & Biases#
Training loss curves and metrics can be tracked and visualized through Weights & Biases.
# ๅฎ่ฃ
pip install wandb
# ็ปๅฝ
wandb login
# ่ฎพ็ฝฎ้กน็ฎไฟกๆฏ
export WANDB_PROJECT=FluxVLA
export WANDB_ENTITY=your-team-name
export WANDB_MODE=online
# ๅฆ้็ฆ็จ
export WANDB_MODE=disabled
Frequently Asked Questions#
Q: Can LIBERO training and private data training be combined?
Yes. Simply configure both LIBERO and private data ParquetDataset entries in the datasets list. However, note that the embodiment_id, video_keys, action_dim, and other parameters may differ between the two, and their respective transforms must be configured separately.
Q: How can training be resumed from an interruption?
Use the --resume-from flag to specify the checkpoint path:
bash scripts/train.sh \
configs/gr00t/gr00t_eagle_3b_aloha_4090_full_train.py \
work_dirs/gr00t_eagle_3b_aloha_4090_full_train \
--resume-from work_dirs/gr00t_eagle_3b_aloha_4090_full_train/checkpoint_epoch_2.pt
Q: What can be done if GPU memory is insufficient?
The following approaches can be attempted:
Reduce
per_device_batch_size(e.g., from 8 to 4 or 2)Enable
enable_gradient_checkpointing=True(trades computation time for memory)Adjust the FSDP sharding strategy (
sharding_strategy='full-shard')
Q: What are the differences between GR00T-N1.5 and ฯโ?
Both share the same VLA head (FlowMatchingHead) in FluxVLA. The primary difference lies in the vision-language backbone: GR00T-N1.5 uses EagleBackbone (Eagle2 model), while ฯโ uses PaliGemma (PaliGemma model). The Eagle backbone generally performs better in multi-image scenarios.
Q: How can configuration file parameters be overridden?
Parameters can be dynamically overridden via the command line using --cfg-options:
bash scripts/train.sh \
configs/gr00t/gr00t_eagle_3b_libero_10_full_train.py \
work_dirs/gr00t_custom \
--cfg-options runner.max_epochs=50 runner.learning_rate=1e-5