π₀.₅ Model Training#

This tutorial provides a comprehensive guide on how to fine-tune the π₀.₅ model using the FluxVLA framework. π₀.₅ is an improved version of π₀ (PI05FlowMatching inherits from PI0FlowMatching); the two share the same architectural design, with π₀.₅ incorporating further optimizations in action generation. The content covers two typical scenarios:

  1. LIBERO Simulation Data Training — Using the ready-to-use LIBERO benchmark dataset for rapid validation (see LIBERO Simulation Data Training, supporting both Full Finetune and LoRA)

  2. Private Real-Robot Data Training — Using self-collected robot data for customized fine-tuning (supporting both single-robot and multi-robot mixed training)

Model Architecture#

π₀.₅ in FluxVLA consists of the following core modules:

Module

Type

Function

VLA Container

PI05FlowMatching

Model wrapper that coordinates visual-language understanding and action generation (inherits from PI0FlowMatching)

Vision-Language Backbone

PaliGemma

PaliGemma 3B multimodal model for processing image and language inputs

Action Expert

GemmaLLMBackbone

Gemma 2B language model serving as the expert network for action generation

State/Action Projectors

LinearProjector

Maps robot states and actions to the model’s internal dimensions

Comparison with GR00T-N1.5:

Feature

π₀.₅

GR00T-N1.5

Model Type

PI05FlowMatching

LlavaVLA

Vision Backbone

PaliGemma (SigLIP + Gemma)

EagleBackbone (Eagle2)

Action Generation

Independent Gemma Expert + LinearProjector

FlowMatchingHead

Text Processing

ProcessPrompts

ProcessPromptsWithImage

Pretrained Weights

pi0/pi0.5 checkpoint (.safetensors)

GR00T-N1.5-3B

Scenario 2: Private Real-Robot Data Training#

This section demonstrates how to train the π₀ model using private data. π₀ provides configuration files for Aloha dual-arm, UR3 single-arm, and Aloha + UR3 mixed training.

1. Data Preparation#

Private data must be converted to the LeRobot v2.1 format (see Real-Robot Data Preparation).

2. Aloha Dual-Arm Training#

Using configuration file configs/pi0/pi0_paligemma_aloha_full_train.py.

2.1 Model Configuration Differences#

Compared with the LIBERO configuration, the Aloha setup requires adjustments to the dimension parameters:

model = dict(
    type='PI0FlowMatching',
    # ... PaliGemma 和 Gemma Expert 配置不变 ...
    proj_width=1024,
    n_action_steps=32,                # Aloha 使用更长的动作序列
    state_proj=dict(type='LinearProjector', in_dim=14, out_dim=1024),
    action_in_proj=dict(type='LinearProjector', in_dim=14, out_dim=1024),
    action_out_proj=dict(type='LinearProjector', in_dim=1024, out_dim=14),
    max_action_dim=14,                # 双臂最大动作维度
    # ...
    freeze_vlm_backbone=True,         # 冻结骨干,只训练动作相关层
    pretrained_name_or_path=
        '/path/to/checkpoints/pi0/model.safetensors',
)

Dimension parameter comparison:

Parameter

LIBERO

Aloha Dual-Arm

Description

state_proj.in_dim

8

14

7-dimensional state per arm

action_in_proj.in_dim

7

14

7-dimensional action per arm

action_out_proj.out_dim

7

14

Output dimension matches input

n_action_steps

10

32

Longer action sequences for real-robot scenarios

max_action_dim

Not set

14

Maximum action dimension limit

Note

The Aloha configuration uses freeze_vlm_backbone=True to freeze the vision-language backbone, training only the action-related projection layers and the expert network. This is because when starting from the official pi0 checkpoint, the backbone has already been sufficiently pretrained.

2.2 Data Configuration#

train_dataloader = dict(
    batch_size=128,
    per_device_batch_size=8,
    per_device_num_workers=4,
    dataset=dict(
        type='DistributedRepeatingDataset',
        name_mappings={'observation.state': ['proprio', 'action']},
        statistic_keys=[
            'observation.state', 'observation.eepose', 'timestamp'
        ],
        datasets=[
            dict(
                type='ParquetDataset',
                data_root_path=[
                    '/path/to/data/RealRobot_AgileX_aloha_lerobot_v2/20250601_20250615_02_4090',
                    '/path/to/data/RealRobot_AgileX_aloha_lerobot_v2/20250616_20250630_02_4090',
                    # ... 可添加更多数据批次
                ],
                transforms=[
                    dict(
                        type='ProcessParquetInputs',
                        parquet_keys=[
                            'observation.state', 'observation.eepose',
                            'timestamp', 'actions', 'info', 'stats',
                            'action_masks'
                        ],
                        video_keys=[
                            'observation.images.cam_high',
                            'observation.images.cam_left_wrist',
                            'observation.images.cam_right_wrist'
                        ],
                        name_mappings={'observation.state': ['states']}),
                    dict(type='ParquetPrompter'),
                    dict(
                        type='ProcessPrompts',
                        tokenizer=dict(
                            type='PretrainedTokenizer',
                            model_path='/path/to/checkpoints/pi0',
                        )),
                    dict(type='ResizeImages', height=224, width=224),
                    dict(
                        type='NormalizeImages',
                        means=[[123.515625, 116.04492188, 103.59375],
                               [123.515625, 116.04492188, 103.59375],
                               [123.515625, 116.04492188, 103.59375]],
                        stds=[[58.27148438, 57.02636719, 57.27539062],
                              [58.27148438, 57.02636719, 57.27539062],
                              [58.27148438, 57.02636719, 57.27539062]]),
                    dict(
                        type='NormalizeStatesAndActions',
                        action_dim=14,
                        state_key='proprio',
                        action_key='action',
                        use_quantiles=False)
                ],
                action_window_size=32)
        ]))

2.3 Inference Configuration#

inference = dict(
    type='AlohaInferenceRunner',
    seed=7,
    task_descriptions={
        '1': 'pick up the yellow chicken toy with left arm',
        '2': 'place it in the brown flat cardboard box with right arm',
        # ... 根据实际任务添加
    },
    dataset=dict(
        type='PrivateInferenceDataset',
        img_keys=['cam_high', 'cam_left_wrist', 'cam_right_wrist'],
        transforms=[
            dict(type='PrivatePrompter'),
            dict(
                type='ProcessPrompts',
                tokenizer=dict(type='PretrainedTokenizer')),
            dict(type='ResizeImages', height=224, width=224),
            dict(
                type='NormalizeImages',
                means=[[123.515625, 116.04492188, 103.59375],
                       [123.515625, 116.04492188, 103.59375],
                       [123.515625, 116.04492188, 103.59375]],
                stds=[[58.27148438, 57.02636719, 57.27539062],
                      [58.27148438, 57.02636719, 57.27539062],
                      [58.27148438, 57.02636719, 57.27539062]]),
            dict(
                type='NormalizeStatesAndActions',
                action_dim=16,
                state_key='proprio',
                action_key='action',
                use_quantiles=False)
        ]),
    denormalize_action=dict(
        type='DenormalizePrivateAction',
        use_quantiles=False),
    operator=dict(
        type='AlohaOperator',
        img_front_topic='/camera_f/color/image_raw',
        img_left_topic='/camera_l/color/image_raw',
        img_right_topic='/camera_r/color/image_raw',
        img_front_depth_topic='/camera_f/depth/image_raw',
        img_left_depth_topic='/camera_l/depth/image_raw',
        img_right_depth_topic='/camera_r/depth/image_raw',
        puppet_arm_left_cmd_topic='/master/joint_left',
        puppet_arm_right_cmd_topic='/master/joint_right',
        puppet_arm_left_topic='/puppet/joint_left',
        puppet_arm_right_topic='/puppet/joint_right',
        robot_base_topic='/odom_raw',
        robot_base_cmd_topic='/cmd_vel'))

3. UR3 Single-Arm Training#

Using configuration file configs/pi0/pi0_paligemma_ur3_full_train.py.

Key differences from the Aloha configuration:

Configuration Item

Aloha Dual-Arm

UR3 Single-Arm

Description

state_proj.in_dim

14

7

UR3 single-arm state dimension

action_in_proj.in_dim

14

7

UR3 single-arm action dimension

action_out_proj.out_dim

14

7

Output action dimension

max_action_dim

14

7

Maximum action dimension

video_keys

3 cameras

2 cameras

cam_high + cam_wrist

inference.type

AlohaInferenceRunner

URInferenceRunner

Different inference runners

operator.type

AlohaOperator

UROperator

Different hardware communication interfaces

4. Multi-Robot Mixed Training#

π₀ supports mixed training with data from multiple robot types, enabling cross-robot generalization. Use configuration file configs/pi0/pi0_paligemma_aloha+ur_full_train.py.

Key configuration for mixed training:

model = dict(
    type='PI0FlowMatching',
    # ...
    n_action_steps=32,
    state_proj=dict(type='LinearProjector', in_dim=14, out_dim=1024),
    action_in_proj=dict(type='LinearProjector', in_dim=14, out_dim=1024),
    action_out_proj=dict(type='LinearProjector', in_dim=1024, out_dim=14),
    max_action_dim=14,               # 取所有机器人中的最大维度
    # ...
)

The dataset uses a grouped dictionary format, organized by robot type:

train_dataloader = dict(
    dataset=dict(
        type='DistributedRepeatingDataset',
        name_mappings={'observation.state': ['proprio', 'action']},
        dim=14,                      # 统一的最大维度
        datasets=dict(
            aloha_4090=[             # Aloha 4090 数据组
                dict(type='ParquetDataset', data_root_path='...', ...),
            ],
            aloha_4060=[             # Aloha 4060 数据组
                dict(type='ParquetDataset', data_root_path='...', ...),
                dict(type='ParquetDataset', data_root_path='...', ...),
            ],
            ur3=[                    # UR3 数据组
                dict(type='ParquetDataset', data_root_path='...', ...),
            ]
        )))

Warning

During mixed training, data from robots with different dimensionalities must be aligned to a unified dimension using the PadKeyToDim transform. For example, the 7-dimensional UR3 data needs to be padded to 14 dimensions:

# 在 UR3 的 transforms 中添加
dict(
    type='PadKeyToDim',
    keys=['actions', 'states', 'observation.eepose'],
    dim=14),

5. Launching Training and Deployment#

cd /path/to/fluxvla

export MLP_WORKER_GPU=8
export MLP_WORKER_NUM=1
export MLP_ROLE_INDEX=0
export MLP_WORKER_0_HOST=localhost
export MLP_WORKER_0_PORT=29500

# Aloha training
bash scripts/train.sh \
    configs/pi0/pi0_paligemma_aloha_full_train.py \
    work_dirs/pi0_paligemma_aloha_full_train

# UR3 training
bash scripts/train.sh \
    configs/pi0/pi0_paligemma_ur3_full_train.py \
    work_dirs/pi0_paligemma_ur3_full_train

# Aloha + UR3 mixed training
bash scripts/train.sh \
    configs/pi0/pi0_paligemma_aloha+ur_full_train.py \
    work_dirs/pi0_paligemma_aloha+ur_full_train

Real-robot deployment:

python scripts/inference_real_robot.py \
    --config configs/pi0/pi0_paligemma_aloha_full_train.py \
    --ckpt-path work_dirs/pi0_paligemma_aloha_full_train/checkpoint_epoch_3.pt

Available Configuration Summary#

Configuration File

Data

Training Strategy

Robot

pi0_paligemma_libero_10_full_train.py

LIBERO-10

Full (from scratch)

Simulation

pi0_paligemma_libero_10_full_finetune_pytorch.py

LIBERO-10

Full Finetune

Simulation

pi0_paligemma_libero_10_full_finetune.py

LIBERO-10

Full Finetune

Simulation

pi0_paligemma_libero_10_lora_finetune.py

LIBERO-10

LoRA

Simulation

pi0_paligemma_libero_90_full_finetune_pytorch.py

LIBERO-90

Full Finetune

Simulation

pi0_paligemma_libero_spatial_full_finetune_pytorch.py

LIBERO-Spatial

Full Finetune

Simulation

pi0_paligemma_libero_object_full_finetune_pytorch.py

LIBERO-Object

Full Finetune

Simulation

pi0_paligemma_libero_goal_full_finetune_pytorch.py

LIBERO-Goal

Full Finetune

Simulation

pi0_paligemma_aloha_full_train.py

Aloha private data

Full

Aloha dual-arm

pi0_paligemma_ur3_full_train.py

UR3 private data

Full

UR3 single-arm

pi0_paligemma_aloha+ur_full_train.py

Aloha + UR3

Full (mixed)

Multi-robot

Frequently Asked Questions#

Q: How should I choose between Full Finetune and LoRA for π₀?

  • Full Finetune generally yields better performance but requires more GPU memory and longer training time. It is recommended for final deployment.

  • LoRA has lower memory requirements (DDP is sufficient) and faster training speed. It is suitable for rapid experimentation and hyperparameter search. LoRA typically uses a higher learning rate (e.g., 5e-4 vs. 2e-5).

Q: Why should the backbone be frozen (freeze_vlm_backbone=True) when training on private data?

The PaliGemma backbone in π₀ has undergone extensive pretraining and possesses strong visual and linguistic understanding capabilities. When private data is limited, freezing the backbone prevents overfitting while substantially reducing the number of trainable parameters. If sufficient data is available, it can also be set to False for full-parameter fine-tuning.

Q: How is data with different dimensionalities handled during mixed training?

The PadKeyToDim transform is used to pad lower-dimensional data to the maximum dimension. The max_action_dim parameter in the model is set to the maximum action dimension across all robots. The framework automatically handles dimension alignment after padding.

Q: Do π₀ and GR00T use different pretrained weight sources?

  • π₀ weights originate from the converted version provided by the openpi_pytorch project, in .safetensors format

  • GR00T-N1.5 weights originate from the GR00T-N1.5-3B checkpoint released by NVIDIA

Both are mapped to their respective module structures in FluxVLA via their corresponding name_mapping configurations.

Q: How can training be resumed from an interruption?

bash scripts/train.sh \
    configs/pi0/pi0_paligemma_aloha_full_train.py \
    work_dirs/pi0_paligemma_aloha_full_train \
    --resume-from work_dirs/pi0_paligemma_aloha_full_train/checkpoint_epoch_2.pt

Q: How can configuration parameters be overridden?

bash scripts/train.sh \
    configs/pi0/pi0_paligemma_libero_10_full_finetune_pytorch.py \
    work_dirs/pi0_custom \
    --cfg-options runner.max_epochs=50 runner.learning_rate=1e-5