π₀.₅ Model Training#
This tutorial provides a comprehensive guide on how to fine-tune the π₀.₅ model using the FluxVLA framework. π₀.₅ is an improved version of π₀ (PI05FlowMatching inherits from PI0FlowMatching); the two share the same architectural design, with π₀.₅ incorporating further optimizations in action generation. The content covers two typical scenarios:
LIBERO Simulation Data Training — Using the ready-to-use LIBERO benchmark dataset for rapid validation (see LIBERO Simulation Data Training, supporting both Full Finetune and LoRA)
Private Real-Robot Data Training — Using self-collected robot data for customized fine-tuning (supporting both single-robot and multi-robot mixed training)
Model Architecture#
π₀.₅ in FluxVLA consists of the following core modules:
Module |
Type |
Function |
|---|---|---|
VLA Container |
|
Model wrapper that coordinates visual-language understanding and action generation (inherits from |
Vision-Language Backbone |
|
PaliGemma 3B multimodal model for processing image and language inputs |
Action Expert |
|
Gemma 2B language model serving as the expert network for action generation |
State/Action Projectors |
|
Maps robot states and actions to the model’s internal dimensions |
Comparison with GR00T-N1.5:
Feature |
π₀.₅ |
GR00T-N1.5 |
|---|---|---|
Model Type |
|
|
Vision Backbone |
|
|
Action Generation |
Independent Gemma Expert + LinearProjector |
|
Text Processing |
|
|
Pretrained Weights |
pi0/pi0.5 checkpoint ( |
GR00T-N1.5-3B |
Scenario 2: Private Real-Robot Data Training#
This section demonstrates how to train the π₀ model using private data. π₀ provides configuration files for Aloha dual-arm, UR3 single-arm, and Aloha + UR3 mixed training.
1. Data Preparation#
Private data must be converted to the LeRobot v2.1 format (see Real-Robot Data Preparation).
2. Aloha Dual-Arm Training#
Using configuration file configs/pi0/pi0_paligemma_aloha_full_train.py.
2.1 Model Configuration Differences#
Compared with the LIBERO configuration, the Aloha setup requires adjustments to the dimension parameters:
model = dict(
type='PI0FlowMatching',
# ... PaliGemma 和 Gemma Expert 配置不变 ...
proj_width=1024,
n_action_steps=32, # Aloha 使用更长的动作序列
state_proj=dict(type='LinearProjector', in_dim=14, out_dim=1024),
action_in_proj=dict(type='LinearProjector', in_dim=14, out_dim=1024),
action_out_proj=dict(type='LinearProjector', in_dim=1024, out_dim=14),
max_action_dim=14, # 双臂最大动作维度
# ...
freeze_vlm_backbone=True, # 冻结骨干,只训练动作相关层
pretrained_name_or_path=
'/path/to/checkpoints/pi0/model.safetensors',
)
Dimension parameter comparison:
Parameter |
LIBERO |
Aloha Dual-Arm |
Description |
|---|---|---|---|
|
8 |
14 |
7-dimensional state per arm |
|
7 |
14 |
7-dimensional action per arm |
|
7 |
14 |
Output dimension matches input |
|
10 |
32 |
Longer action sequences for real-robot scenarios |
|
Not set |
14 |
Maximum action dimension limit |
Note
The Aloha configuration uses freeze_vlm_backbone=True to freeze the vision-language backbone, training only the action-related projection layers and the expert network. This is because when starting from the official pi0 checkpoint, the backbone has already been sufficiently pretrained.
2.2 Data Configuration#
train_dataloader = dict(
batch_size=128,
per_device_batch_size=8,
per_device_num_workers=4,
dataset=dict(
type='DistributedRepeatingDataset',
name_mappings={'observation.state': ['proprio', 'action']},
statistic_keys=[
'observation.state', 'observation.eepose', 'timestamp'
],
datasets=[
dict(
type='ParquetDataset',
data_root_path=[
'/path/to/data/RealRobot_AgileX_aloha_lerobot_v2/20250601_20250615_02_4090',
'/path/to/data/RealRobot_AgileX_aloha_lerobot_v2/20250616_20250630_02_4090',
# ... 可添加更多数据批次
],
transforms=[
dict(
type='ProcessParquetInputs',
parquet_keys=[
'observation.state', 'observation.eepose',
'timestamp', 'actions', 'info', 'stats',
'action_masks'
],
video_keys=[
'observation.images.cam_high',
'observation.images.cam_left_wrist',
'observation.images.cam_right_wrist'
],
name_mappings={'observation.state': ['states']}),
dict(type='ParquetPrompter'),
dict(
type='ProcessPrompts',
tokenizer=dict(
type='PretrainedTokenizer',
model_path='/path/to/checkpoints/pi0',
)),
dict(type='ResizeImages', height=224, width=224),
dict(
type='NormalizeImages',
means=[[123.515625, 116.04492188, 103.59375],
[123.515625, 116.04492188, 103.59375],
[123.515625, 116.04492188, 103.59375]],
stds=[[58.27148438, 57.02636719, 57.27539062],
[58.27148438, 57.02636719, 57.27539062],
[58.27148438, 57.02636719, 57.27539062]]),
dict(
type='NormalizeStatesAndActions',
action_dim=14,
state_key='proprio',
action_key='action',
use_quantiles=False)
],
action_window_size=32)
]))
2.3 Inference Configuration#
inference = dict(
type='AlohaInferenceRunner',
seed=7,
task_descriptions={
'1': 'pick up the yellow chicken toy with left arm',
'2': 'place it in the brown flat cardboard box with right arm',
# ... 根据实际任务添加
},
dataset=dict(
type='PrivateInferenceDataset',
img_keys=['cam_high', 'cam_left_wrist', 'cam_right_wrist'],
transforms=[
dict(type='PrivatePrompter'),
dict(
type='ProcessPrompts',
tokenizer=dict(type='PretrainedTokenizer')),
dict(type='ResizeImages', height=224, width=224),
dict(
type='NormalizeImages',
means=[[123.515625, 116.04492188, 103.59375],
[123.515625, 116.04492188, 103.59375],
[123.515625, 116.04492188, 103.59375]],
stds=[[58.27148438, 57.02636719, 57.27539062],
[58.27148438, 57.02636719, 57.27539062],
[58.27148438, 57.02636719, 57.27539062]]),
dict(
type='NormalizeStatesAndActions',
action_dim=16,
state_key='proprio',
action_key='action',
use_quantiles=False)
]),
denormalize_action=dict(
type='DenormalizePrivateAction',
use_quantiles=False),
operator=dict(
type='AlohaOperator',
img_front_topic='/camera_f/color/image_raw',
img_left_topic='/camera_l/color/image_raw',
img_right_topic='/camera_r/color/image_raw',
img_front_depth_topic='/camera_f/depth/image_raw',
img_left_depth_topic='/camera_l/depth/image_raw',
img_right_depth_topic='/camera_r/depth/image_raw',
puppet_arm_left_cmd_topic='/master/joint_left',
puppet_arm_right_cmd_topic='/master/joint_right',
puppet_arm_left_topic='/puppet/joint_left',
puppet_arm_right_topic='/puppet/joint_right',
robot_base_topic='/odom_raw',
robot_base_cmd_topic='/cmd_vel'))
3. UR3 Single-Arm Training#
Using configuration file configs/pi0/pi0_paligemma_ur3_full_train.py.
Key differences from the Aloha configuration:
Configuration Item |
Aloha Dual-Arm |
UR3 Single-Arm |
Description |
|---|---|---|---|
|
14 |
7 |
UR3 single-arm state dimension |
|
14 |
7 |
UR3 single-arm action dimension |
|
14 |
7 |
Output action dimension |
|
14 |
7 |
Maximum action dimension |
|
3 cameras |
2 cameras |
|
|
|
|
Different inference runners |
|
|
|
Different hardware communication interfaces |
4. Multi-Robot Mixed Training#
π₀ supports mixed training with data from multiple robot types, enabling cross-robot generalization. Use configuration file configs/pi0/pi0_paligemma_aloha+ur_full_train.py.
Key configuration for mixed training:
model = dict(
type='PI0FlowMatching',
# ...
n_action_steps=32,
state_proj=dict(type='LinearProjector', in_dim=14, out_dim=1024),
action_in_proj=dict(type='LinearProjector', in_dim=14, out_dim=1024),
action_out_proj=dict(type='LinearProjector', in_dim=1024, out_dim=14),
max_action_dim=14, # 取所有机器人中的最大维度
# ...
)
The dataset uses a grouped dictionary format, organized by robot type:
train_dataloader = dict(
dataset=dict(
type='DistributedRepeatingDataset',
name_mappings={'observation.state': ['proprio', 'action']},
dim=14, # 统一的最大维度
datasets=dict(
aloha_4090=[ # Aloha 4090 数据组
dict(type='ParquetDataset', data_root_path='...', ...),
],
aloha_4060=[ # Aloha 4060 数据组
dict(type='ParquetDataset', data_root_path='...', ...),
dict(type='ParquetDataset', data_root_path='...', ...),
],
ur3=[ # UR3 数据组
dict(type='ParquetDataset', data_root_path='...', ...),
]
)))
Warning
During mixed training, data from robots with different dimensionalities must be aligned to a unified dimension using the PadKeyToDim transform. For example, the 7-dimensional UR3 data needs to be padded to 14 dimensions:
# 在 UR3 的 transforms 中添加
dict(
type='PadKeyToDim',
keys=['actions', 'states', 'observation.eepose'],
dim=14),
5. Launching Training and Deployment#
cd /path/to/fluxvla
export MLP_WORKER_GPU=8
export MLP_WORKER_NUM=1
export MLP_ROLE_INDEX=0
export MLP_WORKER_0_HOST=localhost
export MLP_WORKER_0_PORT=29500
# Aloha training
bash scripts/train.sh \
configs/pi0/pi0_paligemma_aloha_full_train.py \
work_dirs/pi0_paligemma_aloha_full_train
# UR3 training
bash scripts/train.sh \
configs/pi0/pi0_paligemma_ur3_full_train.py \
work_dirs/pi0_paligemma_ur3_full_train
# Aloha + UR3 mixed training
bash scripts/train.sh \
configs/pi0/pi0_paligemma_aloha+ur_full_train.py \
work_dirs/pi0_paligemma_aloha+ur_full_train
Real-robot deployment:
python scripts/inference_real_robot.py \
--config configs/pi0/pi0_paligemma_aloha_full_train.py \
--ckpt-path work_dirs/pi0_paligemma_aloha_full_train/checkpoint_epoch_3.pt
Available Configuration Summary#
Configuration File |
Data |
Training Strategy |
Robot |
|---|---|---|---|
|
LIBERO-10 |
Full (from scratch) |
Simulation |
|
LIBERO-10 |
Full Finetune |
Simulation |
|
LIBERO-10 |
Full Finetune |
Simulation |
|
LIBERO-10 |
LoRA |
Simulation |
|
LIBERO-90 |
Full Finetune |
Simulation |
|
LIBERO-Spatial |
Full Finetune |
Simulation |
|
LIBERO-Object |
Full Finetune |
Simulation |
|
LIBERO-Goal |
Full Finetune |
Simulation |
|
Aloha private data |
Full |
Aloha dual-arm |
|
UR3 private data |
Full |
UR3 single-arm |
|
Aloha + UR3 |
Full (mixed) |
Multi-robot |
Frequently Asked Questions#
Q: How should I choose between Full Finetune and LoRA for π₀?
Full Finetune generally yields better performance but requires more GPU memory and longer training time. It is recommended for final deployment.
LoRA has lower memory requirements (DDP is sufficient) and faster training speed. It is suitable for rapid experimentation and hyperparameter search. LoRA typically uses a higher learning rate (e.g., 5e-4 vs. 2e-5).
Q: Why should the backbone be frozen (freeze_vlm_backbone=True) when training on private data?
The PaliGemma backbone in π₀ has undergone extensive pretraining and possesses strong visual and linguistic understanding capabilities. When private data is limited, freezing the backbone prevents overfitting while substantially reducing the number of trainable parameters. If sufficient data is available, it can also be set to False for full-parameter fine-tuning.
Q: How is data with different dimensionalities handled during mixed training?
The PadKeyToDim transform is used to pad lower-dimensional data to the maximum dimension. The max_action_dim parameter in the model is set to the maximum action dimension across all robots. The framework automatically handles dimension alignment after padding.
Q: Do π₀ and GR00T use different pretrained weight sources?
π₀ weights originate from the converted version provided by the openpi_pytorch project, in
.safetensorsformatGR00T-N1.5 weights originate from the GR00T-N1.5-3B checkpoint released by NVIDIA
Both are mapped to their respective module structures in FluxVLA via their corresponding name_mapping configurations.
Q: How can training be resumed from an interruption?
bash scripts/train.sh \
configs/pi0/pi0_paligemma_aloha_full_train.py \
work_dirs/pi0_paligemma_aloha_full_train \
--resume-from work_dirs/pi0_paligemma_aloha_full_train/checkpoint_epoch_2.pt
Q: How can configuration parameters be overridden?
bash scripts/train.sh \
configs/pi0/pi0_paligemma_libero_10_full_finetune_pytorch.py \
work_dirs/pi0_custom \
--cfg-options runner.max_epochs=50 runner.learning_rate=1e-5