LIBERO Simulation Data Training#
LIBERO is a widely used robotic manipulation benchmark environment comprising task suites of varying difficulty levels. This page describes how to train and evaluate both GR00T-N1.5 and π₀ models using the LIBERO dataset within the FluxVLA framework, with libero_10 serving as the primary example.
Data Preparation#
Download the LIBERO dataset (LeRobot v2.1 format):
huggingface-cli download limxdynamics/FluxVLAData --repo-type dataset --include "libero_10_no_noops_lerobotv2.1/*" --local-dir ./datasets
Other LIBERO task suites:
# libero_spatial
huggingface-cli download limxdynamics/FluxVLAData --repo-type dataset --include "libero_spatial_no_noops_lerobotv2.1/*" --local-dir ./datasets
# libero_object
huggingface-cli download limxdynamics/FluxVLAData --repo-type dataset --include "libero_object_no_noops_lerobotv2.1/*" --local-dir ./datasets
# libero_goal
huggingface-cli download limxdynamics/FluxVLAData --repo-type dataset --include "libero_goal_no_noops_lerobotv2.1/*" --local-dir ./datasets
Directory structure after download:
libero_10_no_noops_lerobotv2.1/
├── data/
│ └── chunk-000/
│ ├── episode_000000.parquet
│ └── ...
├── videos/
│ └── chunk-000/
│ ├── observation.images.image/
│ │ ├── episode_000000.mp4
│ │ └── ...
│ └── observation.images.wrist_image/
│ ├── episode_000000.mp4
│ └── ...
└── meta/
├── episodes.jsonl
├── episodes_stats.jsonl
├── info.json
└── tasks.jsonl
The LIBERO data includes 2 camera viewpoints (image as the primary view + wrist_image as the wrist-mounted view). Each frame contains a 14-dimensional state and a 7-dimensional action (end-effector position + rotation + gripper).
GR00T-N1.5 LIBERO Training#
Using the provided configuration file configs/gr00t/gr00t_eagle_3b_libero_10_full_train.py, the key configuration items are described below.
1. Model Configuration#
model = dict(
type='LlavaVLA',
pretrained_name_or_path='/path/to/models/GR00T-N1.5-3B',
vlm_backbone=dict(
type='EagleBackbone',
vlm_path='FluxVLA/models/third_party_models/eagle2_hg_model'),
vla_head=dict(
type='FlowMatchingHead',
state_dim=8, # State input dimension
hidden_size=1024,
input_embedding_dim=1536,
num_layers=1,
num_heads=4,
num_inference_timesteps=4,
traj_length=10, # Predict 10 future action steps
action_dim=7), # LIBERO action dimension
freeze_vlm_backbone=False,
name_mapping={
'vlm_backbone.vlm': 'backbone.eagle_model',
'vla_head': 'action_head'
},
freeze_projector=False)
In the LIBERO scenario, state_dim=8 (6-dimensional end-effector pose + 2-dimensional gripper state) and action_dim=7 (6-dimensional end-effector delta + 1-dimensional gripper).
2. Data Configuration#
train_dataloader = dict(
batch_size=128,
per_device_batch_size=8,
per_device_num_workers=4,
dataset=dict(
type='DistributedRepeatingDataset',
name_mappings={
'observation.state': ['proprio'],
'action': ['action']
},
statistic_keys=['observation.state', 'timestamp', 'action'],
statistic_name='libero_10_no_noops',
datasets=dict(
type='ParquetDataset',
data_root_path='/path/to/data/LIBERO_lerobot/libero_10_no_noops_1.0.0_lerobot',
transforms=[
dict(
type='ProcessParquetInputs',
embodiment_id=31,
parquet_keys=[
'observation.state', 'timestamp', 'actions',
'info', 'stats', 'action_masks'
],
video_keys=[
'observation.images.image',
'observation.images.wrist_image',
],
name_mappings={
'observation.state': ['states'],
'actions': ['actions']
}),
dict(type='ParquetPrompter'),
dict(
type='ProcessPromptsWithImage',
max_len=600,
num_images=2,
tokenizer=dict(
type='PretrainedTokenizer',
model_path='/path/to/models/eagle2_hg_model',
)),
dict(type='ResizeImages', height=224, width=224),
dict(
type='NormalizeImages',
means=[[123.515625, 116.04492188, 103.59375],
[123.515625, 116.04492188, 103.59375]],
stds=[[58.27148438, 57.02636719, 57.27539062],
[58.27148438, 57.02636719, 57.27539062]]),
dict(
type='NormalizeStatesAndActions',
action_dim=14,
state_key='proprio',
action_key='action',
use_quantiles=False)
],
action_window_size=10,
action_key='action',
use_delta=False,
statistic_name='libero_10_no_noops',
window_start_idx=0)))
The data transformation pipeline executes the following steps in sequence:
ProcessParquetInputs— Parses states, actions, and images from Parquet filesParquetPrompter— Extracts task descriptions from the data as language promptsProcessPromptsWithImage— Encodes text and images into model input tokensResizeImages— Resizes all images to a uniform 224×224 resolutionNormalizeImages— Applies the standard normalization parameters for the Eagle modelNormalizeStatesAndActions— Normalizes states and actions
3. Training Configuration#
runner = dict(
type='FSDPTrainRunner',
max_epochs=24,
learning_rate=2e-5,
weight_decay=0.0,
max_grad_norm=1.0,
sampler=None,
tokenizer=dict(
type='PretrainedTokenizer',
model_path='/path/to/models/eagle2_hg_model',
),
collator=dict(
type='DictCollator',
keys=[
'states', 'observation.eepose', 'timestamp', 'images',
'img_masks', 'lang_tokens', 'lang_masks', 'actions',
'action_masks', 'embodiment_ids'
],
meta_keys=['task_description', 'prompt', 'info', 'stats']),
metric=dict(
type='VLAMetric',
active_trackers=('jsonl', 'wandb'),
run_dir='work_dirs',
wandb_project='FluxVLA',
wandb_entity='limx',
grad_accumulation_steps=1,
window_size=1),
lr_scheduler_type='constant',
warmup_ratio=0.0,
enable_gradient_checkpointing=True,
enable_mixed_precision_training=True,
mixed_precision_dtype='bf16',
sharding_strategy='full-shard',
change_key_name=False)
Training employs the FSDP (Fully Sharded Data Parallel) strategy for distributed training with BF16 mixed precision and a constant learning rate of 2e-5. Since the LIBERO dataset is relatively small, 24 epochs of training are required for sufficient convergence.
4. Evaluation Configuration#
eval = dict(
type='LiberoEvalRunner',
task_suite_name='libero_10',
model_family='pi0',
eval_chunk_size=10,
resize_size=224,
num_trials_per_task=50,
num_steps_wait=10,
seed=7,
dataset=dict(
type='LiberoParquetEvalDataset',
transforms=[
dict(
type='ProcessLiberoEvalInputs',
embodiment_id=31,
img_keys=['agentview_image', 'robot0_eye_in_hand_image']),
dict(
type='TransformImage',
image_resize_strategy='resize-naive',
input_sizes=[[3, 224, 224], [3, 224, 224]],
means=[[123.515625, 116.04492188, 103.59375],
[123.515625, 116.04492188, 103.59375]],
stds=[[58.27148438, 57.02636719, 57.27539062],
[58.27148438, 57.02636719, 57.27539062]]),
dict(
type='ProcessPromptsWithImage',
max_len=600,
num_images=2,
tokenizer=dict(
type='PretrainedTokenizer',
model_path='/path/to/models/eagle2_hg_model',
)),
dict(
type='LiberoProprioFromInputs',
use_quantiles=False,
pos_key='robot0_eef_pos',
quat_key='robot0_eef_quat',
gripper_key='robot0_gripper_qpos',
out_key='states'),
]),
denormalize_action=dict(
type='DenormalizeLiberoAction',
use_quantiles=False,
))
Evaluation runs in the LIBERO simulation environment, executing 50 trials for each task in libero_10 and reporting the success rate.
5. Launching Training#
cd /path/to/fluxvla
# Set environment variables (single node, 8 GPUs)
export MLP_WORKER_GPU=8
export MLP_WORKER_NUM=1
export MLP_ROLE_INDEX=0
export MLP_WORKER_0_HOST=localhost
export MLP_WORKER_0_PORT=29500
bash scripts/train.sh \
configs/gr00t/gr00t_eagle_3b_libero_10_full_train.py \
work_dirs/gr00t_eagle_3b_libero_10_full_train
On the Volcano Engine MLP platform, environment variables are automatically injected by the platform. Simply run:
bash scripts/train.sh \
configs/gr00t/gr00t_eagle_3b_libero_10_full_train.py \
work_dirs/gr00t_eagle_3b_libero_10_full_train
6. Evaluation#
After training is complete, use eval.sh to perform evaluation:
export MLP_WORKER_GPU=1
export MLP_WORKER_NUM=1
export MLP_ROLE_INDEX=0
export MLP_WORKER_0_HOST=localhost
export MLP_WORKER_0_PORT=29500
bash scripts/eval.sh \
configs/gr00t/gr00t_eagle_3b_libero_10_full_train.py \
work_dirs/gr00t_eagle_3b_libero_10_full_train/checkpoint_epoch_24.pt
Alternatively, append the --eval-after-train flag during training to trigger automatic evaluation upon completion:
bash scripts/train.sh \
configs/gr00t/gr00t_eagle_3b_libero_10_full_train.py \
work_dirs/gr00t_eagle_3b_libero_10_full_train \
--eval-after-train
7. Training Outputs#
Upon completion, the following files are generated under the work_dirs/gr00t_eagle_3b_libero_10_full_train/ directory:
checkpoint_*.pt— Model checkpointsdataset_statistics.json— Dataset statistics (used during evaluation and inference)config.yaml/config.json— Training configuration backups*.jsonl— Training logs (e.g., loss curves)
π₀ LIBERO Training#
1. Training Strategy Selection#
π₀.₅ supports two fine-tuning strategies:
Strategy |
Full Finetune |
LoRA Finetune |
|---|---|---|
Configuration File |
|
|
Training Runner |
|
|
Backbone Freezing |
Not frozen ( |
VLM and Expert frozen |
Memory Requirements |
Higher |
Lower |
Applicable Scenarios |
Large-scale data, optimal performance |
Limited data, constrained memory, rapid experimentation |
2. Full Finetune Configuration Details#
Using configuration file configs/pi05/pi05_paligemma_libero_10_full_finetune_pytorch.py.
2.1 Model Configuration#
model = dict(
type='PI05FlowMatching',
vlm_backbone=dict(
type='PaliGemma',
vlm_backbone_id='paligemma_3b_pt_224',
vlm_config=dict(...)), # PaliGemma detailed config (usually no changes needed)
proj_width=1024, # Projection dimension
n_action_steps=10, # Predict 10 future action steps
state_proj=dict(type='LinearProjector', in_dim=8, out_dim=1024),
action_in_proj=dict(type='LinearProjector', in_dim=7, out_dim=1024),
action_out_proj=dict(type='LinearProjector', in_dim=1024, out_dim=7),
action_time_mlp_in=dict(type='LinearProjector', in_dim=2048, out_dim=1024),
action_time_mlp_out=dict(type='LinearProjector', in_dim=1024, out_dim=1024),
llm_expert=dict(
type='GemmaLLMBackbone',
llm_backbone_id='gemma-2b_causal',
llm_family='gemma',
llm_config=dict(...)), # Gemma Expert detailed config (usually no changes needed)
freeze_vlm_backbone=False,
pretrained_name_or_path=
'/path/to/cache/openpi/openpi-assets/checkpoints/pi0_libero_pytorch/model.safetensors',
name_mapping={
'vlm_backbone.vlm.model.language_model':
'model.paligemma_with_expert.paligemma.model.language_model',
'vlm_backbone.vlm.model.vision_tower':
'model.paligemma_with_expert.paligemma.model.vision_tower',
'vlm_backbone.vlm.model.multi_modal_projector':
'model.paligemma_with_expert.paligemma.model.multi_modal_projector',
'action_time_mlp_in.projector': 'model.action_time_mlp_in',
'action_time_mlp_out.projector': 'model.action_time_mlp_out',
'llm_expert.llm.model':
'model.paligemma_with_expert.gemma_expert.model',
'state_proj.projector': 'model.state_proj',
'action_in_proj.projector': 'model.action_in_proj',
'action_out_proj.projector': 'model.action_out_proj'
})
Key dimension parameters:
Parameter |
LIBERO |
Description |
|---|---|---|
|
8 |
State input dimension (6-dimensional end-effector pose + 2-dimensional gripper) |
|
7 |
Action input dimension |
|
7 |
Action output dimension |
|
10 |
Length of the action sequence predicted per step |
2.2 Data Configuration#
train_dataloader = dict(
batch_size=128,
per_device_batch_size=8,
per_device_num_workers=4,
dataset=dict(
type='DistributedRepeatingDataset',
name_mappings={
'observation.state': ['proprio'],
'action': ['action']
},
statistic_keys=['observation.state', 'timestamp', 'action'],
statistic_name='libero_10_no_noops',
datasets=dict(
type='ParquetDataset',
data_root_path='/path/to/data/LIBERO_lerobot/libero_10_no_noops_1.0.0_lerobot',
transforms=[
dict(
type='ProcessParquetInputs',
parquet_keys=[
'observation.state', 'timestamp', 'actions',
'info', 'stats', 'action_masks'
],
video_keys=[
'observation.images.image',
'observation.images.wrist_image',
],
name_mappings={
'observation.state': ['states'],
'actions': ['actions']
}),
dict(type='ParquetPrompter'),
dict(
type='ProcessPrompts',
tokenizer=dict(
type='PretrainedTokenizer',
model_path='/path/to/checkpoints/pi0',
)),
dict(type='ResizeImages', height=224, width=224),
dict(
type='NormalizeImages',
means=[[123.515625, 116.04492188, 103.59375],
[123.515625, 116.04492188, 103.59375]],
stds=[[58.27148438, 57.02636719, 57.27539062],
[58.27148438, 57.02636719, 57.27539062]]),
dict(
type='NormalizeStatesAndActions',
action_dim=14,
state_key='proprio',
action_key='action',
use_quantiles=False)
],
action_window_size=10,
action_key='action',
use_delta=False,
statistic_name='libero_10_no_noops',
window_start_idx=0)))
Note
π₀ uses ProcessPrompts instead of GR00T’s ProcessPromptsWithImage for text processing. The tokenizer paths also differ: π₀ uses the huggingface/pi0 tokenizer.
2.3 Training Configuration#
runner = dict(
type='FSDPTrainRunner',
max_epochs=24,
learning_rate=2e-5,
weight_decay=0.0,
max_grad_norm=1.0,
collator=dict(
type='DictCollator',
keys=[
'states', 'observation.eepose', 'timestamp', 'images',
'img_masks', 'lang_tokens', 'lang_masks', 'actions',
'action_masks'
],
meta_keys=['task_description', 'prompt', 'info', 'stats']),
sampler=None,
metric=dict(
type='VLAMetric',
active_trackers=('jsonl', 'wandb'),
run_dir='work_dirs',
wandb_project='FluxVLA',
wandb_entity='limx',
grad_accumulation_steps=1,
window_size=1),
lr_scheduler_type='constant',
warmup_ratio=0.0,
enable_gradient_checkpointing=True,
enable_mixed_precision_training=True,
mixed_precision_dtype='bf16',
sharding_strategy='full-shard',
change_key_name=False)
2.4 Evaluation Configuration#
eval = dict(
type='LiberoEvalRunner',
task_suite_name='libero_10',
model_family='pi0',
eval_chunk_size=10,
resize_size=224,
num_trials_per_task=50,
num_steps_wait=10,
seed=7,
dataset=dict(
type='LiberoParquetEvalDataset',
transforms=[
dict(
type='ProcessLiberoEvalInputs',
img_keys=['agentview_image', 'robot0_eye_in_hand_image']),
dict(
type='TransformImage',
image_resize_strategy='resize-naive',
input_sizes=[[3, 224, 224], [3, 224, 224]],
means=[[123.515625, 116.04492188, 103.59375],
[123.515625, 116.04492188, 103.59375]],
stds=[[58.27148438, 57.02636719, 57.27539062],
[58.27148438, 57.02636719, 57.27539062]]),
dict(
type='LiberoPromptFromInputs',
tokenizer=dict(
type='PretrainedTokenizer',
model_path='/path/to/checkpoints/pi0',
)),
dict(
type='LiberoProprioFromInputs',
use_quantiles=False,
pos_key='robot0_eef_pos',
quat_key='robot0_eef_quat',
gripper_key='robot0_gripper_qpos',
out_key='states'),
]),
denormalize_action=dict(
type='DenormalizeLiberoAction',
use_quantiles=False))
3. LoRA Finetune Configuration Details#
Using configuration file configs/pi0/pi0_paligemma_libero_10_lora_finetune.py.
Key differences between LoRA mode and Full Finetune:
model = dict(
type='PI0FlowMatching',
# ... base configuration same as above ...
freeze_vlm_backbone=True, # Freeze VLM backbone
freeze_llm_expert=True, # Freeze action expert
use_lora=True, # Enable LoRA
lora_rank=32, # LoRA rank
lora_dropout=0.0,
lora_target_modules=[ # Modules targeted by LoRA
'q_proj', 'v_proj', 'k_proj', 'o_proj',
'state_proj.projector',
'action_in_proj.projector',
'action_out_proj.projector'
])
The training runner uses DDP instead of FSDP:
runner = dict(
type='DDPTrainRunner', # LoRA uses DDP
learning_rate=0.0005, # LoRA typically uses a higher learning rate
max_epochs=18,
save_epoch_interval=1,
# ... remainder same as Full Finetune ...
)
4. Launching Training#
cd /path/to/fluxvla
# Set environment variables (single node, 8 GPUs)
export MLP_WORKER_GPU=8
export MLP_WORKER_NUM=1
export MLP_ROLE_INDEX=0
export MLP_WORKER_0_HOST=localhost
export MLP_WORKER_0_PORT=29500
# Full Finetune
bash scripts/train.sh \
configs/pi0/pi0_paligemma_libero_10_full_finetune_pytorch.py \
work_dirs/pi0_paligemma_libero_10_full_finetune_pytorch
# Or LoRA Finetune
bash scripts/train.sh \
configs/pi0/pi0_paligemma_libero_10_lora_finetune.py \
work_dirs/pi0_paligemma_libero_10_lora_finetune
5. Evaluation#
export MLP_WORKER_GPU=1
export MLP_WORKER_NUM=1
export MLP_ROLE_INDEX=0
export MLP_WORKER_0_HOST=localhost
export MLP_WORKER_0_PORT=29500
bash scripts/eval.sh \
configs/pi0/pi0_paligemma_libero_10_full_finetune_pytorch.py \
work_dirs/pi0_paligemma_libero_10_full_finetune_pytorch/checkpoint_epoch_24.pt