LIBERO Simulation Data Training#

LIBERO is a widely used robotic manipulation benchmark environment comprising task suites of varying difficulty levels. This page describes how to train and evaluate both GR00T-N1.5 and π₀ models using the LIBERO dataset within the FluxVLA framework, with libero_10 serving as the primary example.

Data Preparation#

Download the LIBERO dataset (LeRobot v2.1 format):

huggingface-cli download limxdynamics/FluxVLAData --repo-type dataset --include "libero_10_no_noops_lerobotv2.1/*" --local-dir ./datasets

Other LIBERO task suites:

# libero_spatial
huggingface-cli download limxdynamics/FluxVLAData --repo-type dataset --include "libero_spatial_no_noops_lerobotv2.1/*" --local-dir ./datasets
# libero_object
huggingface-cli download limxdynamics/FluxVLAData --repo-type dataset --include "libero_object_no_noops_lerobotv2.1/*" --local-dir ./datasets
# libero_goal
huggingface-cli download limxdynamics/FluxVLAData --repo-type dataset --include "libero_goal_no_noops_lerobotv2.1/*" --local-dir ./datasets

Directory structure after download:

libero_10_no_noops_lerobotv2.1/
├── data/
│   └── chunk-000/
│       ├── episode_000000.parquet
│       └── ...
├── videos/
│   └── chunk-000/
│       ├── observation.images.image/
│       │   ├── episode_000000.mp4
│       │   └── ...
│       └── observation.images.wrist_image/
│           ├── episode_000000.mp4
│           └── ...
└── meta/
    ├── episodes.jsonl
    ├── episodes_stats.jsonl
    ├── info.json
    └── tasks.jsonl

The LIBERO data includes 2 camera viewpoints (image as the primary view + wrist_image as the wrist-mounted view). Each frame contains a 14-dimensional state and a 7-dimensional action (end-effector position + rotation + gripper).

GR00T-N1.5 LIBERO Training#

Using the provided configuration file configs/gr00t/gr00t_eagle_3b_libero_10_full_train.py, the key configuration items are described below.

1. Model Configuration#

model = dict(
    type='LlavaVLA',
    pretrained_name_or_path='/path/to/models/GR00T-N1.5-3B',
    vlm_backbone=dict(
        type='EagleBackbone',
        vlm_path='FluxVLA/models/third_party_models/eagle2_hg_model'),
    vla_head=dict(
        type='FlowMatchingHead',
        state_dim=8,              # State input dimension
        hidden_size=1024,
        input_embedding_dim=1536,
        num_layers=1,
        num_heads=4,
        num_inference_timesteps=4,
        traj_length=10,           # Predict 10 future action steps
        action_dim=7),            # LIBERO action dimension
    freeze_vlm_backbone=False,
    name_mapping={
        'vlm_backbone.vlm': 'backbone.eagle_model',
        'vla_head': 'action_head'
    },
    freeze_projector=False)

In the LIBERO scenario, state_dim=8 (6-dimensional end-effector pose + 2-dimensional gripper state) and action_dim=7 (6-dimensional end-effector delta + 1-dimensional gripper).

2. Data Configuration#

train_dataloader = dict(
    batch_size=128,
    per_device_batch_size=8,
    per_device_num_workers=4,
    dataset=dict(
        type='DistributedRepeatingDataset',
        name_mappings={
            'observation.state': ['proprio'],
            'action': ['action']
        },
        statistic_keys=['observation.state', 'timestamp', 'action'],
        statistic_name='libero_10_no_noops',
        datasets=dict(
            type='ParquetDataset',
            data_root_path='/path/to/data/LIBERO_lerobot/libero_10_no_noops_1.0.0_lerobot',
            transforms=[
                dict(
                    type='ProcessParquetInputs',
                    embodiment_id=31,
                    parquet_keys=[
                        'observation.state', 'timestamp', 'actions',
                        'info', 'stats', 'action_masks'
                    ],
                    video_keys=[
                        'observation.images.image',
                        'observation.images.wrist_image',
                    ],
                    name_mappings={
                        'observation.state': ['states'],
                        'actions': ['actions']
                    }),
                dict(type='ParquetPrompter'),
                dict(
                    type='ProcessPromptsWithImage',
                    max_len=600,
                    num_images=2,
                    tokenizer=dict(
                        type='PretrainedTokenizer',
                        model_path='/path/to/models/eagle2_hg_model',
                    )),
                dict(type='ResizeImages', height=224, width=224),
                dict(
                    type='NormalizeImages',
                    means=[[123.515625, 116.04492188, 103.59375],
                           [123.515625, 116.04492188, 103.59375]],
                    stds=[[58.27148438, 57.02636719, 57.27539062],
                          [58.27148438, 57.02636719, 57.27539062]]),
                dict(
                    type='NormalizeStatesAndActions',
                    action_dim=14,
                    state_key='proprio',
                    action_key='action',
                    use_quantiles=False)
            ],
            action_window_size=10,
            action_key='action',
            use_delta=False,
            statistic_name='libero_10_no_noops',
            window_start_idx=0)))

The data transformation pipeline executes the following steps in sequence:

  1. ProcessParquetInputs — Parses states, actions, and images from Parquet files

  2. ParquetPrompter — Extracts task descriptions from the data as language prompts

  3. ProcessPromptsWithImage — Encodes text and images into model input tokens

  4. ResizeImages — Resizes all images to a uniform 224×224 resolution

  5. NormalizeImages — Applies the standard normalization parameters for the Eagle model

  6. NormalizeStatesAndActions — Normalizes states and actions

3. Training Configuration#

runner = dict(
    type='FSDPTrainRunner',
    max_epochs=24,
    learning_rate=2e-5,
    weight_decay=0.0,
    max_grad_norm=1.0,
    sampler=None,
    tokenizer=dict(
        type='PretrainedTokenizer',
        model_path='/path/to/models/eagle2_hg_model',
    ),
    collator=dict(
        type='DictCollator',
        keys=[
            'states', 'observation.eepose', 'timestamp', 'images',
            'img_masks', 'lang_tokens', 'lang_masks', 'actions',
            'action_masks', 'embodiment_ids'
        ],
        meta_keys=['task_description', 'prompt', 'info', 'stats']),
    metric=dict(
        type='VLAMetric',
        active_trackers=('jsonl', 'wandb'),
        run_dir='work_dirs',
        wandb_project='FluxVLA',
        wandb_entity='limx',
        grad_accumulation_steps=1,
        window_size=1),
    lr_scheduler_type='constant',
    warmup_ratio=0.0,
    enable_gradient_checkpointing=True,
    enable_mixed_precision_training=True,
    mixed_precision_dtype='bf16',
    sharding_strategy='full-shard',
    change_key_name=False)

Training employs the FSDP (Fully Sharded Data Parallel) strategy for distributed training with BF16 mixed precision and a constant learning rate of 2e-5. Since the LIBERO dataset is relatively small, 24 epochs of training are required for sufficient convergence.

4. Evaluation Configuration#

eval = dict(
    type='LiberoEvalRunner',
    task_suite_name='libero_10',
    model_family='pi0',
    eval_chunk_size=10,
    resize_size=224,
    num_trials_per_task=50,
    num_steps_wait=10,
    seed=7,
    dataset=dict(
        type='LiberoParquetEvalDataset',
        transforms=[
            dict(
                type='ProcessLiberoEvalInputs',
                embodiment_id=31,
                img_keys=['agentview_image', 'robot0_eye_in_hand_image']),
            dict(
                type='TransformImage',
                image_resize_strategy='resize-naive',
                input_sizes=[[3, 224, 224], [3, 224, 224]],
                means=[[123.515625, 116.04492188, 103.59375],
                       [123.515625, 116.04492188, 103.59375]],
                stds=[[58.27148438, 57.02636719, 57.27539062],
                      [58.27148438, 57.02636719, 57.27539062]]),
            dict(
                type='ProcessPromptsWithImage',
                max_len=600,
                num_images=2,
                tokenizer=dict(
                    type='PretrainedTokenizer',
                    model_path='/path/to/models/eagle2_hg_model',
                )),
            dict(
                type='LiberoProprioFromInputs',
                use_quantiles=False,
                pos_key='robot0_eef_pos',
                quat_key='robot0_eef_quat',
                gripper_key='robot0_gripper_qpos',
                out_key='states'),
        ]),
    denormalize_action=dict(
        type='DenormalizeLiberoAction',
        use_quantiles=False,
    ))

Evaluation runs in the LIBERO simulation environment, executing 50 trials for each task in libero_10 and reporting the success rate.

5. Launching Training#

cd /path/to/fluxvla

# Set environment variables (single node, 8 GPUs)
export MLP_WORKER_GPU=8
export MLP_WORKER_NUM=1
export MLP_ROLE_INDEX=0
export MLP_WORKER_0_HOST=localhost
export MLP_WORKER_0_PORT=29500

bash scripts/train.sh \
    configs/gr00t/gr00t_eagle_3b_libero_10_full_train.py \
    work_dirs/gr00t_eagle_3b_libero_10_full_train

On the Volcano Engine MLP platform, environment variables are automatically injected by the platform. Simply run:

bash scripts/train.sh \
    configs/gr00t/gr00t_eagle_3b_libero_10_full_train.py \
    work_dirs/gr00t_eagle_3b_libero_10_full_train

6. Evaluation#

After training is complete, use eval.sh to perform evaluation:

export MLP_WORKER_GPU=1
export MLP_WORKER_NUM=1
export MLP_ROLE_INDEX=0
export MLP_WORKER_0_HOST=localhost
export MLP_WORKER_0_PORT=29500

bash scripts/eval.sh \
    configs/gr00t/gr00t_eagle_3b_libero_10_full_train.py \
    work_dirs/gr00t_eagle_3b_libero_10_full_train/checkpoint_epoch_24.pt

Alternatively, append the --eval-after-train flag during training to trigger automatic evaluation upon completion:

bash scripts/train.sh \
    configs/gr00t/gr00t_eagle_3b_libero_10_full_train.py \
    work_dirs/gr00t_eagle_3b_libero_10_full_train \
    --eval-after-train

7. Training Outputs#

Upon completion, the following files are generated under the work_dirs/gr00t_eagle_3b_libero_10_full_train/ directory:

  • checkpoint_*.pt — Model checkpoints

  • dataset_statistics.json — Dataset statistics (used during evaluation and inference)

  • config.yaml / config.json — Training configuration backups

  • *.jsonl — Training logs (e.g., loss curves)

π₀ LIBERO Training#

1. Training Strategy Selection#

π₀.₅ supports two fine-tuning strategies:

Strategy

Full Finetune

LoRA Finetune

Configuration File

pi05_paligemma_libero_10_full_finetune_pytorch.py

pi05_paligemma_libero_10_lora_finetune.py

Training Runner

FSDPTrainRunner

DDPTrainRunner

Backbone Freezing

Not frozen (freeze_vlm_backbone=False)

VLM and Expert frozen

Memory Requirements

Higher

Lower

Applicable Scenarios

Large-scale data, optimal performance

Limited data, constrained memory, rapid experimentation

2. Full Finetune Configuration Details#

Using configuration file configs/pi05/pi05_paligemma_libero_10_full_finetune_pytorch.py.

2.1 Model Configuration#

model = dict(
    type='PI05FlowMatching',
    vlm_backbone=dict(
        type='PaliGemma',
        vlm_backbone_id='paligemma_3b_pt_224',
        vlm_config=dict(...)),       # PaliGemma detailed config (usually no changes needed)
    proj_width=1024,                 # Projection dimension
    n_action_steps=10,               # Predict 10 future action steps
    state_proj=dict(type='LinearProjector', in_dim=8, out_dim=1024),
    action_in_proj=dict(type='LinearProjector', in_dim=7, out_dim=1024),
    action_out_proj=dict(type='LinearProjector', in_dim=1024, out_dim=7),
    action_time_mlp_in=dict(type='LinearProjector', in_dim=2048, out_dim=1024),
    action_time_mlp_out=dict(type='LinearProjector', in_dim=1024, out_dim=1024),
    llm_expert=dict(
        type='GemmaLLMBackbone',
        llm_backbone_id='gemma-2b_causal',
        llm_family='gemma',
        llm_config=dict(...)),       # Gemma Expert detailed config (usually no changes needed)
    freeze_vlm_backbone=False,
    pretrained_name_or_path=
        '/path/to/cache/openpi/openpi-assets/checkpoints/pi0_libero_pytorch/model.safetensors',
    name_mapping={
        'vlm_backbone.vlm.model.language_model':
            'model.paligemma_with_expert.paligemma.model.language_model',
        'vlm_backbone.vlm.model.vision_tower':
            'model.paligemma_with_expert.paligemma.model.vision_tower',
        'vlm_backbone.vlm.model.multi_modal_projector':
            'model.paligemma_with_expert.paligemma.model.multi_modal_projector',
        'action_time_mlp_in.projector': 'model.action_time_mlp_in',
        'action_time_mlp_out.projector': 'model.action_time_mlp_out',
        'llm_expert.llm.model':
            'model.paligemma_with_expert.gemma_expert.model',
        'state_proj.projector': 'model.state_proj',
        'action_in_proj.projector': 'model.action_in_proj',
        'action_out_proj.projector': 'model.action_out_proj'
    })

Key dimension parameters:

Parameter

LIBERO

Description

state_proj.in_dim

8

State input dimension (6-dimensional end-effector pose + 2-dimensional gripper)

action_in_proj.in_dim

7

Action input dimension

action_out_proj.out_dim

7

Action output dimension

n_action_steps

10

Length of the action sequence predicted per step

2.2 Data Configuration#

train_dataloader = dict(
    batch_size=128,
    per_device_batch_size=8,
    per_device_num_workers=4,
    dataset=dict(
        type='DistributedRepeatingDataset',
        name_mappings={
            'observation.state': ['proprio'],
            'action': ['action']
        },
        statistic_keys=['observation.state', 'timestamp', 'action'],
        statistic_name='libero_10_no_noops',
        datasets=dict(
            type='ParquetDataset',
            data_root_path='/path/to/data/LIBERO_lerobot/libero_10_no_noops_1.0.0_lerobot',
            transforms=[
                dict(
                    type='ProcessParquetInputs',
                    parquet_keys=[
                        'observation.state', 'timestamp', 'actions',
                        'info', 'stats', 'action_masks'
                    ],
                    video_keys=[
                        'observation.images.image',
                        'observation.images.wrist_image',
                    ],
                    name_mappings={
                        'observation.state': ['states'],
                        'actions': ['actions']
                    }),
                dict(type='ParquetPrompter'),
                dict(
                    type='ProcessPrompts',
                    tokenizer=dict(
                        type='PretrainedTokenizer',
                        model_path='/path/to/checkpoints/pi0',
                    )),
                dict(type='ResizeImages', height=224, width=224),
                dict(
                    type='NormalizeImages',
                    means=[[123.515625, 116.04492188, 103.59375],
                           [123.515625, 116.04492188, 103.59375]],
                    stds=[[58.27148438, 57.02636719, 57.27539062],
                          [58.27148438, 57.02636719, 57.27539062]]),
                dict(
                    type='NormalizeStatesAndActions',
                    action_dim=14,
                    state_key='proprio',
                    action_key='action',
                    use_quantiles=False)
            ],
            action_window_size=10,
            action_key='action',
            use_delta=False,
            statistic_name='libero_10_no_noops',
            window_start_idx=0)))

Note

π₀ uses ProcessPrompts instead of GR00T’s ProcessPromptsWithImage for text processing. The tokenizer paths also differ: π₀ uses the huggingface/pi0 tokenizer.

2.3 Training Configuration#

runner = dict(
    type='FSDPTrainRunner',
    max_epochs=24,
    learning_rate=2e-5,
    weight_decay=0.0,
    max_grad_norm=1.0,
    collator=dict(
        type='DictCollator',
        keys=[
            'states', 'observation.eepose', 'timestamp', 'images',
            'img_masks', 'lang_tokens', 'lang_masks', 'actions',
            'action_masks'
        ],
        meta_keys=['task_description', 'prompt', 'info', 'stats']),
    sampler=None,
    metric=dict(
        type='VLAMetric',
        active_trackers=('jsonl', 'wandb'),
        run_dir='work_dirs',
        wandb_project='FluxVLA',
        wandb_entity='limx',
        grad_accumulation_steps=1,
        window_size=1),
    lr_scheduler_type='constant',
    warmup_ratio=0.0,
    enable_gradient_checkpointing=True,
    enable_mixed_precision_training=True,
    mixed_precision_dtype='bf16',
    sharding_strategy='full-shard',
    change_key_name=False)

2.4 Evaluation Configuration#

eval = dict(
    type='LiberoEvalRunner',
    task_suite_name='libero_10',
    model_family='pi0',
    eval_chunk_size=10,
    resize_size=224,
    num_trials_per_task=50,
    num_steps_wait=10,
    seed=7,
    dataset=dict(
        type='LiberoParquetEvalDataset',
        transforms=[
            dict(
                type='ProcessLiberoEvalInputs',
                img_keys=['agentview_image', 'robot0_eye_in_hand_image']),
            dict(
                type='TransformImage',
                image_resize_strategy='resize-naive',
                input_sizes=[[3, 224, 224], [3, 224, 224]],
                means=[[123.515625, 116.04492188, 103.59375],
                       [123.515625, 116.04492188, 103.59375]],
                stds=[[58.27148438, 57.02636719, 57.27539062],
                      [58.27148438, 57.02636719, 57.27539062]]),
            dict(
                type='LiberoPromptFromInputs',
                tokenizer=dict(
                    type='PretrainedTokenizer',
                    model_path='/path/to/checkpoints/pi0',
                )),
            dict(
                type='LiberoProprioFromInputs',
                use_quantiles=False,
                pos_key='robot0_eef_pos',
                quat_key='robot0_eef_quat',
                gripper_key='robot0_gripper_qpos',
                out_key='states'),
        ]),
    denormalize_action=dict(
        type='DenormalizeLiberoAction',
        use_quantiles=False))

3. LoRA Finetune Configuration Details#

Using configuration file configs/pi0/pi0_paligemma_libero_10_lora_finetune.py.

Key differences between LoRA mode and Full Finetune:

model = dict(
    type='PI0FlowMatching',
    # ... base configuration same as above ...
    freeze_vlm_backbone=True,        # Freeze VLM backbone
    freeze_llm_expert=True,          # Freeze action expert
    use_lora=True,                   # Enable LoRA
    lora_rank=32,                    # LoRA rank
    lora_dropout=0.0,
    lora_target_modules=[            # Modules targeted by LoRA
        'q_proj', 'v_proj', 'k_proj', 'o_proj',
        'state_proj.projector',
        'action_in_proj.projector',
        'action_out_proj.projector'
    ])

The training runner uses DDP instead of FSDP:

runner = dict(
    type='DDPTrainRunner',           # LoRA uses DDP
    learning_rate=0.0005,            # LoRA typically uses a higher learning rate
    max_epochs=18,
    save_epoch_interval=1,
    # ... remainder same as Full Finetune ...
)

4. Launching Training#

cd /path/to/fluxvla

# Set environment variables (single node, 8 GPUs)
export MLP_WORKER_GPU=8
export MLP_WORKER_NUM=1
export MLP_ROLE_INDEX=0
export MLP_WORKER_0_HOST=localhost
export MLP_WORKER_0_PORT=29500

# Full Finetune
bash scripts/train.sh \
    configs/pi0/pi0_paligemma_libero_10_full_finetune_pytorch.py \
    work_dirs/pi0_paligemma_libero_10_full_finetune_pytorch

# Or LoRA Finetune
bash scripts/train.sh \
    configs/pi0/pi0_paligemma_libero_10_lora_finetune.py \
    work_dirs/pi0_paligemma_libero_10_lora_finetune

5. Evaluation#

export MLP_WORKER_GPU=1
export MLP_WORKER_NUM=1
export MLP_ROLE_INDEX=0
export MLP_WORKER_0_HOST=localhost
export MLP_WORKER_0_PORT=29500

bash scripts/eval.sh \
    configs/pi0/pi0_paligemma_libero_10_full_finetune_pytorch.py \
    work_dirs/pi0_paligemma_libero_10_full_finetune_pytorch/checkpoint_epoch_24.pt