Model Configuration#

Model configuration is defined using a dict. The core fields include:

type: Model type name, specifying the VLA model architecture to use
pretrained_name_or_path: Path to pre-trained weights or a HuggingFace model name
vlm_backbone: Vision-language model backbone configuration, defining the foundation model for vision-language understanding
vla_head: VLA action prediction head configuration (if applicable), defining the structure and hyperparameters of the action generation module
freeze_vlm_backbone / freeze_projector: Whether to freeze specific module weights, controlling which parts participate in gradient updates during training
name_mapping: Key mapping from pre-trained weights to the current model, used for loading pre-trained models with different naming conventions

Below is a complete model configuration example:

model = dict(                                   # Model configuration dict
    type='LlavaVLA',                            # Model type, using the LlavaVLA architecture
    pretrained_name_or_path=                    # Pre-trained model name or local path
    '/path/to/models/GR00T-N1.5-3B',  # Pre-trained weight path (GR00T-N1.5-3B)
    vlm_backbone=dict(                          # Vision-language model backbone configuration
        type='EagleBackbone',                   # Backbone type, using the Eagle architecture
        vlm_path=                               # VLM model path
        'limvla/models/third_party_models/eagle2_hg_model'),
    vla_head=dict(                              # VLA action head configuration
        type='FlowMatchingHead',                # Action head type, using the Flow Matching method
        state_dim=8,                            # Robot state dimension (e.g., joint angles + gripper state)
        hidden_size=1024,                       # Hidden layer dimension
        input_embedding_dim=1536,               # Input embedding dimension (feature dim from VLM)
        num_layers=1,                           # Number of Transformer layers
        num_heads=4,                            # Number of attention heads
        num_inference_timesteps=4,              # Denoising steps during inference (more = more accurate but slower)
        traj_length=10,                         # Predicted action trajectory length (future frames)
        action_dim=7),                          # Action dimension (e.g., 6DoF pose + gripper)
    freeze_vlm_backbone=False,                  # Whether to freeze VLM backbone weights
    name_mapping={                              # Weight name mapping (for loading pre-trained weights)
        'vlm_backbone.vlm': 'backbone.eagle_model',
        'vla_head': 'action_head'
    },
    freeze_projector=False)                     # Whether to freeze projection layer weights

Inference model configuration (optional): When the inference model architecture differs from training, or when specific weight loading needs to be skipped, you can use inference_model to separately specify the inference model configuration. Its format is identical to model.