Model Configuration#
Model configuration is defined using a dict. The core fields include:
type: Model type name, specifying the VLA model architecture to use
pretrained_name_or_path: Path to pre-trained weights or a HuggingFace model name
vlm_backbone: Vision-language model backbone configuration, defining the foundation model for vision-language understanding
vla_head: VLA action prediction head configuration (if applicable), defining the structure and hyperparameters of the action generation module
freeze_vlm_backbone / freeze_projector: Whether to freeze specific module weights, controlling which parts participate in gradient updates during training
name_mapping: Key mapping from pre-trained weights to the current model, used for loading pre-trained models with different naming conventions
Below is a complete model configuration example:
model = dict( # Model configuration dict
type='LlavaVLA', # Model type, using the LlavaVLA architecture
pretrained_name_or_path= # Pre-trained model name or local path
'/path/to/models/GR00T-N1.5-3B', # Pre-trained weight path (GR00T-N1.5-3B)
vlm_backbone=dict( # Vision-language model backbone configuration
type='EagleBackbone', # Backbone type, using the Eagle architecture
vlm_path= # VLM model path
'limvla/models/third_party_models/eagle2_hg_model'),
vla_head=dict( # VLA action head configuration
type='FlowMatchingHead', # Action head type, using the Flow Matching method
state_dim=8, # Robot state dimension (e.g., joint angles + gripper state)
hidden_size=1024, # Hidden layer dimension
input_embedding_dim=1536, # Input embedding dimension (feature dim from VLM)
num_layers=1, # Number of Transformer layers
num_heads=4, # Number of attention heads
num_inference_timesteps=4, # Denoising steps during inference (more = more accurate but slower)
traj_length=10, # Predicted action trajectory length (future frames)
action_dim=7), # Action dimension (e.g., 6DoF pose + gripper)
freeze_vlm_backbone=False, # Whether to freeze VLM backbone weights
name_mapping={ # Weight name mapping (for loading pre-trained weights)
'vlm_backbone.vlm': 'backbone.eagle_model',
'vla_head': 'action_head'
},
freeze_projector=False) # Whether to freeze projection layer weights
Inference model configuration (optional): When the inference model architecture differs from training, or when specific weight loading needs to be skipped, you can use inference_model to separately specify the inference model configuration. Its format is identical to model.