Model Training#

(scripts/train.sh Usage Guide)

Overview#

scripts/train.sh is a shell script for launching distributed training. It wraps the torchrun command to facilitate model training in multi-node, multi-GPU environments.

Basic Usage#

bash scripts/train.sh [CONFIG] [WORK_DIR] [additional arguments...]

Pre-trained Weight Downloads#

You may utilize all or a subset of modules from pre-trained models, as well as assemble different modules from various models in a modular fashion. Select the desired model weights from the tables below and download them to the ./checkpoints directory.

VLA Models#

Model	Size	Download Link
GR00T N1.5	3B	nvidia/GR00T-N1.5-3B
OpenVLA	7B	openvla/openvla-7b-finetuned-libero-10
PI0_base	3B	yinchimaoliang/pi0_base
PI05_base	3B	yinchimaoliang/pi05_base
PI05_libero	3B	yinchimaoliang/pi05_libero

Vision-Language Models (VLM)#

Model	Size	Download Link
Qwen2.5-VL	3B	Qwen/Qwen2.5-VL-3B-Instruct

Large Language Models (LLM)#

Model	Size	Download Link
Qwen 2.5	3B	Qwen/Qwen2.5-3B
Qwen 2.5	7B	Qwen/Qwen2.5-7B
Llama 2	7B	meta-llama/Llama-2-7b-hf

Vision Backbones#

Model	Download Link
ViT-Large (DINOv2)	timm/vit_large_patch14_reg4_dinov2.lvd142m
ViT-SO400M (SigLIP)	timm/ViT-SO400M-14-SigLIP
SigLIP2	google/siglip2-base-patch16-224
paligemma	google/paligemma-3b-pt-224

Tip: Use huggingface-cli download <model-name> --local-dir ./checkpoints/<model-name> to download the desired model weights.

Script Parameters#

Positional Arguments#

Argument	Position	Default Value	Description
`CONFIG`	$1	`configs/pi05/pi05_paligemma_libero10_full_finetune.py`	Path to the training configuration file
`WORK_DIR`	$2	`work_dirs/pi05_paligemma_libero10_full_finetune`	Working directory for logs and checkpoints
Additional arguments	$3+	None	Additional arguments passed directly to `train.py`

Environment Variables (Required for Distributed Training)#

These environment variables configure the distributed training parameters for torchrun:

Environment Variable	Description	Example
`MLP_WORKER_GPU`	Number of GPUs per node	`8`
`MLP_WORKER_NUM`	Total number of nodes participating in training	`2`
`MLP_ROLE_INDEX`	Rank index of the current node (starting from 0)	`0`
`MLP_WORKER_0_HOST`	IP address or hostname of the master node (rank 0)	`{MASTER_NODE_IP}`
`MLP_WORKER_0_PORT`	Communication port of the master node	`29500`

Additional Arguments Supported by train.py#

The following arguments can be passed as additional arguments to train.py:

Argument	Type	Description
`--cfg-options`	key=value pairs	Override configuration file settings in the format `xxx=yyy`
`--eval-after-train`	flag	Automatically run evaluation after training completes
`--resume-from`	path	Resume training from a specified checkpoint file

Configuring Weights & Biases (wandb)#

Weights & Biases is used for experiment tracking and visualization. Configuration steps are as follows:

1. Install wandb#

wandb is included in requirements.txt and can also be installed manually:

pip install wandb

2. Log in to wandb#

wandb login

3. Set Environment Variables#

export WANDB_PROJECT=fluxvla        # Project name (default: fluxvla)
export WANDB_ENTITY=your-team-name  # Team or username (default: None)
export WANDB_MODE=online            # online, offline, or disabled (default: online)

4. Disable wandb Logging#

To disable wandb logging during training, set:

export WANDB_MODE=disabled

Note: All wandb configurations are read from environment variables; no additional settings in the configuration file are required.

Usage Examples#

1. Single-Node Multi-GPU Training (8 GPUs)#

export MLP_WORKER_GPU=8
export MLP_WORKER_NUM=1
export MLP_ROLE_INDEX=0
export MLP_WORKER_0_HOST=localhost
export MLP_WORKER_0_PORT=29500

bash scripts/train.sh \
    configs/pi05/pi05_paligemma_libero10_full_finetune.py \
    work_dirs/pi05_paligemma_libero10_full_finetune

2. Volcengine Cloud Platform Training#

On the Volcengine MLP platform, the environment variables MLP_WORKER_GPU, MLP_WORKER_NUM, MLP_ROLE_INDEX, MLP_WORKER_0_HOST, and MLP_WORKER_0_PORT are automatically injected by the platform. No manual configuration is required; simply execute the command directly.

bash scripts/train.sh \
    configs/pi05/pi05_paligemma_libero10_full_finetune.py \
    work_dirs/pi05_paligemma_libero10_full_finetune

3. Multi-Node Distributed Training (Manual Configuration)#

If not running on the Volcengine platform, environment variables must be configured manually:

Node 0 (Master Node):

export MLP_WORKER_GPU=8
export MLP_WORKER_NUM=2
export MLP_ROLE_INDEX=0
export MLP_WORKER_0_HOST={MASTER_NODE_IP}
export MLP_WORKER_0_PORT=29500

bash scripts/train.sh \
    configs/pi05/pi05_paligemma_libero10_full_finetune.py \
    work_dirs/pi05_paligemma_libero10_full_finetune

Node 1:

export MLP_WORKER_GPU=8
export MLP_WORKER_NUM=2
export MLP_ROLE_INDEX=1
export MLP_WORKER_0_HOST={MASTER_NODE_IP}
export MLP_WORKER_0_PORT=29500

bash scripts/train.sh \
    configs/pi05/pi05_paligemma_libero10_full_finetune.py \
    work_dirs/pi05_paligemma_libero10_full_finetune

4. Using Additional Arguments#

# Override configuration parameters
bash scripts/train.sh \
    configs/pi05/pi05_paligemma_libero10_full_finetune.py \
    work_dirs/custom_run \
    --cfg-options runner.max_epochs=100 runner.save_interval=500

# Resume training from a checkpoint
bash scripts/train.sh \
    configs/pi05/pi05_paligemma_libero10_full_finetune.py \
    work_dirs/pi05_resume \
    --resume-from work_dirs/previous_run/checkpoint_epoch_5.pt

# Automatic evaluation after training
bash scripts/train.sh \
    configs/pi05/pi05_paligemma_libero10_full_finetune.py \
    work_dirs/pi05_with_eval \
    --eval-after-train

5. Combining Multiple Additional Arguments#

bash scripts/train.sh \
    configs/pi05/pi05_paligemma_libero10_full_finetune.py \
    work_dirs/pi05_full \
    --resume-from work_dirs/previous_run/latest.pt \
    --eval-after-train \
    --cfg-options runner.max_epochs=50 train_dataloader.batch_size=16

Output Artifacts#

Upon training completion, the following files are generated in the WORK_DIR directory:

config.yaml / config.json - Backup of the training configuration
checkpoint_*.pt - Model checkpoint files
dataset_statistics.json - Dataset statistics (if applicable)
{config_name}_{timestamp}.jsonl - Training log file recording loss curves and other training metrics
Tokenizer-related files - Generated if a tokenizer is employed

Important Notes#

Environment Variables Must Be Set: All MLP_* environment variables must be correctly configured before executing the script; otherwise, torchrun will fail to initialize distributed training.
Configuration File Format: Configuration files use the MMEngine Config format (Python files) and must conform to the project’s configuration specifications.
Multi-Node Synchronization: During multi-node training, all nodes must use identical CONFIG and WORK_DIR parameters.
Port Conflicts: Ensure that the port specified by MLP_WORKER_0_PORT is not occupied and is accessible through the firewall.
NCCL Backend: NCCL is used as the default distributed backend. Ensure compatibility between CUDA and NCCL versions.