Model Training#

(scripts/train.sh Usage Guide)

Overview#

scripts/train.sh is a shell script for launching distributed training. It wraps the torchrun command to facilitate model training in multi-node, multi-GPU environments.

Basic Usage#

bash scripts/train.sh [CONFIG] [WORK_DIR] [additional arguments...]

Pre-trained Weight Downloads#

You may utilize all or a subset of modules from pre-trained models, as well as assemble different modules from various models in a modular fashion. Select the desired model weights from the tables below and download them to the ./checkpoints directory.

VLA Models#

Model

Size

Download Link

GR00T N1.5

3B

nvidia/GR00T-N1.5-3B

OpenVLA

7B

openvla/openvla-7b-finetuned-libero-10

PI0_base

3B

yinchimaoliang/pi0_base

PI05_base

3B

yinchimaoliang/pi05_base

PI05_libero

3B

yinchimaoliang/pi05_libero

Vision-Language Models (VLM)#

Model

Size

Download Link

Qwen2.5-VL

3B

Qwen/Qwen2.5-VL-3B-Instruct

Large Language Models (LLM)#

Model

Size

Download Link

Qwen 2.5

3B

Qwen/Qwen2.5-3B

Qwen 2.5

7B

Qwen/Qwen2.5-7B

Llama 2

7B

meta-llama/Llama-2-7b-hf

Vision Backbones#

Model

Download Link

ViT-Large (DINOv2)

timm/vit_large_patch14_reg4_dinov2.lvd142m

ViT-SO400M (SigLIP)

timm/ViT-SO400M-14-SigLIP

SigLIP2

google/siglip2-base-patch16-224

paligemma

google/paligemma-3b-pt-224

Tip: Use huggingface-cli download <model-name> --local-dir ./checkpoints/<model-name> to download the desired model weights.

Script Parameters#

Positional Arguments#

Argument

Position

Default Value

Description

CONFIG

$1

configs/pi05/pi05_paligemma_libero10_full_finetune.py

Path to the training configuration file

WORK_DIR

$2

work_dirs/pi05_paligemma_libero10_full_finetune

Working directory for logs and checkpoints

Additional arguments

$3+

None

Additional arguments passed directly to train.py

Environment Variables (Required for Distributed Training)#

These environment variables configure the distributed training parameters for torchrun:

Environment Variable

Description

Example

MLP_WORKER_GPU

Number of GPUs per node

8

MLP_WORKER_NUM

Total number of nodes participating in training

2

MLP_ROLE_INDEX

Rank index of the current node (starting from 0)

0

MLP_WORKER_0_HOST

IP address or hostname of the master node (rank 0)

{MASTER_NODE_IP}

MLP_WORKER_0_PORT

Communication port of the master node

29500

Additional Arguments Supported by train.py#

The following arguments can be passed as additional arguments to train.py:

Argument

Type

Description

--cfg-options

key=value pairs

Override configuration file settings in the format xxx=yyy

--eval-after-train

flag

Automatically run evaluation after training completes

--resume-from

path

Resume training from a specified checkpoint file

Configuring Weights & Biases (wandb)#

Weights & Biases is used for experiment tracking and visualization. Configuration steps are as follows:

1. Install wandb#

wandb is included in requirements.txt and can also be installed manually:

pip install wandb

2. Log in to wandb#

wandb login

3. Set Environment Variables#

export WANDB_PROJECT=fluxvla        # Project name (default: fluxvla)
export WANDB_ENTITY=your-team-name  # Team or username (default: None)
export WANDB_MODE=online            # online, offline, or disabled (default: online)

4. Disable wandb Logging#

To disable wandb logging during training, set:

export WANDB_MODE=disabled

Note: All wandb configurations are read from environment variables; no additional settings in the configuration file are required.

Usage Examples#

1. Single-Node Multi-GPU Training (8 GPUs)#

export MLP_WORKER_GPU=8
export MLP_WORKER_NUM=1
export MLP_ROLE_INDEX=0
export MLP_WORKER_0_HOST=localhost
export MLP_WORKER_0_PORT=29500

bash scripts/train.sh \
    configs/pi05/pi05_paligemma_libero10_full_finetune.py \
    work_dirs/pi05_paligemma_libero10_full_finetune

2. Volcengine Cloud Platform Training#

On the Volcengine MLP platform, the environment variables MLP_WORKER_GPU, MLP_WORKER_NUM, MLP_ROLE_INDEX, MLP_WORKER_0_HOST, and MLP_WORKER_0_PORT are automatically injected by the platform. No manual configuration is required; simply execute the command directly.

bash scripts/train.sh \
    configs/pi05/pi05_paligemma_libero10_full_finetune.py \
    work_dirs/pi05_paligemma_libero10_full_finetune

3. Multi-Node Distributed Training (Manual Configuration)#

If not running on the Volcengine platform, environment variables must be configured manually:

Node 0 (Master Node):

export MLP_WORKER_GPU=8
export MLP_WORKER_NUM=2
export MLP_ROLE_INDEX=0
export MLP_WORKER_0_HOST={MASTER_NODE_IP}
export MLP_WORKER_0_PORT=29500

bash scripts/train.sh \
    configs/pi05/pi05_paligemma_libero10_full_finetune.py \
    work_dirs/pi05_paligemma_libero10_full_finetune

Node 1:

export MLP_WORKER_GPU=8
export MLP_WORKER_NUM=2
export MLP_ROLE_INDEX=1
export MLP_WORKER_0_HOST={MASTER_NODE_IP}
export MLP_WORKER_0_PORT=29500

bash scripts/train.sh \
    configs/pi05/pi05_paligemma_libero10_full_finetune.py \
    work_dirs/pi05_paligemma_libero10_full_finetune

4. Using Additional Arguments#

# Override configuration parameters
bash scripts/train.sh \
    configs/pi05/pi05_paligemma_libero10_full_finetune.py \
    work_dirs/custom_run \
    --cfg-options runner.max_epochs=100 runner.save_interval=500

# Resume training from a checkpoint
bash scripts/train.sh \
    configs/pi05/pi05_paligemma_libero10_full_finetune.py \
    work_dirs/pi05_resume \
    --resume-from work_dirs/previous_run/checkpoint_epoch_5.pt

# Automatic evaluation after training
bash scripts/train.sh \
    configs/pi05/pi05_paligemma_libero10_full_finetune.py \
    work_dirs/pi05_with_eval \
    --eval-after-train

5. Combining Multiple Additional Arguments#

bash scripts/train.sh \
    configs/pi05/pi05_paligemma_libero10_full_finetune.py \
    work_dirs/pi05_full \
    --resume-from work_dirs/previous_run/latest.pt \
    --eval-after-train \
    --cfg-options runner.max_epochs=50 train_dataloader.batch_size=16

Output Artifacts#

Upon training completion, the following files are generated in the WORK_DIR directory:

  • config.yaml / config.json - Backup of the training configuration

  • checkpoint_*.pt - Model checkpoint files

  • dataset_statistics.json - Dataset statistics (if applicable)

  • {config_name}_{timestamp}.jsonl - Training log file recording loss curves and other training metrics

  • Tokenizer-related files - Generated if a tokenizer is employed

Important Notes#

  1. Environment Variables Must Be Set: All MLP_* environment variables must be correctly configured before executing the script; otherwise, torchrun will fail to initialize distributed training.

  2. Configuration File Format: Configuration files use the MMEngine Config format (Python files) and must conform to the project’s configuration specifications.

  3. Multi-Node Synchronization: During multi-node training, all nodes must use identical CONFIG and WORK_DIR parameters.

  4. Port Conflicts: Ensure that the port specified by MLP_WORKER_0_PORT is not occupied and is accessible through the firewall.

  5. NCCL Backend: NCCL is used as the default distributed backend. Ensure compatibility between CUDA and NCCL versions.