Model Training#
(scripts/train.sh Usage Guide)
Overview#
scripts/train.sh is a shell script for launching distributed training. It wraps the torchrun command to facilitate model training in multi-node, multi-GPU environments.
Basic Usage#
bash scripts/train.sh [CONFIG] [WORK_DIR] [additional arguments...]
Pre-trained Weight Downloads#
You may utilize all or a subset of modules from pre-trained models, as well as assemble different modules from various models in a modular fashion.
Select the desired model weights from the tables below and download them to the ./checkpoints directory.
VLA Models#
Model |
Size |
Download Link |
|---|---|---|
GR00T N1.5 |
3B |
|
OpenVLA |
7B |
|
PI0_base |
3B |
|
PI05_base |
3B |
|
PI05_libero |
3B |
Vision-Language Models (VLM)#
Model |
Size |
Download Link |
|---|---|---|
Qwen2.5-VL |
3B |
Large Language Models (LLM)#
Model |
Size |
Download Link |
|---|---|---|
Qwen 2.5 |
3B |
|
Qwen 2.5 |
7B |
|
Llama 2 |
7B |
Vision Backbones#
Model |
Download Link |
|---|---|
ViT-Large (DINOv2) |
|
ViT-SO400M (SigLIP) |
|
SigLIP2 |
|
paligemma |
Tip: Use
huggingface-cli download <model-name> --local-dir ./checkpoints/<model-name>to download the desired model weights.
Script Parameters#
Positional Arguments#
Argument |
Position |
Default Value |
Description |
|---|---|---|---|
|
$1 |
|
Path to the training configuration file |
|
$2 |
|
Working directory for logs and checkpoints |
Additional arguments |
$3+ |
None |
Additional arguments passed directly to |
Environment Variables (Required for Distributed Training)#
These environment variables configure the distributed training parameters for torchrun:
Environment Variable |
Description |
Example |
|---|---|---|
|
Number of GPUs per node |
|
|
Total number of nodes participating in training |
|
|
Rank index of the current node (starting from 0) |
|
|
IP address or hostname of the master node (rank 0) |
|
|
Communication port of the master node |
|
Additional Arguments Supported by train.py#
The following arguments can be passed as additional arguments to train.py:
Argument |
Type |
Description |
|---|---|---|
|
key=value pairs |
Override configuration file settings in the format |
|
flag |
Automatically run evaluation after training completes |
|
path |
Resume training from a specified checkpoint file |
Configuring Weights & Biases (wandb)#
Weights & Biases is used for experiment tracking and visualization. Configuration steps are as follows:
1. Install wandb#
wandb is included in requirements.txt and can also be installed manually:
pip install wandb
2. Log in to wandb#
wandb login
3. Set Environment Variables#
export WANDB_PROJECT=fluxvla # Project name (default: fluxvla)
export WANDB_ENTITY=your-team-name # Team or username (default: None)
export WANDB_MODE=online # online, offline, or disabled (default: online)
4. Disable wandb Logging#
To disable wandb logging during training, set:
export WANDB_MODE=disabled
Note: All wandb configurations are read from environment variables; no additional settings in the configuration file are required.
Usage Examples#
1. Single-Node Multi-GPU Training (8 GPUs)#
export MLP_WORKER_GPU=8
export MLP_WORKER_NUM=1
export MLP_ROLE_INDEX=0
export MLP_WORKER_0_HOST=localhost
export MLP_WORKER_0_PORT=29500
bash scripts/train.sh \
configs/pi05/pi05_paligemma_libero10_full_finetune.py \
work_dirs/pi05_paligemma_libero10_full_finetune
2. Volcengine Cloud Platform Training#
On the Volcengine MLP platform, the environment variables MLP_WORKER_GPU, MLP_WORKER_NUM, MLP_ROLE_INDEX, MLP_WORKER_0_HOST, and MLP_WORKER_0_PORT are automatically injected by the platform. No manual configuration is required; simply execute the command directly.
bash scripts/train.sh \
configs/pi05/pi05_paligemma_libero10_full_finetune.py \
work_dirs/pi05_paligemma_libero10_full_finetune
3. Multi-Node Distributed Training (Manual Configuration)#
If not running on the Volcengine platform, environment variables must be configured manually:
Node 0 (Master Node):
export MLP_WORKER_GPU=8
export MLP_WORKER_NUM=2
export MLP_ROLE_INDEX=0
export MLP_WORKER_0_HOST={MASTER_NODE_IP}
export MLP_WORKER_0_PORT=29500
bash scripts/train.sh \
configs/pi05/pi05_paligemma_libero10_full_finetune.py \
work_dirs/pi05_paligemma_libero10_full_finetune
Node 1:
export MLP_WORKER_GPU=8
export MLP_WORKER_NUM=2
export MLP_ROLE_INDEX=1
export MLP_WORKER_0_HOST={MASTER_NODE_IP}
export MLP_WORKER_0_PORT=29500
bash scripts/train.sh \
configs/pi05/pi05_paligemma_libero10_full_finetune.py \
work_dirs/pi05_paligemma_libero10_full_finetune
4. Using Additional Arguments#
# Override configuration parameters
bash scripts/train.sh \
configs/pi05/pi05_paligemma_libero10_full_finetune.py \
work_dirs/custom_run \
--cfg-options runner.max_epochs=100 runner.save_interval=500
# Resume training from a checkpoint
bash scripts/train.sh \
configs/pi05/pi05_paligemma_libero10_full_finetune.py \
work_dirs/pi05_resume \
--resume-from work_dirs/previous_run/checkpoint_epoch_5.pt
# Automatic evaluation after training
bash scripts/train.sh \
configs/pi05/pi05_paligemma_libero10_full_finetune.py \
work_dirs/pi05_with_eval \
--eval-after-train
5. Combining Multiple Additional Arguments#
bash scripts/train.sh \
configs/pi05/pi05_paligemma_libero10_full_finetune.py \
work_dirs/pi05_full \
--resume-from work_dirs/previous_run/latest.pt \
--eval-after-train \
--cfg-options runner.max_epochs=50 train_dataloader.batch_size=16
Output Artifacts#
Upon training completion, the following files are generated in the WORK_DIR directory:
config.yaml/config.json- Backup of the training configurationcheckpoint_*.pt- Model checkpoint filesdataset_statistics.json- Dataset statistics (if applicable){config_name}_{timestamp}.jsonl- Training log file recording loss curves and other training metricsTokenizer-related files - Generated if a tokenizer is employed
Important Notes#
Environment Variables Must Be Set: All
MLP_*environment variables must be correctly configured before executing the script; otherwise,torchrunwill fail to initialize distributed training.Configuration File Format: Configuration files use the MMEngine Config format (Python files) and must conform to the project’s configuration specifications.
Multi-Node Synchronization: During multi-node training, all nodes must use identical
CONFIGandWORK_DIRparameters.Port Conflicts: Ensure that the port specified by
MLP_WORKER_0_PORTis not occupied and is accessible through the firewall.NCCL Backend: NCCL is used as the default distributed backend. Ensure compatibility between CUDA and NCCL versions.