Training and Evaluation Script Interfaces#

This page focuses on two frequently used entry points: scripts/train.sh and scripts/eval.sh.

Purpose#

  • scripts/train.sh: Launches distributed training, wrapping torchrun.

  • scripts/eval.sh: Launches distributed evaluation, wrapping torchrun.

Training Entry Point (scripts/train.sh)#

Command Format#

bash scripts/train.sh [CONFIG] [WORK_DIR] [额外参数...]

Positional Arguments#

Argument

Position

Default

Description

CONFIG

$1

configs/pi05/pi05_paligemma_libero10_full_finetune.py

Training configuration file path

WORK_DIR

$2

work_dirs/pi05_paligemma_libero10_full_finetune

Log and checkpoint output directory

Additional arguments

$3+

None

Passed through to train.py

Distributed Environment Variables#

Environment Variable

Description

MLP_WORKER_GPU

Number of GPUs per node

MLP_WORKER_NUM

Total number of nodes

MLP_ROLE_INDEX

Rank of the current node

MLP_WORKER_0_HOST

Master node address

MLP_WORKER_0_PORT

Master node port

Common Additional Arguments#

Argument

Type

Description

--cfg-options

key=value pair

Override configuration items

--eval-after-train

flag

Automatically evaluate after training

--resume-from

path

Resume from a checkpoint

Evaluation Entry Point (scripts/eval.sh)#

Command Format#

bash scripts/eval.sh [CONFIG] [CKPT_PATH] [额外参数...]

Positional Arguments#

Argument

Position

Description

CONFIG

$1

Evaluation configuration file path

CKPT_PATH

$2

Checkpoint file path

Additional arguments

$3+

Passed through to eval.py

Common Additional Arguments#

Argument

Type

Description

--cfg-options

key=value pair

Override evaluation configuration items

Minimal Runnable Example#

# 训练
export MLP_WORKER_GPU=8
export MLP_WORKER_NUM=1
export MLP_ROLE_INDEX=0
export MLP_WORKER_0_HOST=localhost
export MLP_WORKER_0_PORT=29500

bash scripts/train.sh \
  configs/pi05/pi05_paligemma_libero10_full_finetune.py \
  work_dirs/pi05_paligemma_libero10_full_finetune

# 评估
bash scripts/eval.sh \
  configs/pi05/pi05_paligemma_libero10_full_finetune.py \
  work_dirs/pi05_paligemma_libero10_full_finetune/checkpoint_step_10000.pt