Training and Evaluation Script Interfaces#
This page focuses on two frequently used entry points: scripts/train.sh and scripts/eval.sh.
Purpose#
scripts/train.sh: Launches distributed training, wrappingtorchrun.scripts/eval.sh: Launches distributed evaluation, wrappingtorchrun.
Training Entry Point (scripts/train.sh)#
Command Format#
bash scripts/train.sh [CONFIG] [WORK_DIR] [额外参数...]
Positional Arguments#
Argument |
Position |
Default |
Description |
|---|---|---|---|
|
|
|
Training configuration file path |
|
|
|
Log and checkpoint output directory |
Additional arguments |
|
None |
Passed through to |
Distributed Environment Variables#
Environment Variable |
Description |
|---|---|
|
Number of GPUs per node |
|
Total number of nodes |
|
Rank of the current node |
|
Master node address |
|
Master node port |
Common Additional Arguments#
Argument |
Type |
Description |
|---|---|---|
|
|
Override configuration items |
|
|
Automatically evaluate after training |
|
path |
Resume from a checkpoint |
Evaluation Entry Point (scripts/eval.sh)#
Command Format#
bash scripts/eval.sh [CONFIG] [CKPT_PATH] [额外参数...]
Positional Arguments#
Argument |
Position |
Description |
|---|---|---|
|
|
Evaluation configuration file path |
|
|
Checkpoint file path |
Additional arguments |
|
Passed through to |
Common Additional Arguments#
Argument |
Type |
Description |
|---|---|---|
|
|
Override evaluation configuration items |
Minimal Runnable Example#
# 训练
export MLP_WORKER_GPU=8
export MLP_WORKER_NUM=1
export MLP_ROLE_INDEX=0
export MLP_WORKER_0_HOST=localhost
export MLP_WORKER_0_PORT=29500
bash scripts/train.sh \
configs/pi05/pi05_paligemma_libero10_full_finetune.py \
work_dirs/pi05_paligemma_libero10_full_finetune
# 评估
bash scripts/eval.sh \
configs/pi05/pi05_paligemma_libero10_full_finetune.py \
work_dirs/pi05_paligemma_libero10_full_finetune/checkpoint_step_10000.pt