VLM Training and Evaluation Example#
This document provides a minimal VLM training and evaluation workflow. The commands below use realworldqa as the evaluation benchmark; switching to other benchmarks is straightforward — see the “Switching Benchmarks” section.
Environment Setup#
Navigate to the repository root:
cd /root/code/fluxvla
If you are running scripts/train.sh on a single node with multiple GPUs, set the distributed environment variables first (example for 8 GPUs on one node):
export MLP_WORKER_GPU=8
export MLP_WORKER_NUM=1
export MLP_ROLE_INDEX=0
export MLP_WORKER_0_HOST=localhost
export MLP_WORKER_0_PORT=29500
Note
scripts/train.sh internally invokes torchrun and relies on the environment variables above. When running on a managed platform (e.g., MLP), these variables are typically injected automatically.
1. Dataset Requirements and Custom Construction#
The configuration qwen3_vl_sft_vision2b_llm0_6b_stage2_sft.py uses LLaVAFormatDataset as the training dataset type. The example data is a LLaVA-style manifest from Cambrian-7M (the JSON files specified in data_sources).
1.1 Data Format Requirements#
Each sample must contain at least:
conversations: a list of dialogue turns, each in the form{"from": "human"|"gpt", "value": "..."}Optional
image/images: a single image path (string) or a list of image pathsOptional
video/videos: a single video path (string) or a list of video paths
Two key constraints:
When using
<image>/<video>placeholders in thehumantext, their count must match the number of files inimages/videosIf paths are relative, set
train_dataloader.dataset.data_rootaccordingly; absolute paths do not require this
1.2 Minimal Working Example#
[
{
"id": "sample-0001",
"images": ["images/0001.jpg"],
"conversations": [
{"from": "human", "value": "<image>\nDescribe the main object in the image and suggest a grasp strategy."},
{"from": "gpt", "value": "The main object is a mug. It is recommended to grasp it from the middle of the body, near the handle."}
]
},
{
"id": "sample-0002",
"conversations": [
{"from": "human", "value": "Translate the following to Chinese: Pick up the red block."},
{"from": "gpt", "value": "请拿起红色方块。"}
]
}
]
1.3 Building a Custom Dataset#
Recommended workflow:
Organize raw data (images/videos + instructions + ground-truth answers)
Convert to LLaVA-format
jsonorjsonl(fields described in Section 1.1)Point
train_dataloader.dataset.data_sourcesto your manifest file(s)If using relative paths, set
train_dataloader.dataset.data_rootto the data root directoryStart with a small subset (e.g., a few hundred samples) to verify formatting before full-scale training
A simple “raw annotation → LLaVA format” conversion example (adapt field names to your own data):
import json
raw_records = [
# Replace with your own raw annotation format
{
"sample_id": "1",
"image_rel_path": "images/0001.jpg",
"instruction": "Describe the object in the image.",
"answer": "There is a blue mug in the image."
}
]
out = []
for r in raw_records:
out.append({
"id": f"custom-{r['sample_id']}",
"images": [r["image_rel_path"]], # Use with data_root for relative paths
"conversations": [
{"from": "human", "value": f"<image>\n{r['instruction']}"},
{"from": "gpt", "value": r["answer"]},
],
})
with open("custom_llava_train.json", "w", encoding="utf-8") as f:
json.dump(out, f, ensure_ascii=False, indent=2)
Example configuration for custom data:
train_dataloader = dict(
# ...
dataset=dict(
type='LLaVAFormatDataset',
data_sources=['/path/to/custom_llava_train.json'],
processor_type='qwen3_vl',
model_path='/path/to/base_or_stage1_ckpt',
data_root='/path/to/your_dataset_root', # Set to None if using absolute paths
image_max_pixels=32 * 32 * 1280,
video_max_pixels=32 * 32 * 1280 * 2,
truncation_max_length=2000,
statistic_name='my_custom_vlm_sft',
),
)
Note
Common errors include mismatched placeholder counts vs. image/video file counts, or missing file paths. It is recommended to spot-check ~100 samples before training:
Each entry in
conversationsshould contain at least one human/gpt turn;The number of
<image>/<video>placeholders must match the file list;All paths must be accessible on the training machine.
2. Launch Training#
bash scripts/train.sh \
/root/code/fluxvla/configs/vlm/qwen3_vl_sft_vision2b_llm0_6b_stage2_sft.py \
/path_to_work_dir/
Training command argument reference:
scripts/train.sh: training entry script (internally callsscripts/train.py)1st positional argument: config file path (here
qwen3_vl_sft_vision2b_llm0_6b_stage2_sft.py)2nd positional argument: work directory (logs, checkpoints, and config snapshots are stored here)
3rd and subsequent arguments: passed through to
scripts/train.py. Common options:--cfg-options key=value: override config values on the fly without modifying the file--resume-from /path/to/ckpt.pt: resume training from a checkpoint
3. Post-Training Evaluation#
python scripts/eval.py \
--config configs/vlm/qwen3_vl_sft_vision2b_llm0_6b_stage2_sft.py \
--ckpt-path /path_to_checkpoint/ \
--cfg-options eval.benchmarks=[realworldqa]
Evaluation command argument reference:
python scripts/eval.py: evaluation entry script--config: evaluation config file (typically the same as the training config)--ckpt-path: model checkpoint to evaluate (HF-format directory or a path supported by the evaluation runner)--cfg-options: override config values; the example above limits evaluation to therealworldqabenchmark only
4. Switching Benchmarks#
realworldqa is just an example. To switch benchmarks, simply change eval.benchmarks:
# Evaluate on gqa only
python scripts/eval.py \
--config configs/vlm/qwen3_vl_sft_vision2b_llm0_6b_stage2_sft.py \
--ckpt-path /path_to_checkpoint/ \
--cfg-options eval.benchmarks=[gqa]
# Evaluate on multiple benchmarks at once
python scripts/eval.py \
--config configs/vlm/qwen3_vl_sft_vision2b_llm0_6b_stage2_sft.py \
--ckpt-path /path_to_checkpoint/ \
--cfg-options eval.benchmarks=[gqa,mmmu,docvqa]
Note
The current VLM evaluation runner supports a variety of benchmarks, including gqa, science_qa, textvqa, mmmu, seed, mathvista, ai2d, chartqa, docvqa, mmstar, realworldqa, ocrbench, among others. Simply list the desired names in eval.benchmarks=[...].
5. Switching Models for Training#
There are two common approaches:
Approach A: Use a Different Config File (Recommended)#
Replace the config path in the training command:
bash scripts/train.sh /path/to/another_vlm_config.py /path_to_work_dir/
Approach B: Switch the Backbone Within the Same Config#
Taking configs/vlm/qwen3_vl_sft_vision2b_llm0_6b_stage2_sft.py as an example, the key fields are under model.vlm_backbone:
vlm_backbone_id: selects the VLM architecture/variant (e.g., different Qwen3-VL sizes)vlm_path: initialization weights path (typically pretrained weights or an upstream stage checkpoint)attn_implementation: attention implementation (e.g.,flash_attention_2)
Also ensure that processor-related fields are consistent with the model:
train_dataloader.dataset.processor_typetrain_dataloader.dataset.model_pathrunner.tokenizer.model_patheval.processor_type(if enabled in the config)
6. Assembling a New Model from a Vision Encoder + LLM (Stage 1 Example)#
configs/vlm/qwen3_vl_sft_vision2b_llm0_6b_stage1_alignment.py demonstrates a typical paradigm:
Combine the Vision Encoder from Qwen3-VL-2B with the LLM from Qwen3-0.6B to form a new VLM, then perform Stage 1 alignment training (training only the projection/merger).
6.1 Core Configuration#
The key fields are under model.vlm_backbone:
vlm_backbone=dict(
type='Qwen3VL',
vlm_backbone_id='qwen3_2b_vl_pt',
vlm_path=_qwen3vl_2b,
vision_encoder_path=_qwen3vl_2b,
llm_backbone_path=_qwen3_0_6b,
use_projection=False,
attn_implementation='flash_attention_2',
)
Field descriptions:
type='Qwen3VL': uses the Qwen3VL backbone wrapper that supports modular loadingvlm_backbone_id: specifies the VLM configuration template family for assemblyvlm_path: main template path (typically points to a complete VLM; here the 2B VL model)vision_encoder_path: source of the vision encoder (here the 2B VL model)llm_backbone_path: source of the language model (here the 0.6B LLM)attn_implementation: underlying attention implementation
6.2 What Happens Under the Hood#
When vision_encoder_path or llm_backbone_path is set, modular loading is triggered:
The Qwen3VL configuration is first constructed using
vision_encoder_path(orvlm_path) as a templateIf
llm_backbone_pathis set, the LLM’s structural parameters are merged intotext_configvision_config.out_hidden_sizeis automatically aligned to the LLM’s hidden size, ensuring visual features can be fed into the new LLMVisual parameters are first loaded from the vision template, then LLM parameters are loaded from
llm_backbone_pathIf the final dimension of the merger layer does not match, a “crop-copy by output dimension” operation is performed for initialization
This is how it assembles a trainable new model from “2B vision + 0.6B language.”
6.3 Why Stage 1 Only Trains the Projection/Merger#
In this configuration:
freeze_vision_encoder=True
freeze_projection=False
freeze_backbone=True
This means:
The Vision Encoder is frozen
The LLM Backbone is frozen
Only the vision-language connection (projection/merger) is trained
The goal is to first achieve cross-module alignment, then unfreeze more parameters for fine-tuning in Stage 2.
6.4 Using Your Own Vision Encoder + LLM#
You can directly reuse this pattern. The minimal changes required are:
Set
vision_encoder_pathto your vision model pathSet
llm_backbone_pathto your LLM pathvlm_pathshould still point to a VLM template from the same family as your vision branch (used for configuration and processor)train_dataloader.dataset.model_pathandrunner.tokenizer.model_pathshould typically remain consistent with the template familyStart with the Stage 1 freeze strategy to verify stability before proceeding to Stage 2
Reference configuration:
_my_vision = '/path/to/your_vision_encoder_or_vl_template'
_my_llm = '/path/to/your_llm'
model = dict(
type='VLMForSFT',
vlm_backbone=dict(
type='Qwen3VL',
vlm_backbone_id='qwen3_2b_vl_pt',
vlm_path=_my_vision,
vision_encoder_path=_my_vision,
llm_backbone_path=_my_llm,
use_projection=False,
attn_implementation='flash_attention_2',
),
freeze_vision_encoder=True,
freeze_projection=False,
freeze_backbone=True,
)
Note
To avoid compatibility issues, it is recommended to combine components within the same model family (e.g., Qwen3-VL series vision branch + Qwen3 series LLM). Cross-family combinations may introduce mismatches in tokenizer, special tokens, config fields, or weight mappings, requiring additional adaptation code.
7. Freezing Specific Modules During Training#
VLMForSFT supports three commonly used freeze switches (under model):
freeze_vision_encoder: whether to freeze the vision encoderfreeze_projection: whether to freeze the visual merger/projectionfreeze_backbone: whether to freeze the language backbone (LLM)
Example combinations (suitable for quick experiments):
# Train projection only (similar to Stage 1 alignment)
model = dict(
# ...
freeze_vision_encoder=True,
freeze_projection=False,
freeze_backbone=True,
)
# Train projection + LLM, freeze vision (similar to Stage 2 SFT)
model = dict(
# ...
freeze_vision_encoder=True,
freeze_projection=False,
freeze_backbone=False,
)
# Full-parameter training (nothing frozen)
model = dict(
# ...
freeze_vision_encoder=False,
freeze_projection=False,
freeze_backbone=False,
)
You can also override freeze settings from the command line without modifying the config file:
bash scripts/train.sh \
/root/code/fluxvla/configs/vlm/qwen3_vl_sft_vision2b_llm0_6b_stage2_sft.py \
/path_to_work_dir/ \
--cfg-options \
model.freeze_vision_encoder=True \
model.freeze_projection=False \
model.freeze_backbone=True
8. Key Configuration Parameters#
The following parameters are listed in order of how frequently they are modified and their impact on results.
8.1 Model (model)#
_stage1_ckpt: initialization weights path for Stage 2 SFT (typically the HF checkpoint directory from Stage 1)model.vlm_backbone.vlm_backbone_id: backbone variant selection (e.g., 2B or other sizes)model.vlm_backbone.vlm_path: actual weights path for the backbone (usually the same as_stage1_ckpt)model.vlm_backbone.attn_implementation: attention implementation, commonlyflash_attention_2model.freeze_vision_encoder: whether to freeze the vision encodermodel.freeze_projection: whether to freeze the visual merger/projectionmodel.freeze_backbone: whether to freeze the language backbone
8.2 Data (train_dataloader.dataset)#
data_sources: training data manifest(s) (supports multiple json/jsonl files for mixed training)data_root: root directory for relative paths; set toNonewhenimages/videosuse absolute pathsprocessor_type: data processor type, must match the model family (qwen3_vlin this case)model_path: path for processor/tokenizer loading, typically the same as the current model weights pathimage_max_pixels/video_max_pixels: maximum pixel budget for images/videos; increasing preserves more detail but consumes more GPU memorytruncation_max_length: text truncation length; increasing retains more context but consumes more GPU memory and training timestatistic_name: experiment identifier for logging/statistics
8.3 Training Hyperparameters (train_dataloader + runner)#
train_dataloader.per_device_batch_size: per-GPU batch size (affects GPU memory and throughput)train_dataloader.per_device_num_workers: DataLoader worker count (affects data loading speed)runner.max_epochs: number of training epochsrunner.learning_rate: learning rate (has the greatest impact on convergence speed and stability)runner.weight_decay: weight decay (regularization strength)runner.max_grad_norm: gradient clipping threshold (prevents gradient explosion)runner.lr_scheduler_type/runner.warmup_ratio: learning rate schedule and warmup ratiorunner.enable_gradient_checkpointing: memory-saving option (typically adds some computational overhead)runner.enable_mixed_precision_training+runner.mixed_precision_dtype: mixed-precision training configuration (commonly bf16)
8.4 Evaluation (eval)#
eval.benchmarks: list of benchmarks to evaluate (most frequently modified)eval.max_samples_per_benchmark: per-benchmark sample limit (set to a small integer for quick validation)eval.batch_size: evaluation batch size (limited by GPU memory)eval.max_new_tokens: maximum generation lengtheval.use_llm_judge: whether to use an LLM API as the evaluation judge (when disabled, typically falls back to exact matching)eval.verbose: whether to print detailed inference output (useful for debugging)eval.output_dir: evaluation output directory (uses default if not set)
8.5 Common Command-Line Override Examples#
# Override key training hyperparameters without modifying the config file
bash scripts/train.sh \
/root/code/fluxvla/configs/vlm/qwen3_vl_sft_vision2b_llm0_6b_stage2_sft.py \
/path_to_work_dir/ \
--cfg-options \
train_dataloader.per_device_batch_size=8 \
runner.learning_rate=2e-5 \
runner.max_epochs=4 \
model.freeze_vision_encoder=True \
model.freeze_projection=False \
model.freeze_backbone=False
# Quick evaluation: change benchmarks + limit sample count
python scripts/eval.py \
--config configs/vlm/qwen3_vl_sft_vision2b_llm0_6b_stage2_sft.py \
--ckpt-path /path_to_checkpoint/ \
--cfg-options eval.benchmarks=[gqa,realworldqa] eval.max_samples_per_benchmark=100
9. Common Output Locations#
The training work directory /path_to_work_dir/ typically contains:
checkpoint_*.pt: training checkpointscheckpoints/hf_step-*(if HF export is enabled): HF-format model directoriesconfig.yaml/config.json: config snapshot at training time*.jsonl: training logs
Evaluation logs are printed to the terminal and, depending on the configuration, written to the evaluation output directory (e.g., outputs/vlm_eval).