VLM Training and Evaluation Example#

This document provides a minimal VLM training and evaluation workflow. The commands below use realworldqa as the evaluation benchmark; switching to other benchmarks is straightforward — see the “Switching Benchmarks” section.

Environment Setup#

Navigate to the repository root:

cd /root/code/fluxvla

If you are running scripts/train.sh on a single node with multiple GPUs, set the distributed environment variables first (example for 8 GPUs on one node):

export MLP_WORKER_GPU=8
export MLP_WORKER_NUM=1
export MLP_ROLE_INDEX=0
export MLP_WORKER_0_HOST=localhost
export MLP_WORKER_0_PORT=29500

Note

scripts/train.sh internally invokes torchrun and relies on the environment variables above. When running on a managed platform (e.g., MLP), these variables are typically injected automatically.

1. Dataset Requirements and Custom Construction#

The configuration qwen3_vl_sft_vision2b_llm0_6b_stage2_sft.py uses LLaVAFormatDataset as the training dataset type. The example data is a LLaVA-style manifest from Cambrian-7M (the JSON files specified in data_sources).

1.1 Data Format Requirements#

Each sample must contain at least:

  • conversations: a list of dialogue turns, each in the form {"from": "human"|"gpt", "value": "..."}

  • Optional image / images: a single image path (string) or a list of image paths

  • Optional video / videos: a single video path (string) or a list of video paths

Two key constraints:

  • When using <image> / <video> placeholders in the human text, their count must match the number of files in images / videos

  • If paths are relative, set train_dataloader.dataset.data_root accordingly; absolute paths do not require this

1.2 Minimal Working Example#

[
  {
    "id": "sample-0001",
    "images": ["images/0001.jpg"],
    "conversations": [
      {"from": "human", "value": "<image>\nDescribe the main object in the image and suggest a grasp strategy."},
      {"from": "gpt", "value": "The main object is a mug. It is recommended to grasp it from the middle of the body, near the handle."}
    ]
  },
  {
    "id": "sample-0002",
    "conversations": [
      {"from": "human", "value": "Translate the following to Chinese: Pick up the red block."},
      {"from": "gpt", "value": "请拿起红色方块。"}
    ]
  }
]

1.3 Building a Custom Dataset#

Recommended workflow:

  1. Organize raw data (images/videos + instructions + ground-truth answers)

  2. Convert to LLaVA-format json or jsonl (fields described in Section 1.1)

  3. Point train_dataloader.dataset.data_sources to your manifest file(s)

  4. If using relative paths, set train_dataloader.dataset.data_root to the data root directory

  5. Start with a small subset (e.g., a few hundred samples) to verify formatting before full-scale training

A simple “raw annotation → LLaVA format” conversion example (adapt field names to your own data):

import json

raw_records = [
    # Replace with your own raw annotation format
    {
        "sample_id": "1",
        "image_rel_path": "images/0001.jpg",
        "instruction": "Describe the object in the image.",
        "answer": "There is a blue mug in the image."
    }
]

out = []
for r in raw_records:
    out.append({
        "id": f"custom-{r['sample_id']}",
        "images": [r["image_rel_path"]],  # Use with data_root for relative paths
        "conversations": [
            {"from": "human", "value": f"<image>\n{r['instruction']}"},
            {"from": "gpt", "value": r["answer"]},
        ],
    })

with open("custom_llava_train.json", "w", encoding="utf-8") as f:
    json.dump(out, f, ensure_ascii=False, indent=2)

Example configuration for custom data:

train_dataloader = dict(
    # ...
    dataset=dict(
        type='LLaVAFormatDataset',
        data_sources=['/path/to/custom_llava_train.json'],
        processor_type='qwen3_vl',
        model_path='/path/to/base_or_stage1_ckpt',
        data_root='/path/to/your_dataset_root',  # Set to None if using absolute paths
        image_max_pixels=32 * 32 * 1280,
        video_max_pixels=32 * 32 * 1280 * 2,
        truncation_max_length=2000,
        statistic_name='my_custom_vlm_sft',
    ),
)

Note

Common errors include mismatched placeholder counts vs. image/video file counts, or missing file paths. It is recommended to spot-check ~100 samples before training:

  1. Each entry in conversations should contain at least one human/gpt turn;

  2. The number of <image>/<video> placeholders must match the file list;

  3. All paths must be accessible on the training machine.

2. Launch Training#

bash scripts/train.sh \
    /root/code/fluxvla/configs/vlm/qwen3_vl_sft_vision2b_llm0_6b_stage2_sft.py \
    /path_to_work_dir/

Training command argument reference:

  • scripts/train.sh: training entry script (internally calls scripts/train.py)

  • 1st positional argument: config file path (here qwen3_vl_sft_vision2b_llm0_6b_stage2_sft.py)

  • 2nd positional argument: work directory (logs, checkpoints, and config snapshots are stored here)

  • 3rd and subsequent arguments: passed through to scripts/train.py. Common options:

    • --cfg-options key=value: override config values on the fly without modifying the file

    • --resume-from /path/to/ckpt.pt: resume training from a checkpoint

3. Post-Training Evaluation#

python scripts/eval.py \
    --config configs/vlm/qwen3_vl_sft_vision2b_llm0_6b_stage2_sft.py \
    --ckpt-path /path_to_checkpoint/ \
    --cfg-options eval.benchmarks=[realworldqa]

Evaluation command argument reference:

  • python scripts/eval.py: evaluation entry script

  • --config: evaluation config file (typically the same as the training config)

  • --ckpt-path: model checkpoint to evaluate (HF-format directory or a path supported by the evaluation runner)

  • --cfg-options: override config values; the example above limits evaluation to the realworldqa benchmark only

4. Switching Benchmarks#

realworldqa is just an example. To switch benchmarks, simply change eval.benchmarks:

# Evaluate on gqa only
python scripts/eval.py \
    --config configs/vlm/qwen3_vl_sft_vision2b_llm0_6b_stage2_sft.py \
    --ckpt-path /path_to_checkpoint/ \
    --cfg-options eval.benchmarks=[gqa]
# Evaluate on multiple benchmarks at once
python scripts/eval.py \
    --config configs/vlm/qwen3_vl_sft_vision2b_llm0_6b_stage2_sft.py \
    --ckpt-path /path_to_checkpoint/ \
    --cfg-options eval.benchmarks=[gqa,mmmu,docvqa]

Note

The current VLM evaluation runner supports a variety of benchmarks, including gqa, science_qa, textvqa, mmmu, seed, mathvista, ai2d, chartqa, docvqa, mmstar, realworldqa, ocrbench, among others. Simply list the desired names in eval.benchmarks=[...].

5. Switching Models for Training#

There are two common approaches:

Approach B: Switch the Backbone Within the Same Config#

Taking configs/vlm/qwen3_vl_sft_vision2b_llm0_6b_stage2_sft.py as an example, the key fields are under model.vlm_backbone:

  • vlm_backbone_id: selects the VLM architecture/variant (e.g., different Qwen3-VL sizes)

  • vlm_path: initialization weights path (typically pretrained weights or an upstream stage checkpoint)

  • attn_implementation: attention implementation (e.g., flash_attention_2)

Also ensure that processor-related fields are consistent with the model:

  • train_dataloader.dataset.processor_type

  • train_dataloader.dataset.model_path

  • runner.tokenizer.model_path

  • eval.processor_type (if enabled in the config)

6. Assembling a New Model from a Vision Encoder + LLM (Stage 1 Example)#

configs/vlm/qwen3_vl_sft_vision2b_llm0_6b_stage1_alignment.py demonstrates a typical paradigm: Combine the Vision Encoder from Qwen3-VL-2B with the LLM from Qwen3-0.6B to form a new VLM, then perform Stage 1 alignment training (training only the projection/merger).

6.1 Core Configuration#

The key fields are under model.vlm_backbone:

vlm_backbone=dict(
    type='Qwen3VL',
    vlm_backbone_id='qwen3_2b_vl_pt',
    vlm_path=_qwen3vl_2b,
    vision_encoder_path=_qwen3vl_2b,
    llm_backbone_path=_qwen3_0_6b,
    use_projection=False,
    attn_implementation='flash_attention_2',
)

Field descriptions:

  • type='Qwen3VL': uses the Qwen3VL backbone wrapper that supports modular loading

  • vlm_backbone_id: specifies the VLM configuration template family for assembly

  • vlm_path: main template path (typically points to a complete VLM; here the 2B VL model)

  • vision_encoder_path: source of the vision encoder (here the 2B VL model)

  • llm_backbone_path: source of the language model (here the 0.6B LLM)

  • attn_implementation: underlying attention implementation

6.2 What Happens Under the Hood#

When vision_encoder_path or llm_backbone_path is set, modular loading is triggered:

  1. The Qwen3VL configuration is first constructed using vision_encoder_path (or vlm_path) as a template

  2. If llm_backbone_path is set, the LLM’s structural parameters are merged into text_config

  3. vision_config.out_hidden_size is automatically aligned to the LLM’s hidden size, ensuring visual features can be fed into the new LLM

  4. Visual parameters are first loaded from the vision template, then LLM parameters are loaded from llm_backbone_path

  5. If the final dimension of the merger layer does not match, a “crop-copy by output dimension” operation is performed for initialization

This is how it assembles a trainable new model from “2B vision + 0.6B language.”

6.3 Why Stage 1 Only Trains the Projection/Merger#

In this configuration:

freeze_vision_encoder=True
freeze_projection=False
freeze_backbone=True

This means:

  • The Vision Encoder is frozen

  • The LLM Backbone is frozen

  • Only the vision-language connection (projection/merger) is trained

The goal is to first achieve cross-module alignment, then unfreeze more parameters for fine-tuning in Stage 2.

6.4 Using Your Own Vision Encoder + LLM#

You can directly reuse this pattern. The minimal changes required are:

  1. Set vision_encoder_path to your vision model path

  2. Set llm_backbone_path to your LLM path

  3. vlm_path should still point to a VLM template from the same family as your vision branch (used for configuration and processor)

  4. train_dataloader.dataset.model_path and runner.tokenizer.model_path should typically remain consistent with the template family

  5. Start with the Stage 1 freeze strategy to verify stability before proceeding to Stage 2

Reference configuration:

_my_vision = '/path/to/your_vision_encoder_or_vl_template'
_my_llm = '/path/to/your_llm'

model = dict(
    type='VLMForSFT',
    vlm_backbone=dict(
        type='Qwen3VL',
        vlm_backbone_id='qwen3_2b_vl_pt',
        vlm_path=_my_vision,
        vision_encoder_path=_my_vision,
        llm_backbone_path=_my_llm,
        use_projection=False,
        attn_implementation='flash_attention_2',
    ),
    freeze_vision_encoder=True,
    freeze_projection=False,
    freeze_backbone=True,
)

Note

To avoid compatibility issues, it is recommended to combine components within the same model family (e.g., Qwen3-VL series vision branch + Qwen3 series LLM). Cross-family combinations may introduce mismatches in tokenizer, special tokens, config fields, or weight mappings, requiring additional adaptation code.

7. Freezing Specific Modules During Training#

VLMForSFT supports three commonly used freeze switches (under model):

  • freeze_vision_encoder: whether to freeze the vision encoder

  • freeze_projection: whether to freeze the visual merger/projection

  • freeze_backbone: whether to freeze the language backbone (LLM)

Example combinations (suitable for quick experiments):

# Train projection only (similar to Stage 1 alignment)
model = dict(
    # ...
    freeze_vision_encoder=True,
    freeze_projection=False,
    freeze_backbone=True,
)
# Train projection + LLM, freeze vision (similar to Stage 2 SFT)
model = dict(
    # ...
    freeze_vision_encoder=True,
    freeze_projection=False,
    freeze_backbone=False,
)
# Full-parameter training (nothing frozen)
model = dict(
    # ...
    freeze_vision_encoder=False,
    freeze_projection=False,
    freeze_backbone=False,
)

You can also override freeze settings from the command line without modifying the config file:

bash scripts/train.sh \
    /root/code/fluxvla/configs/vlm/qwen3_vl_sft_vision2b_llm0_6b_stage2_sft.py \
    /path_to_work_dir/ \
    --cfg-options \
    model.freeze_vision_encoder=True \
    model.freeze_projection=False \
    model.freeze_backbone=True

8. Key Configuration Parameters#

The following parameters are listed in order of how frequently they are modified and their impact on results.

8.1 Model (model)#

  • _stage1_ckpt: initialization weights path for Stage 2 SFT (typically the HF checkpoint directory from Stage 1)

  • model.vlm_backbone.vlm_backbone_id: backbone variant selection (e.g., 2B or other sizes)

  • model.vlm_backbone.vlm_path: actual weights path for the backbone (usually the same as _stage1_ckpt)

  • model.vlm_backbone.attn_implementation: attention implementation, commonly flash_attention_2

  • model.freeze_vision_encoder: whether to freeze the vision encoder

  • model.freeze_projection: whether to freeze the visual merger/projection

  • model.freeze_backbone: whether to freeze the language backbone

8.2 Data (train_dataloader.dataset)#

  • data_sources: training data manifest(s) (supports multiple json/jsonl files for mixed training)

  • data_root: root directory for relative paths; set to None when images/videos use absolute paths

  • processor_type: data processor type, must match the model family (qwen3_vl in this case)

  • model_path: path for processor/tokenizer loading, typically the same as the current model weights path

  • image_max_pixels / video_max_pixels: maximum pixel budget for images/videos; increasing preserves more detail but consumes more GPU memory

  • truncation_max_length: text truncation length; increasing retains more context but consumes more GPU memory and training time

  • statistic_name: experiment identifier for logging/statistics

8.3 Training Hyperparameters (train_dataloader + runner)#

  • train_dataloader.per_device_batch_size: per-GPU batch size (affects GPU memory and throughput)

  • train_dataloader.per_device_num_workers: DataLoader worker count (affects data loading speed)

  • runner.max_epochs: number of training epochs

  • runner.learning_rate: learning rate (has the greatest impact on convergence speed and stability)

  • runner.weight_decay: weight decay (regularization strength)

  • runner.max_grad_norm: gradient clipping threshold (prevents gradient explosion)

  • runner.lr_scheduler_type / runner.warmup_ratio: learning rate schedule and warmup ratio

  • runner.enable_gradient_checkpointing: memory-saving option (typically adds some computational overhead)

  • runner.enable_mixed_precision_training + runner.mixed_precision_dtype: mixed-precision training configuration (commonly bf16)

8.4 Evaluation (eval)#

  • eval.benchmarks: list of benchmarks to evaluate (most frequently modified)

  • eval.max_samples_per_benchmark: per-benchmark sample limit (set to a small integer for quick validation)

  • eval.batch_size: evaluation batch size (limited by GPU memory)

  • eval.max_new_tokens: maximum generation length

  • eval.use_llm_judge: whether to use an LLM API as the evaluation judge (when disabled, typically falls back to exact matching)

  • eval.verbose: whether to print detailed inference output (useful for debugging)

  • eval.output_dir: evaluation output directory (uses default if not set)

8.5 Common Command-Line Override Examples#

# Override key training hyperparameters without modifying the config file
bash scripts/train.sh \
    /root/code/fluxvla/configs/vlm/qwen3_vl_sft_vision2b_llm0_6b_stage2_sft.py \
    /path_to_work_dir/ \
    --cfg-options \
    train_dataloader.per_device_batch_size=8 \
    runner.learning_rate=2e-5 \
    runner.max_epochs=4 \
    model.freeze_vision_encoder=True \
    model.freeze_projection=False \
    model.freeze_backbone=False
# Quick evaluation: change benchmarks + limit sample count
python scripts/eval.py \
    --config configs/vlm/qwen3_vl_sft_vision2b_llm0_6b_stage2_sft.py \
    --ckpt-path /path_to_checkpoint/ \
    --cfg-options eval.benchmarks=[gqa,realworldqa] eval.max_samples_per_benchmark=100

9. Common Output Locations#

The training work directory /path_to_work_dir/ typically contains:

  • checkpoint_*.pt: training checkpoints

  • checkpoints/hf_step-* (if HF export is enabled): HF-format model directories

  • config.yaml / config.json: config snapshot at training time

  • *.jsonl: training logs

Evaluation logs are printed to the terminal and, depending on the configuration, written to the evaluation output directory (e.g., outputs/vlm_eval).