Dataset Interfaces#

This page consolidates the documentation conventions for training and inference data interfaces, facilitating quick lookup of configuration items by field and transform.

Purpose#

  • Define training data sources (e.g., ParquetDataset)

  • Define field mappings and statistic keys

  • Organize transform pipelines

  • Maintain consistency between training and inference data processing

Core Parameters#

train_dataloader#

Field

Description

per_device_batch_size

Batch size per device

per_device_num_workers

Number of data loading workers per device

dataset.type

Dataset wrapper type (e.g., DistributedRepeatingDataset)

dataset.name_mappings

Field name mappings

dataset.statistic_keys

Statistic fields

datasets[].data_root_path

Data path list

datasets[].transforms

Data processing pipeline

Highly Associated Transform Fields#

Transform

Common Fields

ProcessParquetInputs

parquet_keys, video_keys, embodiment_id

ProcessPromptsWithImage

max_len, num_images, tokenizer

ResizeImages

height, width

NormalizeImages

means, stds

NormalizeStatesAndActions

state_dim, action_dim, norm_type

Minimal Example#

train_dataloader = dict(
    per_device_batch_size=8,
    per_device_num_workers=4,
    dataset=dict(
        type='DistributedRepeatingDataset',
        datasets=[
            dict(
                type='ParquetDataset',
                data_root_path=['./datasets/your_dataset'],
                transforms=[
                    dict(type='ProcessParquetInputs'),
                    dict(type='ProcessPromptsWithImage', num_images=3),
                    dict(type='ResizeImages', height=224, width=224),
                    dict(type='NormalizeImages'),
                    dict(type='NormalizeStatesAndActions', state_dim=64, action_dim=32)
                ])
        ]))