Dataset Interfaces#
This page consolidates the documentation conventions for training and inference data interfaces, facilitating quick lookup of configuration items by field and transform.
Purpose#
Define training data sources (e.g.,
ParquetDataset)Define field mappings and statistic keys
Organize transform pipelines
Maintain consistency between training and inference data processing
Core Parameters#
train_dataloader#
Field |
Description |
|---|---|
|
Batch size per device |
|
Number of data loading workers per device |
|
Dataset wrapper type (e.g., |
|
Field name mappings |
|
Statistic fields |
|
Data path list |
|
Data processing pipeline |
Highly Associated Transform Fields#
Transform |
Common Fields |
|---|---|
|
|
|
|
|
|
|
|
|
|
Minimal Example#
train_dataloader = dict(
per_device_batch_size=8,
per_device_num_workers=4,
dataset=dict(
type='DistributedRepeatingDataset',
datasets=[
dict(
type='ParquetDataset',
data_root_path=['./datasets/your_dataset'],
transforms=[
dict(type='ProcessParquetInputs'),
dict(type='ProcessPromptsWithImage', num_images=3),
dict(type='ResizeImages', height=224, width=224),
dict(type='NormalizeImages'),
dict(type='NormalizeStatesAndActions', state_dim=64, action_dim=32)
])
]))