Real-Robot Data Preparation#
Overview#
Training data collected from physical robots is commonly stored in HDF5 format. This document describes how to convert such data into the LeRobot Dataset v2.1 format for use with FluxVLA training.
Data conversion script: Available from the project repository. Refer to the project README for details.
Data Format Requirements#
Input Data Format (HDF5)#
HDF5 files should follow the episode_*.hdf5 naming convention. The conversion script recursively searches the specified directory for all matching files.
Required Fields#
/observations/qpos - Robot Joint Positions#
Data type:
float32orfloat64Shape:
[num_frames, 14]or[num_frames, 16](16-dimensional data is automatically converted to 14-dimensional)Joint order: Left arm (7 joints) + Right arm (7 joints)
Format details:
16-dimensional format: Gripper aperture is represented by the absolute positions of two gripper fingers (8 dimensions total)
14-dimensional format: Gripper aperture is represented by a relative position normalized to [0, 0.1]
Conversion formula:
gripper_value = (left_finger - right_finger) * (0.1 / 0.07)
/observations/images/<camera_name> - Camera Images#
Supported cameras:
head_cam,left_cam,right_camFormat:
Uncompressed 4-dimensional numpy array
[num_frames, height, width, channels](uint8)Or JPEG-compressed byte stream
[num_frames](automatically decoded to RGB)
Optional Fields#
/action - Robot Desired Joint Positions#
Data type:
float32orfloat64Shape:
[num_frames, 14]or[num_frames, 16](16-dimensional data is automatically converted to 14-dimensional)
/observations/eepose - End-Effector Pose#
Data type:
float32orfloat64Shape:
[num_frames, 14]Description: Contains position (x, y, z) and quaternion (qx, qy, qz, qw) for both left and right end-effectors
/observations/images_depth/<camera_name>_depth - Depth Images#
Supported cameras:
head_cam,left_cam,right_camData type:
uint16(values in millimeters)Shape:
[num_frames, height, width]Description: Requires
add_infos = ["depth"]to be set in theDatasetConfigfor processing
Output Data Format (LeRobot v2.1)#
The converted dataset adopts the LeRobot v2.1 format, stored in a HuggingFace Datasets-compatible directory structure:
<output_dir>/<repo_id>/
βββ data/
β βββ train/
β β βββ episode_0.parquet
β β βββ ...
β βββ video/
β βββ episode_0/
β β βββ observation.images.head_cam.mp4
β β βββ ...
β βββ ...
βββ info.json
βββ meta.json
Data Field Descriptions#
Each episodeβs parquet file contains the following fields:
observation.state#
Type:
float32Shape:
(14,)Description: Robot joint states; field ordering is identical to the input
qpos
observation.images.<camera_name>#
Type:
VideoFrameobjectDescription: Camera image reference containing the video file path and frame timestamp
Supported cameras:
head_cam,left_cam,right_camVideo specifications:
Frame rate: 30 FPS
Image dimensions:
(480, 640, 3)Storage location: MP4 files in the
video/directory
action (Optional)#
Type:
float32Shape:
(14,)Description: Robot actions; generated only when the input contains the
/actionfield
observation.eepose (Optional)#
Type:
float32Shape:
(14,)Description: End-effector pose with field ordering identical to the input; generated only when the input contains
/observations/eepose
observation.depth.<camera_name> (Optional)#
Type:
uint16Shape:
(480, 640)Description: Depth images; generated only when the input contains depth images and
add_infos = ["depth"]is configured
task#
Type:
stringDefault value:
"pick up the yellow banana and put it on the pink plate"Description: Task label; customizable via the
init_taskparameter