Real-Robot Data Preparation#

Overview#

Training data collected from physical robots is commonly stored in HDF5 format. This document describes how to convert such data into the LeRobot Dataset v2.1 format for use with FluxVLA training.

Data conversion script: Available from the project repository. Refer to the project README for details.


Data Format Requirements#

Input Data Format (HDF5)#

HDF5 files should follow the episode_*.hdf5 naming convention. The conversion script recursively searches the specified directory for all matching files.

Required Fields#

/observations/qpos - Robot Joint Positions#
  • Data type: float32 or float64

  • Shape: [num_frames, 14] or [num_frames, 16] (16-dimensional data is automatically converted to 14-dimensional)

  • Joint order: Left arm (7 joints) + Right arm (7 joints)

  • Format details:

    • 16-dimensional format: Gripper aperture is represented by the absolute positions of two gripper fingers (8 dimensions total)

    • 14-dimensional format: Gripper aperture is represented by a relative position normalized to [0, 0.1]

    • Conversion formula: gripper_value = (left_finger - right_finger) * (0.1 / 0.07)

/observations/images/<camera_name> - Camera Images#
  • Supported cameras: head_cam, left_cam, right_cam

  • Format:

    • Uncompressed 4-dimensional numpy array [num_frames, height, width, channels] (uint8)

    • Or JPEG-compressed byte stream [num_frames] (automatically decoded to RGB)

Optional Fields#

/action - Robot Desired Joint Positions#
  • Data type: float32 or float64

  • Shape: [num_frames, 14] or [num_frames, 16] (16-dimensional data is automatically converted to 14-dimensional)

/observations/eepose - End-Effector Pose#
  • Data type: float32 or float64

  • Shape: [num_frames, 14]

  • Description: Contains position (x, y, z) and quaternion (qx, qy, qz, qw) for both left and right end-effectors

/observations/images_depth/<camera_name>_depth - Depth Images#
  • Supported cameras: head_cam, left_cam, right_cam

  • Data type: uint16 (values in millimeters)

  • Shape: [num_frames, height, width]

  • Description: Requires add_infos = ["depth"] to be set in the DatasetConfig for processing


Output Data Format (LeRobot v2.1)#

The converted dataset adopts the LeRobot v2.1 format, stored in a HuggingFace Datasets-compatible directory structure:

<output_dir>/<repo_id>/
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ train/
β”‚   β”‚   β”œβ”€β”€ episode_0.parquet
β”‚   β”‚   └── ...
β”‚   └── video/
β”‚       β”œβ”€β”€ episode_0/
β”‚       β”‚   β”œβ”€β”€ observation.images.head_cam.mp4
β”‚       β”‚   └── ...
β”‚       └── ...
β”œβ”€β”€ info.json
└── meta.json

Data Field Descriptions#

Each episode’s parquet file contains the following fields:

observation.state#
  • Type: float32

  • Shape: (14,)

  • Description: Robot joint states; field ordering is identical to the input qpos

observation.images.<camera_name>#
  • Type: VideoFrame object

  • Description: Camera image reference containing the video file path and frame timestamp

  • Supported cameras: head_cam, left_cam, right_cam

  • Video specifications:

    • Frame rate: 30 FPS

    • Image dimensions: (480, 640, 3)

    • Storage location: MP4 files in the video/ directory

action (Optional)#
  • Type: float32

  • Shape: (14,)

  • Description: Robot actions; generated only when the input contains the /action field

observation.eepose (Optional)#
  • Type: float32

  • Shape: (14,)

  • Description: End-effector pose with field ordering identical to the input; generated only when the input contains /observations/eepose

observation.depth.<camera_name> (Optional)#
  • Type: uint16

  • Shape: (480, 640)

  • Description: Depth images; generated only when the input contains depth images and add_infos = ["depth"] is configured

task#
  • Type: string

  • Default value: "pick up the yellow banana and put it on the pink plate"

  • Description: Task label; customizable via the init_task parameter