RoSHI logoRoSHI: A Versatile Robot-oriented Suit for Human Data In-the-Wild

*Equal contribution; names are in random order

Abstract

Scaling up robot learning will likely require human data containing rich and long-horizon interactions in the wild. Existing approaches for collecting such data trade off portability, robustness to occlusion, and global consistency. We introduce RoSHI, a hybrid wearable that fuses low-cost sparse IMUs with the Project Aria glasses to estimate the full 3D pose and body shape of the wearer in a metric global coordinate frame from egocentric perception. This system is motivated by the complementarity of the two sensors: IMUs provide robustness to occlusions and high-speed motions, while egocentric SLAM anchors long-horizon motion and stabilizes upper body pose. We collect a dataset of agile activities to evaluate RoSHI. On this dataset, we generally outperform other egocentric baselines and perform comparably to a state-of-the-art exocentric baseline (SAM3D). Finally, we demonstrate that the motion data recorded from our system are suitable for real-world humanoid policy learning.

Method

RoSHI combines nine low-cost IMU trackers (~$30 each, BNO085-based, wireless at 100 Hz) with Project Aria glasses to capture synchronized 3D body pose, egocentric RGB video, and a globally consistent root trajectory. The two modalities are complementary: IMUs provide robustness to visual occlusion and high-speed motion, while egocentric SLAM anchors long-horizon global localization and stabilizes upper-body pose.

Calibration requires only a short 20–40 second video captured by an iPhone while wearing the suit. Each tracker has a rigidly mounted AprilTag; combined with SAM 3D Body estimates from the calibration video, this recovers sensor-to-bone and cross-sensor heading alignments without a box calibration or prescribed poses. It enables quick recalibration at any time without removing the IMUs.

For body pose generation, we leverage the EgoAllo diffusion model conditioned on the 6-DoF headset trajectory from Aria SLAM. We guide diffusion using bone orientations derived from the IMU trackers and enforce three complementary constraints: (i) direct comparison of observable joint angles (elbow, hip, knee), (ii) relative orientation consistency between the pelvis and shoulders, and (iii) temporal smoothness of pelvis-joint rotations across consecutive frames.

RoSHI method pipeline.

Results

Qualitative

Localization

Video coming soon

Jumping Jack

Video coming soon

Agility

Video coming soon

Climbing

Video coming soon

Long Horizon — Coffee Making

Egocentric RGB + Human Pose + 3D Scene Reconstruction

Video coming soon

Quantitative

We evaluate RoSHI on 11 motion sequences across three datasets covering in-place motions, locomotion with global translation, and agile sports-like activities. RoSHI achieves the best mean per-joint position error (MPJPE) across all three datasets and the best joint angle error (JAE) on two of three, showing consistent improvements in both global joint localization and articulated pose reconstruction over egocentric baselines.

Method Egocentric Dataset 1 Dataset 2 Dataset 3
MPJPE (cm) JAE (deg) MPJPE (cm) JAE (deg) MPJPE (cm) JAE (deg)
SAM3D 10.3 10.5 10.5 10.7 21.6 11.2
IMU-only (naive) 16.7 12.6 18.8 12.2 16.1 8.9
IMU + EgoAllo root 12.7 12.5 11.9 12.2 12.5 8.7
EgoAllo 10.6 15.6 10.0 14.1 11.7 17.5
RoSHI (ours) 9.6 12.0 9.9 11.0 10.3 15.6

MPJPE is computed in the OptiTrack world frame; JAE is computed from parent–child bone directions (independent of global/root pose). SAM3D relies on an external calibrated camera and is therefore not a fair baseline (shown in gray).

Dataset 1: In-Place Motions

Walk / March / Jog / Run

Stretch / Boxing / Bow / Wave

Jumping Jack / Squat / One-Leg Squat

Pick Up Box

Dataset 2: Locomotion with Global Translation

Walk / Say Hi / Walk

Pickup / Walk Around

Walk / Jog Back and Forth

Jump Around

Dataset 3: Agile Activities

Sliding

Tennis

Ball Throwing / Catching

Real Robot Experiments

TODO: Add real robot experiment videos here.

Experiment 1

Video coming soon

Experiment 2

Video coming soon

Code & Resources

The RoSHI system is organized into modular repositories. All repositories will be publicly available.

RoSHI Core Algorithm

Core

Full pose estimation pipeline: IMU-guided EgoAllo diffusion, sensor fusion, calibration processing, and SMPL body model output.

RoSHI-App

Calibration

iOS app for calibrating the RoSHI wearable system. Captures RGB video with real-time AprilTag detection and synchronizes with 9 body-mounted IMU sensors over LAN.

RoSHI-Hardware

Hardware

Hardware design files, 3D-printable enclosures, BOM, and ESP32 firmware for the 9 wireless IMU trackers (BNO085-based, 100 Hz).

Refer to the documentation for detailed setup instructions.

Acknowledgements

TODO: Add acknowledgements here.