Localization
Video coming soon
Jumping Jack
Video coming soon
Agility
Video coming soon
Climbing
Video coming soon
Long Horizon — Coffee Making
Egocentric RGB + Human Pose + 3D Scene Reconstruction
Video coming soon
RoSHI: A Versatile Robot-oriented Suit for Human Data
In-the-Wild
Scaling up robot learning will likely require human data containing rich and long-horizon interactions in the wild. Existing approaches for collecting such data trade off portability, robustness to occlusion, and global consistency. We introduce RoSHI, a hybrid wearable that fuses low-cost sparse IMUs with the Project Aria glasses to estimate the full 3D pose and body shape of the wearer in a metric global coordinate frame from egocentric perception. This system is motivated by the complementarity of the two sensors: IMUs provide robustness to occlusions and high-speed motions, while egocentric SLAM anchors long-horizon motion and stabilizes upper body pose. We collect a dataset of agile activities to evaluate RoSHI. On this dataset, we generally outperform other egocentric baselines and perform comparably to a state-of-the-art exocentric baseline (SAM3D). Finally, we demonstrate that the motion data recorded from our system are suitable for real-world humanoid policy learning.
RoSHI combines nine low-cost IMU trackers (~$30 each, BNO085-based, wireless at 100 Hz) with Project Aria glasses to capture synchronized 3D body pose, egocentric RGB video, and a globally consistent root trajectory. The two modalities are complementary: IMUs provide robustness to visual occlusion and high-speed motion, while egocentric SLAM anchors long-horizon global localization and stabilizes upper-body pose.
Calibration requires only a short 20–40 second video captured by an iPhone while wearing the suit. Each tracker has a rigidly mounted AprilTag; combined with SAM 3D Body estimates from the calibration video, this recovers sensor-to-bone and cross-sensor heading alignments without a box calibration or prescribed poses. It enables quick recalibration at any time without removing the IMUs.
For body pose generation, we leverage the EgoAllo diffusion model conditioned on the 6-DoF headset trajectory from Aria SLAM. We guide diffusion using bone orientations derived from the IMU trackers and enforce three complementary constraints: (i) direct comparison of observable joint angles (elbow, hip, knee), (ii) relative orientation consistency between the pelvis and shoulders, and (iii) temporal smoothness of pelvis-joint rotations across consecutive frames.
Localization
Video coming soon
Jumping Jack
Video coming soon
Agility
Video coming soon
Climbing
Video coming soon
Long Horizon — Coffee Making
Egocentric RGB + Human Pose + 3D Scene Reconstruction
Video coming soon
We evaluate RoSHI on 11 motion sequences across three datasets covering in-place motions, locomotion with global translation, and agile sports-like activities. RoSHI achieves the best mean per-joint position error (MPJPE) across all three datasets and the best joint angle error (JAE) on two of three, showing consistent improvements in both global joint localization and articulated pose reconstruction over egocentric baselines.
| Method | Egocentric | Dataset 1 | Dataset 2 | Dataset 3 | |||
|---|---|---|---|---|---|---|---|
| MPJPE (cm) | JAE (deg) | MPJPE (cm) | JAE (deg) | MPJPE (cm) | JAE (deg) | ||
| SAM3D | ✗ | 10.3 | 10.5 | 10.5 | 10.7 | 21.6 | 11.2 |
| IMU-only (naive) | ✓ | 16.7 | 12.6 | 18.8 | 12.2 | 16.1 | 8.9 |
| IMU + EgoAllo root | ✓ | 12.7 | 12.5 | 11.9 | 12.2 | 12.5 | 8.7 |
| EgoAllo | ✓ | 10.6 | 15.6 | 10.0 | 14.1 | 11.7 | 17.5 |
| RoSHI (ours) | ✓ | 9.6 | 12.0 | 9.9 | 11.0 | 10.3 | 15.6 |
MPJPE is computed in the OptiTrack world frame; JAE is computed from parent–child bone directions (independent of global/root pose). SAM3D relies on an external calibrated camera and is therefore not a fair baseline (shown in gray).
Walk / March / Jog / Run
Stretch / Boxing / Bow / Wave
Jumping Jack / Squat / One-Leg Squat
Pick Up Box
Walk / Say Hi / Walk
Pickup / Walk Around
Walk / Jog Back and Forth
Jump Around
Sliding
Tennis
Ball Throwing / Catching
TODO: Add real robot experiment videos here.
Experiment 1
Video coming soon
Experiment 2
Video coming soon
The RoSHI system is organized into modular repositories. All repositories will be publicly available.
RoSHI Core Algorithm
CoreFull pose estimation pipeline: IMU-guided EgoAllo diffusion, sensor fusion, calibration processing, and SMPL body model output.
RoSHI-App
CalibrationiOS app for calibrating the RoSHI wearable system. Captures RGB video with real-time AprilTag detection and synchronizes with 9 body-mounted IMU sensors over LAN.
RoSHI-Hardware
HardwareHardware design files, 3D-printable enclosures, BOM, and ESP32 firmware for the 9 wireless IMU trackers (BNO085-based, 100 Hz).
Refer to the documentation for detailed setup instructions.
TODO: Add acknowledgements here.