Monocular 3D Occupancy Perception for Robots on Sidewalks via Hybrid 2D-3D Learning
Yukai Ma 1,2 , Joe Lin 3 , Liu Liu 1,4 , Honglin He 1 , Lulu Ricketts 3
Brad Squicciarini 3 , Yong Liu 2 , Bolei Zhou 1,3
1 University of California, Los Angeles , 2 Zhejiang University , 3 Coco Robotics , 4 Massachusetts Institute of Technology
TL;DR
- WalkOCC is a hybrid ray-marching 3D semantic occupancy learning framework for sidewalk robots that couples geometry grounding from limited paired LiDAR--RGB sequences with scalable learning from large-scale unpaired monocular images, improving robustness and generalization without costly 3D annotations..
🧭 It learns reliable sidewalk 3D occupancy from scarce paired sensor data by bootstrapping pseudo-3D supervision, stabilizing training compared to purely self-supervised pipelines.
🧩 It scales to diverse real-world appearances via mixed training on additional 2D-only images, strengthening cross-domain generalization beyond the paired-data distribution.
📦 We introduce Sidewalk3D, a large-scale, cross-domain sidewalk perception dataset with LiDAR--camera paired sequences across multiple locations and times, plus 3D semantic occupancy annotations for benchmarking.
Visualization of Cross-Embodiment Inference
Coco Delivery Robot. A wheeled robot with a front-facing fisheye camera, approximately 40 cm tall, primarily used for last-mile food and parcel delivery on sidewalks.
Navigation Planning with OCC supervision
Case 1. Coco robot waits for pedestrians and vehicles to pass before the crosswalk.
Diverse Test Set Inference Visualization
Our proposed SideWalk3D dataset captures diverse appearances across regions and time periods (daytime and nighttime), providing a challenging benchmark for urban sidewalk occupancy prediction.
Long-Horizon Inference Visualization
Long-horizon demo on a wheeled-legged robot dog. The robot runs along a sidewalk in a residential area in Los Angeles.
WalkOCC Model architecture
We present WalkOCC, a hybrid Ray-marching-based occupancy-learning framework for sidewalk occupancy prediction using a monocular RGB camera. Our approach consists of two key components: (i) a depth-aware lifting architecture that transforms front-view images into 3D semantic occupancy grids, and (ii) a hybrid training strategy that leverages both 2D and 3D supervision via a ray-marching-based 2D-3D consistency loss. Enforcing this consistency enables effective learning from large-scale 2D-only data while preserving geometric accuracy, which in turn improves prediction quality and cross-domain generalization.
Reference
@article{ma2026monocular,
title={Monocular 3D Occupancy Perception for Robots on Sidewalks via Hybrid 2D-3D Learning},
author={Ma, Yukai and Lin, Joe and Liu, Liu and He, Honglin and Ricketts, Lulu and Squicciarini, Brad and Liu, Yong and Zhou, Bolei},
journal={arXiv preprint},
year={2026},
}