Monocular 3D Occupancy Perception for Robots on Sidewalks via Hybrid 2D-3D Learning

Yukai Ma 1,2 , Joe Lin 3 , Liu Liu 1,4 , Honglin He 1 , Lulu Ricketts 3
Brad Squicciarini 3 , Yong Liu 2 , Bolei Zhou 1,3
1 University of California, Los Angeles , 2 Zhejiang University , 3 Coco Robotics , 4 Massachusetts Institute of Technology

TL;DR

    WalkOCC is a hybrid ray-marching 3D semantic occupancy learning framework for sidewalk robots that couples geometry grounding from limited paired LiDAR--RGB sequences with scalable learning from large-scale unpaired monocular images, improving robustness and generalization without costly 3D annotations..

    🧭 It learns reliable sidewalk 3D occupancy from scarce paired sensor data by bootstrapping pseudo-3D supervision, stabilizing training compared to purely self-supervised pipelines.
    🧩 It scales to diverse real-world appearances via mixed training on additional 2D-only images, strengthening cross-domain generalization beyond the paired-data distribution.
    📦 We introduce Sidewalk3D, a large-scale, cross-domain sidewalk perception dataset with LiDAR--camera paired sequences across multiple locations and times, plus 3D semantic occupancy annotations for benchmarking.

Visualization of Cross-Embodiment Inference

Coco Delivery Robot. A wheeled robot with a front-facing fisheye camera, approximately 40 cm tall, primarily used for last-mile food and parcel delivery on sidewalks.

Robot 1 / 3

Case 1. Coco robot waits for pedestrians and vehicles to pass before the crosswalk.

Case 1 / 4

Diverse Test Set Inference Visualization

Our proposed SideWalk3D dataset captures diverse appearances across regions and time periods (daytime and nighttime), providing a challenging benchmark for urban sidewalk occupancy prediction.

Long-Horizon Inference Visualization

Long-horizon demo on a wheeled-legged robot dog. The robot runs along a sidewalk in a residential area in Los Angeles.

WalkOCC Model architecture

Image

We present WalkOCC, a hybrid Ray-marching-based occupancy-learning framework for sidewalk occupancy prediction using a monocular RGB camera. Our approach consists of two key components: (i) a depth-aware lifting architecture that transforms front-view images into 3D semantic occupancy grids, and (ii) a hybrid training strategy that leverages both 2D and 3D supervision via a ray-marching-based 2D-3D consistency loss. Enforcing this consistency enables effective learning from large-scale 2D-only data while preserving geometric accuracy, which in turn improves prediction quality and cross-domain generalization.

Reference

@article{ma2026monocular,
         title={Monocular 3D Occupancy Perception for Robots on Sidewalks via Hybrid 2D-3D Learning},
         author={Ma, Yukai and Lin, Joe and Liu, Liu and He, Honglin and Ricketts, Lulu and Squicciarini, Brad and Liu, Yong and Zhou, Bolei},
         journal={arXiv preprint},
         year={2026},
}