
Framework Overview

BRIDGE Pipeline. First training a depth-to-image model to synthesize millions of realistic RGB images with precise ground truth depths and a teacher model for pseudo labeling. Student model is then trained on this extensive synthetic dataset. Finally, it's fine-tuned using mask-based refinement with original ground truth depth for robust generalization and detailed depth capture.
Abstract
Monocular Depth Estimation (MDE) is a foundational task for computer vision. Traditional methods are limited by data scarcity and quality, hindering their robustness. To overcome this, we propose BRIDGE, an RL-optimized depth-to-image (D2I) generation framework that synthesizes over 20M realistic and geometrically accurate RGB images, each intrinsically paired with its ground truth depth, from diverse source depth maps. Then we train our depth estimation model on this dataset, employing a hybrid supervision strategy that integrates teacher pseudo-labels with ground truth depth for comprehensive and robust training. This innovative data generation and training paradigm enables BRIDGE to achieve breakthroughs in scale and domain diversity, consistently outperforming existing state-of-the-art approaches quantitatively and in complex scene detail capture, thereby fostering general and robust depth features.
BibTeX
@misc{liu2025BRIDGE,
title={BRIDGE - Building Reinforcement-Learning Depth-to-Image Data Generation Engine for Monocular Depth Estimation},
author={Liu, Dingning and Guo, Haoyu and Zhou, Jingyi and He, Tong},
year={2025},
eprint={2509.25077},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2509.25077},
}