-
论文名称:Cooperative Holistic Scene Understanding: Unifying 3D Object, Layout, and Camera Pose Estimation
-
论文作者:Siyuan Huang, Siyuan Qi, Yinxue Xiao, Yixin Zhu, Ying Nian Wu, Song-Chun Zhu(来自UCLA)
-
收录情况:NeurIPS 2018
简介
虽然人类能比较容易的感知一个3D场景(200ms内完成;Potter, 1975, 1976, Schyns and Oliva, 1994, Thorpe et al., 1996),但是3D场景的整体理解在计算机视觉研究中是个基础又有挑战性的问题,主要的困难是:单张RGB图片包含大量带有歧义的3D信息。3D场景整体理解包含几项任务:
- 相机3D姿态估计。RGB图片来自相机,搞清楚相机在哪个位置,从哪个角度拍摄了照片,有助于提升2D图片和3D场景的一致性 $consistency$。
- 场景3D布局估计。通常指室内场景,相机3D姿态与场景3D布局组合起来,能反映场景的全局几何性质 $global ~geometry$。
- 场景中物体3D检测。反映的是局部信息 $local~ details$
本文要解决的问题:
- 2D-3D consistency. 2D像平面和3D世界的一致性
- Cooperation. 人类感知系统非常擅长融合不同视觉信息,设计算法时应当遵循这样的原则,让不同模块协同工作
- Physically Plausible. 建模出来的3D场景,应该能用物理世界解释
一些已知的方法或者不够高效,或者解决了部分问题
相关工作
- Traditional Methods
- Abhinav Gupta, Martial Hebert, Takeo Kanade, and David M Blei. Estimating spatial layout of rooms using volumetric reasoning about objects and surfaces. In Conference on Neural Information Processing Systems (NIPS), 2010
- Yibiao Zhao and Song-Chun Zhu. Image parsing with stochastic scene grammar. In Conference on Neural Information Processing Systems (NIPS), 2011
- Wongun Choi, Yu-Wei Chao, Caroline Pantofaru, and Silvio Savarese. Understanding indoor scenes using 3d geometric phrases. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013
- Yinda Zhang, Shuran Song, Ping Tan, and Jianxiong Xiao. Panocontext: A whole-room 3d context model for panoramic scene understanding. In European Conference on Computer Vision (ECCV), 2014
- Hamid Izadinia, Qi Shan, and Steven M Seitz. Im2cad. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017
- Siyuan Huang, Siyuan Qi, Yixin Zhu, Yinxue Xiao, Yuanlu Xu, and Song-Chun Zhu. Holistic 3d scene parsing and reconstruction from a single rgb image. In European Conference on Computer Vision (ECCV), 2018
- Alexander G Schwing, Sanja Fidler, Marc Pollefeys, and Raquel Urtasun. Box in the box: Joint 3d layout and object reasoning from single images. In IEEE International Conference on Computer Vision (ICCV), 2013
- Deep Learning, training individual modules separately
- Arsalan Mousavian, Dragomir Anguelov, John Flynn, and Jana Košecká. 3d bounding box estimation using deep learning and geometry. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017
- Chen-Yu Lee, Vijay Badrinarayanan, Tomasz Malisiewicz, and Andrew Rabinovich. Roomnet: End-to-end room layout estimation. In IEEE International Conference on Computer Vision (ICCV), 2017
- Wadim Kehl, Fabian Manhardt, Federico Tombari, Slobodan Ilic, and Nassir Navab. Ssd-6d: Making rgb-based 3d detection and 6d pose estimation great again. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017
- Abhijit Kundu, Yin Li, and James M Rehg. 3d-rcnn: Instance-level 3d object reconstruction via render-and-compare. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018
- Chuhang Zou, Alex Colburn, Qi Shan, and Derek Hoiem. Layoutnet: Reconstructing the 3d room layout from a single rgb image. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018
- Chen Liu, Jimei Yang, Duygu Ceylan, Ersin Yumer, and Yasutaka Furukawa. Planenet: Piece-wise planar reconstruction from a single rgb image. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018
- Shubham Tulsiani, Saurabh Gupta, David Fouhey, Alexei A Efros, and Jitendra Malik. Factoring shape, pose, and layout from the 2d image of a 3d scene. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018 —— without modeling relations explicitly
- Another Stream of Approaches, taking RGB-D image and camera pose as input
- Dahua Lin, Sanja Fidler, and Raquel Urtasun. Holistic scene understanding for 3d object detection with rgbd cameras. In IEEE International Conference on Computer Vision (ICCV), 2013
- Shuran Song and Jianxiong Xiao. Sliding shapes for 3d object detection in depth images. In European Conference on Computer Vision (ECCV), 2014
- Shuran Song, Fisher Yu, Andy Zeng, Angel X Chang, Manolis Savva, and Thomas Funkhouser. Semantic scene completion from a single depth image. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017
- Zhuo Deng and Longin Jan Latecki. Amodal detection of 3d objects: Inferring 3d bounding boxes from 2d ones in rgb-depth images. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017
- Chuhang Zou, Zhizhong Li, and Derek Hoiem. Complete 3d scene parsing from single rgbd image. arXiv preprint arXiv:1710.09490, 2017
- Charles R Qi, Wei Liu, Chenxia Wu, Hao Su, and Leonidas J Guibas. Frustum pointnets for 3d object detection from rgb-d data. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018
- Jean Lahoud and Bernard Ghanem. 2d-driven 3d object detection in rgb-d images. In IEEE International Conference on Computer Vision (ICCV), 2017
- Yinda Zhang, Mingru Bai, Pushmeet Kohli, Shahram Izadi, and Jianxiong Xiao. Deepcontext: Context-encoding neural pathways for 3d holistic scene understanding. In IEEE International Conference on Computer Vision (ICCV), 2017a
主要方法
- 这部分主要描述 3D bbox 的参数化表示 $parametrization$(其实很简单)和用于3D scene understanding 的神经网络
- $global~ geometric~ network$: 3D room layout、camera pose
- $local~ object~ network$: 3d object
- Parametrization
- 3D objects
- $X^W \in \mathbb{R}^{3 \times 8}$ 表示世界坐标系中的3D物体,维度解释:立方体有8个顶点,每个顶点是3维向量
- 中心点 $C^W \in \mathbb{R}^{3}$不好求(RGB图片没有深度信息),分解为
- 物体在像平面2D bbox 中心 $C^{I} \in \mathbb{R}^{2}$
- 相机中心到3D物体的中心距离 $D$
- 相机内参 $K \in \mathbb{R}^{3 \times 3}$
- 相机外参 $R(\phi, \psi) \in \mathbb{R}^{3 \times 3}$,$T \in \mathbb{R}^{3}$,$\phi$和$\psi$是相机旋转角(理解为$roll$和$pitch$)
-
如上图所示,3D 物体中心投影到像平面,不一定与2D bbox重合,记这个偏移量为 $\delta^I \in \mathbb{R}^{2}$
- 综上,$C^W$ 可用下面公式计算(看起来很复杂,其实是相机投影模型的基本用法)
- \[C^W = T + DR(\phi, \psi)^{-1} \frac{K^{-1}[C^I + \delta^I, 1]^{T}}{\parallel K^{-1}[C^I + \delta^I, 1]^{T} \parallel_{2}}\]
- 当相机坐标系原点与世界坐标系重合时,$T$ 变成 $\overrightarrow{0}$(是我的理解,原文是这么说的:从第一人称视角得到数据时,$T$ 变成 $\overrightarrow{0}$)
- 因此,可以记 $C^W = p(C^I, \delta^{I}, D, \phi, \psi, K)$,其中 $p$ 是可导的 $projection ~ function$
- 从3D box中心点 $C^W$ 的计算方式看,考虑了2D object center $C^I$,有助于维护2D-3D的一致性,减少3D bbox估计的方差(作者的观点,其实未必)。同时,里面集成了camera pose,体现了各个组件 $cooperative ~promoting$。
- 尺寸 $S^W \in \mathbb{R}^{3}$
- 方向 $R(\theta^{W}) \in \mathbb{R}^{3 \times 3}$,$\theta^{W}$ 是沿$z$轴线的方位角
- 组合起来,$X^W = h(C^W, R(\theta^{W}), S)$,$h(\cdot)$是边界框函数
- 3D Room Layout
- 与3D objects类似,$X^L \in \mathbb{R}^{3 \times 8}$
- 中心点 $C^L \in \mathbb{R}^{3}$
- 尺寸 $S^L \in \mathbb{R}^{3}$
- 方向 $R(\theta^{L}) \in \mathbb{R}^{3 \times 3}$,$\theta^{L}$ 是旋转角
- 3D objects
- Direct Estimations
- $global~geometry~network$(GGN)
- 输入:RGB图片
- 输出:3D room layout + 3D camera pose
- 3D room layout、3D camera pose的预测依赖于 global geometry features
- 损失函数:\(\mathcal{L}_{GGN} = \mathcal{L}_{\phi} + \mathcal{L}_{\psi} + \mathcal{L}_{C^L} + \mathcal{L}_{S^L} + \mathcal{L}_{\theta^L}\)
- $local~object~network$(LON)
- 输入:2D image patches(理解为图片的2D bbox)
- 输出:distance $d$ + size $S^W$ + heading angle $\theta^{W}$ + 2D offsets $\delta^{I}$
- 损失函数:\(\mathcal{L}_{LON} = \frac{1}{N} \sum_{j=1}^{N} (\mathcal{L}_{D_j} + \mathcal{L}_{\delta^{I}_j} + \mathcal{L}_{S^W_j} + \mathcal{L}_{\theta^W_j})\)
- $N$ 是场景中的物体数
- 直接拟合物体属性(e.g. 方位角)不是一个好的方法,可能导致大的误差,所以采用了另外一种方法:
- 预定义几个 size templates
- 首先把物体对应的属性(e.g. 方位角) 分类到一个template,然后在template内部预测属性误差
- 拿方位角举例,\(\mathcal{L}_{\phi} = \mathcal{L}_{\phi - cls} + \mathcal{L}_{\phi -reg}\)
- $global~geometry~network$(GGN)
- Cooperative Estimations
- 心理学实验表明:
- 人们对场景的感知依赖的是全局信息而不是局部细节(Oliva, 2005; Oliva and Torralba, 2006)—— gist of scene
- 人们对特定任务的理解,往往涉及多条视觉线索的合作(Landy et al., 1995; Jacobs, 2002)—— depth perception
- 本文Cooperative Estimations遵循同样的原则,让各模块相互配合,促进整个任务(其实就是把不同模块的损失函数加权求和——常见的用法,这篇文章特意强调)
- 3D Bounding Box Loss
- \[\mathcal{L}_{3D} = \frac{1}{N} \sum_{j=1}^{N} \parallel h(C^W_j,R(\theta_j),S_j) - X^{W*}_j \parallel_2^2\]
- $X^{W*}$ 是世界坐标系下的ground truth 3D bbox
- 2D Projection Loss
- \[\mathcal{L}_{PROJ} = \frac{1}{N} \sum_{j=1}^{N} \parallel f(X^W_j,R,K) - X^{I*}_j \parallel_2^2\]
- $f(\cdot)$ 是可导的投影函数,把3D bbox投影成2D bbox
- $X^{I*}_j \in \mathbb{R}^{2 \times 4}$ 是2D bbox ground truth(把检测到的当做2D bbox ground truth)
- Physical Loss
- \[\mathcal{L}_{PHY} = \frac{1}{N} \sum_{j=1}^{N} (ReLU(Max(X_j^W) - Max(X^L)) + ReLU(Min(X^L) - Min(X^W_j)))\]
- $ReLU$ 是激活函数,$Max(\cdot) / Min(\cdot)$的输入是$X^W, X^L$,输出沿 $x,y,z$ 三个轴的 max/min value,作者认为这样就把3D layout和3D object联系起来
- Total Loss
- \[\mathcal{L}_{Total} = \mathcal{L}_{GGN} + \mathcal{L}_{LON} + \lambda_{COOP}(\mathcal{L}_{3D} + \mathcal{L}_{PROJ} + \mathcal{L}_{PHY})\]
- 心理学实验表明:
实现
- GGN、LON的骨干网络采用ResNet-34,把 256x256 的图片编码成2048维向量(编码器)
- 预测时,在编码器上增加两个全连接层,最后输出 L 维的预测向量
- 2048 $\times$ 1024
- 1024 $\times$ L
- 预测时,在编码器上增加两个全连接层,最后输出 L 维的预测向量
- 训练过程分为2步
-
用30个common object categories调优2D detector([Dai et al., 2017; Bodla et al., 2017),然后生成 2D bbox,作为3D bbox投影的监督信息
-
训练2个3D estimation networks —— GGN和LON。先分别在synthetic dataset(Song et al., 2017; s Zhang et al. 2017b)上训练,获得初始化权重,再放到SUN RGBD数据集一起训练。
-
- 数据增强方法
- 随机翻转图片
- randomly shift 2D bboxes with corresponding labels during cooperative training(???第一次见这种操作)
实验评估
- Qualitative Results
- Quantitative Results
- 3D layout estimation
- IoU threshold 15%
- $P_g$:geometric precision
- $P_g$:geometric recall
- $R_r$:semantic recall
- IoU threshold 15%
-
3D object detection
-
3D box estimation
-
camera pose estimation
- Holistic Scene Understanding & Ablation Analysis
- The model trained without the supervision on 3D object bounding box corners (w/o L3D, S1).
- The model trained without the 2D supervision (w/o LPROJ, S2).
- The model trained without the penalty of physical constraint (w/o LPHY, S3).
- The model trained in an unsupervised fashion where we only use 2D supervision to estimate the 3D bounding boxes (w/o L3D + LGGN + LLON, S4).
- the model trained directly on SUN RGB-D without pre-train (S5)
- the model trained with 2D bounding boxes projected from ground truth 3D bounding boxes (S6)
- 3D layout estimation