CornerNet - Detecting Objects as Paired Keypoints 论文解读

雨点打透心脏的1/2处 2022-02-16 16:16 179阅读 0赞

paper:CornerNet: Detecting Objects as Paired Keypoints  
arXiv:[https://arxiv.org/abs/1808.01244][https_arxiv.org_abs_1808.01244]  
github:[https://github.com/princeton-vl/CornerNet][https_github.com_princeton-vl_CornerNet]  
conf: ECCV 2018 oral [https://www.youtube.com/watch?v=aJnvTT1-spc][https_www.youtube.com_watch_v_aJnvTT1-spc]  
intro: predict top-left and bottom-right point, corner pooling, hourglass network

**主要内容：**  
在目标检测任务中，通常我们会通过中心(x,y)的位置和宽高(w,h)来定义一个bounding box，然后通过模型来求解这四个参数。CornerNet这篇文章将目标检测任务中的bounding box的求解转化为左上和右下这一对点的求解，通过检测这一对点来完成目标检测。相比于Faster RCNN，SSD等基于anchor box的目标检测方法，cornerNet无需设置anchor box。  
![network overview][]  
**主干网络:**  
使用两个[hourglass network][]作为主干网络，最初是应用在人体关键点检测中。hourglass模块首先会通过卷积和池化降低feature map的尺寸，然年再通过上采样和卷积将featrue map恢复到原来的尺寸，另外由于池化会造成细节信息的损失，hourglass通过skip连接将低层的细节特征添加到经过上采样的feature map上。通过这样的构造，hourglass可以同步捕捉到全局和局部的信息。  
CornerNet首先使用 7 ∗ 7 7\*7 7∗7的卷积将图像尺寸缩小4倍，通道数增长到128，然后接一个残差模块，步长为2，通道数目为256。然后使用了两个hourglass模块，使用带步长的卷积代替池化操作进行feature map尺寸减小。在每个hourglass模块中，进行了5次feature map尺寸的减小，同时将通道数目从256增长到512(256->384->384->384->512)。在上采样的时候，使用最近邻插值扩大尺寸，并在上采样前面增加了两个残差模块。每一个skip连接中也包含了两个残差模块。

**检测网络：**  
主干网络之后是检测网络，包括两个模块，前面是一个残差块，里面包括了corner pooling部分；后面是一个卷积模块，分成了三条分支，分别预测点的heat map，embedding和offset。  
![prediction module][]  
**Corner Pooling:**  
通常来说，对于bounding box的左上和右下两个顶点，在图像中对应的位置上没有局部的视觉特征来表示当前存在一个顶点，这与bounding box的中心坐标点是有所不同的。因此，要预测这两个点，需要去看点所在的行和列的情况。例如对于左上角点，就需要看这个点所在列往右去找bounding box的上边界，看这个点所在的行往下去找bounding box的做边界。  
也就是当前点在行（列）当中右（下）边的所有值当中取最大值，然后将行和列的最大值相加作为pooling的结果。  
![corner pooling][]  
**`corner pooling的实现：`**  
（1）论文给出的源代码中的实现(c++)：[https://github.com/princeton-vl/CornerNet/tree/master/models/py\_utils/\_cpools/src][https_github.com_princeton-vl_CornerNet_tree_master_models_py_utils_cpools_src]  
（2）基于tensorflow的实现：[https://github.com/tensorlayer/tensorlayer/issues/781][https_github.com_tensorlayer_tensorlayer_issues_781]  
基于tensorflow的实现方法是，以top-left点中的top为例，首先对feature map做一个padding：右边pading大小为feature map的宽度减1，左/上/下padding大小为0；然后对padding之后的feature map做kernal\_size为(feature\_map\_width,1)的max pooling(valid padding)，这样就得到top-left中的top部分的值，同理可以得到其他3个值。

**损失函数:**  
CornerNet为每个点预测了其对应的类别和坐标，同时为了实现同一个bounding box 的两个点组合，为每一个点预测了一个一维embedding，用embedding向量之间的距离来判断两个点是否属于同一组。损失函数包括三个方面：  
 L = L d e t + α L p u l l + β L p u s h + γ L o f f L=L\_\{det\}+\{\\alpha\}L\_\{pull\}+\{\\beta\}L\_\{push\}+\{\\gamma\}L\_\{off\} L=Ldet\+αLpull\+βLpush\+γLoff

*  det损失，也就是类别损失，使用focal loss；
 *  embedding损失，包括pull，push损失，pull损失使属于同一个bounding box的两个point的距离尽量小，push损失使属于不同bounding box的两个point的距离尽量大；
 *  offset损失，也就是坐标偏移损失，是预测的左上或右下点相对于feature map上格子顶点的偏移，跟yolo中的bounding box中心坐标偏移类似；  
    其中的 α = 0.1 , β = 0.1 , γ = 1 \{\\alpha\}=0.1, \{\\beta\}=0.1, \{\\gamma\}=1 α=0.1,β=0.1,γ=1。

**模型表现:**  
COCO AP： 42.4%  
FPS：244ms per image

[https_arxiv.org_abs_1808.01244]: https://arxiv.org/abs/1808.01244
[https_github.com_princeton-vl_CornerNet]: https://github.com/princeton-vl/CornerNet
[https_www.youtube.com_watch_v_aJnvTT1-spc]: https://www.youtube.com/watch?v=aJnvTT1-spc
[network overview]: /images/20220216/e89ff1c705b44da785292118140605b4.png
[hourglass network]: https://arxiv.org/abs/1603.06937
[prediction module]: /images/20220216/3bbc5267f3614b208e1050b2b9edbb9d.png
[corner pooling]: /images/20220216/93773cd499c143e6bd867e2d9d266c5b.png
[https_github.com_princeton-vl_CornerNet_tree_master_models_py_utils_cpools_src]: https://github.com/princeton-vl/CornerNet/tree/master/models/py_utils/_cpools/src
[https_github.com_tensorlayer_tensorlayer_issues_781]: https://github.com/tensorlayer/tensorlayer/issues/781

CornerNet - Detecting Objects as Paired Keypoints 论文解读

发表评论取消回复

还没有评论，来说两句吧...

相关阅读