[paper reading] Faster RCNN
GitHub:Notes of Classic Detection Papers
Problem to Solve Proposal with CNN | Architecture Region Proposal Network RoI Pooling Anchor | anchor/proposals/bbox Pyramids Positive & Negative Label Sampling Strategy 4-Step Alternating Training | Architecture Region Proposal Network RoI Pooling Anchor | Loss Function Coordinates Parametrization |
Problem to Solve
Fast-RCNN 和 SPPnet 的性能瓶颈在 region proposals 上, region proposal 占据了大量的时间。
Advances like SPPnet [1] and Fast R-CNN [2] have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck
Proposal with CNN
Faster-RCNN = Fast-RCNN + RPN,即使用CNN直接计算proposal
- 其中 Fast-RCNN 充当 detection network
Conv layers
- 输入:image
- 输出:shared feature map(被后续的RPN和detection network共享)
卷积层使用same-padding保持卷积前后的空间分辨率,在池化层进行下采样(长宽都变为原来的 1 2 \frac12 21) ==> feature map的任一点均可对应原图的一个grid
[Region Proposal Network](#Region Proposal Network)
- 输入:shared feature map
- 输出:proposals 和 regression loss
RPN 用于生成 region proposals
- 判断anchors属于positive或者negative
- 利用bounding box regression修正anchors获得精确的proposals
[Roi Pooling](#Roi Pooling)
- 输入:shared feature maps 和 proposals
- 输出:proposal feature maps
- 输入:proposal feature maps
- 输出:bounding box 和 score
Region Proposal Network
Convolutional Feature Sharing
Region Proposal Network 通过和 detection network 共享full-image的卷积特征,实现了几乎cost-free的region proposal(通过共享卷积层实现)
In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals
“Attention” Mechanism
RPN 告诉网络去关注何处
We further merge RPN and Fast R-CNN into a single network by sharing their convolutional features—using the recently popular terminology of neural networks with “attention” mechanisms, the RPN component tells the unified network where to look.
RPN 是几层卷积层,本质上相当于可以生成 proposal 的 FCN(fully convolutional layers)
we construct an RPN by adding a few additional convolutional layers
The RPN is thus a kind of fully convolutional network (FCN) [7] and can be trained end-to-end specifically for the task for generating detection proposals.
This architecture is naturally implemented with a n×n convolutional layer followed by two sibling 1×1 convolutional layers (for reg and cls, respectively).
n × n n×n n×n convolutional layer
在共享卷积层输出的feature map上提取 3 × 3 3×3 3×3 的窗口,并映射到低维的feature。
256-d for ZF and 512-d for VGG, with ReLU following
即便选取较小的 n n n ,在原图上依旧能对应较大的 reception filed
We use n= 3 in this paper, noting that the effective receptive field on the input image is large (171 and 228 pixels for ZF and VGG, respectively).
box-classification layer & box-regression layer
2路 1x1 Conv,分别负责输出 objectness scores 和 rectangle object proposals
这说明 1x1 Conv 也能起到分类器的作用,显示出卷积层同FC层一样具有生成器的性质
- 上路:softmax分类anchors获得positive和negative分类
- 下路:anchors对于ground-truth box的偏移量,以获得精确的proposal
- Proposal层:positive anchors和bbox regression偏移量 ==> proposals(同时剔除太小和超出边界的proposals)
由于 RPN 是以 sliding-window 的方式滑动,2路 1x1 Conv 各自在空间位置上共享。
Note that because the mini-network operates in a sliding-window fashion, the fully-connected layers are shared across all spatial locations
整个网络到了Proposal Layer 这里,就完成了相当于目标定位的功能
RPNs are designed to efficiently predict region proposals with a wide range of scales and aspect ratios.
Input & Output
shared feature map
positive + anchor offset = proposals
- box-classification layer ==> objectness scores (positive/negative)
- box-regression layer ==> rectangle object proposals
RoI Pooling
ROI Pooling 其实是对proposal区域的进行的像素/特征重采样过程
Idea & Purpose
即:每个feature map的location有 3 × 3 = 9 3×3=9 3×3=9 个anchor
Idea & Purpose
Pyramids of reference boxes (预定义的proposals)
在对image和filter保持single scale下,实现multi scale和ratios目标的检测
在共享卷积层输出的feature map(不是image)的每个location的sliding window都放置 9 个 anchor,分别对应3个scale和3个ratios,以解决 multi-scale 和 multi-ratios 的问题
小的sliding window在原图上依旧能对应较大的 reception filed
关于anchor在训练时的取舍,见 [Sampling Strategy](#Sampling Strategy)
⌊ 800 / 16 ⌋ × ⌊ 600 / 16 ⌋ × 9 = 17100 \lfloor800/16\rfloor × \lfloor600/16\rfloor × 9 = 17100 ⌊800/16⌋×⌊600/16⌋×9=17100
Key Element
=预定义=> anchor =RPN微调=> proposals =后处理=> bounding box
为了处理 multi scale 和 multi ratios 的问题,有以下3种 Pyramids
Pyramids of images and feature maps
采用不同scale的image(resize),以获得不同scale的 feature map
Pyramids of filters
通过不同size的filter,获得不同的 reception filed,以得到 multi-scale 的 feature map
Pyramids of reference boxes
通过不同scale和ratios的anchor,获得不同scale和ratios的 feature map
Positive & Negative Sample
Positive & Negative Label 用于表明一个anchor是否包含object
- anchor与任意ground-truth box的 IoU > 0.7
- anchor与某个ground-truth box有最高的IoU(尽管 < 0.7)
一般来说足够①足够确定positive sample,②是用来应对①无法找到positive sample的极端情况。(下面这段话①②是反着的)
一个ground-truth box可能会对应多个positive的anchor ==> Faster-RCNN 其实并没有对box进行筛选,而是选择了所有的box(没有YOLO中的responsible机制)
Sampling Strategy
4-Step Alternating Training
交替进行 region proposal 和 object detection 的 fine-tune
由在ImageNet预训练的模型进行初始化,对region proposal任务进行端到端的fine-tune
This network is initialized with an ImageNet-pre-trained model and fine-tuned end-to-end for the region proposal task.
使用step-1的proposal训练Faster RCNN(detection network)
In the second step, we train a separate detection network by Fast R-CNN using the proposals generated by the step-1 RPN.
注意:在step-1和step-2中,RPN和Faster RCNN两个网络没有共享卷积层
固定RPN和Faster RCNN的公共层,仅仅训练RPN独有的层
注意:需要用Faster RCNN初始化RPN的训练
In the third step, we use the detector network to initialize RPN training, but we fix the shared convolutional layers and only fine-tune the layers unique to RPN.
注意:在step-3,RPN和Faster RCNN两个网络实现了共享卷积层
固定共享的卷积层,仅仅fine-tune Faster RCNN独有的层
Finally, keeping the shared convolutional layers fixed, we fine-tune the unique layers of Fast R-CNN.
Loss Function
Multi-Task Loss of RPN(stage-1)
RPN 在 regular grid 的每个位置上,同时回归 region bounds 和 objectness scores
simultaneously regress region bounds and objectness scores at each location on a regular grid.
L ( { p i } , { t i } ) = 1 N c l s ∑ i L c l s ( p i , p i ∗ ) + λ 1 N r e g ∑ i p i ∗ L r e g ( t i , t i ∗ ) \begin{array}{r} L\left(\left\{p_{i}\right\},\left\{t_{i}\right\}\right)=\frac{1}{N_{c l s}} \sum_{i} L_{c l s}\left(p_{i}, p_{i}^{*}\right) \\ \quad+\lambda \frac{1}{N_{r e g}} \sum_{i} p_{i}^{*} L_{r e g}\left(t_{i}, t_{i}^{*}\right) \end{array} L({ pi},{ ti})=Ncls1∑iLcls(pi,pi∗)+λNreg1∑ipi∗Lreg(ti,ti∗)
- N c l s N_{cls} Ncls, N r e g N_{reg} Nreg :为归一化系数
- λ \lambda λ :平衡系数,一般取10,对应两项等权重
classification loss ==> 二分类问题(positive v s vs vs. negative)
∑ i L c l s ( p i , p i ∗ ) \sum_{i} L_{c l s}\left(p_{i}, p_{i}^{*}\right) i∑Lcls(pi,pi∗)- i i i :mini-batch中anchor的序号
- p i p_i pi :anchor i i i 包含object的概率
- p i ∗ p_i^* pi∗:ground-truth label。正样本为1,负样本为0
目标检测其实是一种饱和式检测,其生成的负样本要远远多于正样本,导致极端的样本不平衡,导致在二分类的损失函数出现样本不平衡的问题(详细的问题说明见 [YOLO v1](./[paper reading] YOLO v1.md))
Faster-RCNN并没有处理这个问题,后续的YOLO有了一些改进,而Focal Loss主要解决了这个问题
regression loss
∑ i p i ∗ L r e g ( t i , t i ∗ ) \sum_{i} p_{i}^{*} L_{r e g}\left(t_{i}, t_{i}^{*}\right) i∑pi∗Lreg(ti,ti∗)- t i t_i ti :预测得到的bounding box的4个参数化坐标(详见 [Coordinates Parametrization](#Coordinates Parametrization))
- t i ∗ t_i^* ti∗ :ground-truth box的4个参数化坐标
- L r e g ( t i , t i ∗ ) L_{r e g}\left(t_{i}, t_{i}^{*}\right) Lreg(ti,ti∗) :实质为 smooth L 1 L_1 L1 loss
L r e g ( t i , t i ∗ ) = ∑ i ∈ { x , y , w , h } s m o o t h L 1 ( t i − t i ∗ ) soomth L 1 ( x ) = { 0.5 x 2 if ∣ x ∣ < 1 ∣ x ∣ − 0.5 otherwise \begin{array}{l} \mathrm{L}_{\mathrm{reg}}\left(t_{i}, t_{i}^{*}\right)=\sum_{i \in\{x, y, w, h\}} \mathrm{smooth}_{\mathrm{L} 1}\left(t_{i}-t_{i}^{*}\right) \\ \text { soomth }_{\mathrm{L} 1}(x)=\left\{\begin{array}{ll} 0.5 x^{2} & \text { if }|\mathrm{x}|<1 \\ |x|-0.5 & \text { otherwise } \end{array}\right. \end{array} Lreg(ti,ti∗)=∑i∈{ x,y,w,h}smoothL1(ti−ti∗) soomth L1(x)={ 0.5x2∣x∣−0.5 if ∣x∣<1 otherwise - p i ∗ p_i^* pi∗:表示该回归仅仅对正样本计算
Loss Function of Classification(stage-2)
Coordinates Parametrization
t x = ( x − x a ) / w a , t y = ( y − y a ) / h a t w = log ( w / w a ) , t h = log ( h / h a ) t x ∗ = ( x ∗ − x a ) / w a , t y ∗ = ( y ∗ − y a ) / h a t w ∗ = log ( w ∗ / w a ) , t h ∗ = log ( h ∗ / h a ) \begin{aligned} t_{\mathrm{x}} &=\left(x-x_{\mathrm{a}}\right) / w_{\mathrm{a}}, \quad t_{\mathrm{y}}=\left(y-y_{\mathrm{a}}\right) / h_{\mathrm{a}} \\ t_{\mathrm{w}} &=\log \left(w / w_{\mathrm{a}}\right), \quad t_{\mathrm{h}}=\log \left(h / h_{\mathrm{a}}\right) \\ t_{\mathrm{x}}^{*} &=\left(x^{*}-x_{\mathrm{a}}\right) / w_{\mathrm{a}}, \quad t_{\mathrm{y}}^{*}=\left(y^{*}-y_{\mathrm{a}}\right) / h_{\mathrm{a}} \\ t_{\mathrm{w}}^{*} &=\log \left(w^{*} / w_{\mathrm{a}}\right), \quad t_{\mathrm{h}}^{*}=\log \left(h^{*} / h_{\mathrm{a}}\right) \end{aligned} txtwtx∗tw∗=(x−xa)/wa,ty=(y−ya)/ha=log(w/wa),th=log(h/ha)=(x∗−xa)/wa,ty∗=(y∗−ya)/ha=log(w∗/wa),th∗=log(h∗/ha)
- x , y x, y x,y :中心点坐标
- w , h w, h w,h :width & height
中心点坐标 x , y x, y x,y 其实是平移的形式,宽高 w , h w, h w,h 其实是缩放的形式
Use Yourself
其实对于集成度不高的模型来说,通常有独立的成分去负责独立的功能(比如 RCNN 和 Fast RCNN 中 selective search 进行proposals)
[Region Proposal Network](#Region Proposal Network)
==> 神经网络的生成器特征
其实从网络结构而言,RPN 其实就是一个类似 FCN 的小型网络,两个支路 1x1 Conv 分别负责 bbox regression 和 objectness score
[RoI Pooling](#RoI Pooling)
RoI Pooling 本质上是一个特征下采样的操作,这也是为什么称之为“Pooling”。
RoI 其实解决“不定尺寸feature的下采样”的问题
而且向更前层看,RoI Pooling的输入feature map也是卷积层的输出,也相当于提取了前层的卷积特征,所以跟卷积特征也是有关系的
换句话说,anchor其实构成了anchor_based detector的基石
R-CNN 本质上只是起到了一个“分类器”的作用,即对 Selective Search 得到的 proposal 进行分类。
其并不预测 bounding box,而仅矫正 bounding box
The R-CNN method [5] trains CNNs end-to-end to classify the proposal regions into object categories or background.
R-CNN mainly plays as a classifier, and it does not predict object bounds (except for refining by bounding box regression).
==> image的显性grid和隐性grid
相对于YOLO直接在原图像上显性分割了grid,Faster RCNN 通过在shared feature map的每个点放置anchor,隐性地将原图像分割了grid(grid的大小由下采样的倍数决定)
