1. 前言

传统的卷积网络由于其固有的卷积特性（卷积核中单元位置固定）为其带来了先天的固定几何变换。那么直接的思想就是使卷积的为位置不固定，从而增加CNN网络的表达能力，为这篇文章给出了可变形卷积与可变形ROI Pooling，两者都基于这样的想法：增加模块中的空间采样位置以及额外的偏移量，并且从目标任务中学习偏移量，而不需要额外的监督。新模块可以很容易地替换现有CNN中的普通模块，并且可以通过标准的反向传播便易地进行端对端训练，从而产生可变形卷积网络。
无论是现有的分割、分类还是检测任务网络中，其中图片中几何变换信息大多是来自于数据增广；而且这样并不能很好掌控图像几何变化信息。因而在本文中给出了Deformable Convolution Network，如下图所示：
在这里插入图片描述
上图中a是标准的卷积，后面的几种就是通过变形之后的卷积了。在网络中卷积的偏移量通过附加的卷积层从前面的feature map中学习。因此，变形以局部、密集和自适应的方式取决于输入特征。变形卷积正是这样高效、简单地实现端到端复杂空间变换。

2. 可行变卷积

2D卷积包含两步：1）用规则的网格 R R R在输入特征映射 x x x上采样；2）对 w w w加权的采样值求和。网格 R R R定义了感受野的大小和扩张。例如，
在这里插入图片描述
定义了一个扩张大小为1的 3 ∗ 3 3*3 3∗3卷积核。上面的 ( − 1 , − 1 ) (-1,-1) (−1,−1)之类的是代表相对卷积中心的相对坐标。对于输出特征映射 y y y上的每个位置 p 0 p_0 p0，我们有

其中 p n p_n pn枚举了 R R R中的位置（坐标），这是原始卷积的运算表达式。然而，在可变形卷积中，规则的网格 R R R通过偏移 Δ p n ∣ n = 1 , … , N {Δp_n | n=1,\dots, N} Δpn∣n=1,…,N增大，其中 N = ∣ R ∣ N=|R| N=∣R∣，也就是卷积核中元素的个数。方程(1)变为在这里插入图片描述
就是相当于在原始卷积操作的基础上增加了偏移，采样是在不规则且有偏移的位置 p n + Δ p n p_n+Δp_n pn+Δpn上。由于偏移 Δ p n Δp_n Δpn通常是小数，那么偏移之后所在位置处参与卷积计算的值是怎么计算的呢？在论文中说是使用双线性插值算法实现的。
在这里插入图片描述
其中 p p p表示任意（小数）位置(公式(2)中 p = p 0 + p n + Δ p n p=p_0+p_n+Δp_n p=p0+pn+Δpn)， q q q枚举了特征映射 x x x中所有整体空间位置， G ( ⋅ , ⋅ ) G(⋅,⋅) G(⋅,⋅)是双线性插值的核。注意 G G G是二维的。它被分为两个一维核
在这里插入图片描述
其中 g ( a , b ) = m a x ( 0 , 1 − ∣ a − b ∣ ) g(a,b)=max(0,1−|a−b|) g(a,b)=max(0,1−∣a−b∣)。方程(3)可以快速计算因为 G ( q , p ) G(q,p) G(q,p)仅对于一些 q q q是非零的。
下图是Deformable Convolution的说明图。其通过在相同的输入特征映射上应用卷积层来获得偏移。卷积核具有与当前卷积层相同的空间分辨率和扩张。输出偏移域与输入特征映射具有相同的空间分辨率。通道维度2N对应于N个2D偏移量。在训练过程中，同时学习用于生成输出特征的卷积核和偏移量。为了学习偏移量，梯度通过方程(3)和(4)中的双线性运算进行反向传播。
在这里插入图片描述
对于Deformable ROI Pooling也是与上面类似的思想进行实现的，其说明图如下：

这里就不展开进行说明了。

3. Caffe中使用Deformable Convolution

目前在Git仓库中已经有DCN网络的实现了，这里只需要在Caffe中添加相应的层就好了。Deformable-ConvNets-caffe
这里需要的文件都在文件夹deformable_conv_cxx/下，这里复制里面的cu文件和cpp文件到caffe/src/caffe/layers下，将h头文件放在caffe/include/caffe/layers目录下。之后修改caffe.proto文件，在里面对应位置处添加如下内容：

optional DeformableConvolutionParameter deformable_convolution_param = 999;
message DeformableConvolutionParameter {
  optional uint32 num_output = 1; 
  optional bool bias_term = 2 [default = true]; 
  repeated uint32 pad = 3; // The padding size; defaults to 0
  repeated uint32 kernel_size = 4; // The kernel size
  repeated uint32 stride = 6; // The stride; defaults to 1
  repeated uint32 dilation = 18; // The dilation; defaults to 1
  optional uint32 pad_h = 9 [default = 0]; // The padding height (2D only)
  optional uint32 pad_w = 10 [default = 0]; // The padding width (2D only)
  optional uint32 kernel_h = 11; // The kernel height (2D only)
  optional uint32 kernel_w = 12; // The kernel width (2D only)
  optional uint32 stride_h = 13; // The stride height (2D only)
  optional uint32 stride_w = 14; // The stride width (2D only)
  optional uint32 group = 5 [default = 4]; 
  optional uint32 deformable_group = 25 [default = 4]; 
  optional FillerParameter weight_filler = 7; // The filler for the weight
  optional FillerParameter bias_filler = 8; // The filler for the bias
  enum Engine {
    DEFAULT = 0;
    CAFFE = 1;
    CUDNN = 2;
  }
  optional Engine engine = 15 [default = DEFAULT];
  optional int32 axis = 16 [default = 1];
  optional bool force_nd_im2col = 17 [default = false];
}

Deformable Convolution的输入参数维度要求为：

bottom[0](data): (batch_size, channel, height, width)
bottom[1] (offset): (batch_size, deformable_group * kernel[0] * kernel[1]*2, height, width)

网络的输出为：

out_height=f(height, kernel[0], pad[0], stride[0], dilate[0])
out_width=f(width, kernel[1], pad[1], stride[1], dilate[1])

其中运算 f f f被定义为：

f(x,k,p,s,d) = floor((x+2*p-d*(k-1)-1)/s)+1

使用如下卷积作为offset layer：

layer {
  name: "offset"
  type: "Convolution"
  bottom: "pool1"
  top: "offset"
  param {
    lr_mult: 1
  }
  param {
    lr_mult: 2
  }
  convolution_param {
    num_output: 72
    kernel_size: 3
    stride: 1
    dilation: 2
    pad: 2
    weight_filler {
      type: "xavier"
    }
    bias_filler {
      type: "constant"
    }
  }
}

Deformable Convolution Layer：

layer {
  name: "dec"
  type: "DeformableConvolution"
  bottom: "conv1"
  bottom: "offset"
  top: "dec"
  param {
    lr_mult: 1
  }
  param {
    lr_mult: 2
  }
  deformable_convolution_param {
    num_output: 512
    kernel_size: 3
    stride: 1
    pad: 2
    engine: 1
    dilation: 2
    deformable_group: 4
    weight_filler {
      type: "xavier"
    }
    bias_filler {
      type: "constant"
    }
  }
}

使用时候的网络结构应该为：
在这里插入图片描述
错误修正
在添加Deformable Convolution Layer的时候可能会存在如下的错误：

Check failed: registry.count(type) == 1 (0 vs. 1) Unknown layer type: DeformableConvolution

也就是找不到这个层，原因是cpp文件中没有对其进行注册，因而需要修改对应位置处代码为：

#ifdef CPU_ONLY
STUB_GPU(DeformableConvolutionLayer);
#endif
INSTANTIATE_CLASS(DeformableConvolutionLayer);
REGISTER_LAYER_CLASS(DeformableConvolution); // add register