睿智的目标检测17——Keras搭建Retinanet目标检测平台

学习前言
什么是Retinanet目标检测算法
源码下载
Retinanet实现思路
- 一、预测部分
- - 1、主干网络介绍
  - 2、从特征获取预测结果
  - 3、预测结果的解码
  - 4、在原图上进行绘制
- 二、训练部分
- - 1、真实框的处理
  - 2、利用处理完的真实框与对应图片的预测结果计算loss
  - - a、控制正负样本的权重
    - b、控制容易分类和难分类样本的权重
    - c、两种权重控制方法合并
训练自己的Retinanet模型
- 一、数据集的准备
- 二、数据集的处理
- 三、开始网络训练
- 四、训练结果预测

学习前言

一起来看看Retinanet的keras实现吧，顺便训练一下自己的数据。
在这里插入图片描述

什么是Retinanet目标检测算法

在这里插入图片描述
Retinanet是在何凯明大神提出Focal loss同时提出的一种新的目标检测方案，来验证Focal Loss的有效性。

One-Stage目标检测方法常常使用先验框提高预测性能，一张图像可能生成成千上万的候选框，但是其中只有很少一部分是包含目标的的，有目标的就是正样本，没有目标的就是负样本。这种情况造成了One-Stage目标检测方法的正负样本不平衡，也使得One-Stage目标检测方法的检测效果比不上Two-Stage目标检测方法。

Focal Loss是一种新的用于平衡One-Stage目标检测方法正负样本的Loss方案。

Retinane的结构非常简单，但是其存在非常多的先验框，以输入600x600x3的图片为例，就存在着67995个先验框，这些先验框里面大多包含的是背景，存在非常多的负样本。以Focal Loss训练的Retinanet可以有效的平衡正负样本，实现有效的训练。

源码下载

https://github.com/bubbliiiing/retinanet-keras
喜欢的可以点个star噢。

Retinanet实现思路

一、预测部分

1、主干网络介绍

在这里插入图片描述
Retinanet采用的主干网络是Resnet网络，关于Resnet的介绍大家可以看我的另外一篇博客https://blog.csdn.net/weixin_44791964/article/details/102790260。

本例子假设输入的图片大小为600x600x3。

ResNet50有两个基本的块，分别名为Conv Block和Identity Block，其中Conv Block输入和输出的维度是不一样的，所以不能连续串联，它的作用是改变网络的维度；Identity Block输入维度和输出维度相同，可以串联，用于加深网络的。
Conv Block的结构如下：
在这里插入图片描述
Identity Block的结构如下：

这两个都是残差网络结构。

当输入的图片为600x600x3的时候，shape变化与总的网络结构如下：
在这里插入图片描述
我们取出长宽压缩了三次、四次、五次的结果来进行网络金字塔结构的构造。

实现代码：

#-------------------------------------------------------------#
#   ResNet50的网络部分
#-------------------------------------------------------------#
from keras import layers
from keras.layers import (Activation, BatchNormalization, Conv2D, Input,
                          MaxPooling2D, ZeroPadding2D)
from keras.models import Model
def identity_block(input_tensor, kernel_size, filters, stage, block):
    filters1, filters2, filters3 = filters
    conv_name_base = 'res' + str(stage) + block + '_branch'
    bn_name_base = 'bn' + str(stage) + block + '_branch'
    x = Conv2D(filters1, (1, 1), name=conv_name_base + '2a',use_bias=False)(input_tensor)
    x = BatchNormalization(name=bn_name_base + '2a')(x)
    x = Activation('relu')(x)
    x = Conv2D(filters2, kernel_size,padding='same', name=conv_name_base + '2b',use_bias=False)(x)
    x = BatchNormalization(name=bn_name_base + '2b')(x)
    x = Activation('relu')(x)
    x = Conv2D(filters3, (1, 1), name=conv_name_base + '2c',use_bias=False)(x)
    x = BatchNormalization(name=bn_name_base + '2c')(x)
    x = layers.add([x, input_tensor])
    x = Activation('relu')(x)
    return x
def conv_block(input_tensor, kernel_size, filters, stage, block, strides=(2, 2)):
    filters1, filters2, filters3 = filters
    conv_name_base = 'res' + str(stage) + block + '_branch'
    bn_name_base = 'bn' + str(stage) + block + '_branch'
    x = Conv2D(filters1, (1, 1), strides=strides,
               name=conv_name_base + '2a',use_bias=False)(input_tensor)
    x = BatchNormalization(name=bn_name_base + '2a')(x)
    x = Activation('relu')(x)
    x = Conv2D(filters2, kernel_size, padding='same',
               name=conv_name_base + '2b',use_bias=False)(x)
    x = BatchNormalization(name=bn_name_base + '2b')(x)
    x = Activation('relu')(x)
    x = Conv2D(filters3, (1, 1), name=conv_name_base + '2c',use_bias=False)(x)
    x = BatchNormalization(name=bn_name_base + '2c')(x)
    shortcut = Conv2D(filters3, (1, 1), strides=strides,
                      name=conv_name_base + '1',use_bias=False)(input_tensor)
    shortcut = BatchNormalization(name=bn_name_base + '1')(shortcut)
    x = layers.add([x, shortcut])
    x = Activation('relu')(x)
    return x
def ResNet50(inputs):
    #-----------------------------------------------------------#
    #   假设输入图像为600,600,3
    #-----------------------------------------------------------#
    img_input = inputs
    x = ZeroPadding2D((3, 3))(img_input)
    # 600,600,3 -> 300,300,64
    x = Conv2D(64, (7, 7), strides=(2, 2), name='conv1',use_bias=False)(x)
    x = BatchNormalization(name='bn_conv1')(x)
    x = Activation('relu')(x)
    # 300,300,64 -> 150,150,64
    x = MaxPooling2D((3, 3), strides=(2, 2), padding="same")(x)
    # 150,150,64 -> 150,150,256
    x = conv_block(x, 3, [64, 64, 256], stage=2, block='a', strides=(1, 1))
    x = identity_block(x, 3, [64, 64, 256], stage=2, block='b')
    x = identity_block(x, 3, [64, 64, 256], stage=2, block='c')
    # 150,150,256 -> 75,75,512
    x = conv_block(x, 3, [128, 128, 512], stage=3, block='a')
    x = identity_block(x, 3, [128, 128, 512], stage=3, block='b')
    x = identity_block(x, 3, [128, 128, 512], stage=3, block='c')
    x = identity_block(x, 3, [128, 128, 512], stage=3, block='d')
    y1 = x
    # 75,75,512 -> 38,38,1024
    x = conv_block(x, 3, [256, 256, 1024], stage=4, block='a')
    x = identity_block(x, 3, [256, 256, 1024], stage=4, block='b')
    x = identity_block(x, 3, [256, 256, 1024], stage=4, block='c')
    x = identity_block(x, 3, [256, 256, 1024], stage=4, block='d')
    x = identity_block(x, 3, [256, 256, 1024], stage=4, block='e')
    x = identity_block(x, 3, [256, 256, 1024], stage=4, block='f')
    y2 = x
    # 38,38,1024 -> 19,19,2048
    x = conv_block(x, 3, [512, 512, 2048], stage=5, block='a')
    x = identity_block(x, 3, [512, 512, 2048], stage=5, block='b')
    x = identity_block(x, 3, [512, 512, 2048], stage=5, block='c')
    y3 = x
    return y1, y2, y3
if __name__ == "__main__":
    inputs = Input(shape=(600, 600, 3))
    model = ResNet50(inputs)
    model.summary()

2、从特征获取预测结果

在这里插入图片描述
由抽象的结构图可知，获得到的特征还需要经过图像金字塔的处理，这样的结构可以融合多尺度的特征，实现更有效的预测。

图像金字塔的具体结构如下：
在这里插入图片描述
通过图像金字塔我们可以获得五个有效的特征层，分别是P3、P4、P5、P6、P7，
为了和普通特征层区分，我们称之为有效特征层，将这五个有效的特征层传输过class+box subnets就可以获得预测结果了。

class subnet采用4次256通道的卷积和1次num_anchors x num_classes的卷积，num_anchors指的是该特征层所拥有的先验框数量，num_classes指的是网络一共对多少类的目标进行检测。

box subnet采用4次256通道的卷积和1次num_anchors x 4的卷积，num_anchors指的是该特征层所拥有的先验框数量，4指的是先验框的调整情况。

需要注意的是，每个特征层所用的class subnet是同一个class subnet；每个特征层所用的box subnet是同一个box subnet。
在这里插入图片描述
其中：
num_anchors x 4的卷积 用于预测 该特征层上 每一个网格点上每一个先验框的变化情况。（为什么说是变化情况呢，这是因为ssd的预测结果需要结合先验框获得预测框，预测结果就是先验框的变化情况。）

num_anchors x num_classes的卷积 用于预测 该特征层上 每一个网格点上 每一个预测框对应的种类。

实现代码为：

import math
import keras
import keras.layers
import numpy as np
import tensorflow as tf
from nets.resnet import ResNet50
class UpsampleLike(keras.layers.Layer):
    def call(self, inputs, **kwargs):
        source, target = inputs
        target_shape = keras.backend.shape(target)
        return tf.image.resize_images(source, (target_shape[1], target_shape[2]), method=tf.image.ResizeMethod.NEAREST_NEIGHBOR, align_corners=False)
    def compute_output_shape(self, input_shape):
        return (input_shape[0][0],) + input_shape[1][1:3] + (input_shape[0][-1],)
class PriorProbability(keras.initializers.Initializer):
    def __init__(self, probability=0.01):
        self.probability = probability
    def get_config(self):
        return {
    'probability': self.probability}
    def __call__(self, shape, dtype=None):
        # set bias to -log((1 - p)/p) for foreground
        result = np.ones(shape, dtype=dtype) * -math.log((1 - self.probability) / self.probability)
        return result
#-----------------------------------------#
#   Retinahead 获得回归预测结果
#   所有特征层共用一个Retinahead
#-----------------------------------------#
def make_last_layer_loc(num_anchors, pyramid_feature_size = 256):
    inputs = keras.layers.Input(shape=(None, None, pyramid_feature_size)) 
    options = {
        'kernel_size'        : 3,
        'strides'            : 1,
        'padding'            : 'same',
        'kernel_initializer' : keras.initializers.normal(mean=0.0, stddev=0.01, seed=None),
        'bias_initializer'   : 'zeros'
    }
    outputs = inputs
    #-----------------------------------------#
    #   进行四次卷积，通道数均为256
    #-----------------------------------------#
    for i in range(4):
        outputs = keras.layers.Conv2D(filters=256,activation='relu',name='pyramid_regression_{}'.format(i),**options)(outputs)
    #-----------------------------------------#
    #   获得回归预测结果，并进行reshape
    #-----------------------------------------#
    outputs     = keras.layers.Conv2D(num_anchors * 4, name='pyramid_regression', **options)(outputs)
    regression  = keras.layers.Reshape((-1, 4), name='pyramid_regression_reshape')(outputs)
    #-----------------------------------------#
    #   构建成一个模型
    #-----------------------------------------#
    regression_model = keras.models.Model(inputs=inputs, outputs=regression, name="regression_submodel")
    return regression_model
#-----------------------------------------#
#   Retinahead 获得分类预测结果
#   所有特征层共用一个Retinahead
#-----------------------------------------#
def make_last_layer_cls(num_classes, num_anchors, pyramid_feature_size=256):
    inputs = keras.layers.Input(shape=(None, None, pyramid_feature_size))
    options = {
        'kernel_size' : 3,
        'strides'     : 1,
        'padding'     : 'same',
    }
    outputs = inputs
    #-----------------------------------------#
    #   进行四次卷积，通道数均为256
    #-----------------------------------------#
    for i in range(4):
        outputs = keras.layers.Conv2D(filters=256, activation='relu', name='pyramid_classification_{}'.format(i),
            kernel_initializer=keras.initializers.normal(mean=0.0, stddev=0.01, seed=None), bias_initializer='zeros', **options)(outputs)
    #-----------------------------------------#
    #   获得分类预测结果，并进行reshape
    #-----------------------------------------#
    outputs = keras.layers.Conv2D(filters=num_classes * num_anchors,
        kernel_initializer  = keras.initializers.normal(mean=0.0, stddev=0.01, seed=None),
        bias_initializer    = PriorProbability(probability=0.01),
        name='pyramid_classification'.format(),
        **options
    )(outputs)
    outputs         = keras.layers.Reshape((-1, num_classes), name='pyramid_classification_reshape')(outputs)
    #-----------------------------------------#
    #   为了转换成概率，使用sigmoid激活函数
    #-----------------------------------------#
    classification  = keras.layers.Activation('sigmoid', name='pyramid_classification_sigmoid')(outputs)
    #-----------------------------------------#
    #   构建成一个模型
    #-----------------------------------------#
    classification_model = keras.models.Model(inputs=inputs, outputs=classification, name="classification_submodel")
    return classification_model
def resnet_retinanet(input_shape, num_classes, num_anchors = 9, name='retinanet'):
    inputs = keras.layers.Input(shape=input_shape)
    #-----------------------------------------#
    #   取出三个有效特征层，分别是C3、C4、C5
    #   C3     75,75,512
    #   C4     38,38,1024
    #   C5     19,19,2048
    #-----------------------------------------#
    C3, C4, C5 = ResNet50(inputs)
    # 75,75,512 -> 75,75,256
    P3              = keras.layers.Conv2D(256, kernel_size=1, strides=1, padding='same', name='C3_reduced')(C3)
    # 38,38,1024 -> 38,38,256
    P4              = keras.layers.Conv2D(256, kernel_size=1, strides=1, padding='same', name='C4_reduced')(C4)
    # 19,19,2048 -> 19,19,256
    P5              = keras.layers.Conv2D(256, kernel_size=1, strides=1, padding='same', name='C5_reduced')(C5)
    # 19,19,256 -> 38,38,256
    P5_upsampled    = UpsampleLike(name='P5_upsampled')([P5, P4])
    # 38,38,256 + 38,38,256 -> 38,38,256
    P4              = keras.layers.Add(name='P4_merged')([P5_upsampled, P4])
    # 38,38,256 -> 75,75,256
    P4_upsampled    = UpsampleLike(name='P4_upsampled')([P4, P3])
    # 75,75,256 + 75,75,256 -> 75,75,256
    P3              = keras.layers.Add(name='P3_merged')([P4_upsampled, P3])
    # 75,75,256 -> 75,75,256
    P3              = keras.layers.Conv2D(256, kernel_size=3, strides=1, padding='same', name='P3')(P3)
    # 38,38,256 -> 38,38,256
    P4              = keras.layers.Conv2D(256, kernel_size=3, strides=1, padding='same', name='P4')(P4)
    # 19,19,256 -> 19,19,256
    P5              = keras.layers.Conv2D(256, kernel_size=3, strides=1, padding='same', name='P5')(P5)
    # 19,19,2048 -> 10,10,256
    P6              = keras.layers.Conv2D(256, kernel_size=3, strides=2, padding='same', name='P6')(C5)
    P7              = keras.layers.Activation('relu', name='C6_relu')(P6)
    # 10,10,256 -> 5,5,256
    P7              = keras.layers.Conv2D(256, kernel_size=3, strides=2, padding='same', name='P7')(P7)
    features        =  [P3, P4, P5, P6, P7]
    regression_model        = make_last_layer_loc(num_anchors)
    classification_model    = make_last_layer_cls(num_classes, num_anchors)
    regressions     = []
    classifications = []
    #----------------------------------------------------------#
    #   将获取到的P3, P4, P5, P6, P7传入到
    #   Retinahead里面进行预测，获得回归预测结果和分类预测结果
    #   将所有特征层的预测结果进行堆叠
    #----------------------------------------------------------#
    for feature in features:
        regression      = regression_model(feature)
        classification  = classification_model(feature)
        regressions.append(regression)
        classifications.append(classification)
    regressions     = keras.layers.Concatenate(axis=1, name="regression")(regressions)
    classifications = keras.layers.Concatenate(axis=1, name="classification")(classifications)
    model = keras.models.Model(inputs, [regressions, classifications], name=name)
    return model

3、预测结果的解码

我们通过对每一个特征层的处理，可以获得三个内容，分别是：

num_anchors x 4的卷积 用于预测 该特征层上 每一个网格点上每一个先验框的变化情况。**

num_anchors x num_classes的卷积 用于预测 该特征层上 每一个网格点上 每一个预测框对应的种类。

每一个有效特征层对应的先验框对应着该特征层上 每一个网格点上预先设定好的9个框。

我们利用 num_anchors x 4的卷积 与 每一个有效特征层对应的先验框 获得框的真实位置。

每一个有效特征层对应的先验框就是，如图所示的作用：
每一个有效特征层将整个图片分成与其长宽对应的网格，如P3的特征层就是将整个图像分成75x75个网格；然后从每个网格中心建立9个先验框，一共75x75x9个，50625个先验框。
在这里插入图片描述
先验框虽然可以代表一定的框的位置信息与框的大小信息，但是其是有限的，无法表示任意情况，因此还需要调整，Retinanet利用4次256通道的卷积+num_anchors x 4的卷积的结果对先验框进行调整。

num_anchors x 4中的num_anchors表示了这个网格点所包含的先验框数量，其中的4表示了框的左上角xy轴，右下角xy的调整情况。

Retinanet解码过程就是将对应的先验框的左上角和右下角进行位置的调整，调整完的结果就是预测框的位置了。

当然得到最终的预测结构后还要进行得分排序与非极大抑制筛选这一部分基本上是所有目标检测通用的部分。
1、取出每一类得分大于confidence_threshold的框和得分。
2、利用框的位置和得分进行非极大抑制。
实现代码如下：

def decode_boxes(self, mbox_loc, anchors, variance=0.2):
    # 获得先验框的宽与高
    prior_width = anchors[:, 2] - anchors[:, 0]
    prior_height = anchors[:, 3] - anchors[:, 1]
    # 获取真实框的左上角与右下角
    decode_bbox_xmin = mbox_loc[:,0] * prior_width * variance + anchors[:, 0]
    decode_bbox_ymin = mbox_loc[:,1] * prior_height * variance + anchors[:, 1]
    decode_bbox_xmax = mbox_loc[:,2] * prior_width * variance + anchors[:, 2]
    decode_bbox_ymax = mbox_loc[:,3] * prior_height * variance + anchors[:, 3]
    # 真实框的左上角与右下角进行堆叠
    decode_bbox = np.concatenate((decode_bbox_xmin[:, None],
                                    decode_bbox_ymin[:, None],
                                    decode_bbox_xmax[:, None],
                                    decode_bbox_ymax[:, None]), axis=-1)
    # 防止超出0与1
    decode_bbox = np.minimum(np.maximum(decode_bbox, 0.0), 1.0)
    return decode_bbox
def decode_box(self, predictions, anchors, image_shape, input_shape, letterbox_image, confidence=0.5):
    #---------------------------------------------------#
    #   获得回归预测结果
    #---------------------------------------------------#
    mbox_loc    = predictions[0]
    #---------------------------------------------------#
    #   获得种类的置信度
    #---------------------------------------------------#
    mbox_conf   = predictions[1]
    results     = [None for _ in range(len(mbox_loc))]
    #----------------------------------------------------------------------------------------------------------------#
    #   对每一张图片进行处理，由于在predict.py的时候，我们只输入一张图片，所以for i in range(len(mbox_loc))只进行一次
    #----------------------------------------------------------------------------------------------------------------#
    for i in range(len(mbox_loc)):
        results.append([])
        #--------------------------------#
        #   利用回归结果对先验框进行解码
        #--------------------------------#
        decode_bbox = self.decode_boxes(mbox_loc[i], anchors)
        class_conf  = np.expand_dims(np.max(mbox_conf[i], 1), -1)
        class_pred  = np.expand_dims(np.argmax(mbox_conf[i], 1), -1)
        #--------------------------------#
        #   判断置信度是否大于门限要求
        #--------------------------------#
        conf_mask       = (class_conf >= confidence)[:, 0]
        #--------------------------------#
        #   将预测结果进行堆叠
        #--------------------------------#
        detections      = np.concatenate((decode_bbox[conf_mask], class_conf[conf_mask], class_pred[conf_mask]), 1)
        unique_labels   = np.unique(detections[:,-1])
        #-------------------------------------------------------------------#
        #   对种类进行循环，
        #   非极大抑制的作用是筛选出一定区域内属于同一种类得分最大的框，
        #   对种类进行循环可以帮助我们对每一个类分别进行非极大抑制。
        #-------------------------------------------------------------------#
        for c in unique_labels:
            #------------------------------------------#
            #   获得某一类得分筛选后全部的预测结果
            #------------------------------------------#
            detections_class = detections[detections[:, -1] == c]
            #------------------------------------------#
            #   使用官方自带的非极大抑制会速度更快一些！
            #------------------------------------------#
            idx             = self.sess.run(self.nms, feed_dict={
    self.boxes: detections_class[:, :4], self.scores: detections_class[:, 4]})
            max_detections  = detections_class[idx]
            # #------------------------------------------#
            # #   非官方的实现部分
            # #   获得某一类得分筛选后全部的预测结果
            # #------------------------------------------#
            # detections_class    = detections[detections[:, -1] == c]
            # scores              = detections_class[:, 4]
            # #------------------------------------------#
            # #   根据得分对该种类进行从大到小排序。
            # #------------------------------------------#
            # arg_sort            = np.argsort(scores)[::-1]
            # detections_class    = detections_class[arg_sort]
            # max_detections = []
            # while np.shape(detections_class)[0]>0:
            #     #-------------------------------------------------------------------------------------#
            #     #   每次取出得分最大的框，计算其与其它所有预测框的重合程度，重合程度过大的则剔除。
            #     #-------------------------------------------------------------------------------------#
            #     max_detections.append(detections_class[0])
            #     if len(detections_class) == 1:
            #         break
            #     ious             = self.bbox_iou(max_detections[-1], detections_class[1:])
            #     detections_class = detections_class[1:][ious < self._nms_thresh]
            results[i] = max_detections if results[i] is None else np.concatenate((results[i], max_detections), axis = 0)
        if results[i] is not None:
            results[i] = np.array(results[i])
            box_xy, box_wh = (results[i][:, 0:2] + results[i][:, 2:4])/2, results[i][:, 2:4] - results[i][:, 0:2]
            results[i][:, :4] = self.efficientdet_correct_boxes(box_xy, box_wh, input_shape, image_shape, letterbox_image)
    return results

4、在原图上进行绘制

通过第三步，我们可以获得预测框在原图上的位置，而且这些预测框都是经过筛选的。这些筛选后的框可以直接绘制在图片上，就可以获得结果了。

二、训练部分

1、真实框的处理

从预测部分我们知道，每个特征层的预测结果，num_anchors x 4的卷积 用于预测 该特征层上 每一个网格点上每一个先验框的变化情况。

也就是说，我们直接利用retinanet网络预测到的结果，并不是预测框在图片上的真实位置，需要解码才能得到真实位置。

而在训练的时候，我们需要计算loss函数，这个loss函数是相对于Retinanet网络的预测结果的。我们需要把图片输入到当前的Retinanet网络中，得到预测结果；同时还需要把真实框的信息，进行编码，这个编码是把真实框的位置信息格式转化为Retinanet预测结果的格式信息。

也就是，我们需要找到 每一张用于训练的图片的每一个真实框对应的先验框，并求出如果想要得到这样一个真实框，我们的预测结果应该是怎么样的。

从预测结果获得真实框的过程被称作解码，而从真实框获得预测结果的过程就是编码的过程。

因此我们只需要将解码过程逆过来就是编码过程了。

实现代码如下：

def encode_box(self, box, return_iou=True, variance=0.2):
    #---------------------------------------------#
    #   计算当前真实框和先验框的重合情况
    #---------------------------------------------#
    iou = self.iou(box)
    ignored_box         = np.zeros((self.num_anchors, 1))
    #---------------------------------------------------#
    #   找到处于忽略门限值范围内的先验框
    #---------------------------------------------------#
    assign_mask_ignore  = (iou > self.ignore_threshold) & (iou < self.overlap_threshold)
    ignored_box[:, 0][assign_mask_ignore] = iou[assign_mask_ignore]
    encoded_box = np.zeros((self.num_anchors, 4 + return_iou))
    #---------------------------------------------#
    #   找到每一个真实框，重合程度较高的先验框
    #---------------------------------------------#
    assign_mask = iou > self.overlap_threshold
    #---------------------------------------------#
    #   如果没有一个先验框重合度大于self.overlap_threshold
    #   则选择重合度最大的为正样本
    #---------------------------------------------#
    if not assign_mask.any():
        assign_mask[iou.argmax()] = True
    #---------------------------------------------#
    #   利用iou进行赋值 
    #---------------------------------------------#
    if return_iou:
        encoded_box[:, -1][assign_mask] = iou[assign_mask]
    #---------------------------------------------#
    #   找到对应的先验框
    #---------------------------------------------#
    assigned_anchors = self.anchors[assign_mask]
    #---------------------------------------------#
    #   逆向编码，将真实框转化为retinanet预测结果的格式
    #   先计算真实框的中心与长宽
    #---------------------------------------------#
    assigned_anchors_w = (assigned_anchors[:, 2] - assigned_anchors[:, 0])
    assigned_anchors_h = (assigned_anchors[:, 3] - assigned_anchors[:, 1])
    #------------------------------------------------#
    #   逆向求取retinanet应该有的预测结果
    #   先求取中心的预测结果，再求取宽高的预测结果
    #------------------------------------------------#
    encoded_box[:,0][assign_mask] = (box[0] - assigned_anchors[:, 0])/assigned_anchors_w/variance
    encoded_box[:,1][assign_mask] = (box[1] - assigned_anchors[:, 1])/assigned_anchors_h/variance
    encoded_box[:,2][assign_mask] = (box[2] - assigned_anchors[:, 2])/assigned_anchors_w/variance
    encoded_box[:,3][assign_mask] = (box[3] - assigned_anchors[:, 3])/assigned_anchors_h/variance
    return encoded_box.ravel(), ignored_box.ravel()

利用上述代码我们可以获得，真实框对应的所有的iou较大先验框，并计算了真实框对应的所有iou较大的先验框应该有的预测结果。

但是由于原始图片中可能存在多个真实框，可能同一个先验框会与多个真实框重合度较高，我们只取其中与真实框重合度最高的就可以了。

因此我们还要经过一次筛选，将上述代码获得的真实框对应的所有的iou较大先验框的预测结果中，iou最大的那个真实框筛选出来。

通过assign_boxes我们就获得了，输入进来的这张图片，应该有的预测结果是什么样子的。

实现代码如下：

def assign_boxes(self, boxes):
    #---------------------------------------------------#
    #   assignment分为3个部分
    #   :4      的内容为网络应该有的回归预测结果
    #   4:-1    的内容为先验框所对应的种类，默认为背景
    #   -1      的内容为当前先验框是否包含目标
    #---------------------------------------------------#
    assignment          = np.zeros((self.num_anchors, 4 + 1 + self.num_classes + 1))
    assignment[:, 4]    = 0.0
    assignment[:, -1]   = 0.0
    if len(boxes) == 0:
        return assignment
    #---------------------------------------------------#
    #   对每一个真实框都进行iou计算
    #---------------------------------------------------#
    apply_along_axis_boxes = np.apply_along_axis(self.encode_box, 1, boxes[:, :4])
    encoded_boxes = np.array([apply_along_axis_boxes[i, 0] for i in range(len(apply_along_axis_boxes))])
    ingored_boxes = np.array([apply_along_axis_boxes[i, 1] for i in range(len(apply_along_axis_boxes))])
    #---------------------------------------------------#
    #   在reshape后，获得的ingored_boxes的shape为：
    #   [num_true_box, num_priors, 1] 其中1为iou
    #---------------------------------------------------#
    ingored_boxes   = ingored_boxes.reshape(-1, self.num_anchors, 1)
    ignore_iou      = ingored_boxes[:, :, 0].max(axis=0)
    ignore_iou_mask = ignore_iou > 0
    assignment[:, 4][ignore_iou_mask] = -1
    assignment[:, -1][ignore_iou_mask] = -1
    #---------------------------------------------------#
    #   在reshape后，获得的encoded_boxes的shape为：
    #   [num_true_box, num_anchors, 4+1]
    #   4是编码后的结果，1为iou
    #---------------------------------------------------#
    encoded_boxes   = encoded_boxes.reshape(-1, self.num_anchors, 5)
    #---------------------------------------------------#
    #   [num_anchors]求取每一个先验框重合度最大的真实框
    #---------------------------------------------------#
    best_iou        = encoded_boxes[:, :, -1].max(axis=0)
    best_iou_idx    = encoded_boxes[:, :, -1].argmax(axis=0)
    best_iou_mask   = best_iou > 0
    best_iou_idx    = best_iou_idx[best_iou_mask]
    #---------------------------------------------------#
    #   计算一共有多少先验框满足需求
    #---------------------------------------------------#
    assign_num      = len(best_iou_idx)
    # 将编码后的真实框取出
    encoded_boxes   = encoded_boxes[:, best_iou_mask, :]
    assignment[:, :4][best_iou_mask] = encoded_boxes[best_iou_idx,np.arange(assign_num),:4]
    #----------------------------------------------------------#
    #   4代表为背景的概率，设定为0，因为这些先验框有对应的物体
    #----------------------------------------------------------#
    assignment[:, 4][best_iou_mask]     = 1
    assignment[:, 5:-1][best_iou_mask]  = boxes[best_iou_idx, 4:]
    #----------------------------------------------------------#
    #   -8表示先验框是否有对应的物体
    #----------------------------------------------------------#
    assignment[:, -1][best_iou_mask]    = 1
    # 通过assign_boxes我们就获得了，输入进来的这张图片，应该有的预测结果是什么样子的
    return assignment

focal会忽略一些重合度相对较高但是不是非常高的先验框，一般将重合度在0.4-0.5之间的先验框进行忽略。

2、利用处理完的真实框与对应图片的预测结果计算loss

loss的计算分为两个部分：
1、Smooth Loss：获取所有正标签的框的预测结果的回归loss。
2、Focal Loss：获取所有未被忽略的种类的预测结果的交叉熵loss。

由于在Retinanet的训练过程中，正负样本极其不平衡，即存在对应真实框的先验框可能只有若干个，但是不存在对应真实框的负样本却有上万个，这就会导致负样本的loss值极大，因此引入了Focal Loss进行正负样本的平衡。

Focal loss是何恺明大神提出的一种新的loss计算方案。其具有两个重要的特点。

控制正负样本的权重
控制容易分类和难分类样本的权重

正负样本的概念如下：
一张图像可能生成成千上万的候选框，但是其中只有很少一部分是包含目标的的，有目标的就是正样本，没有目标的就是负样本。

容易分类和难分类样本的概念如下：
假设存在一个二分类，样本1属于类别1的pt=0.9，样本2属于类别1的pt=0.6，显然前者更可能是类别1，其就是容易分类的样本；后者有可能是类别1，所以其为难分类样本。

如何实现权重控制呢，请往下看：

a、控制正负样本的权重

如下是常用的交叉熵loss，以二分类为例：
在这里插入图片描述
我们可以利用如下Pt简化交叉熵loss。

此时：

想要降低负样本的影响，可以在常规的损失函数前增加一个系数αt。与Pt类似，当label=1的时候，αt=α；当label=otherwise的时候，αt=1 - α，a的范围也是0到1。此时我们便可以通过设置α实现控制正负样本对loss的贡献。在这里插入图片描述
其中：

分解开就是：

b、控制容易分类和难分类样本的权重

按照刚才的思路，一个二分类，样本1属于类别1的pt=0.9，样本2属于类别1的pt=0.6，也就是 是某个类的概率越大，其越容易分类 所以利用1-Pt就可以计算出其属于容易分类或者难分类。
具体实现方式如下。
在这里插入图片描述
其中：
( 1 − p t ) γ (1-p_{t})^{γ} (1−pt)γ
称为调制系数（modulating factor）

1、当pt趋于0的时候，调制系数趋于1，对于总的loss的贡献很大。当pt趋于1的时候，调制系数趋于0，也就是对于总的loss的贡献很小。
2、当γ=0的时候，focal loss就是传统的交叉熵损失，可以通过调整γ实现调制系数的改变。

c、两种权重控制方法合并

通过如下公式就可以实现控制正负样本的权重和控制容易分类和难分类样本的权重。
在这里插入图片描述

实现代码如下：

import tensorflow as tf
from keras import backend as K
def focal(alpha=0.25, gamma=2.0):
    def _focal(y_true, y_pred):
        #---------------------------------------------------#
        #   y_true [batch_size, num_anchor, num_classes+1]
        #   y_pred [batch_size, num_anchor, num_classes]
        #---------------------------------------------------#
        labels         = y_true[:, :, :-1]
        #---------------------------------------------------#
        #   -1 是需要忽略的, 0 是背景, 1 是存在目标
        #---------------------------------------------------#
        anchor_state   = y_true[:, :, -1]  
        classification = y_pred
        # 找出存在目标的先验框
        indices        = tf.where(K.not_equal(anchor_state, -1))
        labels         = tf.gather_nd(labels, indices)
        classification = tf.gather_nd(classification, indices)
        # 计算每一个先验框应该有的权重
        alpha_factor = K.ones_like(labels) * alpha
        alpha_factor = tf.where(K.equal(labels, 1), alpha_factor, 1 - alpha_factor)
        focal_weight = tf.where(K.equal(labels, 1), 1 - classification, classification)
        focal_weight = alpha_factor * focal_weight ** gamma
        # 将权重乘上所求得的交叉熵
        cls_loss = focal_weight * K.binary_crossentropy(labels, classification)
        # 标准化，实际上是正样本的数量
        normalizer = tf.where(K.equal(anchor_state, 1))
        normalizer = K.cast(K.shape(normalizer)[0], K.floatx())
        normalizer = K.maximum(K.cast_to_floatx(1.0), normalizer)
        # 将所获得的loss除上正样本的数量
        loss = K.sum(cls_loss) / normalizer
        return loss
    return _focal
def smooth_l1(sigma=3.0):
    sigma_squared = sigma ** 2
    def _smooth_l1(y_true, y_pred):
        #---------------------------------------------------#
        #   y_true [batch_size, num_anchor, 4+1]
        #   y_pred [batch_size, num_anchor, 4]
        #---------------------------------------------------#
        regression        = y_pred
        regression_target = y_true[:, :, :-1]
        anchor_state      = y_true[:, :, -1]
        # 找出存在目标的先验框
        indices           = tf.where(K.equal(anchor_state, 1))
        regression        = tf.gather_nd(regression, indices)
        regression_target = tf.gather_nd(regression_target, indices)
        # 计算smooth L1损失
        regression_diff = regression - regression_target
        regression_diff = K.abs(regression_diff)
        regression_loss = tf.where(
            K.less(regression_diff, 1.0 / sigma_squared),
            0.5 * sigma_squared * K.pow(regression_diff, 2),
            regression_diff - 0.5 / sigma_squared
        )
        # 将所获得的loss除上正样本的数量
        normalizer = K.maximum(1, K.shape(indices)[0])
        normalizer = K.cast(normalizer, dtype=K.floatx())
        return K.sum(regression_loss) / normalizer / 4
    return _smooth_l1

训练自己的Retinanet模型

首先前往Github下载对应的仓库，下载完后利用解压软件解压，之后用编程软件打开文件夹。
注意打开的根目录必须正确，否则相对目录不正确的情况下，代码将无法运行。
一定要注意打开后的根目录是文件存放的目录。
在这里插入图片描述

一、数据集的准备

本文使用VOC格式进行训练，训练前需要自己制作好数据集，如果没有自己的数据集，可以通过Github连接下载VOC12+07的数据集尝试下。
训练前将标签文件放在VOCdevkit文件夹下的VOC2007文件夹下的Annotation中。
在这里插入图片描述
训练前将图片文件放在VOCdevkit文件夹下的VOC2007文件夹下的JPEGImages中。

此时数据集的摆放已经结束。

二、数据集的处理

在完成数据集的摆放之后，我们需要对数据集进行下一步的处理，目的是获得训练用的2007_train.txt以及2007_val.txt，需要用到根目录下的voc_annotation.py。

voc_annotation.py里面有一些参数需要设置。
分别是annotation_mode、classes_path、trainval_percent、train_percent、VOCdevkit_path，第一次训练可以仅修改classes_path

'''
annotation_mode用于指定该文件运行时计算的内容
annotation_mode为0代表整个标签处理过程，包括获得VOCdevkit/VOC2007/ImageSets里面的txt以及训练用的2007_train.txt、2007_val.txt
annotation_mode为1代表获得VOCdevkit/VOC2007/ImageSets里面的txt
annotation_mode为2代表获得训练用的2007_train.txt、2007_val.txt
'''
annotation_mode     = 0
'''
必须要修改，用于生成2007_train.txt、2007_val.txt的目标信息
与训练和预测所用的classes_path一致即可
如果生成的2007_train.txt里面没有目标信息
那么就是因为classes没有设定正确
仅在annotation_mode为0和2的时候有效
'''
classes_path        = 'model_data/voc_classes.txt'
'''
trainval_percent用于指定(训练集+验证集)与测试集的比例，默认情况下 (训练集+验证集):测试集 = 9:1
train_percent用于指定(训练集+验证集)中训练集与验证集的比例，默认情况下 训练集:验证集 = 9:1
仅在annotation_mode为0和1的时候有效
'''
trainval_percent    = 0.9
train_percent       = 0.9
'''
指向VOC数据集所在的文件夹
默认指向根目录下的VOC数据集
'''
VOCdevkit_path  = 'VOCdevkit'

classes_path用于指向检测类别所对应的txt，以voc数据集为例，我们用的txt为：
在这里插入图片描述

训练自己的数据集时，可以自己建立一个cls_classes.txt，里面写自己所需要区分的类别。

三、开始网络训练

通过voc_annotation.py我们已经生成了2007_train.txt以及2007_val.txt，此时我们可以开始训练了。
训练的参数较多，大家可以在下载库后仔细看注释，其中最重要的部分依然是train.py里的classes_path。

classes_path用于指向检测类别所对应的txt，这个txt和voc_annotation.py里面的txt一样！训练自己的数据集必须要修改！
在这里插入图片描述
修改完classes_path后就可以运行train.py开始训练了，在训练多个epoch后，权值会生成在logs文件夹中。
其它参数的作用如下：

#--------------------------------------------------------#
#   训练前一定要修改classes_path，使其对应自己的数据集
#--------------------------------------------------------#
classes_path    = 'model_data/voc_classes.txt'
#----------------------------------------------------------------------------------------------------------------------------#
#   权值文件请看README，百度网盘下载。数据的预训练权重对不同数据集是通用的，因为特征是通用的。
#   预训练权重对于99%的情况都必须要用，不用的话权值太过随机，特征提取效果不明显，网络训练的结果也不会好。
#   训练自己的数据集时提示维度不匹配正常，预测的东西都不一样了自然维度不匹配
#
#   如果想要断点续练就将model_path设置成logs文件夹下已经训练的权值文件。 
#   当model_path = ''的时候不加载整个模型的权值。
#
#   此处使用的是整个模型的权重，因此是在train.py进行加载的。
#   如果想要让模型从主干的预训练权值开始训练，则设置model_path为主干网络的权值，此时仅加载主干。
#   如果想要让模型从0开始训练，则设置model_path = ''，Freeze_Train = Fasle，此时从0开始训练，且没有冻结主干的过程。
#   一般来讲，从0开始训练效果会很差，因为权值太过随机，特征提取效果不明显。
#----------------------------------------------------------------------------------------------------------------------------#
model_path      = 'model_data/resnet50_coco_best_v2.1.0.h5'
#------------------------------------------------------#
#   输入的shape大小
#------------------------------------------------------#
input_shape     = [600, 600]
#---------------------------------------------------------------------#
#   可用于设定先验框的大小，默认的anchors_size，大多数情况下都是通用的！
#   如果想要检测小物体，可以修改anchors_size
#   一般调小浅层先验框的大小就行了！因为浅层负责小物体检测！
#---------------------------------------------------------------------#
anchors_size    = [32, 64, 128, 256, 512]
#----------------------------------------------------#
#   训练分为两个阶段，分别是冻结阶段和解冻阶段。
#   显存不足与数据集大小无关，提示显存不足请调小batch_size。
#   受到BatchNorm层影响，batch_size最小为2，不能为1。
#----------------------------------------------------#
#----------------------------------------------------#
#   冻结阶段训练参数
#   此时模型的主干被冻结了，特征提取网络不发生改变
#   占用的显存较小，仅对网络进行微调
#----------------------------------------------------#
Init_Epoch          = 0
Freeze_Epoch        = 50
Freeze_batch_size   = 8
Freeze_lr           = 1e-4
#----------------------------------------------------#
#   解冻阶段训练参数
#   此时模型的主干不被冻结了，特征提取网络会发生改变
#   占用的显存较大，网络所有的参数都会发生改变
#----------------------------------------------------#
UnFreeze_Epoch      = 100
Unfreeze_batch_size = 4
Unfreeze_lr         = 1e-5
#------------------------------------------------------#
#   是否进行冻结训练，默认先冻结主干训练后解冻训练。
#------------------------------------------------------#
Freeze_Train        = True
#------------------------------------------------------#
#   用于设置是否使用多线程读取数据，0代表关闭多线程
#   开启后会加快数据读取速度，但是会占用更多内存
#   keras里开启多线程有些时候速度反而慢了许多
#   在IO为瓶颈的时候再开启多线程，即GPU运算速度远大于读取图片的速度。
#------------------------------------------------------#
num_workers         = 0
#----------------------------------------------------#
#   获得图片路径和标签
#----------------------------------------------------#
train_annotation_path   = '2007_train.txt'
val_annotation_path     = '2007_val.txt'