RetinaNet Examples：NVIDIA 一站式训练、推理及模型转换解决方案

港控/mmm° 2021-11-05 06:34 207阅读 0赞

[retinanet-examples][] 是英伟达提供的目标检测工程范例，针对端到端 GPU 处理进行了优化：

*  使用基于 Python 多进程的 [apex.parallel.DistributedDataParallel][] 加速分布式训练；
 *  [apex.amp][] 优化混合精度训练；
 *  [NVIDIA DALI][] 加速数据预处理；
 *  推理使用 [TensorRT][]。

项目推荐安装 [PyTorch NGC docker container][]：

nvidia-docker run --rm --ipc=host -it nvcr.io/nvidia/pytorch:19.05-py3

然而新发布的版本仅支持 compute capability 6.0 以上的 GPU。

## [main][] ##

Created with Raphaël 2.2.0mainargsparseload\_modelworkerEnd

[parse][] 解析参数。  
[load\_model][load_model] 创建模型并加载参数。  
[Module.share\_memory][Module.share_memory] 未在文档中列出。  
[worker][] 是工作的函数。  
[torch.multiprocessing.spawn][] 生成使用`args`运行`fn`的`nprocs`个进程。如果其中一个进程以非零退出状态退出，则会终止其余进程，并引发异常并导致终止。在子进程中捕获异常的情况下，它将被转发，并且其跟踪包含在父进程中引发的异常中。

每个设备创建一个进程执行 [worker][]。

'Entry point for the retinanet command'
    
        args = parse(args or sys.argv[1:])
    
        model, state = load_model(args, verbose=True)
        if model: model.share_memory()
    
        world = torch.cuda.device_count()
        if args.command == 'export' or world <= 1:
            worker(0, args, 1, model, state)
        else:
            torch.multiprocessing.spawn(worker, args=(args, world, model, state), nprocs=world)

## [parse][] ##

[ArgumentParser.add\_subparsers][ArgumentParser.add_subparsers]  
许多程序将其功能分解为许多子命令，例如，`svn`程序可以调用子命令如`svn checkout`、`svn update`和`svn commit`。当程序执行多个函数，每个函数需要不同类型的命令行参数时，以这种方式拆分功能可能是一个特别好的主意。[ArgumentParser][] 支持使用 [add\_subparsers()][ArgumentParser.add_subparsers] 方法创建此类子命令。通常不带参数调用 [add\_subparsers()][ArgumentParser.add_subparsers] 方法，并返回一个特殊的操作对象。该对象有一个方法 add\_parser()，它接受命令名和任何 [ArgumentParser][] 构造函数参数，并返回一个可以照常修改的 [ArgumentParser][] 对象。

参数说明：

*  `title`：默认情况下，如果提供了`description`，则使用“subcommands”，否则将使用`title`参数的标题。
 *  `description`：帮助输出中子解析器组的描述，默认情况下为`None`。
 *  `prog`：将在子命令帮助中显示的用法信息，默认情况下，程序的名称和subparser参数之前的任何位置参数。  
    `parser_class`：用于创建子解析器实例的类，默认情况下是当前解析器的类（例如 [ArgumentParser][]）
 *  [action][]：在命令行遇到此参数时要采取的基本操作类型。
 *  [dest][]：存储子命令名的属性的名称；默认情况下，不存储任何值。
 *  [required][]：是否必须提供子命令，默认为`False`。
 *  [help][]：帮助输出中子解析器组的帮助，默认为`None`。
 *  [metavar][]：在帮助中显示可用子命令的字符串；默认情况下为`None`，并以\{cmd1, cmd2, …\}的形式显示子命令。

parser = argparse.ArgumentParser(description='RetinaNet Detection Utility.')
        parser.add_argument('--master', metavar='address:port', type=str, help='Adress and port of the master worker', default='127.0.0.1:29500')
    
        subparsers = parser.add_subparsers(help='sub-command', dest='command')
        subparsers.required = True
    
        devcount = max(1, torch.cuda.device_count())

训练参数设置。  
batch 为每个设备两张图。

parser_train = subparsers.add_parser('train', help='train a network')
        parser_train.add_argument('model', type=str, help='path to output model or checkpoint to resume from')
        parser_train.add_argument('--annotations', metavar='path', type=str, help='path to COCO style annotations', required=True)
        parser_train.add_argument('--images', metavar='path', type=str, help='path to images', default='.')
        parser_train.add_argument('--backbone', action='store', type=str, nargs='+', help='backbone model (or list of)', default=['ResNet50FPN'])
        parser_train.add_argument('--classes', metavar='num', type=int, help='number of classes', default=80)
        parser_train.add_argument('--batch', metavar='size', type=int, help='batch size', default=2*devcount)
        parser_train.add_argument('--resize', metavar='scale', type=int, help='resize to given size', default=800)
        parser_train.add_argument('--max-size', metavar='max', type=int, help='maximum resizing size', default=1333)
        parser_train.add_argument('--jitter', metavar='min max', type=int, nargs=2, help='jitter size within range', default=[640, 1024])
        parser_train.add_argument('--iters', metavar='number', type=int, help='number of iterations to train for', default=90000)
        parser_train.add_argument('--milestones', action='store', type=int, nargs='*', help='list of iteration indices where learning rate decays', default=[60000, 80000])
        parser_train.add_argument('--schedule', metavar='scale', type=float, help='scale schedule (affecting iters and milestones)', default=1)
        parser_train.add_argument('--full-precision', help='train in full precision', action='store_true')
        parser_train.add_argument('--lr', metavar='value', help='learning rate', type=float, default=0.01)
        parser_train.add_argument('--warmup', metavar='iterations', help='numer of warmup iterations', type=int, default=1000)
        parser_train.add_argument('--gamma', metavar='value', type=float, help='multiplicative factor of learning rate decay', default=0.1)
        parser_train.add_argument('--override', help='override model', action='store_true')
        parser_train.add_argument('--val-annotations', metavar='path', type=str, help='path to COCO style validation annotations')
        parser_train.add_argument('--val-images', metavar='path', type=str, help='path to validation images')
        parser_train.add_argument('--post-metrics', metavar='url', type=str, help='post metrics to specified url')
        parser_train.add_argument('--fine-tune', metavar='path', type=str, help='fine tune a pretrained model')
        parser_train.add_argument('--logdir', metavar='logdir', type=str, help='directory where to write logs')
        parser_train.add_argument('--val-iters', metavar='number', type=int, help='number of iterations between each validation', default=8000)
        parser_train.add_argument('--with-dali', help='use dali for data loading', action='store_true')

测试参数设置。

parser_infer = subparsers.add_parser('infer', help='run inference')
        parser_infer.add_argument('model', type=str, help='path to model')
        parser_infer.add_argument('--images', metavar='path', type=str, help='path to images', default='.')
        parser_infer.add_argument('--annotations', metavar='annotations', type=str, help='evaluate using provided annotations')
        parser_infer.add_argument('--output', metavar='file', type=str, help='save detections to specified JSON file', default='detections.json')
        parser_infer.add_argument('--batch', metavar='size', type=int, help='batch size', default=2*devcount)
        parser_infer.add_argument('--resize', metavar='scale', type=int, help='resize to given size', default=800)
        parser_infer.add_argument('--max-size', metavar='max', type=int, help='maximum resizing size', default=1333)
        parser_infer.add_argument('--with-dali', help='use dali for data loading', action='store_true')
        parser_infer.add_argument('--full-precision', help='inference in full precision', action='store_true')

模型导出参数。

parser_export = subparsers.add_parser('export', help='export a model into a TensorRT engine')
        parser_export.add_argument('model', type=str, help='path to model')
        parser_export.add_argument('export', type=str, help='path to exported output')
        parser_export.add_argument('--size', metavar='height width', type=int, nargs='+', help='input size (square) or sizes (h w) to use when generating TensorRT engine', default=[1280])
        parser_export.add_argument('--batch', metavar='size', type=int, help='max batch size to use for TensorRT engine', default=2)
        parser_export.add_argument('--full-precision', help='export in full instead of half precision', action='store_true')
        parser_export.add_argument('--int8', help='calibrate model and export in int8 precision', action='store_true')
        parser_export.add_argument('--opset', metavar='version', type=int, help='ONNX opset version')
        parser_export.add_argument('--calibration-batches', metavar='size', type=int, help='number of batches to use for int8 calibration', default=10)
        parser_export.add_argument('--calibration-images', metavar='path', type=str, help='path to calibration images to use for int8 calibration', default="")
        parser_export.add_argument('--calibration-table', metavar='path', type=str, help='path of existing calibration table to load from, or name of new calibration table', default="")
        parser_export.add_argument('--verbose', help='enable verbose logging', action='store_true')
    
        return parser.parse_args(args)

## [load\_model][load_model] ##

检查是否指定了模型文件。

if args.command != 'train' and not os.path.isfile(args.model):
            raise RuntimeError('Model file {} does not exist!'.format(args.model))

解析模型的扩展名。

model = None
        state = {
        }
        _, ext = os.path.splitext(args.model)

训练模式下如果未指定模型文件则创建 [Model][] 实例。[Model.initialize][] 加载预训练的参数或者进行初始化。  
[Model.load][] 首先由基础网络创建模型，然后加载参数并获取训练状态变量。

if args.command == 'train' and (not os.path.exists(args.model) or args.override):
            if verbose: print('Initializing model...')
            model = Model(args.backbone, args.classes)
            model.initialize(args.fine_tune)
            if verbose: print(model)
    
        elif ext == '.pth' or ext == '.torch':
            if verbose: print('Loading model from {}...'.format(os.path.basename(args.model)))
            model, state = Model.load(args.model)
            if verbose: print(model)
    
        elif args.command == 'infer' and ext in ['.engine', '.plan']:
            model = None
        
        else:
            raise RuntimeError('Invalid model format "{}"!'.format(args.ext))
    
        state['path'] = args.model
        return model, state

## [worker][] ##

worker

train.train

infer.infer

model.export

[worker][] 函数能够执行训练、推理、模型导出3种任务。

[os.environ][] 是进程参数之一，表示字符串环境的[mapping][] 对象。例如，`environ['HOME']`是主目录的路径名（在某些平台上），相当于 C 中的`getenv("HOME")`。这个映射是在第一次导入 [os][] 模块时捕获的，通常在 Python 启动期间作为处理`site.py`的一部分。在此时间之后对环境所做的更改不会反映在 [os.environ][] 中，除非直接修改 [os.environ][]。

[torch.cuda.set\_device][torch.cuda.set_device] 设置当前设备。  
[torch.distributed.init\_process\_group][torch.distributed.init_process_group] 初始化默认的分布式进程组，这也将初始化分布式程序包。初始化进程组主要有两种方法：

*  明确指定`store`、`rank`和`world_size`。
 *  指定`init_method`（URL 字符串），指示发现对等点的位置/方式。（可选）指定`rank`和`world_size`，或在URL中编码所有必需的参数并省略它们。

如果两者都未指定，则假定`init_method`为“env://”。

'Per-device distributed worker'
    
        if torch.cuda.is_available():
            os.environ.update({
        
                'MASTER_PORT': args.master.split(':')[-1],
                'MASTER_ADDR': ':'.join(args.master.split(':')[:-1]),
                'WORLD_SIZE':  str(world),
                'RANK':        str(rank),
                'CUDA_DEVICE': str(rank)
            })
    
            torch.cuda.set_device(rank)
            torch.distributed.init_process_group(backend='nccl', init_method='env://')
    
            if args.batch % world != 0:
                raise RuntimeError('Batch size should be a multiple of the number of GPUs')

[train][] 函数参数特别多。

if args.command == 'train':
            train.train(model, state, args.images, args.annotations,
                args.val_images or args.images, args.val_annotations, args.resize, args.max_size, args.jitter, 
                args.batch, int(args.iters * args.schedule), args.val_iters, not args.full_precision, args.lr, 
                args.warmup, [int(m * args.schedule) for m in args.milestones], args.gamma, 
                is_master=(rank == 0), world=world, use_dali=args.with_dali,
                metrics_url=args.post_metrics, logdir=args.logdir, verbose=(rank == 0))

推理时调用 [Engine::\_load][Engine_load] 加载模型。 [Engine][] 类封装了 TensorRT CUDA engine，[PYBIND11\_MODULE][PYBIND11_MODULE] 宏在 [extensions.cpp][] 中导出 C++ 变量。  
[infer][] 进行推理。

elif args.command == 'infer':
            if model is None:
                if rank == 0: print('Loading CUDA engine from {}...'.format(os.path.basename(args.model)))
                model = Engine.load(args.model)
    
            infer.infer(model, args.images, args.output, args.resize, args.max_size, args.batch,
                annotations=args.annotations, mixed_precision=not args.full_precision,
                is_master=(rank == 0), world=world, use_dali=args.with_dali, verbose=(rank == 0))

由路径获取校正图像列表并置乱。  
[Model.export][] 进行校正。

elif args.command == 'export':
            onnx_only = args.export.split('.')[-1] == 'onnx'
            input_size = args.size * 2 if len(args.size) == 1 else args.size
    
            calibration_files = []
            if args.int8:
                # Get list of images to use for calibration
                if os.path.isdir(args.calibration_images):
                    import glob
                    file_extensions = ['.jpg', '.JPG', '.jpeg', '.JPEG', '.png', '.PNG']
                    for ex in file_extensions:
                        calibration_files += glob.glob("{}/*{}".format(args.calibration_images, ex), recursive=True)
                    # Only need enough images for specified num of calibration batches
                    if len(calibration_files) >= args.calibration_batches * args.batch:
                        calibration_files = calibration_files[:(args.calibration_batches * args.batch)]
                    else:
                        print('Only found enough images for {} batches. Continuing anyway...'.format(len(calibration_files) // args.batch))
    
                    random.shuffle(calibration_files)
    
            precision = "FP32"
            if args.int8:
                precision = "INT8"
            elif not args.full_precision:
                precision = "FP16"
    
            exported = model.export(input_size, args.batch, precision, calibration_files, args.calibration_table, args.verbose, onnx_only=onnx_only, opset=args.opset)
            if onnx_only:
                with open(args.export, 'wb') as out:
                    out.write(exported)
            else:
                exported.save(args.export)

## [train][] ##

保留模型到`nn_model`。  
[convert\_fixedbn\_model][convert_fixedbn_model] 将模型中的 [torch.nn.BatchNorm2d][] 替换为 [FixedBatchNorm2d][]。

'Train the model on the given dataset'
    
        # Prepare model
        nn_model = model
        stride = model.stride
    
        model = convert_fixedbn_model(model)
        if torch.cuda.is_available():
            model = model.cuda()

[apex.amp.initialize][] 根据所选的`opt_level`和重写属性（如果有的话）初始化模型，优化器以及 Torch 张量和函数命名空间。构建完模型和优化器之后，应在将模型发送到 [torch.nn.parallel.DistributedDataParallel][] 包装器之前，调用 [apex.amp.initialize][]。请参见 ImageNet 示例中的 [Distributed training][]。目前，尽管 Apex 可以处理任意数量的模型和优化器，但只应调用一次 [apex.amp.initialize][]（参见相应的 [Advanced Amp Usage topic][]）。如果您认为您的用例需要多次调用 [apex.amp.initialize][]，请联系 NVIDIA。任何非`None`的属性关键字参数都将被解释为手动覆盖。为了避免重写脚本中的其他内容，请命名返回的模型/优化器以替换传递的模型/优化器。

[apex.parallel.DistributedDataParallel][] 是一个模块包装器，它支持简单的多进程分布式数据并行训练，类似于 [torch.nn.parallel.DistributedDataParallel][]。初始化过程中在进程之间广播参数，并且在 [backward()][backward] 期间，规约平均进程间的梯度。

[DistributedDataParallel][apex.parallel.DistributedDataParallel] 针对 NCCL 进行了优化。它在 [backward()][backward] 期间交叠传输和通信，并且桶运较小的梯度传输以减少所需的传输总数，从而提高了性能。这与 [NVCaffe][] 所做的优化类似。

# Setup optimizer and schedule
        optimizer = SGD(model.parameters(), lr=lr, weight_decay=0.0001, momentum=0.9) 
    
        model, optimizer = amp.initialize(model, optimizer,
                                          opt_level = 'O2' if mixed_precision else 'O0',
                                          keep_batchnorm_fp32 = True,
                                          loss_scale = 128.0,
                                          verbosity = is_master)
    
        if world > 1: 
            model = DistributedDataParallel(model)
        model.train()
    
        if 'optimizer' in state:
            optimizer.load_state_dict(state['optimizer'])
    
        def schedule(train_iter):
            if warmup and train_iter <= warmup:
                return 0.9 * train_iter / warmup + 0.1
            return gamma ** len([m for m in milestones if m <= train_iter])
        scheduler = LambdaLR(optimizer, schedule)

[DaliDataIterator][] 使用 [NVIDIA DALI][] 加载数据。

# Prepare dataset
        if verbose: print('Preparing dataset...')
        data_iterator = (DaliDataIterator if use_dali else DataIterator)(
            path, jitter, max_size, batch_size, stride,
            world, annotations, training=True)
        if verbose: print(data_iterator)
    
    
        if verbose:
            print('    device: {} {}'.format(
                world, 'cpu' if not torch.cuda.is_available() else 'gpu' if world == 1 else 'gpus'))
            print('    batch: {}, precision: {}'.format(batch_size, 'mixed' if mixed_precision else 'full'))
            print('Training model for {} iterations...'.format(iterations))
    
        # Create TensorBoard writer
        if logdir is not None:
            from tensorboardX import SummaryWriter
            if is_master and verbose:
                print('Writing TensorBoard logs to: {}'.format(logdir))
            writer = SummaryWriter(logdir=logdir)

[Profiler][] 记录时间用于分析。  
前向运行完后手动删除`data`。  
[apex.amp.scale\_loss][apex.amp.scale_loss] 在上下文管理器入口处，创建`scaled_loss = (loss.float())*current loss scale`。产生`scaled_loss`，以便用户可以调用`scaled_loss.backward()`：

with amp.scale_loss(loss, optimizer) as scaled_loss:
        scaled_loss.backward()

在上下文管理器退出时（`if delay_unscale=False`），将检查梯度的 infs/NaNs 并反缩放，以便可以调用`optimizer.step()`。

profiler = Profiler(['train', 'fw', 'bw'])
        iteration = state.get('iteration', 0)
        while iteration < iterations:
            cls_losses, box_losses = [], []
            for i, (data, target) in enumerate(data_iterator):
                scheduler.step(iteration)
    
                # Forward pass
                profiler.start('fw')
    
                optimizer.zero_grad()
                cls_loss, box_loss = model([data, target])
                del data
                profiler.stop('fw')
    
                # Backward pass
                profiler.start('bw')
                with amp.scale_loss(cls_loss + box_loss, optimizer) as scaled_loss:
                    scaled_loss.backward()
                optimizer.step()
    
                # Reduce all losses
                cls_loss, box_loss = cls_loss.mean().clone(), box_loss.mean().clone()
                if world > 1:
                    torch.distributed.all_reduce(cls_loss)
                    torch.distributed.all_reduce(box_loss)
                    cls_loss /= world
                    box_loss /= world
                if is_master:
                    cls_losses.append(cls_loss)
                    box_losses.append(box_loss)
    
                if is_master and not isfinite(cls_loss + box_loss):
                    raise RuntimeError('Loss is diverging!\n{}'.format(
                        'Try lowering the learning rate.'))
    
                del cls_loss, box_loss
                profiler.stop('bw')
    
                iteration += 1

主节点打印并记录信息。  
[post\_metrics][post_metrics] 使用 [requests][] 发送信息。  
[ignore\_sigint][ignore_sigint] 忽略信号。  
[infer][] 根据模型进行推理。

profiler.bump('train')
                if is_master and (profiler.totals['train'] > 60 or iteration == iterations):
                    focal_loss = torch.stack(list(cls_losses)).mean().item()
                    box_loss = torch.stack(list(box_losses)).mean().item()
                    learning_rate = optimizer.param_groups[0]['lr']
                    if verbose:
                        msg  = '[{:{len}}/{}]'.format(iteration, iterations, len=len(str(iterations)))
                        msg += ' focal loss: {:.3f}'.format(focal_loss)
                        msg += ', box loss: {:.3f}'.format(box_loss)
                        msg += ', {:.3f}s/{}-batch'.format(profiler.means['train'], batch_size)
                        msg += ' (fw: {:.3f}s, bw: {:.3f}s)'.format(profiler.means['fw'], profiler.means['bw'])
                        msg += ', {:.1f} im/s'.format(batch_size / profiler.means['train'])
                        msg += ', lr: {:.2g}'.format(learning_rate)
                        print(msg, flush=True)
    
                    if logdir is not None:
                        writer.add_scalar('focal_loss', focal_loss,  iteration)
                        writer.add_scalar('box_loss', box_loss, iteration)
                        writer.add_scalar('learning_rate', learning_rate, iteration)
                        del box_loss, focal_loss
    
                    if metrics_url:
                        post_metrics(metrics_url, {
        
                            'focal loss': mean(cls_losses),
                            'box loss': mean(box_losses),
                            'im_s': batch_size / profiler.means['train'],
                            'lr': learning_rate
                        })
    
                    # Save model weights
                    state.update({
        
                        'iteration': iteration,
                        'optimizer': optimizer.state_dict(),
                        'scheduler': scheduler.state_dict(),
                    })
                    with ignore_sigint():
                        nn_model.save(state)
    
                    profiler.reset()
                    del cls_losses[:], box_losses[:]
    
                if val_annotations and (iteration == iterations or iteration % val_iterations == 0):
                    infer(model, val_path, None, resize, max_size, batch_size, annotations=val_annotations,
                        mixed_precision=mixed_precision, is_master=is_master, world=world, use_dali=use_dali, is_validation=True, verbose=False)
                    model.train()
    
                if iteration == iterations:
                    break
    
        if logdir is not None:
            writer.close()

## [infer][] ##

根据模型的类型确定执行后端。

'Run inference on images from path'
    
        backend = 'pytorch' if isinstance(model, Model) or isinstance(model, DDP) else 'tensorrt'
    
        stride = model.module.stride if isinstance(model, DDP) else model.stride

[tempfile.mktemp][] 已经废弃了。[tempfile.mkstemp][] 尽可能以最安全的方式创建临时文件。假设平台正确实现 [os.open()][os.open] 的 [os.O\_EXCL][os.O_EXCL] 标志，文件创建中没有竞争条件。只有创建用户 ID 才能读写该文件。如果平台使用权限位指示文件是否可执行，则任何人都无法执行该文件。子进程不继承文件描述符。

# Create annotations if none was provided
        if not annotations:
            annotations = tempfile.mktemp('.json')
            images = [{
         'id': i, 'file_name': f} for i, f in enumerate(os.listdir(path))]
            json.dump({
         'images': images }, open(annotations, 'w'))
    
        # TensorRT only supports fixed input sizes, so override input size accordingly
        if backend == 'tensorrt': max_size = max(model.input_size)

使用 [DaliDataIterator][] 或者 [DataIterator][] 加载数据。

# Prepare dataset
        if verbose: print('Preparing dataset...')
        data_iterator = (DaliDataIterator if use_dali else DataIterator)(
            path, resize, max_size, batch_size, stride,
            world, annotations, training=False)
        if verbose: print(data_iterator)

判断是独立推理还是训练时的验证。

# Prepare model
        if backend is 'pytorch':
            # If we are doing validation during training,
            # no need to register model with AMP again
            if not is_validation:
                if torch.cuda.is_available(): model = model.cuda()
                model = amp.initialize(model, None,
                                   opt_level = 'O2' if mixed_precision else 'O0',
                                   keep_batchnorm_fp32 = True,
                                   verbosity = 0)
    
            model.eval()
     
        if verbose:
            print('   backend: {}'.format(backend))
            print('    device: {} {}'.format(
                world, 'cpu' if not torch.cuda.is_available() else 'gpu' if world == 1 else 'gpus'))
            print('     batch: {}, precision: {}'.format(batch_size,
                'unknown' if backend is 'tensorrt' else 'mixed' if mixed_precision else 'full'))
            print('Running inference...')

网络运行结果保存在列表中。

results = []
        profiler = Profiler(['infer', 'fw'])
        with torch.no_grad():
            for i, (data, ids, ratios) in enumerate(data_iterator):
                # Forward pass
                profiler.start('fw')
                scores, boxes, classes = model(data)
                profiler.stop('fw')
    
                results.append([scores, boxes, classes, ids, ratios])
    
                profiler.bump('infer')
                if verbose and (profiler.totals['infer'] > 60 or i == len(data_iterator) - 1):
                    size = len(data_iterator.ids)
                    msg  = '[{:{len}}/{}]'.format(min((i + 1) * batch_size,
                        size), size, len=len(str(size)))
                    msg += ' {:.3f}s/{}-batch'.format(profiler.means['infer'], batch_size)
                    msg += ' (fw: {:.3f}s)'.format(profiler.means['fw'])
                    msg += ', {:.1f} im/s'.format(batch_size / profiler.means['infer'])
                    print(msg, flush=True)
    
                    profiler.reset()

[torch.distributed.all\_gather][torch.distributed.all_gather] 从列表中收集整个组的张量。

# Gather results from all devices
        if verbose: print('Gathering results...')
        results = [torch.cat(r, dim=0) for r in zip(*results)]
        if world > 1:
            for r, result in enumerate(results):
                all_result = [torch.ones_like(result, device=result.device) for _ in range(world)]
                torch.distributed.all_gather(list(all_result), result)
                results[r] = torch.cat(all_result, dim=0)

主节点将结果拷贝到 cpu，然后收集整理并评测。  
[pycocotools.coco.COCO.getCatIds][] 获取类别 id 列表。

if is_master:
            # Copy buffers back to host
            results = [r.cpu() for r in results]
    
            # Collect detections
            detections = []
            processed_ids = set()
            for scores, boxes, classes, image_id, ratios in zip(*results):
                image_id = image_id.item()
                if image_id in processed_ids:
                    continue
                processed_ids.add(image_id)
    
                keep = (scores > 0).nonzero()
                scores = scores[keep].view(-1)
                boxes = boxes[keep, :].view(-1, 4) / ratios
                classes = classes[keep].view(-1).int()
    
                for score, box, cat in zip(scores, boxes, classes):
                    x1, y1, x2, y2 = box.data.tolist()
                    cat = cat.item()
                    if 'annotations' in data_iterator.coco.dataset:
                        cat = data_iterator.coco.getCatIds()[cat]
                    detections.append({
        
                        'image_id': image_id,
                        'score': score.item(),
                        'bbox': [x1, y1, x2 - x1 + 1, y2 - y1 + 1],
                        'category_id': cat
                    })

[COCO.loadRes][] 加载结果文件并返回结果 api 对象。  
实例化一个 [COCOeval][] 对象评测结果。

if detections:
                # Save detections
                if detections_file and verbose: print('Writing {}...'.format(detections_file))
                detections = {
         'annotations': detections }
                detections['images'] = data_iterator.coco.dataset['images']
                if 'categories' in data_iterator.coco.dataset:
                    detections['categories'] = [data_iterator.coco.dataset['categories']]
                if detections_file:
                    json.dump(detections, open(detections_file, 'w'), indent=4)
    
                # Evaluate model on dataset
                if 'annotations' in data_iterator.coco.dataset:
                    if verbose: print('Evaluating model...')
                    with redirect_stdout(None):
                        coco_pred = data_iterator.coco.loadRes(detections['annotations'])
                        coco_eval = COCOeval(data_iterator.coco, coco_pred, 'bbox')
                        coco_eval.evaluate()
                        coco_eval.accumulate()
                    coco_eval.summarize()
            else:
                print('No detections!')

## [Model][] ##

[getattr][] 返回`object`的命名属性的值。`name`必须是字符串。如果字符串是对象属性之一的名称，则结果是该属性的值。例如，`getattr(x, 'foobar')`相当于`x.foobar`。如果命名属性不存在，则返回`default`（如果提供），否则将引发 [AttributeError][]。  
[torch.nn.ModuleDict][] 在字典中保存子模块。[ModuleDict][torch.nn.ModuleDict] 可以像常规 Python 字典一样编制索引，只不过它包含的模块已正确注册，并且对所有 [Module][] 方法可见。

[ModuleDict][torch.nn.ModuleDict] 是一个有序字典并且遵循

*  插入顺序
 *  在 [update()][update] 中，合并的`OrderedDict` 或另一个 [ModuleDict][torch.nn.ModuleDict]（[update()][update] 的参数）的顺序。

请注意，使用其他无序映射类型（例如，Python 的普通字典）的 [update()][update] 不会保留合并映射的顺序。

'RetinaNet - https://arxiv.org/abs/1708.02002'
    
        def __init__(self, backbones='ResNet50FPN', classes=80, config={
        }):
            super().__init__()
    
            if not isinstance(backbones, list):
                backbones = [backbones]
    
            self.backbones = nn.ModuleDict({
        b: getattr(backbones_mod, b)() for b in backbones})
            self.name = 'RetinaNet'
            self.exporting = False
    
            self.ratios = [1.0, 2.0, 0.5]
            self.scales = [4 * 2**(i/3) for i in range(3)]
            self.anchors = {
        }
            self.classes = classes
    
            self.threshold  = config.get('threshold', 0.05)
            self.top_n      = config.get('top_n', 1000)
            self.nms        = config.get('nms', 0.5)
            self.detections = config.get('detections', 100)
    
            self.stride = max([b.stride for _, b in self.backbones.items()])

分类和回归头均为4层卷积。  
[FocalLoss][] 基于 [torch.nn.functional.binary\_cross\_entropy\_with\_logits][torch.nn.functional.binary_cross_entropy_with_logits]。  
[SmoothL1Loss][] 为自行实现。

# classification and box regression heads
            def make_head(out_size):
                layers = []
                for _ in range(4):
                    layers += [nn.Conv2d(256, 256, 3, padding=1), nn.ReLU()]
                layers += [nn.Conv2d(256, out_size, 3, padding=1)]
                return nn.Sequential(*layers)
    
            anchors = len(self.ratios) * len(self.scales)
            self.cls_head = make_head(classes * anchors)
            self.box_head = make_head(4 * anchors)
    
            self.cls_criterion = FocalLoss()
            self.box_criterion = SmoothL1Loss(beta=0.11)

### [\_\_repr\_\_][repr] ###

[object.\_\_repr\_\_][object._repr] 由 [repr()][repr 1] 内置函数调用以计算对象的“官方”字符串表示。 如果可能的话，这应该看起来像一个有效的 Python 表达式，可用于重新创建具有相同值的对象（给定适当的环境）。如果这不可能，则应返回<…一些有用的描述…>形式的字符串。 返回值必须是字符串对象。 如果一个类定义了[\_\_repr\_\_()][object._repr] 而不是 [\_\_str\_\_()][str]，那么当需要该类的实例的“非正式”字符串表示时，也会使用 [*\_repr*\_()][object._repr]。这通常用于调试，因此表示是信息丰富且清晰的很重要。

return '\n'.join([
                '     model: {}'.format(self.name),
                '  backbone: {}'.format(', '.join([k for k, _ in self.backbones.items()])),
                '   classes: {}, anchors: {}'.format(self.classes, len(self.ratios) * len(self.scales)),
            ])

### [initialize][Model.initialize] ###

如果是预训练的，加载模型并忽略`cls_head.8`。

if pre_trained:
                # Initialize using weights from pre-trained model
                if not os.path.isfile(pre_trained):
                    raise ValueError('No checkpoint {}'.format(pre_trained))
    
                print('Fine-tuning weights from {}...'.format(os.path.basename(pre_trained)))
                state_dict = self.state_dict()
                chk = torch.load(pre_trained, map_location=lambda storage, loc: storage)
                ignored = ['cls_head.8.bias', 'cls_head.8.weight']
                weights = {
         k: v for k, v in chk['state_dict'].items() if k not in ignored }
                state_dict.update(weights)
                self.load_state_dict(state_dict)
    
                del chk, weights
                torch.cuda.empty_cache()

否则调用基础网络的初始化方法，然后初始化检测和回归分支。

else:
                # Initialize backbone(s)
                for _, backbone in self.backbones.items():
                    backbone.initialize()
    
                # Initialize heads
                def initialize_layer(layer):
                    if isinstance(layer, nn.Conv2d):
                        nn.init.normal_(layer.weight, std=0.01)
                        if layer.bias is not None:
                            nn.init.constant_(layer.bias, val=0)
                self.cls_head.apply(initialize_layer)
                self.box_head.apply(initialize_layer)

特殊初始化分类分支的最后一层。

# Initialize class head prior
            def initialize_prior(layer):
                pi = 0.01
                b = - math.log((1 - pi) / pi)
                nn.init.constant_(layer.bias, b)
                nn.init.normal_(layer.weight, std=0.01)
            self.cls_head[-1].apply(initialize_prior)

### [forward][] ###

Created with Raphaël 2.2.0forwardxbackbonecls\_headbox\_headtraining?\_compute\_lossEndcls\_head.sigmoidexporting?generate\_anchorsdecodenmsyesnoyesno

训练时，输入的`x`包含处理后的标注信息。  
由基础网络提取特征，然后同时进行分类和回归。  
如果是训练，则调用 [\_compute\_loss][compute_loss] 计算损失并返回。

if self.training: x, targets = x
    
            # Backbones forward pass
            features = []
            for _, backbone in self.backbones.items():
                features.extend(backbone(x))
    
            # Heads forward pass
            cls_heads = [self.cls_head(t) for t in features]
            box_heads = [self.box_head(t) for t in features]
    
            if self.training:
                return self._compute_loss(x, cls_heads, box_heads, targets.float())

如果是导出，返回分类和回归输出。

cls_heads = [cls_head.sigmoid() for cls_head in cls_heads]
    
            if self.exporting:
                self.strides = [x.shape[-1] // cls_head.shape[-1] for cls_head in cls_heads]
                return cls_heads, box_heads

否则，运行推理的后处理。  
[generate\_anchors][generate_anchors] 从比例/比率生成锚点坐标。  
[decode][] 根据得分过滤结果并解码出边界框。  
[nms][] 进一步过滤结果。

# Inference post-processing
            decoded = []
            for cls_head, box_head in zip(cls_heads, box_heads):
                # Generate level's anchors
                stride = x.shape[-1] // cls_head.shape[-1]
                if stride not in self.anchors:
                    self.anchors[stride] = generate_anchors(stride, self.ratios, self.scales)
    
                # Decode and filter boxes
                decoded.append(decode(cls_head, box_head, stride,
                    self.threshold, self.top_n, self.anchors[stride]))
    
            # Perform non-maximum suppression
            decoded = [torch.cat(tensors, 1) for tensors in zip(*decoded)]
            return nms(*decoded, self.nms, self.detections)

### [\_extract\_targets][extract_targets] ###

\_extract\_targets

generate\_anchors

snap\_to\_anchors

[generate\_anchors][generate_anchors] 从比例/比率生成锚点坐标。  
[snap\_to\_anchors][snap_to_anchors] 参照锚点生成目标量。

cls_target, box_target, depth = [], [], []
            for target in targets:
                target = target[target[:, -1] > -1]
                if stride not in self.anchors:
                    self.anchors[stride] = generate_anchors(stride, self.ratios, self.scales)
                snapped = snap_to_anchors(
                    target, [s * stride for s in size[::-1]], stride,
                    self.anchors[stride].to(targets.device), self.classes, targets.device)
                for l, s in zip((cls_target, box_target, depth), snapped): l.append(s)
            return torch.stack(cls_target), torch.stack(box_target), torch.stack(depth)

### [\_compute\_loss][compute_loss] ###

\_compute\_loss

\_extract\_targets

self.cls\_criterion

self.box\_criterion

[\_extract\_targets][extract_targets] 获得分类和回归目标。  
`depth`为样本选取掩码。

cls_losses, box_losses, fg_targets = [], [], []
            for cls_head, box_head in zip(cls_heads, box_heads):
                size = cls_head.shape[-2:]
                stride = x.shape[-1] / cls_head.shape[-1]
    
                cls_target, box_target, depth = self._extract_targets(targets, stride, size)
                fg_targets.append((depth > 0).sum().float().clamp(min=1))
    
                cls_head = cls_head.view_as(cls_target).float()
                cls_mask = (depth >= 0).expand_as(cls_target).float()
                cls_loss = self.cls_criterion(cls_head, cls_target)
                cls_loss = cls_mask * cls_loss
                cls_losses.append(cls_loss.sum())
    
                box_head = box_head.view_as(box_target).float()
                box_mask = (depth > 0).expand_as(box_target).float()
                box_loss = self.box_criterion(box_head, box_target)
                box_loss = box_mask * box_loss
                box_losses.append(box_loss.sum())
    
            fg_targets = torch.stack(fg_targets).sum()
            cls_loss = torch.stack(cls_losses).sum() / fg_targets
            box_loss = torch.stack(box_losses).sum() / fg_targets
            return cls_loss, box_loss

### [save][] ###

保存基础网络、类别数和模型。  
如果有的话，保存`iteration`、`optimizer`和`scheduler`。

checkpoint = {
        
                'backbone': [k for k, _ in self.backbones.items()],
                'classes': self.classes,
                'state_dict': self.state_dict()
            }
    
            for key in ('iteration', 'optimizer', 'scheduler'):
                if key in state:
                    checkpoint[key] = state[key]
    
            torch.save(checkpoint, state['path'])

### [load][Model.load] ###

创建基础网络，然后加载其中的参数。  
[torch.cuda.empty\_cache][torch.cuda.empty_cache] 释放当前由缓存分配器保存的所有未占用的缓存内存，以便可以在其他 GPU 应用程序中使用这些缓存并在 nvidia-smi 中可见。

@classmethod
        def load(cls, filename):
            if not os.path.isfile(filename):
                raise ValueError('No checkpoint {}'.format(filename))
    
            checkpoint = torch.load(filename, map_location=lambda storage, loc: storage)
            # Recreate model from checkpoint instead of from individual backbones
            model = cls(backbones=checkpoint['backbone'], classes=checkpoint['classes'])
            model.load_state_dict(checkpoint['state_dict'])
    
            state = {
        }
            for key in ('iteration', 'optimizer', 'scheduler'):
                if key in checkpoint:
                    state[key] = checkpoint[key]
    
            del checkpoint
            torch.cuda.empty_cache()
    
            return model, state

### [export][Model.export] ###

如果 OpSet 的版本低于9，则定义`upsample_nearest2d`函数。

import torch.onnx.symbolic
    
            if opset is not None and opset < 9:
                # Override Upsample's ONNX export from old opset if required (not needed for TRT 5.1+)
                @torch.onnx.symbolic.parse_args('v', 'is')
                def upsample_nearest2d(g, input, output_size):
                    height_scale = float(output_size[-2]) / input.type().sizes()[-2]
                    width_scale = float(output_size[-1]) / input.type().sizes()[-1]
                    return g.op("Upsample", input,
                        scales_f=(1, 1, height_scale, width_scale),
                        mode_s="nearest")
                torch.onnx.symbolic.upsample_nearest2d = upsample_nearest2d

[io.BytesIO][] 使用内存中字节缓冲区的流实现。它继承了 [BufferedIOBase][]。 调用 [close()][close] 方法时，将丢弃缓冲区。可选参数`initial_bytes`是一个包含初始数据的 [bytes-like object][]。  
[torch.onnx.export][] 导出模型。  
[io.BytesIO.getvalue][] 返回包含缓冲区全部内容的 [bytes][]。

# Export to ONNX
            print('Exporting to ONNX...')
            self.exporting = True
            onnx_bytes = io.BytesIO()
            zero_input = torch.zeros([1, 3, *size]).cuda()
            extra_args = {
         'opset_version': opset } if opset else {
        }
            torch.onnx.export(self.cuda(), zero_input, onnx_bytes, *extra_args)
            self.exporting = False
    
            if onnx_only:
                return onnx_bytes.getvalue()

[generate\_anchors][generate_anchors] 从比例/比率生成锚点坐标。  
返回一个 [Engine][] 对象。

# Build TensorRT engine
            model_name = '_'.join([k for k, _ in self.backbones.items()])
            anchors = [generate_anchors(stride, self.ratios, self.scales).view(-1).tolist() 
                for stride in self.strides]
            return Engine(onnx_bytes.getvalue(), len(onnx_bytes.getvalue()), batch, precision,
                self.threshold, self.top_n, anchors, self.nms, self.detections, calibration_files, model_name, calibration_table, verbose)

## [DaliDataIterator][] ##

DaliDataIterator

COCOPipeline

[DaliDataIterator][] 定义的函数与 [torch.utils.data.DataLoader][] 类似，但使用 [DALI][NVIDIA DALI] 实现数据并行加载。  
输入的`batch_size`参数为总批量大小。

'Data loader for data parallel using Dali'
    
        def __init__(self, path, resize, max_size, batch_size, stride, world, annotations, training=False):
            self.training = training
            self.resize = resize
            self.max_size = max_size
            self.stride = stride
            self.batch_size = batch_size // world
            self.mean = [255.*x for x in [0.485, 0.456, 0.406]]
            self.std = [255.*x for x in [0.229, 0.224, 0.225]]
            self.world = world
            self.path = path

[contextlib.redirect\_stdout][contextlib.redirect_stdout] 是用于临时将 [sys.stdout][] 重定向到另一个文件或类文件对象的上下文管理器。此工具为输出硬连接到 stdout 的现有函数或类增加了灵活性。  
创建一个 [COCO][] 实例。  
[COCOPipeline][] 封装了 [nvidia.dali.pipeline.Pipeline][]。

[nvidia.dali.pipeline.Pipeline.build][] 构建管道。需要构建管道才能独立运行它。特定于框架的插件自动处理此步骤。

# Setup COCO
            with redirect_stdout(None):
                self.coco = COCO(annotations)
            self.ids = list(self.coco.imgs.keys())
            if 'categories' in self.coco.dataset:
                self.categories_inv = {
         k: i for i, k in enumerate(self.coco.getCatIds()) }
    
            self.pipe = COCOPipeline(batch_size=self.batch_size, num_threads=2, 
                path=path, coco=self.coco, training=training, annotations=annotations, world=world, 
                device_id = torch.cuda.current_device(), mean=self.mean, std=self.std, resize=resize, max_size=max_size, stride=self.stride)
    
            self.pipe.build()

### [\_\_repr\_\_][repr 2] ###

功能说明。

return '\n'.join([
                '    loader: dali',
                '    resize: {}, max: {}'.format(self.resize, self.max_size),
            ])

### [\_\_len\_\_][len] ###

return ceil(len(self.ids) // self.world / self.batch_size)

### [\_\_iter\_\_][iter] ###

[nvidia.dali.pipeline.Pipeline.run][] 运行管道并返回结果。如果管道是在`exec_pipelined`选项设置为`True`的情况下创建的，则此函数也将开始预取下一次迭代以加快执行速度。不应与同一管道中的 [nvidia.dali.pipeline.Pipeline.schedule\_run()][nvidia.dali.pipeline.Pipeline.schedule_run]、[nvidia.dali.pipeline.Pipeline.share\_outputs()][nvidia.dali.pipeline.Pipeline.share_outputs] 和 [nvidia.dali.pipeline.Pipeline.release\_outputs()][nvidia.dali.pipeline.Pipeline.release_outputs] 混合使用。  
[ctypes.c\_void\_p][ctypes.c_void_p] 表示 C 语言的`void *`类型。该值表示为整数。构造函数接受可选的整数初始值设定项。  
[orch.Tensor.data\_ptr][orch.Tensor.data_ptr] 返回自张量的第一个元素的地址。

`self.pipe.run()`返回的结果均为 [nvidia.dali.backend.TensorListCPU][] 或 [nvidia.dali.backend.TensorListGPU][] 类型。  
[nvidia.dali.backend.TensorListCPU.copy\_to\_external][nvidia.dali.backend.TensorListCPU.copy_to_external] 将此`TensorList`的内容复制到 CPU 内存中的外部指针（类型为 [ctypes.c\_void\_p][ctypes.c_void_p]）。插件内部使用此函数与支持的深度学习框架中的张量进行交互。

[nvidia.dali.backend.TensorListGPU.as\_cpu][nvidia.dali.backend.TensorListGPU.as_cpu] 返回 [TensorListCPU][nvidia.dali.backend.TensorListCPU] 对象，该对象是此 [TensorListGPU][nvidia.dali.backend.TensorListGPU] 的副本。

data, ratios, ids, num_detections = [], [], [], []
                dali_data, dali_boxes, dali_labels, dali_ids, dali_attrs, dali_resize_img = self.pipe.run()
    
                for l in range(len(dali_boxes)):
                    num_detections.append(dali_boxes.at(l).shape[0])
    
                pyt_targets = -1 * torch.ones([len(dali_boxes), max(max(num_detections),1), 5])
    
                for batch in range(self.batch_size):
                    id = int(dali_ids.at(batch)[0])
                    
                    # Convert dali tensor to pytorch
                    dali_tensor = dali_data.at(batch)
                    tensor_shape = dali_tensor.shape()
    
                    datum = torch.zeros(dali_tensor.shape(), dtype=torch.float, device=torch.device('cuda'))
                    c_type_pointer = ctypes.c_void_p(datum.data_ptr())
                    dali_tensor.copy_to_external(c_type_pointer)
    
                    # Calculate image resize ratio to rescale boxes
                    prior_size = dali_attrs.as_cpu().at(batch)
                    resized_size = dali_resize_img.at(batch).shape()
                    ratio = max(resized_size) / max(prior_size)
    
                    if self.training:
                        # Rescale boxes
                        b_arr = dali_boxes.at(batch)
                        num_dets = b_arr.shape[0]
                        if num_dets is not 0:
                            pyt_bbox = torch.from_numpy(b_arr).float()
    
                            pyt_bbox[:,0] *= float(prior_size[1])
                            pyt_bbox[:,1] *= float(prior_size[0])
                            pyt_bbox[:,2] *= float(prior_size[1])
                            pyt_bbox[:,3] *= float(prior_size[0])
                            # (l,t,r,b) ->  (x,y,w,h) == (l,r, r-l, b-t)
                            pyt_bbox[:,2] -= pyt_bbox[:,0]
                            pyt_bbox[:,3] -= pyt_bbox[:,1]
                            pyt_targets[batch,:num_dets,:4] = pyt_bbox * ratio
    
                        # Arrange labels in target tensor
                        l_arr = dali_labels.at(batch)
                        if num_dets is not 0:
                            pyt_label = torch.from_numpy(l_arr).float()
                            pyt_label -= 1 #Rescale labels to [0,79] instead of [1,80]
                            pyt_targets[batch,:num_dets, 4] = pyt_label.squeeze()
    
                    ids.append(id)
                    data.append(datum.unsqueeze(0))
                    ratios.append(ratio)
    
                data = torch.cat(data, dim=0)
    
                if self.training:
                    pyt_targets = pyt_targets.cuda(non_blocking=True)
    
                    yield data, pyt_targets
    
                else:
                    ids = torch.Tensor(ids).int().cuda(non_blocking=True)
                    ratios = torch.Tensor(ratios).cuda(non_blocking=True)
    
                    yield data, ids, ratios

## [COCOPipeline][] ##

COCOPipeline

pipeline.Pipeline

可以参考文档 [COCO Reader with augmentations][]。

[nvidia.dali.ops.COCOReader][] 是一个 CPU 运算符。从 COCO 数据集中读取数据，其每个目录中包含图像和一个注释文件。对于带有 m 个 bbox 的图像，将其 bbox 返回为 (m,4) Tensor (m \* \[x, y, w, h\] or `m * [left, top, right, bottom]`) 和 标签为 (m,1) Tensor (m \* category\_id)。  
[nvidia.dali.ops.nvJPEGDecoderSlice][] 是一个“混合”操作。根据给定大小和锚点的裁剪窗口使用 [nvJPEG][] 库部分解码 JPEG 图像。输入必须以特定顺序提供3个张量：

*  `encoded_dat`包含编码图像数据；
 *  `begin`包含`(x,y)`格式的裁剪起始像素坐标
 *  `size`包含裁剪的像素尺寸，`(w,h)`格式。

对于`begin`和`size`，坐标必须在区间`[0.0, 1.0]`中。解码器输出的排序为`HWC`。

> 警告  
> 此运算符现已弃用。请改用 [ImageDecoderSlice][]

[nvidia.dali.ops.RandomBBoxCrop][] 是一个 CPU 运算符。对图像执行预期裁剪，同时保持边界框和标签一致。输入必须作为两个张量提供：

*  `BBoxes`包含表示为`[l,t,r,b]`或`[x,y,w,h]`的边界框；
 *  `Labels`包含每个边界框的相应标签。

得到的预期切图以两个张量形式提供：

*  `Begin`包含`(x,y)`格式的切图起始坐标；
 *  `Size`包含`(w,h)`格式的切图大小。

边界框提供为`(m*4)`张量，其中每个边界框表示为`[l,t,r,b]`或`[x,y,w,h]`。丢弃与边框交并比小于阈值的标签。 请注意，当`allow_no_crop`为`False`且阈值不包含0时，最好增加`num_attempts`，否则它可能会循环很长时间。

[nvidia.dali.ops.BbFlip][] 是一个 CPU，GPU 运算符，执行边界框水平翻转（镜像）操作。 输入为边界框坐标，格式为`[x, y, w, h]`或`[left, top, right, bottom]`。所有坐标都在图像坐标系中（即0.0-1.0）。

[nvidia.dali.ops.Flip][] 是一个 CPU，GPU 运算符，在水平轴和(或)垂直轴上翻转图像。

[nvidia.dali.ops.CoinFlip][] 是一个支持操作符。产生充满0和1的张量——随机掷硬币的结果，可用作选择操作的参数。

[nvidia.dali.ops.Uniform][] 是一个支持操作符，产生均匀分布随机数的张量。

[nvidia.dali.ops.Resize][] 是一个 CPU，GPU 运算符，调整图像大小。

[nvidia.dali.ops.Paste][] 是一个 GPU 操作符。将输入图像粘贴到更大的画布上。画布大小等于输入大小\*比率。

[nvidia.dali.ops.CropMirrorNormalize][] 是一个 CPU，GPU 运算符。如果需要，融合执行裁剪、标准化、格式转换（NHWC 到 NCHW）和类型转换。标准化输入图像并使用以下公式生成输出：

output = (input - mean) / std

请注意，不提供任何裁剪参数将仅导致镜像和标准化。该运算符允许序列输入。

super().__init__(batch_size=batch_size, num_threads=num_threads, device_id = device_id, prefetch_queue_depth=num_threads, seed=42)
    
            self.path = path
            self.training = training
            self.coco = coco
            self.stride = stride
            self.iter = 0
    
            self.reader = ops.COCOReader(annotations_file=annotations, file_root=path, num_shards=world,shard_id=torch.cuda.current_device(), 
                                         ltrb=True, ratio=True, shuffle_after_epoch=True, save_img_ids=True)
    
            self.decode_train = ops.nvJPEGDecoderSlice(device="mixed", output_type=types.RGB)
            self.decode_infer = ops.nvJPEGDecoder(device="mixed", output_type=types.RGB)
            self.bbox_crop = ops.RandomBBoxCrop(device='cpu', ltrb=True, scaling=[0.3, 1.0], thresholds=[0.1,0.3,0.5,0.7,0.9])
    
            self.bbox_flip = ops.BbFlip(device='cpu', ltrb=True)
            self.img_flip = ops.Flip(device='gpu')
            self.coin_flip = ops.CoinFlip(probability=0.5)
    
            if isinstance(resize, list): resize = max(resize)
            self.rand_resize = ops.Uniform(range=[resize, float(max_size)])
    
            self.resize_train = ops.Resize(device='gpu', interp_type=types.DALIInterpType.INTERP_CUBIC, save_attrs=True)
            self.resize_infer = ops.Resize(device='gpu', interp_type=types.DALIInterpType.INTERP_CUBIC, resize_longer=max_size, save_attrs=True)
    
            padded_size = max_size + ((self.stride - max_size % self.stride) % self.stride)
    
            self.pad = ops.Paste(device='gpu', fill_value = 0, ratio=1.1, min_canvas_size=padded_size, paste_x=0, paste_y=0)
            self.normalize = ops.CropMirrorNormalize(device='gpu', mean=mean, std=std, crop=padded_size, crop_pos_x=0, crop_pos_y=0)

### [define\_graph][define_graph] ###

[nvidia.dali.pipeline.Pipeline.define\_graph][nvidia.dali.pipeline.Pipeline.define_graph] 返回输出`EdgeReference`的列表。用户定义此函数以构造其管道的操作图。  
`self.reader()`从数据集中读取数据。  
如果是训练，读取图像并进行增广；如果是推理则仅解码和调整大小。

images, bboxes, labels, img_ids = self.reader()
    
            if self.training:
                crop_begin, crop_size, bboxes, labels = self.bbox_crop(bboxes, labels)
                images = self.decode_train(images, crop_begin, crop_size)
                resize = self.rand_resize()
                images, attrs = self.resize_train(images, resize_longer=resize)
    
                flip = self.coin_flip()
                bboxes = self.bbox_flip(bboxes, horizontal=flip)
                images = self.img_flip(images, horizontal=flip)
    
            else:
                images = self.decode_infer(images)
                images, attrs = self.resize_infer(images)
    
            resized_images = images
            images = self.normalize(self.pad(images))
    
            return images, bboxes, labels, img_ids, attrs, resized_images

## [snap\_to\_anchors][snap_to_anchors] ##

[torch.Tensor.nelement][] 是 [torch.Tensor.numel][] 的别称。

如果`boxes`为空则直接返回0填充的张量。

'Snap target boxes (x, y, w, h) to anchors'
    
        num_anchors = anchors.size()[0] if anchors is not None else 1
        width, height = (int(size[0] / stride), int(size[1] / stride))
    
        if boxes.nelement() == 0:
            return (torch.zeros([num_anchors, num_classes, height, width], device=device),
                torch.zeros([num_anchors, 4, height, width], device=device),
                torch.zeros([num_anchors, 1, height, width], device=device))

根据输出尺寸将锚点广播到每个位置。

boxes, classes = boxes.split(4, dim=1)
    
        # Generate anchors
        x, y = torch.meshgrid([torch.arange(0, size[i], stride, device=device, dtype=classes.dtype) for i in range(2)])
        xyxy = torch.stack((x, y, x, y), 2).unsqueeze(0)
        anchors = anchors.view(-1, 1, 1, 4).to(dtype=classes.dtype)
        anchors = (xyxy + anchors).contiguous().view(-1, 4)

`boxes`由`[x, y, width, height]`转为`[left, top, right, bottom]`，便于计算交并比。

# Compute overlap between boxes and anchors
        boxes = torch.cat([boxes[:, :2], boxes[:, :2] + boxes[:, 2:] - 1], 1)
        xy1 = torch.max(anchors[:, None, :2], boxes[:, :2])
        xy2 = torch.min(anchors[:, None, 2:], boxes[:, 2:])
        inter = torch.prod((xy2 - xy1 + 1).clamp(0), 2)
        boxes_area = torch.prod(boxes[:, 2:] - boxes[:, :2] + 1, 1)
        anchors_area = torch.prod(anchors[:, 2:] - anchors[:, :2] + 1, 1)
        overlap = inter / (anchors_area[:, None] + boxes_area - inter)

为每个锚框保留最佳匹配目标框。  
[box2delta][] 将边界框转换为锚框的增量。  
[torch.ones\_like][torch.ones_like] 返回填充标量值1的张量，其大小与`input`相同。`torch.ones_like(input)`等效于`torch.ones(input.size(), dtype=input.dtype, layout=input.layout, device=input.device)`。

`depth`为样本选取掩码。

# Keep best box per anchor
        overlap, indices = overlap.max(1)
        box_target = box2delta(boxes[indices], anchors)
        box_target = box_target.view(num_anchors, 1, width, height, 4)
        box_target = box_target.transpose(1, 4).transpose(2, 3)
        box_target = box_target.squeeze().contiguous()
    
        depth = torch.ones_like(overlap) * -1
        depth[overlap < 0.4] = 0 # background
        depth[overlap >= 0.5] = classes[indices][overlap >= 0.5].squeeze() + 1 # objects
        depth = depth.view(num_anchors, width, height).transpose(1, 2).contiguous()

生成目标类别。每个类别上的值为0或1。

[torch.Tensor.scatter\_][torch.Tensor.scatter] 将张量`src`中的所有值写入`index`张量中指定的索引处的`self`。对于`src`中的每个值，`dimension != dim`时其输出索引由`src`中的索引指定，`dimension = dim`时为`index`的索引值。

# Generate target classes
        cls_target = torch.zeros((anchors.size()[0], num_classes + 1), device=device, dtype=boxes.dtype)
        if classes.nelement() == 0:
            classes = torch.LongTensor([num_classes], device=device).expand_as(indices)
        else:
            classes = classes[indices].long()
        classes = classes.view(-1, 1)
        classes[overlap < 0.4] = num_classes # background has no class
        cls_target.scatter_(1, classes, 1)
        cls_target = cls_target[:, :num_classes].view(-1, 1, width, height, num_classes)
        cls_target = cls_target.transpose(1, 4).transpose(2, 3)
        cls_target = cls_target.squeeze().contiguous()
    
        return (cls_target.view(num_anchors, num_classes, height, width),
            box_target.view(num_anchors, 4, height, width),
            depth.view(num_anchors, 1, height, width))

## 参考资料： ##

*  [Multiprocessing best practices][]
 *  [PyTorch为何如此高效好用？来探寻深度学习框架的内部架构][PyTorch]
 *  [【分布式训练】单机多卡的正确打开方式（三）：PyTorch][PyTorch 1]
 *  [Pytorch tutorial 之Transfer Learning][Pytorch tutorial _Transfer Learning]
 *  [Python’s Requests Library (Guide)][Python_s Requests Library _Guide]
 *  [90秒训练AlexNet！商汤刷新纪录][90_AlexNet]
 *  [S9243 Fast and Accurate Object Detection][]
 *  [NumPy array slice using None][]
 *  [Pytorch多机多卡分布式训练][Pytorch]

[retinanet-examples]: https://github.com/NVIDIA/retinanet-examples
[apex.parallel.DistributedDataParallel]: https://nvidia.github.io/apex/parallel.html#apex.parallel.DistributedDataParallel
[apex.amp]: https://nvidia.github.io/apex/amp.html?highlight=amp#apex-amp
[NVIDIA DALI]: https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/index.html
[TensorRT]: https://developer.nvidia.com/tensorrt
[PyTorch NGC docker container]: https://ngc.nvidia.com/catalog/containers/nvidia:pytorch
[main]: https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/main.py#L172
[parse]: https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/main.py#L15
[load_model]: https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/main.py#L76
[Module.share_memory]: https://github.com/pytorch/pytorch/blob/master/torch/nn/modules/module.py#L1108
[worker]: https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/main.py#L104
[torch.multiprocessing.spawn]: https://pytorch.org/docs/stable/multiprocessing.html#torch.multiprocessing.spawn
[ArgumentParser.add_subparsers]: https://docs.python.org/3/library/argparse.html#argparse.ArgumentParser.add_subparsers
[ArgumentParser]: https://docs.python.org/3/library/argparse.html#argparse.ArgumentParser
[action]: https://docs.python.org/3/library/argparse.html#action
[dest]: https://docs.python.org/3/library/argparse.html#dest
[required]: https://docs.python.org/3/library/argparse.html#required
[help]: https://docs.python.org/3/library/argparse.html#help
[metavar]: https://docs.python.org/3/library/argparse.html#metavar
[Model]: https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/model.py#L12
[Model.initialize]: https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/model.py#L59
[Model.load]: https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/model.py#L187
[os.environ]: https://docs.python.org/3/library/os.html#os.environ
[mapping]: https://docs.python.org/3/glossary.html#term-mapping
[os]: https://docs.python.org/3/library/os.html#module-os
[torch.cuda.set_device]: https://pytorch.org/docs/stable/cuda.html#torch.cuda.set_device
[torch.distributed.init_process_group]: https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group
[train]: https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/train.py#L16
[Engine_load]: https://github.com/NVIDIA/retinanet-examples/blob/master/csrc/engine.cpp#L55
[Engine]: https://github.com/NVIDIA/retinanet-examples/blob/master/csrc/engine.h#L38
[PYBIND11_MODULE]: https://pybind11.readthedocs.io/en/stable/reference.html#c.PYBIND11_MODULE
[extensions.cpp]: https://github.com/NVIDIA/retinanet-examples/blob/master/csrc/extensions.cpp#L122
[infer]: https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/infer.py#L16
[Model.export]: https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/model.py#L206
[convert_fixedbn_model]: https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/backbones/layers.py#L18
[torch.nn.BatchNorm2d]: https://pytorch.org/docs/stable/nn.html#torch.nn.BatchNorm2d
[FixedBatchNorm2d]: https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/backbones/layers.py#L5
[apex.amp.initialize]: https://nvidia.github.io/apex/amp.html#apex.amp.initialize
[torch.nn.parallel.DistributedDataParallel]: https://pytorch.org/docs/stable/nn.html#torch.nn.parallel.DistributedDataParallel
[Distributed training]: https://github.com/NVIDIA/apex/tree/master/examples/imagenet#distributed-training
[Advanced Amp Usage topic]: https://nvidia.github.io/apex/advanced.html#multiple-models-optimizers-losses
[backward]: https://pytorch.org/docs/stable/tensors.html#torch.Tensor.backward
[NVCaffe]: https://github.com/NVIDIA/caffe
[DaliDataIterator]: https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/dali.py#L69
[Profiler]: https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/utils.py#L44
[apex.amp.scale_loss]: https://nvidia.github.io/apex/amp.html#apex.amp.scale_loss
[post_metrics]: https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/utils.py#L73
[requests]: https://github.com/request/request
[ignore_sigint]: https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/utils.py#L36
[tempfile.mktemp]: https://docs.python.org/3/library/tempfile.html#tempfile.mktemp
[tempfile.mkstemp]: https://docs.python.org/3/library/tempfile.html#tempfile.mkstemp
[os.open]: https://docs.python.org/3/library/os.html#os.open
[os.O_EXCL]: https://docs.python.org/3/library/os.html#os.O_EXCL
[DataIterator]: https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/data.py#L134
[torch.distributed.all_gather]: https://pytorch.org/docs/stable/distributed.html#torch.distributed.all_gather
[pycocotools.coco.COCO.getCatIds]: https://github.com/cocodataset/cocoapi/blob/master/PythonAPI/pycocotools/coco.py#L157
[COCO.loadRes]: https://github.com/cocodataset/cocoapi/blob/master/PythonAPI/pycocotools/coco.py#L297
[COCOeval]: https://github.com/cocodataset/cocoapi/blob/master/PythonAPI/pycocotools/cocoeval.py#L10
[getattr]: https://docs.python.org/3/library/functions.html#getattr
[AttributeError]: https://docs.python.org/3/library/exceptions.html#AttributeError
[torch.nn.ModuleDict]: https://pytorch.org/docs/stable/nn.html#torch.nn.ModuleDict
[Module]: https://pytorch.org/docs/stable/nn.html#torch.nn.Module
[update]: https://pytorch.org/docs/stable/nn.html#torch.nn.ModuleDict.update
[FocalLoss]: https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/loss.py#L5
[torch.nn.functional.binary_cross_entropy_with_logits]: https://pytorch.org/docs/stable/nn.html#torch.nn.functional.binary_cross_entropy_with_logits
[SmoothL1Loss]: https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/loss.py#L20
[repr]: https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/model.py#L52
[object._repr]: https://docs.python.org/3/reference/datamodel.html#object.__repr__
[repr 1]: https://docs.python.org/3/library/functions.html#repr
[str]: https://docs.python.org/3/reference/datamodel.html#object.__str__
[forward]: https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/model.py#L98
[compute_loss]: https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/model.py#L147
[generate_anchors]: https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/box.py#L5
[decode]: https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/box.py#L105
[nms]: https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/box.py#L157
[extract_targets]: https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/model.py#L135
[snap_to_anchors]: https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/box.py#L48
[save]: https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/model.py#L173
[torch.cuda.empty_cache]: https://pytorch.org/docs/stable/cuda.html#torch.cuda.empty_cache
[io.BytesIO]: https://docs.python.org/3/library/io.html#io.BytesIO
[BufferedIOBase]: https://docs.python.org/3/library/io.html#io.BufferedIOBase
[close]: https://docs.python.org/3/library/io.html#io.IOBase.close
[bytes-like object]: https://docs.python.org/3/glossary.html#term-bytes-like-object
[torch.onnx.export]: https://pytorch.org/docs/stable/onnx.html#torch.onnx.export
[io.BytesIO.getvalue]: https://docs.python.org/3/library/io.html#io.BytesIO.getvalue
[bytes]: https://docs.python.org/3/library/stdtypes.html#bytes
[torch.utils.data.DataLoader]: https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader
[contextlib.redirect_stdout]: https://docs.python.org/3/library/contextlib.html#contextlib.redirect_stdout
[sys.stdout]: https://docs.python.org/3/library/sys.html#sys.stdout
[COCO]: https://github.com/cocodataset/cocoapi/blob/master/PythonAPI/pycocotools/coco.py#L70
[COCOPipeline]: https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/dali.py#L11
[nvidia.dali.pipeline.Pipeline]: https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/api.html#nvidia.dali.pipeline.Pipeline
[nvidia.dali.pipeline.Pipeline.build]: https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/api.html#nvidia.dali.pipeline.Pipeline.build
[repr 2]: https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/dali.py#L96
[len]: https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/dali.py#L102
[iter]: https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/dali.py#L105
[nvidia.dali.pipeline.Pipeline.run]: https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/api.html#nvidia.dali.pipeline.Pipeline.run
[nvidia.dali.pipeline.Pipeline.schedule_run]: https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/api.html#nvidia.dali.pipeline.Pipeline.schedule_run
[nvidia.dali.pipeline.Pipeline.share_outputs]: https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/api.html#nvidia.dali.pipeline.Pipeline.share_outputs
[nvidia.dali.pipeline.Pipeline.release_outputs]: https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/api.html#nvidia.dali.pipeline.Pipeline.release_outputs
[ctypes.c_void_p]: https://docs.python.org/3/library/ctypes.html#ctypes.c_void_p
[orch.Tensor.data_ptr]: https://pytorch.org/docs/stable/tensors.html#torch.Tensor.data_ptr
[nvidia.dali.backend.TensorListCPU]: https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/api.html#nvidia.dali.backend.TensorListCPU
[nvidia.dali.backend.TensorListGPU]: https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/api.html#nvidia.dali.backend.TensorListGPU
[nvidia.dali.backend.TensorListCPU.copy_to_external]: https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/api.html#nvidia.dali.backend.TensorListCPU.copy_to_external
[nvidia.dali.backend.TensorListGPU.as_cpu]: https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/api.html#nvidia.dali.backend.TensorListGPU.as_cpu
[COCO Reader with augmentations]: https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/examples/detection_pipeline.html#COCO-Reader-with-augmentations
[nvidia.dali.ops.COCOReader]: https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/supported_ops.html#nvidia.dali.ops.COCOReader
[nvidia.dali.ops.nvJPEGDecoderSlice]: https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/supported_ops.html#nvidia.dali.ops.nvJPEGDecoderSlice
[nvJPEG]: https://developer.nvidia.com/nvjpeg
[ImageDecoderSlice]: https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/supported_ops.html#nvidia.dali.ops.ImageDecoderSlice
[nvidia.dali.ops.RandomBBoxCrop]: https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/supported_ops.html#nvidia.dali.ops.RandomBBoxCrop
[nvidia.dali.ops.BbFlip]: https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/supported_ops.html#nvidia.dali.ops.BbFlip
[nvidia.dali.ops.Flip]: https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/supported_ops.html#nvidia.dali.ops.Flip
[nvidia.dali.ops.CoinFlip]: https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/supported_ops.html#nvidia.dali.ops.CoinFlip
[nvidia.dali.ops.Uniform]: https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/supported_ops.html#nvidia.dali.ops.Uniform
[nvidia.dali.ops.Resize]: https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/supported_ops.html#nvidia.dali.ops.Resize
[nvidia.dali.ops.Paste]: https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/supported_ops.html#nvidia.dali.ops.Paste
[nvidia.dali.ops.CropMirrorNormalize]: https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/supported_ops.html#nvidia.dali.ops.CropMirrorNormalize
[define_graph]: https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/dali.py#L45
[nvidia.dali.pipeline.Pipeline.define_graph]: https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/api.html#nvidia.dali.pipeline.Pipeline.define_graph
[torch.Tensor.nelement]: https://pytorch.org/docs/stable/tensors.html#torch.Tensor.nelement
[torch.Tensor.numel]: https://pytorch.org/docs/stable/tensors.html#torch.Tensor.numel
[box2delta]: https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/box.py#L19
[torch.ones_like]: https://pytorch.org/docs/stable/torch.html#torch.ones_like
[torch.Tensor.scatter]: https://pytorch.org/docs/stable/tensors.html#torch.Tensor.scatter_
[Multiprocessing best practices]: https://pytorch.org/docs/stable/notes/multiprocessing.html#multiprocessing-best-practices
[PyTorch]: https://www.jiqizhixin.com/articles/2018-03-13
[PyTorch 1]: https://zhuanlan.zhihu.com/p/74792767
[Pytorch tutorial _Transfer Learning]: https://www.cnblogs.com/king-lps/p/8665344.html
[Python_s Requests Library _Guide]: https://realpython.com/python-requests/
[90_AlexNet]: https://www.jiqizhixin.com/articles/2019-02-21-14
[S9243 Fast and Accurate Object Detection]: https://developer.download.nvidia.cn/video/gputechconf/gtc/2019/presentation/s9243-fast-and-accurate-object-detection-with-pytorch-and-tensorrt.pdf
[NumPy array slice using None]: https://stackoverflow.com/questions/1408311/numpy-array-slice-using-none
[Pytorch]: https://zhuanlan.zhihu.com/p/68717029