RetinaNet Examples:NVIDIA 一站式训练、推理及模型转换解决方案 港控/mmm° 2021-11-05 06:34 203阅读 0赞 [retinanet-examples][] 是英伟达提供的目标检测工程范例,针对端到端 GPU 处理进行了优化: * 使用基于 Python 多进程的 [apex.parallel.DistributedDataParallel][] 加速分布式训练; * [apex.amp][] 优化混合精度训练; * [NVIDIA DALI][] 加速数据预处理; * 推理使用 [TensorRT][]。 项目推荐安装 [PyTorch NGC docker container][]: nvidia-docker run --rm --ipc=host -it nvcr.io/nvidia/pytorch:19.05-py3 然而新发布的版本仅支持 compute capability 6.0 以上的 GPU。 ## [main][] ## Created with Raphaël 2.2.0mainargsparseload\_modelworkerEnd [parse][] 解析参数。 [load\_model][load_model] 创建模型并加载参数。 [Module.share\_memory][Module.share_memory] 未在文档中列出。 [worker][] 是工作的函数。 [torch.multiprocessing.spawn][] 生成使用`args`运行`fn`的`nprocs`个进程。如果其中一个进程以非零退出状态退出,则会终止其余进程,并引发异常并导致终止。在子进程中捕获异常的情况下,它将被转发,并且其跟踪包含在父进程中引发的异常中。 每个设备创建一个进程执行 [worker][]。 'Entry point for the retinanet command' args = parse(args or sys.argv[1:]) model, state = load_model(args, verbose=True) if model: model.share_memory() world = torch.cuda.device_count() if args.command == 'export' or world <= 1: worker(0, args, 1, model, state) else: torch.multiprocessing.spawn(worker, args=(args, world, model, state), nprocs=world) ## [parse][] ## [ArgumentParser.add\_subparsers][ArgumentParser.add_subparsers] 许多程序将其功能分解为许多子命令,例如,`svn`程序可以调用子命令如`svn checkout`、`svn update`和`svn commit`。当程序执行多个函数,每个函数需要不同类型的命令行参数时,以这种方式拆分功能可能是一个特别好的主意。[ArgumentParser][] 支持使用 [add\_subparsers()][ArgumentParser.add_subparsers] 方法创建此类子命令。通常不带参数调用 [add\_subparsers()][ArgumentParser.add_subparsers] 方法,并返回一个特殊的操作对象。该对象有一个方法 add\_parser(),它接受命令名和任何 [ArgumentParser][] 构造函数参数,并返回一个可以照常修改的 [ArgumentParser][] 对象。 参数说明: * `title`:默认情况下,如果提供了`description`,则使用“subcommands”,否则将使用`title`参数的标题。 * `description`:帮助输出中子解析器组的描述,默认情况下为`None`。 * `prog`:将在子命令帮助中显示的用法信息,默认情况下,程序的名称和subparser参数之前的任何位置参数。 `parser_class`:用于创建子解析器实例的类,默认情况下是当前解析器的类(例如 [ArgumentParser][]) * [action][]:在命令行遇到此参数时要采取的基本操作类型。 * [dest][]:存储子命令名的属性的名称;默认情况下,不存储任何值。 * [required][]:是否必须提供子命令,默认为`False`。 * [help][]:帮助输出中子解析器组的帮助,默认为`None`。 * [metavar][]:在帮助中显示可用子命令的字符串;默认情况下为`None`,并以\{cmd1, cmd2, …\}的形式显示子命令。 parser = argparse.ArgumentParser(description='RetinaNet Detection Utility.') parser.add_argument('--master', metavar='address:port', type=str, help='Adress and port of the master worker', default='127.0.0.1:29500') subparsers = parser.add_subparsers(help='sub-command', dest='command') subparsers.required = True devcount = max(1, torch.cuda.device_count()) 训练参数设置。 batch 为每个设备两张图。 parser_train = subparsers.add_parser('train', help='train a network') parser_train.add_argument('model', type=str, help='path to output model or checkpoint to resume from') parser_train.add_argument('--annotations', metavar='path', type=str, help='path to COCO style annotations', required=True) parser_train.add_argument('--images', metavar='path', type=str, help='path to images', default='.') parser_train.add_argument('--backbone', action='store', type=str, nargs='+', help='backbone model (or list of)', default=['ResNet50FPN']) parser_train.add_argument('--classes', metavar='num', type=int, help='number of classes', default=80) parser_train.add_argument('--batch', metavar='size', type=int, help='batch size', default=2*devcount) parser_train.add_argument('--resize', metavar='scale', type=int, help='resize to given size', default=800) parser_train.add_argument('--max-size', metavar='max', type=int, help='maximum resizing size', default=1333) parser_train.add_argument('--jitter', metavar='min max', type=int, nargs=2, help='jitter size within range', default=[640, 1024]) parser_train.add_argument('--iters', metavar='number', type=int, help='number of iterations to train for', default=90000) parser_train.add_argument('--milestones', action='store', type=int, nargs='*', help='list of iteration indices where learning rate decays', default=[60000, 80000]) parser_train.add_argument('--schedule', metavar='scale', type=float, help='scale schedule (affecting iters and milestones)', default=1) parser_train.add_argument('--full-precision', help='train in full precision', action='store_true') parser_train.add_argument('--lr', metavar='value', help='learning rate', type=float, default=0.01) parser_train.add_argument('--warmup', metavar='iterations', help='numer of warmup iterations', type=int, default=1000) parser_train.add_argument('--gamma', metavar='value', type=float, help='multiplicative factor of learning rate decay', default=0.1) parser_train.add_argument('--override', help='override model', action='store_true') parser_train.add_argument('--val-annotations', metavar='path', type=str, help='path to COCO style validation annotations') parser_train.add_argument('--val-images', metavar='path', type=str, help='path to validation images') parser_train.add_argument('--post-metrics', metavar='url', type=str, help='post metrics to specified url') parser_train.add_argument('--fine-tune', metavar='path', type=str, help='fine tune a pretrained model') parser_train.add_argument('--logdir', metavar='logdir', type=str, help='directory where to write logs') parser_train.add_argument('--val-iters', metavar='number', type=int, help='number of iterations between each validation', default=8000) parser_train.add_argument('--with-dali', help='use dali for data loading', action='store_true') 测试参数设置。 parser_infer = subparsers.add_parser('infer', help='run inference') parser_infer.add_argument('model', type=str, help='path to model') parser_infer.add_argument('--images', metavar='path', type=str, help='path to images', default='.') parser_infer.add_argument('--annotations', metavar='annotations', type=str, help='evaluate using provided annotations') parser_infer.add_argument('--output', metavar='file', type=str, help='save detections to specified JSON file', default='detections.json') parser_infer.add_argument('--batch', metavar='size', type=int, help='batch size', default=2*devcount) parser_infer.add_argument('--resize', metavar='scale', type=int, help='resize to given size', default=800) parser_infer.add_argument('--max-size', metavar='max', type=int, help='maximum resizing size', default=1333) parser_infer.add_argument('--with-dali', help='use dali for data loading', action='store_true') parser_infer.add_argument('--full-precision', help='inference in full precision', action='store_true') 模型导出参数。 parser_export = subparsers.add_parser('export', help='export a model into a TensorRT engine') parser_export.add_argument('model', type=str, help='path to model') parser_export.add_argument('export', type=str, help='path to exported output') parser_export.add_argument('--size', metavar='height width', type=int, nargs='+', help='input size (square) or sizes (h w) to use when generating TensorRT engine', default=[1280]) parser_export.add_argument('--batch', metavar='size', type=int, help='max batch size to use for TensorRT engine', default=2) parser_export.add_argument('--full-precision', help='export in full instead of half precision', action='store_true') parser_export.add_argument('--int8', help='calibrate model and export in int8 precision', action='store_true') parser_export.add_argument('--opset', metavar='version', type=int, help='ONNX opset version') parser_export.add_argument('--calibration-batches', metavar='size', type=int, help='number of batches to use for int8 calibration', default=10) parser_export.add_argument('--calibration-images', metavar='path', type=str, help='path to calibration images to use for int8 calibration', default="") parser_export.add_argument('--calibration-table', metavar='path', type=str, help='path of existing calibration table to load from, or name of new calibration table', default="") parser_export.add_argument('--verbose', help='enable verbose logging', action='store_true') return parser.parse_args(args) ## [load\_model][load_model] ## 检查是否指定了模型文件。 if args.command != 'train' and not os.path.isfile(args.model): raise RuntimeError('Model file {} does not exist!'.format(args.model)) 解析模型的扩展名。 model = None state = { } _, ext = os.path.splitext(args.model) 训练模式下如果未指定模型文件则创建 [Model][] 实例。[Model.initialize][] 加载预训练的参数或者进行初始化。 [Model.load][] 首先由基础网络创建模型,然后加载参数并获取训练状态变量。 if args.command == 'train' and (not os.path.exists(args.model) or args.override): if verbose: print('Initializing model...') model = Model(args.backbone, args.classes) model.initialize(args.fine_tune) if verbose: print(model) elif ext == '.pth' or ext == '.torch': if verbose: print('Loading model from {}...'.format(os.path.basename(args.model))) model, state = Model.load(args.model) if verbose: print(model) elif args.command == 'infer' and ext in ['.engine', '.plan']: model = None else: raise RuntimeError('Invalid model format "{}"!'.format(args.ext)) state['path'] = args.model return model, state ## [worker][] ## worker train.train infer.infer model.export [worker][] 函数能够执行训练、推理、模型导出3种任务。 [os.environ][] 是进程参数之一,表示字符串环境的[mapping][] 对象。例如,`environ['HOME']`是主目录的路径名(在某些平台上),相当于 C 中的`getenv("HOME")`。这个映射是在第一次导入 [os][] 模块时捕获的,通常在 Python 启动期间作为处理`site.py`的一部分。在此时间之后对环境所做的更改不会反映在 [os.environ][] 中,除非直接修改 [os.environ][]。 [torch.cuda.set\_device][torch.cuda.set_device] 设置当前设备。 [torch.distributed.init\_process\_group][torch.distributed.init_process_group] 初始化默认的分布式进程组,这也将初始化分布式程序包。初始化进程组主要有两种方法: * 明确指定`store`、`rank`和`world_size`。 * 指定`init_method`(URL 字符串),指示发现对等点的位置/方式。(可选)指定`rank`和`world_size`,或在URL中编码所有必需的参数并省略它们。 如果两者都未指定,则假定`init_method`为“env://”。 'Per-device distributed worker' if torch.cuda.is_available(): os.environ.update({ 'MASTER_PORT': args.master.split(':')[-1], 'MASTER_ADDR': ':'.join(args.master.split(':')[:-1]), 'WORLD_SIZE': str(world), 'RANK': str(rank), 'CUDA_DEVICE': str(rank) }) torch.cuda.set_device(rank) torch.distributed.init_process_group(backend='nccl', init_method='env://') if args.batch % world != 0: raise RuntimeError('Batch size should be a multiple of the number of GPUs') [train][] 函数参数特别多。 if args.command == 'train': train.train(model, state, args.images, args.annotations, args.val_images or args.images, args.val_annotations, args.resize, args.max_size, args.jitter, args.batch, int(args.iters * args.schedule), args.val_iters, not args.full_precision, args.lr, args.warmup, [int(m * args.schedule) for m in args.milestones], args.gamma, is_master=(rank == 0), world=world, use_dali=args.with_dali, metrics_url=args.post_metrics, logdir=args.logdir, verbose=(rank == 0)) 推理时调用 [Engine::\_load][Engine_load] 加载模型。 [Engine][] 类封装了 TensorRT CUDA engine,[PYBIND11\_MODULE][PYBIND11_MODULE] 宏在 [extensions.cpp][] 中导出 C++ 变量。 [infer][] 进行推理。 elif args.command == 'infer': if model is None: if rank == 0: print('Loading CUDA engine from {}...'.format(os.path.basename(args.model))) model = Engine.load(args.model) infer.infer(model, args.images, args.output, args.resize, args.max_size, args.batch, annotations=args.annotations, mixed_precision=not args.full_precision, is_master=(rank == 0), world=world, use_dali=args.with_dali, verbose=(rank == 0)) 由路径获取校正图像列表并置乱。 [Model.export][] 进行校正。 elif args.command == 'export': onnx_only = args.export.split('.')[-1] == 'onnx' input_size = args.size * 2 if len(args.size) == 1 else args.size calibration_files = [] if args.int8: # Get list of images to use for calibration if os.path.isdir(args.calibration_images): import glob file_extensions = ['.jpg', '.JPG', '.jpeg', '.JPEG', '.png', '.PNG'] for ex in file_extensions: calibration_files += glob.glob("{}/*{}".format(args.calibration_images, ex), recursive=True) # Only need enough images for specified num of calibration batches if len(calibration_files) >= args.calibration_batches * args.batch: calibration_files = calibration_files[:(args.calibration_batches * args.batch)] else: print('Only found enough images for {} batches. Continuing anyway...'.format(len(calibration_files) // args.batch)) random.shuffle(calibration_files) precision = "FP32" if args.int8: precision = "INT8" elif not args.full_precision: precision = "FP16" exported = model.export(input_size, args.batch, precision, calibration_files, args.calibration_table, args.verbose, onnx_only=onnx_only, opset=args.opset) if onnx_only: with open(args.export, 'wb') as out: out.write(exported) else: exported.save(args.export) ## [train][] ## 保留模型到`nn_model`。 [convert\_fixedbn\_model][convert_fixedbn_model] 将模型中的 [torch.nn.BatchNorm2d][] 替换为 [FixedBatchNorm2d][]。 'Train the model on the given dataset' # Prepare model nn_model = model stride = model.stride model = convert_fixedbn_model(model) if torch.cuda.is_available(): model = model.cuda() [apex.amp.initialize][] 根据所选的`opt_level`和重写属性(如果有的话)初始化模型,优化器以及 Torch 张量和函数命名空间。构建完模型和优化器之后,应在将模型发送到 [torch.nn.parallel.DistributedDataParallel][] 包装器之前,调用 [apex.amp.initialize][]。请参见 ImageNet 示例中的 [Distributed training][]。目前,尽管 Apex 可以处理任意数量的模型和优化器,但只应调用一次 [apex.amp.initialize][](参见相应的 [Advanced Amp Usage topic][])。如果您认为您的用例需要多次调用 [apex.amp.initialize][],请联系 NVIDIA。任何非`None`的属性关键字参数都将被解释为手动覆盖。为了避免重写脚本中的其他内容,请命名返回的模型/优化器以替换传递的模型/优化器。 [apex.parallel.DistributedDataParallel][] 是一个模块包装器,它支持简单的多进程分布式数据并行训练,类似于 [torch.nn.parallel.DistributedDataParallel][]。初始化过程中在进程之间广播参数,并且在 [backward()][backward] 期间,规约平均进程间的梯度。 [DistributedDataParallel][apex.parallel.DistributedDataParallel] 针对 NCCL 进行了优化。它在 [backward()][backward] 期间交叠传输和通信,并且桶运较小的梯度传输以减少所需的传输总数,从而提高了性能。这与 [NVCaffe][] 所做的优化类似。 # Setup optimizer and schedule optimizer = SGD(model.parameters(), lr=lr, weight_decay=0.0001, momentum=0.9) model, optimizer = amp.initialize(model, optimizer, opt_level = 'O2' if mixed_precision else 'O0', keep_batchnorm_fp32 = True, loss_scale = 128.0, verbosity = is_master) if world > 1: model = DistributedDataParallel(model) model.train() if 'optimizer' in state: optimizer.load_state_dict(state['optimizer']) def schedule(train_iter): if warmup and train_iter <= warmup: return 0.9 * train_iter / warmup + 0.1 return gamma ** len([m for m in milestones if m <= train_iter]) scheduler = LambdaLR(optimizer, schedule) [DaliDataIterator][] 使用 [NVIDIA DALI][] 加载数据。 # Prepare dataset if verbose: print('Preparing dataset...') data_iterator = (DaliDataIterator if use_dali else DataIterator)( path, jitter, max_size, batch_size, stride, world, annotations, training=True) if verbose: print(data_iterator) if verbose: print(' device: {} {}'.format( world, 'cpu' if not torch.cuda.is_available() else 'gpu' if world == 1 else 'gpus')) print(' batch: {}, precision: {}'.format(batch_size, 'mixed' if mixed_precision else 'full')) print('Training model for {} iterations...'.format(iterations)) # Create TensorBoard writer if logdir is not None: from tensorboardX import SummaryWriter if is_master and verbose: print('Writing TensorBoard logs to: {}'.format(logdir)) writer = SummaryWriter(logdir=logdir) [Profiler][] 记录时间用于分析。 前向运行完后手动删除`data`。 [apex.amp.scale\_loss][apex.amp.scale_loss] 在上下文管理器入口处,创建`scaled_loss = (loss.float())*current loss scale`。产生`scaled_loss`,以便用户可以调用`scaled_loss.backward()`: with amp.scale_loss(loss, optimizer) as scaled_loss: scaled_loss.backward() 在上下文管理器退出时(`if delay_unscale=False`),将检查梯度的 infs/NaNs 并反缩放,以便可以调用`optimizer.step()`。 profiler = Profiler(['train', 'fw', 'bw']) iteration = state.get('iteration', 0) while iteration < iterations: cls_losses, box_losses = [], [] for i, (data, target) in enumerate(data_iterator): scheduler.step(iteration) # Forward pass profiler.start('fw') optimizer.zero_grad() cls_loss, box_loss = model([data, target]) del data profiler.stop('fw') # Backward pass profiler.start('bw') with amp.scale_loss(cls_loss + box_loss, optimizer) as scaled_loss: scaled_loss.backward() optimizer.step() # Reduce all losses cls_loss, box_loss = cls_loss.mean().clone(), box_loss.mean().clone() if world > 1: torch.distributed.all_reduce(cls_loss) torch.distributed.all_reduce(box_loss) cls_loss /= world box_loss /= world if is_master: cls_losses.append(cls_loss) box_losses.append(box_loss) if is_master and not isfinite(cls_loss + box_loss): raise RuntimeError('Loss is diverging!\n{}'.format( 'Try lowering the learning rate.')) del cls_loss, box_loss profiler.stop('bw') iteration += 1 主节点打印并记录信息。 [post\_metrics][post_metrics] 使用 [requests][] 发送信息。 [ignore\_sigint][ignore_sigint] 忽略信号。 [infer][] 根据模型进行推理。 profiler.bump('train') if is_master and (profiler.totals['train'] > 60 or iteration == iterations): focal_loss = torch.stack(list(cls_losses)).mean().item() box_loss = torch.stack(list(box_losses)).mean().item() learning_rate = optimizer.param_groups[0]['lr'] if verbose: msg = '[{:{len}}/{}]'.format(iteration, iterations, len=len(str(iterations))) msg += ' focal loss: {:.3f}'.format(focal_loss) msg += ', box loss: {:.3f}'.format(box_loss) msg += ', {:.3f}s/{}-batch'.format(profiler.means['train'], batch_size) msg += ' (fw: {:.3f}s, bw: {:.3f}s)'.format(profiler.means['fw'], profiler.means['bw']) msg += ', {:.1f} im/s'.format(batch_size / profiler.means['train']) msg += ', lr: {:.2g}'.format(learning_rate) print(msg, flush=True) if logdir is not None: writer.add_scalar('focal_loss', focal_loss, iteration) writer.add_scalar('box_loss', box_loss, iteration) writer.add_scalar('learning_rate', learning_rate, iteration) del box_loss, focal_loss if metrics_url: post_metrics(metrics_url, { 'focal loss': mean(cls_losses), 'box loss': mean(box_losses), 'im_s': batch_size / profiler.means['train'], 'lr': learning_rate }) # Save model weights state.update({ 'iteration': iteration, 'optimizer': optimizer.state_dict(), 'scheduler': scheduler.state_dict(), }) with ignore_sigint(): nn_model.save(state) profiler.reset() del cls_losses[:], box_losses[:] if val_annotations and (iteration == iterations or iteration % val_iterations == 0): infer(model, val_path, None, resize, max_size, batch_size, annotations=val_annotations, mixed_precision=mixed_precision, is_master=is_master, world=world, use_dali=use_dali, is_validation=True, verbose=False) model.train() if iteration == iterations: break if logdir is not None: writer.close() ## [infer][] ## 根据模型的类型确定执行后端。 'Run inference on images from path' backend = 'pytorch' if isinstance(model, Model) or isinstance(model, DDP) else 'tensorrt' stride = model.module.stride if isinstance(model, DDP) else model.stride [tempfile.mktemp][] 已经废弃了。[tempfile.mkstemp][] 尽可能以最安全的方式创建临时文件。假设平台正确实现 [os.open()][os.open] 的 [os.O\_EXCL][os.O_EXCL] 标志,文件创建中没有竞争条件。只有创建用户 ID 才能读写该文件。如果平台使用权限位指示文件是否可执行,则任何人都无法执行该文件。子进程不继承文件描述符。 # Create annotations if none was provided if not annotations: annotations = tempfile.mktemp('.json') images = [{ 'id': i, 'file_name': f} for i, f in enumerate(os.listdir(path))] json.dump({ 'images': images }, open(annotations, 'w')) # TensorRT only supports fixed input sizes, so override input size accordingly if backend == 'tensorrt': max_size = max(model.input_size) 使用 [DaliDataIterator][] 或者 [DataIterator][] 加载数据。 # Prepare dataset if verbose: print('Preparing dataset...') data_iterator = (DaliDataIterator if use_dali else DataIterator)( path, resize, max_size, batch_size, stride, world, annotations, training=False) if verbose: print(data_iterator) 判断是独立推理还是训练时的验证。 # Prepare model if backend is 'pytorch': # If we are doing validation during training, # no need to register model with AMP again if not is_validation: if torch.cuda.is_available(): model = model.cuda() model = amp.initialize(model, None, opt_level = 'O2' if mixed_precision else 'O0', keep_batchnorm_fp32 = True, verbosity = 0) model.eval() if verbose: print(' backend: {}'.format(backend)) print(' device: {} {}'.format( world, 'cpu' if not torch.cuda.is_available() else 'gpu' if world == 1 else 'gpus')) print(' batch: {}, precision: {}'.format(batch_size, 'unknown' if backend is 'tensorrt' else 'mixed' if mixed_precision else 'full')) print('Running inference...') 网络运行结果保存在列表中。 results = [] profiler = Profiler(['infer', 'fw']) with torch.no_grad(): for i, (data, ids, ratios) in enumerate(data_iterator): # Forward pass profiler.start('fw') scores, boxes, classes = model(data) profiler.stop('fw') results.append([scores, boxes, classes, ids, ratios]) profiler.bump('infer') if verbose and (profiler.totals['infer'] > 60 or i == len(data_iterator) - 1): size = len(data_iterator.ids) msg = '[{:{len}}/{}]'.format(min((i + 1) * batch_size, size), size, len=len(str(size))) msg += ' {:.3f}s/{}-batch'.format(profiler.means['infer'], batch_size) msg += ' (fw: {:.3f}s)'.format(profiler.means['fw']) msg += ', {:.1f} im/s'.format(batch_size / profiler.means['infer']) print(msg, flush=True) profiler.reset() [torch.distributed.all\_gather][torch.distributed.all_gather] 从列表中收集整个组的张量。 # Gather results from all devices if verbose: print('Gathering results...') results = [torch.cat(r, dim=0) for r in zip(*results)] if world > 1: for r, result in enumerate(results): all_result = [torch.ones_like(result, device=result.device) for _ in range(world)] torch.distributed.all_gather(list(all_result), result) results[r] = torch.cat(all_result, dim=0) 主节点将结果拷贝到 cpu,然后收集整理并评测。 [pycocotools.coco.COCO.getCatIds][] 获取类别 id 列表。 if is_master: # Copy buffers back to host results = [r.cpu() for r in results] # Collect detections detections = [] processed_ids = set() for scores, boxes, classes, image_id, ratios in zip(*results): image_id = image_id.item() if image_id in processed_ids: continue processed_ids.add(image_id) keep = (scores > 0).nonzero() scores = scores[keep].view(-1) boxes = boxes[keep, :].view(-1, 4) / ratios classes = classes[keep].view(-1).int() for score, box, cat in zip(scores, boxes, classes): x1, y1, x2, y2 = box.data.tolist() cat = cat.item() if 'annotations' in data_iterator.coco.dataset: cat = data_iterator.coco.getCatIds()[cat] detections.append({ 'image_id': image_id, 'score': score.item(), 'bbox': [x1, y1, x2 - x1 + 1, y2 - y1 + 1], 'category_id': cat }) [COCO.loadRes][] 加载结果文件并返回结果 api 对象。 实例化一个 [COCOeval][] 对象评测结果。 if detections: # Save detections if detections_file and verbose: print('Writing {}...'.format(detections_file)) detections = { 'annotations': detections } detections['images'] = data_iterator.coco.dataset['images'] if 'categories' in data_iterator.coco.dataset: detections['categories'] = [data_iterator.coco.dataset['categories']] if detections_file: json.dump(detections, open(detections_file, 'w'), indent=4) # Evaluate model on dataset if 'annotations' in data_iterator.coco.dataset: if verbose: print('Evaluating model...') with redirect_stdout(None): coco_pred = data_iterator.coco.loadRes(detections['annotations']) coco_eval = COCOeval(data_iterator.coco, coco_pred, 'bbox') coco_eval.evaluate() coco_eval.accumulate() coco_eval.summarize() else: print('No detections!') ## [Model][] ## [getattr][] 返回`object`的命名属性的值。`name`必须是字符串。如果字符串是对象属性之一的名称,则结果是该属性的值。例如,`getattr(x, 'foobar')`相当于`x.foobar`。如果命名属性不存在,则返回`default`(如果提供),否则将引发 [AttributeError][]。 [torch.nn.ModuleDict][] 在字典中保存子模块。[ModuleDict][torch.nn.ModuleDict] 可以像常规 Python 字典一样编制索引,只不过它包含的模块已正确注册,并且对所有 [Module][] 方法可见。 [ModuleDict][torch.nn.ModuleDict] 是一个有序字典并且遵循 * 插入顺序 * 在 [update()][update] 中,合并的`OrderedDict` 或另一个 [ModuleDict][torch.nn.ModuleDict]([update()][update] 的参数)的顺序。 请注意,使用其他无序映射类型(例如,Python 的普通字典)的 [update()][update] 不会保留合并映射的顺序。 'RetinaNet - https://arxiv.org/abs/1708.02002' def __init__(self, backbones='ResNet50FPN', classes=80, config={ }): super().__init__() if not isinstance(backbones, list): backbones = [backbones] self.backbones = nn.ModuleDict({ b: getattr(backbones_mod, b)() for b in backbones}) self.name = 'RetinaNet' self.exporting = False self.ratios = [1.0, 2.0, 0.5] self.scales = [4 * 2**(i/3) for i in range(3)] self.anchors = { } self.classes = classes self.threshold = config.get('threshold', 0.05) self.top_n = config.get('top_n', 1000) self.nms = config.get('nms', 0.5) self.detections = config.get('detections', 100) self.stride = max([b.stride for _, b in self.backbones.items()]) 分类和回归头均为4层卷积。 [FocalLoss][] 基于 [torch.nn.functional.binary\_cross\_entropy\_with\_logits][torch.nn.functional.binary_cross_entropy_with_logits]。 [SmoothL1Loss][] 为自行实现。 # classification and box regression heads def make_head(out_size): layers = [] for _ in range(4): layers += [nn.Conv2d(256, 256, 3, padding=1), nn.ReLU()] layers += [nn.Conv2d(256, out_size, 3, padding=1)] return nn.Sequential(*layers) anchors = len(self.ratios) * len(self.scales) self.cls_head = make_head(classes * anchors) self.box_head = make_head(4 * anchors) self.cls_criterion = FocalLoss() self.box_criterion = SmoothL1Loss(beta=0.11) ### [\_\_repr\_\_][repr] ### [object.\_\_repr\_\_][object._repr] 由 [repr()][repr 1] 内置函数调用以计算对象的“官方”字符串表示。 如果可能的话,这应该看起来像一个有效的 Python 表达式,可用于重新创建具有相同值的对象(给定适当的环境)。如果这不可能,则应返回<…一些有用的描述…>形式的字符串。 返回值必须是字符串对象。 如果一个类定义了[\_\_repr\_\_()][object._repr] 而不是 [\_\_str\_\_()][str],那么当需要该类的实例的“非正式”字符串表示时,也会使用 [*\_repr*\_()][object._repr]。这通常用于调试,因此表示是信息丰富且清晰的很重要。 return '\n'.join([ ' model: {}'.format(self.name), ' backbone: {}'.format(', '.join([k for k, _ in self.backbones.items()])), ' classes: {}, anchors: {}'.format(self.classes, len(self.ratios) * len(self.scales)), ]) ### [initialize][Model.initialize] ### 如果是预训练的,加载模型并忽略`cls_head.8`。 if pre_trained: # Initialize using weights from pre-trained model if not os.path.isfile(pre_trained): raise ValueError('No checkpoint {}'.format(pre_trained)) print('Fine-tuning weights from {}...'.format(os.path.basename(pre_trained))) state_dict = self.state_dict() chk = torch.load(pre_trained, map_location=lambda storage, loc: storage) ignored = ['cls_head.8.bias', 'cls_head.8.weight'] weights = { k: v for k, v in chk['state_dict'].items() if k not in ignored } state_dict.update(weights) self.load_state_dict(state_dict) del chk, weights torch.cuda.empty_cache() 否则调用基础网络的初始化方法,然后初始化检测和回归分支。 else: # Initialize backbone(s) for _, backbone in self.backbones.items(): backbone.initialize() # Initialize heads def initialize_layer(layer): if isinstance(layer, nn.Conv2d): nn.init.normal_(layer.weight, std=0.01) if layer.bias is not None: nn.init.constant_(layer.bias, val=0) self.cls_head.apply(initialize_layer) self.box_head.apply(initialize_layer) 特殊初始化分类分支的最后一层。 # Initialize class head prior def initialize_prior(layer): pi = 0.01 b = - math.log((1 - pi) / pi) nn.init.constant_(layer.bias, b) nn.init.normal_(layer.weight, std=0.01) self.cls_head[-1].apply(initialize_prior) ### [forward][] ### Created with Raphaël 2.2.0forwardxbackbonecls\_headbox\_headtraining?\_compute\_lossEndcls\_head.sigmoidexporting?generate\_anchorsdecodenmsyesnoyesno 训练时,输入的`x`包含处理后的标注信息。 由基础网络提取特征,然后同时进行分类和回归。 如果是训练,则调用 [\_compute\_loss][compute_loss] 计算损失并返回。 if self.training: x, targets = x # Backbones forward pass features = [] for _, backbone in self.backbones.items(): features.extend(backbone(x)) # Heads forward pass cls_heads = [self.cls_head(t) for t in features] box_heads = [self.box_head(t) for t in features] if self.training: return self._compute_loss(x, cls_heads, box_heads, targets.float()) 如果是导出,返回分类和回归输出。 cls_heads = [cls_head.sigmoid() for cls_head in cls_heads] if self.exporting: self.strides = [x.shape[-1] // cls_head.shape[-1] for cls_head in cls_heads] return cls_heads, box_heads 否则,运行推理的后处理。 [generate\_anchors][generate_anchors] 从比例/比率生成锚点坐标。 [decode][] 根据得分过滤结果并解码出边界框。 [nms][] 进一步过滤结果。 # Inference post-processing decoded = [] for cls_head, box_head in zip(cls_heads, box_heads): # Generate level's anchors stride = x.shape[-1] // cls_head.shape[-1] if stride not in self.anchors: self.anchors[stride] = generate_anchors(stride, self.ratios, self.scales) # Decode and filter boxes decoded.append(decode(cls_head, box_head, stride, self.threshold, self.top_n, self.anchors[stride])) # Perform non-maximum suppression decoded = [torch.cat(tensors, 1) for tensors in zip(*decoded)] return nms(*decoded, self.nms, self.detections) ### [\_extract\_targets][extract_targets] ### \_extract\_targets generate\_anchors snap\_to\_anchors [generate\_anchors][generate_anchors] 从比例/比率生成锚点坐标。 [snap\_to\_anchors][snap_to_anchors] 参照锚点生成目标量。 cls_target, box_target, depth = [], [], [] for target in targets: target = target[target[:, -1] > -1] if stride not in self.anchors: self.anchors[stride] = generate_anchors(stride, self.ratios, self.scales) snapped = snap_to_anchors( target, [s * stride for s in size[::-1]], stride, self.anchors[stride].to(targets.device), self.classes, targets.device) for l, s in zip((cls_target, box_target, depth), snapped): l.append(s) return torch.stack(cls_target), torch.stack(box_target), torch.stack(depth) ### [\_compute\_loss][compute_loss] ### \_compute\_loss \_extract\_targets self.cls\_criterion self.box\_criterion [\_extract\_targets][extract_targets] 获得分类和回归目标。 `depth`为样本选取掩码。 cls_losses, box_losses, fg_targets = [], [], [] for cls_head, box_head in zip(cls_heads, box_heads): size = cls_head.shape[-2:] stride = x.shape[-1] / cls_head.shape[-1] cls_target, box_target, depth = self._extract_targets(targets, stride, size) fg_targets.append((depth > 0).sum().float().clamp(min=1)) cls_head = cls_head.view_as(cls_target).float() cls_mask = (depth >= 0).expand_as(cls_target).float() cls_loss = self.cls_criterion(cls_head, cls_target) cls_loss = cls_mask * cls_loss cls_losses.append(cls_loss.sum()) box_head = box_head.view_as(box_target).float() box_mask = (depth > 0).expand_as(box_target).float() box_loss = self.box_criterion(box_head, box_target) box_loss = box_mask * box_loss box_losses.append(box_loss.sum()) fg_targets = torch.stack(fg_targets).sum() cls_loss = torch.stack(cls_losses).sum() / fg_targets box_loss = torch.stack(box_losses).sum() / fg_targets return cls_loss, box_loss ### [save][] ### 保存基础网络、类别数和模型。 如果有的话,保存`iteration`、`optimizer`和`scheduler`。 checkpoint = { 'backbone': [k for k, _ in self.backbones.items()], 'classes': self.classes, 'state_dict': self.state_dict() } for key in ('iteration', 'optimizer', 'scheduler'): if key in state: checkpoint[key] = state[key] torch.save(checkpoint, state['path']) ### [load][Model.load] ### 创建基础网络,然后加载其中的参数。 [torch.cuda.empty\_cache][torch.cuda.empty_cache] 释放当前由缓存分配器保存的所有未占用的缓存内存,以便可以在其他 GPU 应用程序中使用这些缓存并在 nvidia-smi 中可见。 @classmethod def load(cls, filename): if not os.path.isfile(filename): raise ValueError('No checkpoint {}'.format(filename)) checkpoint = torch.load(filename, map_location=lambda storage, loc: storage) # Recreate model from checkpoint instead of from individual backbones model = cls(backbones=checkpoint['backbone'], classes=checkpoint['classes']) model.load_state_dict(checkpoint['state_dict']) state = { } for key in ('iteration', 'optimizer', 'scheduler'): if key in checkpoint: state[key] = checkpoint[key] del checkpoint torch.cuda.empty_cache() return model, state ### [export][Model.export] ### 如果 OpSet 的版本低于9,则定义`upsample_nearest2d`函数。 import torch.onnx.symbolic if opset is not None and opset < 9: # Override Upsample's ONNX export from old opset if required (not needed for TRT 5.1+) @torch.onnx.symbolic.parse_args('v', 'is') def upsample_nearest2d(g, input, output_size): height_scale = float(output_size[-2]) / input.type().sizes()[-2] width_scale = float(output_size[-1]) / input.type().sizes()[-1] return g.op("Upsample", input, scales_f=(1, 1, height_scale, width_scale), mode_s="nearest") torch.onnx.symbolic.upsample_nearest2d = upsample_nearest2d [io.BytesIO][] 使用内存中字节缓冲区的流实现。它继承了 [BufferedIOBase][]。 调用 [close()][close] 方法时,将丢弃缓冲区。可选参数`initial_bytes`是一个包含初始数据的 [bytes-like object][]。 [torch.onnx.export][] 导出模型。 [io.BytesIO.getvalue][] 返回包含缓冲区全部内容的 [bytes][]。 # Export to ONNX print('Exporting to ONNX...') self.exporting = True onnx_bytes = io.BytesIO() zero_input = torch.zeros([1, 3, *size]).cuda() extra_args = { 'opset_version': opset } if opset else { } torch.onnx.export(self.cuda(), zero_input, onnx_bytes, *extra_args) self.exporting = False if onnx_only: return onnx_bytes.getvalue() [generate\_anchors][generate_anchors] 从比例/比率生成锚点坐标。 返回一个 [Engine][] 对象。 # Build TensorRT engine model_name = '_'.join([k for k, _ in self.backbones.items()]) anchors = [generate_anchors(stride, self.ratios, self.scales).view(-1).tolist() for stride in self.strides] return Engine(onnx_bytes.getvalue(), len(onnx_bytes.getvalue()), batch, precision, self.threshold, self.top_n, anchors, self.nms, self.detections, calibration_files, model_name, calibration_table, verbose) ## [DaliDataIterator][] ## DaliDataIterator COCOPipeline [DaliDataIterator][] 定义的函数与 [torch.utils.data.DataLoader][] 类似,但使用 [DALI][NVIDIA DALI] 实现数据并行加载。 输入的`batch_size`参数为总批量大小。 'Data loader for data parallel using Dali' def __init__(self, path, resize, max_size, batch_size, stride, world, annotations, training=False): self.training = training self.resize = resize self.max_size = max_size self.stride = stride self.batch_size = batch_size // world self.mean = [255.*x for x in [0.485, 0.456, 0.406]] self.std = [255.*x for x in [0.229, 0.224, 0.225]] self.world = world self.path = path [contextlib.redirect\_stdout][contextlib.redirect_stdout] 是用于临时将 [sys.stdout][] 重定向到另一个文件或类文件对象的上下文管理器。此工具为输出硬连接到 stdout 的现有函数或类增加了灵活性。 创建一个 [COCO][] 实例。 [COCOPipeline][] 封装了 [nvidia.dali.pipeline.Pipeline][]。 [nvidia.dali.pipeline.Pipeline.build][] 构建管道。需要构建管道才能独立运行它。特定于框架的插件自动处理此步骤。 # Setup COCO with redirect_stdout(None): self.coco = COCO(annotations) self.ids = list(self.coco.imgs.keys()) if 'categories' in self.coco.dataset: self.categories_inv = { k: i for i, k in enumerate(self.coco.getCatIds()) } self.pipe = COCOPipeline(batch_size=self.batch_size, num_threads=2, path=path, coco=self.coco, training=training, annotations=annotations, world=world, device_id = torch.cuda.current_device(), mean=self.mean, std=self.std, resize=resize, max_size=max_size, stride=self.stride) self.pipe.build() ### [\_\_repr\_\_][repr 2] ### 功能说明。 return '\n'.join([ ' loader: dali', ' resize: {}, max: {}'.format(self.resize, self.max_size), ]) ### [\_\_len\_\_][len] ### return ceil(len(self.ids) // self.world / self.batch_size) ### [\_\_iter\_\_][iter] ### [nvidia.dali.pipeline.Pipeline.run][] 运行管道并返回结果。如果管道是在`exec_pipelined`选项设置为`True`的情况下创建的,则此函数也将开始预取下一次迭代以加快执行速度。不应与同一管道中的 [nvidia.dali.pipeline.Pipeline.schedule\_run()][nvidia.dali.pipeline.Pipeline.schedule_run]、[nvidia.dali.pipeline.Pipeline.share\_outputs()][nvidia.dali.pipeline.Pipeline.share_outputs] 和 [nvidia.dali.pipeline.Pipeline.release\_outputs()][nvidia.dali.pipeline.Pipeline.release_outputs] 混合使用。 [ctypes.c\_void\_p][ctypes.c_void_p] 表示 C 语言的`void *`类型。该值表示为整数。构造函数接受可选的整数初始值设定项。 [orch.Tensor.data\_ptr][orch.Tensor.data_ptr] 返回自张量的第一个元素的地址。 `self.pipe.run()`返回的结果均为 [nvidia.dali.backend.TensorListCPU][] 或 [nvidia.dali.backend.TensorListGPU][] 类型。 [nvidia.dali.backend.TensorListCPU.copy\_to\_external][nvidia.dali.backend.TensorListCPU.copy_to_external] 将此`TensorList`的内容复制到 CPU 内存中的外部指针(类型为 [ctypes.c\_void\_p][ctypes.c_void_p])。插件内部使用此函数与支持的深度学习框架中的张量进行交互。 [nvidia.dali.backend.TensorListGPU.as\_cpu][nvidia.dali.backend.TensorListGPU.as_cpu] 返回 [TensorListCPU][nvidia.dali.backend.TensorListCPU] 对象,该对象是此 [TensorListGPU][nvidia.dali.backend.TensorListGPU] 的副本。 data, ratios, ids, num_detections = [], [], [], [] dali_data, dali_boxes, dali_labels, dali_ids, dali_attrs, dali_resize_img = self.pipe.run() for l in range(len(dali_boxes)): num_detections.append(dali_boxes.at(l).shape[0]) pyt_targets = -1 * torch.ones([len(dali_boxes), max(max(num_detections),1), 5]) for batch in range(self.batch_size): id = int(dali_ids.at(batch)[0]) # Convert dali tensor to pytorch dali_tensor = dali_data.at(batch) tensor_shape = dali_tensor.shape() datum = torch.zeros(dali_tensor.shape(), dtype=torch.float, device=torch.device('cuda')) c_type_pointer = ctypes.c_void_p(datum.data_ptr()) dali_tensor.copy_to_external(c_type_pointer) # Calculate image resize ratio to rescale boxes prior_size = dali_attrs.as_cpu().at(batch) resized_size = dali_resize_img.at(batch).shape() ratio = max(resized_size) / max(prior_size) if self.training: # Rescale boxes b_arr = dali_boxes.at(batch) num_dets = b_arr.shape[0] if num_dets is not 0: pyt_bbox = torch.from_numpy(b_arr).float() pyt_bbox[:,0] *= float(prior_size[1]) pyt_bbox[:,1] *= float(prior_size[0]) pyt_bbox[:,2] *= float(prior_size[1]) pyt_bbox[:,3] *= float(prior_size[0]) # (l,t,r,b) -> (x,y,w,h) == (l,r, r-l, b-t) pyt_bbox[:,2] -= pyt_bbox[:,0] pyt_bbox[:,3] -= pyt_bbox[:,1] pyt_targets[batch,:num_dets,:4] = pyt_bbox * ratio # Arrange labels in target tensor l_arr = dali_labels.at(batch) if num_dets is not 0: pyt_label = torch.from_numpy(l_arr).float() pyt_label -= 1 #Rescale labels to [0,79] instead of [1,80] pyt_targets[batch,:num_dets, 4] = pyt_label.squeeze() ids.append(id) data.append(datum.unsqueeze(0)) ratios.append(ratio) data = torch.cat(data, dim=0) if self.training: pyt_targets = pyt_targets.cuda(non_blocking=True) yield data, pyt_targets else: ids = torch.Tensor(ids).int().cuda(non_blocking=True) ratios = torch.Tensor(ratios).cuda(non_blocking=True) yield data, ids, ratios ## [COCOPipeline][] ## COCOPipeline pipeline.Pipeline 可以参考文档 [COCO Reader with augmentations][]。 [nvidia.dali.ops.COCOReader][] 是一个 CPU 运算符。从 COCO 数据集中读取数据,其每个目录中包含图像和一个注释文件。对于带有 m 个 bbox 的图像,将其 bbox 返回为 (m,4) Tensor (m \* \[x, y, w, h\] or `m * [left, top, right, bottom]`) 和 标签为 (m,1) Tensor (m \* category\_id)。 [nvidia.dali.ops.nvJPEGDecoderSlice][] 是一个“混合”操作。根据给定大小和锚点的裁剪窗口使用 [nvJPEG][] 库部分解码 JPEG 图像。输入必须以特定顺序提供3个张量: * `encoded_dat`包含编码图像数据; * `begin`包含`(x,y)`格式的裁剪起始像素坐标 * `size`包含裁剪的像素尺寸,`(w,h)`格式。 对于`begin`和`size`,坐标必须在区间`[0.0, 1.0]`中。解码器输出的排序为`HWC`。 > 警告 > 此运算符现已弃用。请改用 [ImageDecoderSlice][] [nvidia.dali.ops.RandomBBoxCrop][] 是一个 CPU 运算符。对图像执行预期裁剪,同时保持边界框和标签一致。输入必须作为两个张量提供: * `BBoxes`包含表示为`[l,t,r,b]`或`[x,y,w,h]`的边界框; * `Labels`包含每个边界框的相应标签。 得到的预期切图以两个张量形式提供: * `Begin`包含`(x,y)`格式的切图起始坐标; * `Size`包含`(w,h)`格式的切图大小。 边界框提供为`(m*4)`张量,其中每个边界框表示为`[l,t,r,b]`或`[x,y,w,h]`。丢弃与边框交并比小于阈值的标签。 请注意,当`allow_no_crop`为`False`且阈值不包含0时,最好增加`num_attempts`,否则它可能会循环很长时间。 [nvidia.dali.ops.BbFlip][] 是一个 CPU,GPU 运算符,执行边界框水平翻转(镜像)操作。 输入为边界框坐标,格式为`[x, y, w, h]`或`[left, top, right, bottom]`。所有坐标都在图像坐标系中(即0.0-1.0)。 [nvidia.dali.ops.Flip][] 是一个 CPU,GPU 运算符,在水平轴和(或)垂直轴上翻转图像。 [nvidia.dali.ops.CoinFlip][] 是一个支持操作符。产生充满0和1的张量——随机掷硬币的结果,可用作选择操作的参数。 [nvidia.dali.ops.Uniform][] 是一个支持操作符,产生均匀分布随机数的张量。 [nvidia.dali.ops.Resize][] 是一个 CPU,GPU 运算符,调整图像大小。 [nvidia.dali.ops.Paste][] 是一个 GPU 操作符。将输入图像粘贴到更大的画布上。画布大小等于输入大小\*比率。 [nvidia.dali.ops.CropMirrorNormalize][] 是一个 CPU,GPU 运算符。如果需要,融合执行裁剪、标准化、格式转换(NHWC 到 NCHW)和类型转换。标准化输入图像并使用以下公式生成输出: output = (input - mean) / std 请注意,不提供任何裁剪参数将仅导致镜像和标准化。该运算符允许序列输入。 super().__init__(batch_size=batch_size, num_threads=num_threads, device_id = device_id, prefetch_queue_depth=num_threads, seed=42) self.path = path self.training = training self.coco = coco self.stride = stride self.iter = 0 self.reader = ops.COCOReader(annotations_file=annotations, file_root=path, num_shards=world,shard_id=torch.cuda.current_device(), ltrb=True, ratio=True, shuffle_after_epoch=True, save_img_ids=True) self.decode_train = ops.nvJPEGDecoderSlice(device="mixed", output_type=types.RGB) self.decode_infer = ops.nvJPEGDecoder(device="mixed", output_type=types.RGB) self.bbox_crop = ops.RandomBBoxCrop(device='cpu', ltrb=True, scaling=[0.3, 1.0], thresholds=[0.1,0.3,0.5,0.7,0.9]) self.bbox_flip = ops.BbFlip(device='cpu', ltrb=True) self.img_flip = ops.Flip(device='gpu') self.coin_flip = ops.CoinFlip(probability=0.5) if isinstance(resize, list): resize = max(resize) self.rand_resize = ops.Uniform(range=[resize, float(max_size)]) self.resize_train = ops.Resize(device='gpu', interp_type=types.DALIInterpType.INTERP_CUBIC, save_attrs=True) self.resize_infer = ops.Resize(device='gpu', interp_type=types.DALIInterpType.INTERP_CUBIC, resize_longer=max_size, save_attrs=True) padded_size = max_size + ((self.stride - max_size % self.stride) % self.stride) self.pad = ops.Paste(device='gpu', fill_value = 0, ratio=1.1, min_canvas_size=padded_size, paste_x=0, paste_y=0) self.normalize = ops.CropMirrorNormalize(device='gpu', mean=mean, std=std, crop=padded_size, crop_pos_x=0, crop_pos_y=0) ### [define\_graph][define_graph] ### [nvidia.dali.pipeline.Pipeline.define\_graph][nvidia.dali.pipeline.Pipeline.define_graph] 返回输出`EdgeReference`的列表。用户定义此函数以构造其管道的操作图。 `self.reader()`从数据集中读取数据。 如果是训练,读取图像并进行增广;如果是推理则仅解码和调整大小。 images, bboxes, labels, img_ids = self.reader() if self.training: crop_begin, crop_size, bboxes, labels = self.bbox_crop(bboxes, labels) images = self.decode_train(images, crop_begin, crop_size) resize = self.rand_resize() images, attrs = self.resize_train(images, resize_longer=resize) flip = self.coin_flip() bboxes = self.bbox_flip(bboxes, horizontal=flip) images = self.img_flip(images, horizontal=flip) else: images = self.decode_infer(images) images, attrs = self.resize_infer(images) resized_images = images images = self.normalize(self.pad(images)) return images, bboxes, labels, img_ids, attrs, resized_images ## [snap\_to\_anchors][snap_to_anchors] ## [torch.Tensor.nelement][] 是 [torch.Tensor.numel][] 的别称。 如果`boxes`为空则直接返回0填充的张量。 'Snap target boxes (x, y, w, h) to anchors' num_anchors = anchors.size()[0] if anchors is not None else 1 width, height = (int(size[0] / stride), int(size[1] / stride)) if boxes.nelement() == 0: return (torch.zeros([num_anchors, num_classes, height, width], device=device), torch.zeros([num_anchors, 4, height, width], device=device), torch.zeros([num_anchors, 1, height, width], device=device)) 根据输出尺寸将锚点广播到每个位置。 boxes, classes = boxes.split(4, dim=1) # Generate anchors x, y = torch.meshgrid([torch.arange(0, size[i], stride, device=device, dtype=classes.dtype) for i in range(2)]) xyxy = torch.stack((x, y, x, y), 2).unsqueeze(0) anchors = anchors.view(-1, 1, 1, 4).to(dtype=classes.dtype) anchors = (xyxy + anchors).contiguous().view(-1, 4) `boxes`由`[x, y, width, height]`转为`[left, top, right, bottom]`,便于计算交并比。 # Compute overlap between boxes and anchors boxes = torch.cat([boxes[:, :2], boxes[:, :2] + boxes[:, 2:] - 1], 1) xy1 = torch.max(anchors[:, None, :2], boxes[:, :2]) xy2 = torch.min(anchors[:, None, 2:], boxes[:, 2:]) inter = torch.prod((xy2 - xy1 + 1).clamp(0), 2) boxes_area = torch.prod(boxes[:, 2:] - boxes[:, :2] + 1, 1) anchors_area = torch.prod(anchors[:, 2:] - anchors[:, :2] + 1, 1) overlap = inter / (anchors_area[:, None] + boxes_area - inter) 为每个锚框保留最佳匹配目标框。 [box2delta][] 将边界框转换为锚框的增量。 [torch.ones\_like][torch.ones_like] 返回填充标量值1的张量,其大小与`input`相同。`torch.ones_like(input)`等效于`torch.ones(input.size(), dtype=input.dtype, layout=input.layout, device=input.device)`。 `depth`为样本选取掩码。 # Keep best box per anchor overlap, indices = overlap.max(1) box_target = box2delta(boxes[indices], anchors) box_target = box_target.view(num_anchors, 1, width, height, 4) box_target = box_target.transpose(1, 4).transpose(2, 3) box_target = box_target.squeeze().contiguous() depth = torch.ones_like(overlap) * -1 depth[overlap < 0.4] = 0 # background depth[overlap >= 0.5] = classes[indices][overlap >= 0.5].squeeze() + 1 # objects depth = depth.view(num_anchors, width, height).transpose(1, 2).contiguous() 生成目标类别。每个类别上的值为0或1。 [torch.Tensor.scatter\_][torch.Tensor.scatter] 将张量`src`中的所有值写入`index`张量中指定的索引处的`self`。对于`src`中的每个值,`dimension != dim`时其输出索引由`src`中的索引指定,`dimension = dim`时为`index`的索引值。 # Generate target classes cls_target = torch.zeros((anchors.size()[0], num_classes + 1), device=device, dtype=boxes.dtype) if classes.nelement() == 0: classes = torch.LongTensor([num_classes], device=device).expand_as(indices) else: classes = classes[indices].long() classes = classes.view(-1, 1) classes[overlap < 0.4] = num_classes # background has no class cls_target.scatter_(1, classes, 1) cls_target = cls_target[:, :num_classes].view(-1, 1, width, height, num_classes) cls_target = cls_target.transpose(1, 4).transpose(2, 3) cls_target = cls_target.squeeze().contiguous() return (cls_target.view(num_anchors, num_classes, height, width), box_target.view(num_anchors, 4, height, width), depth.view(num_anchors, 1, height, width)) ## 参考资料: ## * [Multiprocessing best practices][] * [PyTorch为何如此高效好用?来探寻深度学习框架的内部架构][PyTorch] * [【分布式训练】单机多卡的正确打开方式(三):PyTorch][PyTorch 1] * [Pytorch tutorial 之Transfer Learning][Pytorch tutorial _Transfer Learning] * [Python’s Requests Library (Guide)][Python_s Requests Library _Guide] * [90秒训练AlexNet!商汤刷新纪录][90_AlexNet] * [S9243 Fast and Accurate Object Detection][] * [NumPy array slice using None][] * [Pytorch多机多卡分布式训练][Pytorch] [retinanet-examples]: https://github.com/NVIDIA/retinanet-examples [apex.parallel.DistributedDataParallel]: https://nvidia.github.io/apex/parallel.html#apex.parallel.DistributedDataParallel [apex.amp]: https://nvidia.github.io/apex/amp.html?highlight=amp#apex-amp [NVIDIA DALI]: https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/index.html [TensorRT]: https://developer.nvidia.com/tensorrt [PyTorch NGC docker container]: https://ngc.nvidia.com/catalog/containers/nvidia:pytorch [main]: https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/main.py#L172 [parse]: https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/main.py#L15 [load_model]: https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/main.py#L76 [Module.share_memory]: https://github.com/pytorch/pytorch/blob/master/torch/nn/modules/module.py#L1108 [worker]: https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/main.py#L104 [torch.multiprocessing.spawn]: https://pytorch.org/docs/stable/multiprocessing.html#torch.multiprocessing.spawn [ArgumentParser.add_subparsers]: https://docs.python.org/3/library/argparse.html#argparse.ArgumentParser.add_subparsers [ArgumentParser]: https://docs.python.org/3/library/argparse.html#argparse.ArgumentParser [action]: https://docs.python.org/3/library/argparse.html#action [dest]: https://docs.python.org/3/library/argparse.html#dest [required]: https://docs.python.org/3/library/argparse.html#required [help]: https://docs.python.org/3/library/argparse.html#help [metavar]: https://docs.python.org/3/library/argparse.html#metavar [Model]: https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/model.py#L12 [Model.initialize]: https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/model.py#L59 [Model.load]: https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/model.py#L187 [os.environ]: https://docs.python.org/3/library/os.html#os.environ [mapping]: https://docs.python.org/3/glossary.html#term-mapping [os]: https://docs.python.org/3/library/os.html#module-os [torch.cuda.set_device]: https://pytorch.org/docs/stable/cuda.html#torch.cuda.set_device [torch.distributed.init_process_group]: https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group [train]: https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/train.py#L16 [Engine_load]: https://github.com/NVIDIA/retinanet-examples/blob/master/csrc/engine.cpp#L55 [Engine]: https://github.com/NVIDIA/retinanet-examples/blob/master/csrc/engine.h#L38 [PYBIND11_MODULE]: https://pybind11.readthedocs.io/en/stable/reference.html#c.PYBIND11_MODULE [extensions.cpp]: https://github.com/NVIDIA/retinanet-examples/blob/master/csrc/extensions.cpp#L122 [infer]: https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/infer.py#L16 [Model.export]: https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/model.py#L206 [convert_fixedbn_model]: https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/backbones/layers.py#L18 [torch.nn.BatchNorm2d]: https://pytorch.org/docs/stable/nn.html#torch.nn.BatchNorm2d [FixedBatchNorm2d]: https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/backbones/layers.py#L5 [apex.amp.initialize]: https://nvidia.github.io/apex/amp.html#apex.amp.initialize [torch.nn.parallel.DistributedDataParallel]: https://pytorch.org/docs/stable/nn.html#torch.nn.parallel.DistributedDataParallel [Distributed training]: https://github.com/NVIDIA/apex/tree/master/examples/imagenet#distributed-training [Advanced Amp Usage topic]: https://nvidia.github.io/apex/advanced.html#multiple-models-optimizers-losses [backward]: https://pytorch.org/docs/stable/tensors.html#torch.Tensor.backward [NVCaffe]: https://github.com/NVIDIA/caffe [DaliDataIterator]: https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/dali.py#L69 [Profiler]: https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/utils.py#L44 [apex.amp.scale_loss]: https://nvidia.github.io/apex/amp.html#apex.amp.scale_loss [post_metrics]: https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/utils.py#L73 [requests]: https://github.com/request/request [ignore_sigint]: https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/utils.py#L36 [tempfile.mktemp]: https://docs.python.org/3/library/tempfile.html#tempfile.mktemp [tempfile.mkstemp]: https://docs.python.org/3/library/tempfile.html#tempfile.mkstemp [os.open]: https://docs.python.org/3/library/os.html#os.open [os.O_EXCL]: https://docs.python.org/3/library/os.html#os.O_EXCL [DataIterator]: https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/data.py#L134 [torch.distributed.all_gather]: https://pytorch.org/docs/stable/distributed.html#torch.distributed.all_gather [pycocotools.coco.COCO.getCatIds]: https://github.com/cocodataset/cocoapi/blob/master/PythonAPI/pycocotools/coco.py#L157 [COCO.loadRes]: https://github.com/cocodataset/cocoapi/blob/master/PythonAPI/pycocotools/coco.py#L297 [COCOeval]: https://github.com/cocodataset/cocoapi/blob/master/PythonAPI/pycocotools/cocoeval.py#L10 [getattr]: https://docs.python.org/3/library/functions.html#getattr [AttributeError]: https://docs.python.org/3/library/exceptions.html#AttributeError [torch.nn.ModuleDict]: https://pytorch.org/docs/stable/nn.html#torch.nn.ModuleDict [Module]: https://pytorch.org/docs/stable/nn.html#torch.nn.Module [update]: https://pytorch.org/docs/stable/nn.html#torch.nn.ModuleDict.update [FocalLoss]: https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/loss.py#L5 [torch.nn.functional.binary_cross_entropy_with_logits]: https://pytorch.org/docs/stable/nn.html#torch.nn.functional.binary_cross_entropy_with_logits [SmoothL1Loss]: https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/loss.py#L20 [repr]: https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/model.py#L52 [object._repr]: https://docs.python.org/3/reference/datamodel.html#object.__repr__ [repr 1]: https://docs.python.org/3/library/functions.html#repr [str]: https://docs.python.org/3/reference/datamodel.html#object.__str__ [forward]: https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/model.py#L98 [compute_loss]: https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/model.py#L147 [generate_anchors]: https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/box.py#L5 [decode]: https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/box.py#L105 [nms]: https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/box.py#L157 [extract_targets]: https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/model.py#L135 [snap_to_anchors]: https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/box.py#L48 [save]: https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/model.py#L173 [torch.cuda.empty_cache]: https://pytorch.org/docs/stable/cuda.html#torch.cuda.empty_cache [io.BytesIO]: https://docs.python.org/3/library/io.html#io.BytesIO [BufferedIOBase]: https://docs.python.org/3/library/io.html#io.BufferedIOBase [close]: https://docs.python.org/3/library/io.html#io.IOBase.close [bytes-like object]: https://docs.python.org/3/glossary.html#term-bytes-like-object [torch.onnx.export]: https://pytorch.org/docs/stable/onnx.html#torch.onnx.export [io.BytesIO.getvalue]: https://docs.python.org/3/library/io.html#io.BytesIO.getvalue [bytes]: https://docs.python.org/3/library/stdtypes.html#bytes [torch.utils.data.DataLoader]: https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader [contextlib.redirect_stdout]: https://docs.python.org/3/library/contextlib.html#contextlib.redirect_stdout [sys.stdout]: https://docs.python.org/3/library/sys.html#sys.stdout [COCO]: https://github.com/cocodataset/cocoapi/blob/master/PythonAPI/pycocotools/coco.py#L70 [COCOPipeline]: https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/dali.py#L11 [nvidia.dali.pipeline.Pipeline]: https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/api.html#nvidia.dali.pipeline.Pipeline [nvidia.dali.pipeline.Pipeline.build]: https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/api.html#nvidia.dali.pipeline.Pipeline.build [repr 2]: https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/dali.py#L96 [len]: https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/dali.py#L102 [iter]: https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/dali.py#L105 [nvidia.dali.pipeline.Pipeline.run]: https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/api.html#nvidia.dali.pipeline.Pipeline.run [nvidia.dali.pipeline.Pipeline.schedule_run]: https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/api.html#nvidia.dali.pipeline.Pipeline.schedule_run [nvidia.dali.pipeline.Pipeline.share_outputs]: https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/api.html#nvidia.dali.pipeline.Pipeline.share_outputs [nvidia.dali.pipeline.Pipeline.release_outputs]: https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/api.html#nvidia.dali.pipeline.Pipeline.release_outputs [ctypes.c_void_p]: https://docs.python.org/3/library/ctypes.html#ctypes.c_void_p [orch.Tensor.data_ptr]: https://pytorch.org/docs/stable/tensors.html#torch.Tensor.data_ptr [nvidia.dali.backend.TensorListCPU]: https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/api.html#nvidia.dali.backend.TensorListCPU [nvidia.dali.backend.TensorListGPU]: https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/api.html#nvidia.dali.backend.TensorListGPU [nvidia.dali.backend.TensorListCPU.copy_to_external]: https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/api.html#nvidia.dali.backend.TensorListCPU.copy_to_external [nvidia.dali.backend.TensorListGPU.as_cpu]: https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/api.html#nvidia.dali.backend.TensorListGPU.as_cpu [COCO Reader with augmentations]: https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/examples/detection_pipeline.html#COCO-Reader-with-augmentations [nvidia.dali.ops.COCOReader]: https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/supported_ops.html#nvidia.dali.ops.COCOReader [nvidia.dali.ops.nvJPEGDecoderSlice]: https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/supported_ops.html#nvidia.dali.ops.nvJPEGDecoderSlice [nvJPEG]: https://developer.nvidia.com/nvjpeg [ImageDecoderSlice]: https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/supported_ops.html#nvidia.dali.ops.ImageDecoderSlice [nvidia.dali.ops.RandomBBoxCrop]: https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/supported_ops.html#nvidia.dali.ops.RandomBBoxCrop [nvidia.dali.ops.BbFlip]: https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/supported_ops.html#nvidia.dali.ops.BbFlip [nvidia.dali.ops.Flip]: https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/supported_ops.html#nvidia.dali.ops.Flip [nvidia.dali.ops.CoinFlip]: https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/supported_ops.html#nvidia.dali.ops.CoinFlip [nvidia.dali.ops.Uniform]: https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/supported_ops.html#nvidia.dali.ops.Uniform [nvidia.dali.ops.Resize]: https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/supported_ops.html#nvidia.dali.ops.Resize [nvidia.dali.ops.Paste]: https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/supported_ops.html#nvidia.dali.ops.Paste [nvidia.dali.ops.CropMirrorNormalize]: https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/supported_ops.html#nvidia.dali.ops.CropMirrorNormalize [define_graph]: https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/dali.py#L45 [nvidia.dali.pipeline.Pipeline.define_graph]: https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/api.html#nvidia.dali.pipeline.Pipeline.define_graph [torch.Tensor.nelement]: https://pytorch.org/docs/stable/tensors.html#torch.Tensor.nelement [torch.Tensor.numel]: https://pytorch.org/docs/stable/tensors.html#torch.Tensor.numel [box2delta]: https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/box.py#L19 [torch.ones_like]: https://pytorch.org/docs/stable/torch.html#torch.ones_like [torch.Tensor.scatter]: https://pytorch.org/docs/stable/tensors.html#torch.Tensor.scatter_ [Multiprocessing best practices]: https://pytorch.org/docs/stable/notes/multiprocessing.html#multiprocessing-best-practices [PyTorch]: https://www.jiqizhixin.com/articles/2018-03-13 [PyTorch 1]: https://zhuanlan.zhihu.com/p/74792767 [Pytorch tutorial _Transfer Learning]: https://www.cnblogs.com/king-lps/p/8665344.html [Python_s Requests Library _Guide]: https://realpython.com/python-requests/ [90_AlexNet]: https://www.jiqizhixin.com/articles/2019-02-21-14 [S9243 Fast and Accurate Object Detection]: https://developer.download.nvidia.cn/video/gputechconf/gtc/2019/presentation/s9243-fast-and-accurate-object-detection-with-pytorch-and-tensorrt.pdf [NumPy array slice using None]: https://stackoverflow.com/questions/1408311/numpy-array-slice-using-none [Pytorch]: https://zhuanlan.zhihu.com/p/68717029
还没有评论,来说两句吧...