CTPN网络理解

本文主要对常用的文本检测模型算法进行总结及分析，有的模型笔者切实run过，有的是通过论文及相关代码的分析，如有错误，请不吝指正。

一下进行各个模型的详细解析

CTPN 详解

代码链接：https://github.com/xiaofengShi/CHINESE-OCR

CTPN是目前应用非常广泛的印刷体文本检测模型算法。

CTPN由fasterrcnn改进而来，可以看下二者的异同

网络结构	FasterRcnn	CTPN
basenet	Vgg16 ,Vgg19,resnet	Vgg16,也可以使用其他CNN结构
RPN预测	basenet的predict layer使用CNN生成	basenet之后使用双向RNN使用FC生成
ROI	模型适用于目标检测，为多分类任务，包含ROI及类别损失和BOX回归	文本提取为二分类任务，不包含ROI及类别损失，只在RPN层计算目标损失及BOX回归
Anchor	一共9种anchor尺寸,3比例，3尺寸	固定anchor宽度，高度为10种
batch	每次只能训练一个样本	每次只能训练一个样本

根据ctpn的网络设计，可以看到看到ctpn一般使用预训练的vggnet，并且只用来检测水平文本，一般可以用来进行标准格式印刷体的检测，在目标框回归预测时，加上回归框的角度信息，就可以用来检测旋转文本，比如EAST模型。

代码分析

网络模型

直接看CTPN的网络代码

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192

copy

class VGGnet_train(Network):    # 继承自NetWork,关与NetWork可以看这里：https://github.com/xiaofengShi/CHINESE-OCR/blob/master/ctpn/lib/networks/network.py    def __init__(self, trainable=True):        self.inputs = []        self.data = tf.placeholder(tf.float32, shape=[None, None, None, 3], name=‘data‘)        self.im_info = tf.placeholder(tf.float32, shape=[None, 3], name=‘im_info‘)        self.gt_boxes = tf.placeholder(tf.float32, shape=[None, 5], name=‘gt_boxes‘)        self.gt_ishard = tf.placeholder(tf.int32, shape=[None], name=‘gt_ishard‘)        self.dontcare_areas = tf.placeholder(tf.float32, shape=[None, 4], name=‘dontcare_areas‘)        self.keep_prob = tf.placeholder(tf.float32)        self.layers = dict({‘data‘: self.data, ‘im_info‘: self.im_info, ‘gt_boxes‘: self.gt_boxes,‘gt_ishard‘: self.gt_ishard, ‘dontcare_areas‘: self.dontcare_areas})        self.trainable = trainable        self.setup()

    def setup(self):        # 对于文本提议来说，类别为2，一类为为文字部分，另一类为背景        n_classes = cfg.NCLASSES        # anchor的初始尺寸，论文中使用的是16        anchor_scales = cfg.ANCHOR_SCALES        _feat_stride = [16, ]

        # base net is vgg16        # 内部使用的函数        (self.feed(‘data‘)            .conv(3, 3, 64, 1, 1, name=‘conv1_1‘)            .conv(3, 3, 64, 1, 1, name=‘conv1_2‘)            .max_pool(2, 2, 2, 2, padding=‘VALID‘, name=‘pool1‘)            .conv(3, 3, 128, 1, 1, name=‘conv2_1‘)            .conv(3, 3, 128, 1, 1, name=‘conv2_2‘)            .max_pool(2, 2, 2, 2, padding=‘VALID‘, name=‘pool2‘)            .conv(3, 3, 256, 1, 1, name=‘conv3_1‘)            .conv(3, 3, 256, 1, 1, name=‘conv3_2‘)            .conv(3, 3, 256, 1, 1, name=‘conv3_3‘)            .max_pool(2, 2, 2, 2, padding=‘VALID‘, name=‘pool3‘)            .conv(3, 3, 512, 1, 1, name=‘conv4_1‘)            .conv(3, 3, 512, 1, 1, name=‘conv4_2‘)            .conv(3, 3, 512, 1, 1, name=‘conv4_3‘)            .max_pool(2, 2, 2, 2, padding=‘VALID‘, name=‘pool4‘)            .conv(3, 3, 512, 1, 1, name=‘conv5_1‘)            .conv(3, 3, 512, 1, 1, name=‘conv5_2‘)            .conv(3, 3, 512, 1, 1, name=‘conv5_3‘))        # RPN         # 该层对上层的feature map进行卷积，生成512通道的的feature map        (self.feed(‘conv5_3‘).conv(3, 3, 512, 1, 1, name=‘rpn_conv/3x3‘))        # 卷积最后一层的的feature_map尺寸为batch*h*w*512

        # 原来的单层双向LSTM        (self.feed(‘rpn_conv/3x3‘).Bilstm(512, 128, 512, name=‘lstm_o‘))        # bilstm之后输出的尺寸为(N, H, W, 512)

        """         和faster—rcnn相似，在ctpn的rpn网络中，使用双向lstm和全连接得到预测的        目标概率和回归框，在faster-rcnn中使用的是卷积的方式从basenet的最后一层生成        使用LSTM的输出来计算位置偏移和类别概率（判断是否是物体，不判断类别的种类）        输入尺寸为(N, H, W, 512)  输出尺寸（N, H, W, int(d_o)）        可以将这一层当做目标检测中的最后一层feature_map        rpn_bbox_pred--对于h*w的尺寸上，每一anchor上生成4个位置偏移量        rpn_cls_score--对于h*w的尺寸上，每一anchor上生成2个置信度得分，判断是否为物体

        """        (self.feed(‘lstm_o‘).lstm_fc(512, len(anchor_scales) * 10 * 4, name=‘rpn_bbox_pred‘))        (self.feed(‘lstm_o‘).lstm_fc(512, len(anchor_scales) * 10 * 2, name=‘rpn_cls_score‘))

        # generating training labels on the fly        # output: rpn_labels(HxWxA, 2) rpn_bbox_targets(HxWxA, 4) rpn_bbox_inside_weights rpn_bbox_outside_weights        # 给每个anchor上标签，并计算真值（也是delta的形式），以及内部权重和外部权重        (self.feed(‘rpn_cls_score‘, ‘gt_boxes‘, ‘gt_ishard‘, ‘dontcare_areas‘, ‘im_info‘)            .anchor_target_layer(_feat_stride, anchor_scales, name=‘rpn-data‘))

        # shape is (1, H, W, Ax2) -> (1, H, WxA, 2)        # 给之前得到的score进行softmax，得到0-1之间的得分        (self.feed(‘rpn_cls_score‘)            .spatial_reshape_layer(2, name=‘rpn_cls_score_reshape‘)            .spatial_softmax(name=‘rpn_cls_prob‘))        ‘‘‘        # the below is the rcnn net model from faster_rcnn        # 后面的部分是fasterrcnn之后的ROIPooling部分        (self.feed(‘rpn_cls_prob‘).spatial_reshape_layer(len(anchor_scales) * 10 * 2, name=‘rpn_cls_prob_reshape‘))

        self.feed(‘rpn_cls_prob_reshape‘, ‘rpn_bbox_pred‘, ‘im_info‘).proposal_layer(            _feat_stride, anchor_scales, ‘TRAIN‘, name=‘rpn_rois‘)

        (self.feed(‘rpn_rois‘, ‘gt_boxes‘).proposal_target_layer(n_classes, name=‘roi-data‘))

        # ========= RCNN ============        (self.feed(‘conv5_3‘, ‘roi-data‘).roi_pool(7, 7, 1.0/16, name=‘pool_5‘)             .fc(4096, name=‘fc6‘).dropout(0.5, name=‘drop6‘)             .fc(4096, name=‘fc7‘).dropout(0.5, name=‘drop7‘)             .fc(n_classes, relu=False, name=‘cls_score‘).softmax(name=‘cls_prob‘))

        (self.feed(‘drop7‘).fc(n_classes*4, relu=False, name=‘bbox_pred‘))        ‘‘‘

可以看到CTPN的网络结构有FasterRcnn改变而来，使用vggnet进行图像的特征提取，对得到的最后一层featuremap的尺寸为[N,H,W,C][N,H,W,C]，进行维度变换为[NH,W,C][NH,W,C]成为序列，使用BLSTM得到的维度为[NH,W,2D][NH,W,2D]其中DD为单向RNN的隐藏层节点数，转换维度为[NHW,2D][NHW,2D]，使用全连接进行维度转换为[NHW,C][NHW,C]，最后再reshape成[N,H,W,C][N,H,W,C]，在这一步中，使用RNN对CNN之后的特征图进行特征图长度方向上的连接；接下来使用lstm_fc函数对anchor进行目标类别预测和边界回归框预测，在这一层的特征图上，每个点生成A个anchor，每个anchor存在目标类别预测和边界回归预测：对于回归预测，每个格点生成2A个目标预测；对于边界回归预测，每个格点生成4A个边界预测。

网络模型结构如下所示

CTPN MODEL STRUCTURE

anchor生成及筛选

在整个模型中，AnchorGen处需要详细说明，这就是大名鼎鼎的RPN，下面结合代码说明：

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335

copy

# -*- coding:utf-8 -*-import numpy as npimport numpy.random as npr

from ..fast_rcnn.config import cfgfrom bbox import bbox_overlaps, bbox_intersections

DEBUG = False

# 生成基础anchor boxdef generate_basic_anchors(sizes, base_size=16):    base_anchor = np.array([0, 0, base_size - 1, base_size - 1], np.int32)    anchors = np.zeros((len(sizes), 4), np.int32)    index = 0    for h, w in sizes:        anchors[index] = scale_anchor(base_anchor, h, w)        index += 1    return anchors

# 根据baseanchor和设定的anchor的高度和宽度进行设定的anchor生成def scale_anchor(anchor, h, w):    x_ctr = (anchor[0] + anchor[2]) * 0.5    y_ctr = (anchor[1] + anchor[3]) * 0.5    scaled_anchor = anchor.copy()    scaled_anchor[0] = x_ctr - w / 2  # xmin    scaled_anchor[2] = x_ctr + w / 2  # xmax    scaled_anchor[1] = y_ctr - h / 2  # ymin    scaled_anchor[3] = y_ctr + h / 2  # ymax    return scaled_anchor

# 生成anchor box# 此处使用的是宽度固定，高度不同的anchor设置def generate_anchors(base_size=16, ratios=[0.5, 1, 2],                     scales=2 ** np.arange(3, 6)):    heights = [11, 16, 23, 33, 48, 68, 97, 139, 198, 283]    widths = [16]    sizes = []    for h in heights:        for w in widths:            sizes.append((h, w))    return generate_basic_anchors(sizes)

# 生成的anchor和groundtruth之间进行转换，转换方式和论文一致def bbox_transform(ex_rois, gt_rois):    """    computes the distance from ground-truth boxes to the given boxes, normed by their size    :param ex_rois: n * 4 numpy array, anchor boxes    :param gt_rois: n * 4 numpy array, ground-truth boxes    :return: deltas: n * 4 numpy array, ground-truth boxes    """    ex_widths = ex_rois[:, 2] - ex_rois[:, 0] + 1.0 # anchor width     ex_heights = ex_rois[:, 3] - ex_rois[:, 1] + 1.0 # anchor height    ex_ctr_x = ex_rois[:, 0] + 0.5 * ex_widths # anchor center x    ex_ctr_y = ex_rois[:, 1] + 0.5 * ex_heights # anchor center y

    assert np.min(ex_widths) > 0.1 and np.min(ex_heights) > 0.1, \        ‘Invalid boxes found: {} {}‘. \        format(ex_rois[np.argmin(ex_widths), :], ex_rois[np.argmin(ex_heights), :])

    gt_widths = gt_rois[:, 2] - gt_rois[:, 0] + 1.0 # gt_box width    gt_heights = gt_rois[:, 3] - gt_rois[:, 1] + 1.0 # gt_box height    gt_ctr_x = gt_rois[:, 0] + 0.5 * gt_widths # gt_box center x    gt_ctr_y = gt_rois[:, 1] + 0.5 * gt_heights # gt_box center y

    # warnings.catch_warnings()    # warnings.filterwarnings(‘error‘)    targets_dx = (gt_ctr_x - ex_ctr_x) / ex_widths  # (gt_c_x-a_c_x)    targets_dy = (gt_ctr_y - ex_ctr_y) / ex_heights    targets_dw = np.log(gt_widths / ex_widths)    targets_dh = np.log(gt_heights / ex_heights)

    targets = np.vstack(        (targets_dx, targets_dy, targets_dw, targets_dh)).transpose()

    return targets

# 生成anchorsdef anchor_target_layer(        rpn_cls_score, gt_boxes, gt_ishard, dontcare_areas, im_info, _feat_stride=[16, ],        anchor_scales=[16, ]):    """    Assign anchors to ground-truth targets. Produces anchor classification    labels and bounding-box regression targets.    Parameters    ----------    rpn_cls_score: (1, H, W, Ax2) bg/fg scores of previous conv layer    gt_boxes: (G, 5) vstack of [x1, y1, x2, y2, class]    gt_ishard: (G, 1), 1 or 0 indicates difficult or not    dontcare_areas: (D, 4), some areas may contains small objs but no labelling. D may be 0    im_info: a list of [image_height, image_width, scale_ratios]    _feat_stride: the downsampling ratio of feature map to the original input image    anchor_scales: the scales to the basic_anchor (basic anchor is [16, 16])    ----------    Returns    ----------    rpn_labels : (HxWxA, 1), for each anchor, 0 denotes bg, 1 fg, -1 dontcare    rpn_bbox_targets: (HxWxA, 4), distances of the anchors to the gt_boxes(may contains some transform)                            that are the regression objectives    rpn_bbox_inside_weights: (HxWxA, 4) weights of each boxes, mainly accepts hyper param in cfg    rpn_bbox_outside_weights: (HxWxA, 4) used to balance the fg/bg,                            beacuse the numbers of bgs and fgs mays significiantly different    """    # anchors is the [x_min,y_min,x_max,y_max]    # 生成基本的anchor,一共10个    _anchors = generate_anchors(scales=np.array(anchor_scales))      _num_anchors = _anchors.shape[0]  # 10个anchor

    # allow boxes to sit over the edge by a small amount    _allowed_border = 0    # 原始图像的信息，图像的高宽及通道数    im_info = im_info[0]  

    # 在feature-map上定位anchor，并加上delta，得到在实际图像中anchor的真实坐标    """     Algorithm:        for each (H, W) location i            generate 9 anchor boxes centered on cell i            apply predicted bbox deltas at cell i to each of the 9 anchors            filter out-of-image anchors        measure GT overlap     """    assert rpn_cls_score.shape[0] == 1, \        ‘Only single item batches are supported‘

    # map of shape (..., H, W)    height, width = rpn_cls_score.shape[1:3]  # feature-map的高宽    # 1. Generate proposals from bbox deltas and shifted anchors    shift_x = np.arange(0, width) * _feat_stride    shift_y = np.arange(0, height) * _feat_stride    shift_x, shift_y = np.meshgrid(shift_x, shift_y)  # in W H order    # 生成feature-map和真实图像上anchor之间的偏移量    # shifts构建网格结构，shape [height*width,4]    shifts = np.vstack((shift_x.ravel(), shift_y.ravel(),                        shift_x.ravel(), shift_y.ravel())).transpose()      A = _num_anchors  # 10个anchor    K = shifts.shape[0]  # feature-map的宽乘高的大小    # 为当前的featuremap每个点生成A个anchor，shape is [K,A,4]    all_anchors = (_anchors.reshape((1, A, 4)) +                   shifts.reshape((1, K, 4)).transpose((1, 0, 2)))      all_anchors = all_anchors.reshape((K * A, 4))  # shape is (K*A,4)    # 在featuremap上每个点生成A个anchor    total_anchors = int(K * A)    # only keep anchors inside the image    # 因为生成的anchor尺寸有大有小，因此在边缘处生成的anchor有可能会超过原始图像的边界，    # 将这些超过边界的anchor去掉,得到的是这些anchor的在all_anchors中的索引    # 仅保留那些还在图像内部的anchor，超出图像的都删掉    # anchors[:]=[x_min,y_min,x_max,y_max]    inds_inside = np.where(        (all_anchors[:, 0] >= -_allowed_border) &        (all_anchors[:, 1] >= -_allowed_border) &        (all_anchors[:, 2] < im_info[1] + _allowed_border) &  # width        (all_anchors[:, 3] < im_info[0] + _allowed_border)  # height    )[0]

    # keep only inside anchors    anchors = all_anchors[inds_inside, :]  # 保留那些在图像内的anchor

    # 至此，anchor准备好了    # --------------------------------------------------------------    # label: 1 is positive, 0 is negative, -1 is dont care    # (A)    labels = np.empty((len(inds_inside),), dtype=np.float32)    labels.fill(-1)  # 初始化label，均为-1    # overlaps between the anchors and the gt boxes    # overlaps (ex, gt), shape is A x G    # 计算anchor和gt-box的overlap，用来给anchor上标签    # anchor box and groundtruth box 交集面积/并集面积    # 通过IOU的得分来确定anchor为正样本与否    # overlaps shape is [anchor.shape[0],gt_box.shape[0]]    overlaps = bbox_overlaps(        np.ascontiguousarray(anchors, dtype=np.float),        np.ascontiguousarray(gt_boxes, dtype=np.float))      # 存放每一个anchor和每一个gtbox之间的overlap    # 找到和每一个gtbox，overlap最大的那个anchor    argmax_overlaps = overlaps.argmax(axis=1)     max_overlaps = overlaps[np.arange(len(inds_inside)), argmax_overlaps]    # 找到每个位置上10个anchor中与gtbox，overlap最大的那个    gt_argmax_overlaps = overlaps.argmax(axis=0)      gt_max_overlaps = overlaps[gt_argmax_overlaps,                               np.arange(overlaps.shape[1])]    gt_argmax_overlaps = np.where(overlaps == gt_max_overlaps)[0]

    if not cfg.TRAIN.RPN_CLOBBER_POSITIVES:        # assign bg labels first so that positive labels can clobber them        # 先给背景上标签，小于0.3overlap的为负样本label为0        labels[max_overlaps < cfg.TRAIN.RPN_NEGATIVE_OVERLAP] = 0  

    # -----------------------------------#    # 正样本的确定，iou得分大于0.7和每个位置上具有最大IOU得分的anchor    # fg label: for each gt, anchor with highest overlap    # 每个位置上的10个个anchor中overlap最大的认为是前景    labels[gt_argmax_overlaps] = 1      # fg label: above threshold IOU    # overlap大于0.7的认为是前景    labels[max_overlaps >= cfg.TRAIN.RPN_POSITIVE_OVERLAP] = 1  

    if cfg.TRAIN.RPN_CLOBBER_POSITIVES:        # assign bg labels last so that negative labels can clobber positives        labels[max_overlaps < cfg.TRAIN.RPN_NEGATIVE_OVERLAP] = 0

    # preclude dontcare areas    # 这里我们暂时不考虑有doncare_area的存在    if dontcare_areas is not None and dontcare_areas.shape[0] > 0:          # intersec shape is D x A        intersecs = bbox_intersections(            np.ascontiguousarray(dontcare_areas, dtype=np.float),  # D x 4            np.ascontiguousarray(anchors, dtype=np.float)  # A x 4        )        intersecs_ = intersecs.sum(axis=0)  # A x 1        labels[intersecs_ > cfg.TRAIN.DONTCARE_AREA_INTERSECTION_HI] = -1

    # 这里我们暂时不考虑难样本的问题    # preclude hard samples that are highly occlusioned, truncated or difficult to see    if cfg.TRAIN.PRECLUDE_HARD_SAMPLES and gt_ishard is not None and gt_ishard.shape[0] > 0:        assert gt_ishard.shape[0] == gt_boxes.shape[0]        gt_ishard = gt_ishard.astype(int)        gt_hardboxes = gt_boxes[gt_ishard == 1, :]if gt_hardboxes.shape[0] > 0:# H x A            hard_overlaps = bbox_overlaps(                np.ascontiguousarray(gt_hardboxes, dtype=np.float),  # H x 4                np.ascontiguousarray(anchors, dtype=np.float))  # A x 4            hard_max_overlaps = hard_overlaps.max(axis=0)  # (A)            labels[hard_max_overlaps >= cfg.TRAIN.RPN_POSITIVE_OVERLAP] = -1            max_intersec_label_inds = hard_overlaps.argmax(axis=1)  # H x 1            labels[max_intersec_label_inds] = -1  #

# subsample positive labels if we have too many# 对正样本进行采样，如果正样本的数量太多的话# 限制正样本的数量不超过128个，排除的置位dont_Care类# TODO 这个后期可能还需要修改，毕竟如果使用的是字符的片段，那个正样本的数量是很多的。    num_fg = int(cfg.TRAIN.RPN_FG_FRACTION * cfg.TRAIN.RPN_BATCHSIZE)    fg_inds = np.where(labels == 1)[0]if len(fg_inds) > num_fg:        disable_inds = npr.choice(            fg_inds, size=(len(fg_inds) - num_fg), replace=False)  # 随机去除掉一些正样本        labels[disable_inds] = -1  # 变为-1

# subsample negative labels if we have too many# 对负样本进行采样，如果负样本的数量太多的话# 正负样本总数是256，限制正样本数目最多128，# 如果正样本数量小于128，差的那些就用负样本补上，凑齐256个样本    num_bg = cfg.TRAIN.RPN_BATCHSIZE - np.sum(labels == 1)    bg_inds = np.where(labels == 0)[0]if len(bg_inds) > num_bg:        disable_inds = npr.choice(            bg_inds, size=(len(bg_inds) - num_bg), replace=False)        labels[disable_inds] = -1# print "was %s inds, disabling %s, now %s inds" % (# len(bg_inds), len(disable_inds), np.sum(labels == 0))

# 至此， 上好标签，开始计算rpn-box的真值# --------------------------------------------------------------    bbox_targets = np.zeros((len(inds_inside), 4), dtype=np.float32)# 根据anchor和gtbox计算得真值（anchor和gtbox之间的偏差）    bbox_targets = _compute_targets(anchors, gt_boxes[argmax_overlaps, :])# 内部权重，前景就给1，其他是0    bbox_inside_weights = np.zeros((len(inds_inside), 4), dtype=np.float32)    bbox_inside_weights[labels == 1, :] = np.array(        cfg.TRAIN.RPN_BBOX_INSIDE_WEIGHTS)  

    bbox_outside_weights = np.zeros((len(inds_inside), 4), dtype=np.float32)if cfg.TRAIN.RPN_POSITIVE_WEIGHT < 0: # 此处使用uniform权重，也就是正样本是1，负样本是0# uniform weighting of examples (given non-uniform sampling)# num_examples = np.sum(labels >= 0) + 1# positive_weights = np.ones((1, 4)) * 1.0 / num_examples# negative_weights = np.ones((1, 4)) * 1.0 / num_examples        positive_weights = np.ones((1, 4))  # 前景为1        negative_weights = np.zeros((1, 4))  # 背景为0else:assert ((cfg.TRAIN.RPN_POSITIVE_WEIGHT > 0) &                (cfg.TRAIN.RPN_POSITIVE_WEIGHT < 1))        positive_weights = (cfg.TRAIN.RPN_POSITIVE_WEIGHT /                            (np.sum(labels == 1)) + 1)        negative_weights = ((1.0 - cfg.TRAIN.RPN_POSITIVE_WEIGHT) /                            (np.sum(labels == 0)) + 1)# 外部权重，前景是1，背景是0# bbox_outside_weights初始化为0，将label中为0的位置赋值bbox_outside_weights为0,labels为1的位置赋值为1    bbox_outside_weights[labels == 1, :] = positive_weights    bbox_outside_weights[labels == 0, :] = negative_weights

# map up to original set of anchors# 一开始是将超出图像范围的anchor直接丢掉的，现在在加回来# inds_inside 是原始anchor中的索引    labels = _unmap(labels, total_anchors, inds_inside, fill=-1)  # 这些anchor的label是-1，也即dontcare    bbox_targets = _unmap(bbox_targets, total_anchors, inds_inside, fill=0)  # 这些anchor的真值是0，也即没有值    bbox_inside_weights = _unmap(bbox_inside_weights, total_anchors,                                 inds_inside, fill=0)  # 内部权重以0填充    bbox_outside_weights = _unmap(bbox_outside_weights, total_anchors,                                  inds_inside, fill=0)  # 外部权重以0填充

# labels    labels = labels.reshape((1, height, width, A))  # reshap一下label    rpn_labels = labels

# bbox_targets    bbox_targets = bbox_targets.reshape((1, height, width, A * 4))  # reshape    rpn_bbox_targets = bbox_targets

# bbox_inside_weights    bbox_inside_weights = bbox_inside_weights.reshape((1, height, width, A * 4))    rpn_bbox_inside_weights = bbox_inside_weights

# bbox_outside_weights    bbox_outside_weights = bbox_outside_weights.reshape((1, height, width, A * 4))    rpn_bbox_outside_weights = bbox_outside_weights

	rpn_data=(rpn_labels, rpn_bbox_targets, rpn_bbox_inside_weights, rpn_bbox_outside_weights)

return rpn_data

# 将排除掉边界之外的anchors之后的anchor补全回来def _unmap(data, count, inds, fill=0):""" Unmap a subset of item (data) back to the original set of items (of    size count) """if len(data.shape) == 1:        ret = np.empty((count,), dtype=np.float32)        ret.fill(fill)        ret[inds] = dataelse:        ret = np.empty((count,) + data.shape[1:], dtype=np.float32)        ret.fill(fill)        ret[inds, :] = datareturn ret

# 计算anchor和gt之间的矩形框的偏差def _compute_targets(ex_rois, gt_rois):"""Compute bounding-box regression targets for an image."""

assert ex_rois.shape[0] == gt_rois.shape[0]assert ex_rois.shape[1] == 4assert gt_rois.shape[1] == 5

return bbox_transform(ex_rois, gt_rois[:, :4]).astype(np.float32, copy=False)

对于bbox使用cpython写成(.pyx文件)

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596

copy

import numpy as npcimport numpy as np

DTYPE = np.floatctypedef np.float_t DTYPE_t

# 计算IOUdef bbox_overlaps(        np.ndarray[DTYPE_t, ndim=2] boxes,        np.ndarray[DTYPE_t, ndim=2] query_boxes):    """    Parameters    ----------    boxes: (N, 4) ndarray of float, anchor box nums    query_boxes: (K, 4) ndarray of float, groud_truth object nums,[x_min,y_min,x_max,y_max,class]    Returns    -------    overlaps: (N, K) ndarray of overlap between boxes and query_boxes    """    cdef unsigned int N = boxes.shape[0]    cdef unsigned int K = query_boxes.shape[0]    cdef np.ndarray[DTYPE_t, ndim=2] overlaps = np.zeros((N, K), dtype=DTYPE)    cdef DTYPE_t iw, ih, box_area    cdef DTYPE_t ua    cdef unsigned int k, n    for k in range(K):        box_area = (            (query_boxes[k, 2] - query_boxes[k, 0] + 1) *            (query_boxes[k, 3] - query_boxes[k, 1] + 1)        )        for n in range(N):            # 水平方向上的交集，如果存在那么iw为正            iw = (                min(boxes[n, 2], query_boxes[k, 2]) -                max(boxes[n, 0], query_boxes[k, 0]) + 1            )            if iw > 0:                # 竖直方向上的交集                ih = (                    min(boxes[n, 3], query_boxes[k, 3]) -                    max(boxes[n, 1], query_boxes[k, 1]) + 1                )                if ih > 0:                    # 如果存在交集，计算并集的面积                    # union area                    ua = float(                        (boxes[n, 2] - boxes[n, 0] + 1) *                        (boxes[n, 3] - boxes[n, 1] + 1) +                        box_area - iw * ih                    )                    # 交集面积/并集面积                    overlaps[n, k] = iw * ih / ua    return overlaps

# anchor与gt交集面积相对于gt面积的比例def bbox_intersections(        np.ndarray[DTYPE_t, ndim=2] boxes,        np.ndarray[DTYPE_t, ndim=2] query_boxes):    """    For each query box compute the intersection ratio covered by boxes    ----------    Parameters    ----------    boxes: (N, 4) ndarray of float    query_boxes: (K, 4) ndarray of float    Returns    -------    overlaps: (N, K) ndarray of intersec between boxes and query_boxes    """    cdef unsigned int N = boxes.shape[0]    cdef unsigned int K = query_boxes.shape[0]    cdef np.ndarray[DTYPE_t, ndim=2] intersec = np.zeros((N, K), dtype=DTYPE)    cdef DTYPE_t iw, ih, box_area    cdef DTYPE_t ua    cdef unsigned int k, n    for k in range(K):        box_area = (            (query_boxes[k, 2] - query_boxes[k, 0] + 1) *            (query_boxes[k, 3] - query_boxes[k, 1] + 1)        )        for n in range(N):            iw = (                min(boxes[n, 2], query_boxes[k, 2]) -                max(boxes[n, 0], query_boxes[k, 0]) + 1            )            if iw > 0:                ih = (                    min(boxes[n, 3], query_boxes[k, 3]) -                    max(boxes[n, 1], query_boxes[k, 1]) + 1                )                if ih > 0:                    intersec[n, k] = iw * ih / box_area    return intersec

代码中的注释已经写得明明白白了。anchor生成函数为anchor_target_layer.py

Anchors

首先根据设定的anchor高度和宽度在特征图上每个cell生成A个anchors，这些anchors有的会超过原始图像的边界，如上图所示，将这些超出边界的anchors先删除，并记录保留的anchor在原始所有anchors中的索引值，使用内部的anchor和groundtruth进行IOU计算(anchor和gt之间如果存在交集，则使用交集面积和二者并集的面积进行IOU计算)，使用两个原则进行anchor正样本的认定：如果anchor和gt之间的IOU大于设定的阈值0.7则认定该anchor为正样本；将具有和任意gt最大的IOU的anchor为正样本，也就是和gt最大的几个anchor最为正样本，这一步选择的anchor数量和gt的数量相同。至此就确定了正样本的anchor和剩余的负样本anchor，使用设定的正负样本数量，来控制正负样本的数量，将正负样本和和gt之间计算偏移量并作为目标框的label。对于anchor和gt之间的偏移量计算如下图所示

Anchor_groudtruth

图中红色表示groundtruth，黑色表示anchor box，首先计算两个矩形框的中心坐标和宽度高度，计算公式为

targetxtragetytragetwtrageth=(GTx−ANx)/ANwidth=(GTy−any)/ANheight=log(GTwidth/ANwidth)=log(GTheight/ANheight)targetx=(GTx−ANx)/ANwidthtragety=(GTy−any)/ANheighttragetw=log?(GTwidth/ANwidth)trageth=log?(GTheight/ANheight)

整个流程如下图所示

ctpn_anchor_gen

总结

至此，对CTPN网络结构结合代码进行了一些跟人理解的解读，该模型与2016年提出，可以看到收到很多的fastercnn的影响，可以看到CTPN具有如下的一些特点

基础VGG网络的使用，因此一般需要ImageNet数据集的预训练权重会使得训练更快速和平稳
Bilstm的使用使得模型无法向CNN那样并行运算，影响了模型的速度
Anchor的设定为等宽度变高度，因此这种anchor只能适用于水平方向文本的检测，也可以通过更改anchor使得anchor兼容竖直方向的文本检测
模型中anchor的宽度为15，因此模型的检测粒度收到该设置的影响，有可能存在边界不明确的状况
因为使用的是和fasterrcnn相同的anchor生成及预测方法，因此在inference阶段需要对预测的值进行反向变换得到目标框

EAST

论文关键idea

提出了两段式的文本检测方法，FCN+NMS，消除多过程造成的中间误差累计，减少了检测时间
模型可以进行单词级别检测，又可以进行文本行检测，检测的形状可以是任意形状的四边形也可以是普通的四边形
采用了Locality-Aware NMS的预测框过滤

网络结构如下所示

EAST Model

Pipeline

先用一个通用的网络(论文中采用的是PVAnet，实际在使用的时候可以采用VGG16，Resnet等)作为base net ，用于特征提取

此处对PAVnet进行一些说明，PAVnet主要是对VGG进行了改进并应用于目标检测任务，主要针对FasterRcnn的基础网络进行了改进，包含mCReLU,Inception,Hyper-feature各个结构

PVAnet

在论文总的基础网络用的是PVAnet的基础网络，具体参数如下所示

PVAnetParam

对于mCReLU结构和Inception结构如下所示

PVAnet mCReLU Inception
基于上述主干特征提取网络，抽取不同层的featuremap（它们的尺寸分别是inuput-image的132,116,18,14132,116,18,14，这样可以得到不同尺度的特征图，这样做的目的是解决文本行尺度变换剧烈的问题，ealy-stage可用于预测小的文本行(较大的特征图)，late-stage可用于预测大的文本行(较小的特征图)。
特征合并层，将抽取的特征进行merge．这里合并的规则采用了Unet的方法，合并规则：从特征提取网络的顶部特征按照相应的规则向上进行合并，不断增大featuremap的尺寸。
网络输出层，包含文本得分和文本形状．根据不同文本形状(可分为RBOX和QUAD，对于RROX预测的是当前点距离gtbox的四个边的距离以及gtbox的相对图像的x正方向的角度θ?θ?，也就是总共为5个值分别对应着(d1,d2,d3,d4,θ)?(d1,d2,d3,d4,θ)?，而对于QUAD来说预测对应的gtbox的四个交点的坐标，一共8个值)，对于RBOX对应的示意图如下所示

EAST_RBOX

图中的didi对应的是当前点到gt的距离，知道了一个固定点到矩形的四条边的距离，就可以的知道这个矩形所在的位置和大小，即确定这个矩形。

EAST_RBOX_QUAD

可以看出，对于RBOX输出5个预测值，而QUAD输出8个预测值。

对于层g和h的计算方式如图中公式所示。

对于g为uppooling层，每次操作将featuremap放大到原来的2倍，主要进行特征图的上采样，论文中采取的双线性插值的方法进行上采样，没有使用反卷积的方式，减少了模型的计算量但是有可能降低模型的表达能力
上采样之后的featuremap和下采样同样尺寸的f层进行merge并使用conv1x1降低合并后的模型的通道数
之后使用conv3x3卷积，输出该阶段的featuremap
上述操作重复3次最终模型输出的通道数为32

进行特征图合并之后进行预测输出，也就是针对不同的box形式输出5个或者8个预测值。

Loss计算

总的损失包含分类损失和回归损失，即

L=LS+λgLgL=LS+λgLg

分类损失论文中使用的是平衡交叉熵损失

LS= balanced−xent(Y˙,Y)=−βYlogY˙−(1−β)(1−Y˙)(log(1−Y˙))whereβ=1−∑y∈Yy|Y|LS= balanced−xent(Y˙,Y)=−βYlog?Y˙−(1−β)(1−Y˙)(log?(1−Y˙))whereβ=1−∑y∈Yy|Y|

其中Y˙?Y˙?为预测值，Y?Y?为label值。相比普通的交叉熵损失，平衡交叉熵损失对正负样本进行了平衡。

对于LgLg损失，由于在对于RBOX信息中包含的是5个预测值即(d1,d2,d3,d4,θ)(d1,d2,d3,d4,θ)，那么就可以得到损失为

whereLg=LAABB+λθLθLAABB=−logIoU(R˙,R∗)=−log|R˙∩R∗||R˙∪R∗|Lθ=1−cos(θ˙−θ∗)Lg=LAABB+λθLθwhereLAABB=−log?IoU(R˙,R∗)=−log?|R˙∩R∗||R˙∪R∗|Lθ=1−cos?(θ˙−θ∗)

对于IOU损失的计算是，论文中对交集区域面积的计算方式为

wi=min(d˙2,d∗2)+min(d˙4,d∗4)hi=min(d˙1,d∗1)+min(d˙3,d∗3)wi=min(d˙2,d2∗)+min(d˙4,d4∗)hi=min(d˙1,d1∗)+min(d˙3,d3∗)

实际上这种计算方式是存在问题的，分析如下

east_iou

如上图所示，红色对应gt，蓝色对应predict，如果不考虑角度，那么按照公式所述是正确的，但是考虑角度信息之后就会发现iou的交集面积计算公式存在错误。

Reference

综述

自然场景文本检测识别技术综述

白翔:：图像OCR年度进展|VALSE2018之十一

白翔：趣谈“捕文捉字”— 场景文字检测 | VALSE2017之十

基于深度学习的目标检测及场景文字检测研究进展

知乎文本检测综述

优秀论文解读博客

知乎专栏:小石头的码疯窝

OCR_Overview_冠军试炼
文本检测
- CTPN
  
  场景文字检测—CTPN原理与实现
  
  CTPN: Tensorflow
- EAST
  Bolg: EAST
  
  知乎：文本检测之EAST
  
  EAST：tensorflow
  
  EAST: Keras
  
  EAST: Advanced keras
- SegLink
  SegLink_Blog
  
  文本检测之SegLink
- PixelLink
  文本检测之PixelLink
  
  Github: PixelLink
- TextBoxes
  论文笔记：TextBoxes++: A Single-Shot Oriented Scene Text Detector
  
  Github: TextBoxes++
- 角定位
基于角定位于区域分割
文本识别
- ASTER
  
  Github: ASTER
TextSpotter
- Mask TextSpotter
  
  华科白翔教授团队ECCV2018 OCR论文：Mask TextSpotter

原文地址：https://www.cnblogs.com/ZFJ1094038955/p/12070441.html

时间： 2025-02-01 15:40:32

CTPN网络理解

CTPN 详解

代码分析

网络模型

anchor生成及筛选

总结

EAST

论文关键idea

Pipeline

Loss计算

Reference

CTPN网络理解的相关文章

虚拟机网络理解

并发用户数的理解

XenServer部署实录——网络配置

这都不懂？网络与互联网

【OCR技术系列之六】文本检测CTPN的代码实现

生成对抗网络浅析（GAN）

[转载]【虚拟化系列】VMware vSphere 5.1 网络管理

鸟哥的Linux私房菜--第一部分-第二章-Linux如何学习

NDN与TCP/IP