Thesis: “YOLOv3: An Incremental Improvement”
Improvement compared to previous versions
Network feature extraction section
By the Darknet-19 into a Darknet-53, deeper, speed indeed decline, but still much higher compared to ResNet it.
yolov3 network in FIG. 3 wherein different sizes, wherein each size predicted FIG three boxes, anchor box or a method using clustering Nine anchor box, and three different sizes of each feature in FIG. 3 to give anchor box. Was thus obtained N × N × [3 (4 + 1 + 80) *], N is the grid size (13 * 52 * 13, 26 * 26, 52), the number 3 is the bounding box of each grid obtained, 4 a frame number (x, y, w, h) the boundary parameters, the confidence is 1 (bounding box contains a category and a position), 80 is the number of category (here category coco dataset 80).
FIG yolo3 will UTILIZATION 82,94,106 layer of target detection to different sizes.
The image layer 82 is small (low resolution), large receptive fields, it is possible to detect larger target image;
Image medium layer 94, medium receptive fields, the image can be detected not too small target;
The image layer 106 is large (high resolution), but relatively minimal receptive fields, the image can be detected in a smaller target.
Therefore, if the training process, the output values found in a number of non-layer, which is only described in this layer does not detect a target object, as long as at least one of the three digital output normally, that is, normal.
Can also be seen from the figure, in order to simultaneously learn the features of deep and shallow, the upper layer is characterized in FIG. 82, 94 itself and also up-sampled through FIG characteristics do some early splice layer (the concat) operation. Papers with original words: This approach allows us to get more meaningful semantic information from the sampling characteristics in; get texture information (finer-grained information) from the earlier features.
Because each bounding boxes are used to predict multi-label classification categories included, the authors did not use softmax, but the use of independent logistic regression classifier to predict each category. This method makes it yolov3 be training-like image Open Images dataset less complex data, the data set has many tags contain attributes, such as (for example of a person in the image is marked, there may be “woman” and “man” and other tags).
yolov3 performance comparison chart:
YOLOv1 end presents a model of object detection, image input, only a running web, can recognize the position of the target image and the type of target.
YOLOv1 compared to other models, faster, and can already achieve real-time detection of water products; generalization stronger, even in training on natural picture, and then on the art picture test can have a good performance.
YOLOv1 drawback is also very clear: on the predicted target position is not accurate enough; difficult to detect small targets (such as birds); generalization difficult than the picture to a new or unusual aspect;
YOLOv2 learning speed, accuracy, small target detection, detection of different scales on the picture has improved;
YOLOv3 uses a deeper network to extract features enhance accuracy by more than 2 percent, slowed down, but still much higher than the other models;
YOLOv3 using a cross-scale structure prediction and FPN (feature pyramid networks), considering the deep and shallow features characteristic so that the position information and semantic information can be more accurately predicted;
YOLOv3 training classes like Open Images can be as complex data sets, because the author did not use softmax, but the use of independent logistic regression classifier to predict each category.
ok, principles and differences in each version introduced here yolo of you, want to know more details of the friend suggested personally read the original paper, and then combined before the author wrote several target detection series of articles about the person get started operation, will We have a better understanding.