

Furthermore, significantly improving the speed of high-quality detection can broaden the range of settings where computer vision is useful. This is a larger relative improvement in detection accuracy than that from the recent, very high-profile work on residual networks .

While these contributions may seem small independently, we note that the resulting system improves accuracy on real-time detection for PASCAL VOC from 63.4 % mAP for YOLO to 74.3 % mAP for our SSD. With these modifications-especially using multiple layers for prediction at different scales-we can achieve high-accuracy using relatively low resolution input, further increasing detection speed.

Our improvements include using a small convolutional filter to predict object categories and offsets in bounding box locations, using separate predictors (filters) for different aspect ratio detections, and applying these filters to multiple feature maps from the later stages of a network in order to perform detection at multiple scales.
#Box shot 3d v2.9.4 soft series#
), but by adding a series of improvements, we manage to increase the accuracy significantly over previous attempts. The fundamental improvement in speed comes from eliminating bounding box proposals and the subsequent pixel or feature resampling stage. This results in a significant improvement in speed for high-accuracy detection (59 FPS with mAP 74.3 % on VOC2007 test, vs Faster R-CNN 7 FPS with mAP 73.2 % or YOLO 45 FPS with mAP 63.4 %). This paper presents the first deep network based object detector that does not resample pixels or features for bounding box hypotheses and is as accurate as approaches that do. 4), but so far, significantly increased speed comes only at the cost of significantly decreased detection accuracy. There have been many attempts to build faster detectors by attacking each stage of the detection pipeline (see related work in Sect. Often detection speed for these approaches is measured in frames per second, and even the fastest high-accuracy detector, Faster R-CNN, operates at only 7 frames per second (FPS). While accurate, these approaches have been too computationally intensive for embedded systems and, even with high-end hardware, too slow for real-time applications. This pipeline has prevailed on detection benchmarks since the Selective Search work through the current leading results on PASCAL VOC, COCO, and ILSVRC detection all based on Faster R-CNN albeit with deeper features such as. KeywordsĬurrent state-of-the-art object detection systems are variants of the following approach: hypothesize bounding boxes, resample pixels or features for each box, and apply a high-quality classifier. Compared to other single stage methods, SSD has much better accuracy even with a smaller input image size. For \(300 \times 300\) input, SSD achieves 74.3 % mAP on VOC2007 test at 59 FPS on a Nvidia Titan X and for \(512 \times 512\) input, SSD achieves 76.9 % mAP, outperforming a comparable state of the art Faster R-CNN model.

Experimental results on the PASCAL VOC, COCO, and ILSVRC datasets confirm that SSD has competitive accuracy to methods that utilize an additional object proposal step and is much faster, while providing a unified framework for both training and inference. This makes SSD easy to train and straightforward to integrate into systems that require a detection component. SSD is simple relative to methods that require object proposals because it completely eliminates proposal generation and subsequent pixel or feature resampling stages and encapsulates all computation in a single network. Additionally, the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes. At prediction time, the network generates scores for the presence of each object category in each default box and produces adjustments to the box to better match the object shape. Our approach, named SSD, discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location. We present a method for detecting objects in images using a single deep neural network.
