850 likes | 1.96k Views
Object Detection using Deep Neural Network. Wan-Ru, Lin 2016/10/27. Outline. Introduction Background R-CNN (2014) SPPnet (2014) – speedup R-CNN Fast R-CNN (2015) Faster R-CNN (2015) YOLO (2015). Introduction. Object detection has long been an interesting task in computer vision
E N D
Object DetectionusingDeep Neural Network Wan-Ru, Lin 2016/10/27
Outline • Introduction • Background • R-CNN (2014) • SPPnet (2014) – speedup R-CNN • Fast R-CNN (2015) • Faster R-CNN (2015) • YOLO (2015)
Introduction • Object detection has long been an interesting task in computer vision • Location (x,y,w,h) • Classification
Introduction • Before fast R-CNN (2015)… • After fast R-CNN … cat Classifier Feature extraction Region proposal cat Region proposal Feature extraction Classifier [R. Girshick, “Fast R-CNN,” in IEEE International Conference on Computer Vision (ICCV), 2015]
Introduction (2014) (2015) (2015) YOLO (2015)
Background • Convolution Neural Network(CNN) • Convolution • Nonlinearity – (sigmoid , ReLU) • Pooling classifier Feature extractor
Background • Pooling • reduce the spatial size • translation invariant • Loss function • Error backpropagation
Background Person: person Animal: bird, cat, cow, dog, horse, sheep Vehicle: aeroplane, bicycle, boat, bus, car, motorbike, train Indoor: bottle, chair, dining table, potted plant, sofa, tv/monitor • PASCAL VOC • Location • Class
Background • Pre-training • ILSVRC dataset ~ 120W images • Fine-tuning • PASCAL VOC 2012
Outline • Introduction • Background • R-CNN (2014) • SPPnet (2014) – speedup R-CNN • Fast R-CNN (2015) • Faster R-CNN (2015) • YOLO (2015)
R-CNN • Multi-stage SVM Selective Search
R-CNN • Selective Search • Generate possible object locations
R-CNN • Training • Supervised pre-training : ILSVRC 2012 • Domain-specific fine-tuning : • warp input • output number : 1000 -> 20 + 1(ground truth) • SVM • Separate data with hyperplane
R-CNN • Disadvantage of R-CNN • Distortion due to warping • Training is a multi-stage pipeline • Training is expensive in space and time • Object detection is slow • VGG takes 47s/image
Outline • Introduction • Background • R-CNN (2014) • SPPnet (2014) – speedup R-CNN • Fast R-CNN (2015) • Faster R-CNN (2015) • YOLO (2015)
SPPnet • Share feature map • Fixed-length feature • Assume bins • ROI size : • Pooling window size = • Avoid image warping
SPPnet • Share feature maps speed up R-CNN • Achieve comparable mAP with R-CNN
Outline • Introduction • Background • R-CNN (2014) • SPPnet (2014) – speedup R-CNN • Fast R-CNN (2015) • Faster R-CNN (2015) • YOLO (2015)
Fast R-CNN 1-scale SPP layer (7x7) • Single-stage training • Training can update all network layer Selective Search ~2K
Fast R-CNN • Multi-task loss • Output : • v
Fast R-CNN • Contributions • Higher mAP than R-CNN and SPPnet • Training is single-stage, using multi-task loss • Training can update all network layers • No disk storage is required for feature caching
Outline • Introduction • Background • R-CNN (2014) • SPPnet (2014) – speedup R-CNN • Fast R-CNN (2015) • Faster R-CNN (2015) • YOLO (2015)
Faster R-CNN • Selective search consumes much running time • Fast R-CNN • Region proposal network (RPN)
Faster R-CNN • Region proposal network (RPN) • Pick top-ranked 100 proposal at test time
Faster R-CNN • Timing(ms)
Faster R-CNN • Contribution • Present RPNs for efficient and accurate region proposal generation • Sharing convolutional features for region proposal and object detection
Outline • Introduction • Background • R-CNN (2014) • SPPnet (2014) – speedup R-CNN • Fast R-CNN (2015) • Faster R-CNN (2015) • YOLO (2015)
YOLO • Use features from the entire image to predict each bounding box • Single neural network • Region proposal • Feature extraction • Classification • Bounding box regression
YOLO • Divide input image to grid • Each grid cell • predict 2 bounding boxes (x,y,w,h) • Confidence scores of bounding boxes • Predict class probabilities :
YOLO IOU = 0.8 IOU = 0.3 • Output number =
YOLO • VOC 2007
YOLO • Limitation • Struggle with small objects that appear in groups • Struggle to generalize to objects in new or unusual aspect ratios or configurations
Reference [1] Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and semantic segmentation." Proceedings of the IEEE conference on computer vision and pattern recognition. 2014. [2] J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeulders. Selective search for object recognition. International journal of computer vision, 104(2):154–171, 2013. [3] R. B. Girshick. Fast R-CNN. CoRR, abs/1504.08083, 2015 [4] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. arXiv preprint arXiv:1506.01497, 2015 [5] Redmon, Joseph, et al. "You only look once: Unified, real-time object detection." arXiv preprint arXiv:1506.02640 (2015). [6] He, Kaiming, et al. "Spatial pyramid pooling in deep convolutional networks for visual recognition." European Conference on Computer Vision. Springer International Publishing, 2014.