AdaScale: Towards Real-time Video Object Detection using Adaptive Scaling

SysML 2019 • AdaScale: Towards Real-time Video Object Detection using Adaptive Scaling • Ting-Wu (Rudy) Chin* Ruizhuo Ding* Diana Marculescu • ECE Dept., Carnegie Mellon University

Video object detection is one of the key tasks in various emerging applications Autonomous Cars1 Household Robots3 Autonomous Drones2 1. https://medium.com/udacity/how-the-udacity-self-driving-car-works-575365270a40 2. https://software.intel.com/en-us/articles/object-detection-on-drone-videos-using-caffe-framework 3. Loghmani, Mohammad Reza, Barbara Caputo, and Markus Vincze. "Recognizing objects in-the-wild: Where do we stand?." 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018.

Prior art uses scales to trade speed for accuracy RetinaNet5 YOLOv26 5. Lin, Tsung-Yi, et al. "Focal loss for dense object detection." ICCV. 2017. 6. Redmon, Joseph, and Ali Farhadi. "YOLO9000: better, faster, stronger." CVPR. 2017.

How to determine which image to scale by how much? A regression problem Scaling-down an image (resolution) may sometimes help • Down-sampling could reduce noises, which further reduce False Positives

Outline • Motivation • From images to scales: a regression problem • AdaScale methodology • Results

Regressing scales from input images • To generate target labels • Choose a set of discrete scales to broadly cover the scales of interest. • For each image in the training set, evaluate every scale with a metric to identify the best scale.

Current loss function favors extreme scales • will not introduce regression loss if it is in background predicted bounding box ground truth bounding box

Our proposal: only consider the foreground boxes Foreground bounding box Background bounding box Sort by loss

The overall flow of AdaScale Multi-scale training for scale regressor (freezing object detector) Fine-tune Object Detectors with multi-scale training Generate labels for the scale regressor Training Testing t Object Detector t+n Backbone CNN Scaling t+1 Scale Regressor t (real value) For t+1

AdaScale on ImageNet VID SS/SS: Single-scale Training, Single-scale Testing MS/SS: Multi-scale Training, Single-scale Testing MS/AdaScale: Multi-scale Training, AdaScale Testing MS/Ada SS/SS MS/SS

Ablation study: multi-scale fine-tuning Regressed scales

Qualitative analysis: dynamics of AdaScale

Qualitative analysis: comparison with baseline SS/SS MS/AdaScale

Conclusions • We propose AdaScale, which improves both speed and accuracy in video object detection with image scaling instead of trading one for the other. • Our results demonstrate 1.3 and 2.7 mAP improvement on ImageNet VID and mini-YoutubeBB datasets with 1.6x and 1.8x speedup, respectively. • Together with state-of-the-art video object detection acceleration technique (i.e., Deep Feature Flow), we further push the speedup by 1.25x with slightly better mAP.

Q & A • Thank you

AdaScale: Towards Real-time Video Object Detection using Adaptive Scaling