510 likes | 659 Views
Programme. 2pm Introduction Andrew Zisserman, Chris Williams 2.10pm Overview of the challenge and results Mark Everingham (Oxford) 2.40pm Session 1: The Classification Task Frederic Jurie presenting work by Jianguo Zhang (INRIA) 20 mins Frederic Jurie (INRIA) 20 mins
E N D
Programme • 2pm Introduction • Andrew Zisserman, Chris Williams • 2.10pm Overview of the challenge and results • Mark Everingham (Oxford) • 2.40pm Session 1: The Classification Task • Frederic Jurie presenting work by • Jianguo Zhang (INRIA) 20 mins • Frederic Jurie (INRIA) 20 mins • Thomas Deselaers (Aachen) 20 mins • Jason Farquhar (Southampton) 20 mins • 4-4.30pm Coffee break • 4.30pm Session 2: The Detection Task • Stefan Duffner/Christophe Garcia (France Telecom) 30 mins • Mario Fritz (Darmstadt) 30 mins • 5.30pm Discussion • Lessons learnt, and future challenges
The PASCAL Visual Object Classes Challenge Mark EveringhamLuc Van GoolChris WilliamsAndrew Zisserman
Challenge • Four object classes • Motorbikes • Bicycles • People • Cars • Classification • Predict object present/absent • Detection • Predict bounding boxes of objects
Competitions • Train on any (non-test) data • How well do state-of-the-art methods perform on these problems? • Which methods perform best? • Train on supplied data • Which methods perform best given specified training data?
Data sets • train, val, test1 • Sampled from the same distribution of images • Images taken from PASCAL image databases • “Easier” challenge • test2 • Freshly collected for the challenge (mostly Google Images) • “Harder” challenge
Training and first test set train+val test1
Second test set test2
Annotation for training • Object class present/absent • Sub-class labels (partial) • Car side, Car rear, etc. • Bounding boxes • Segmentation masks (partial)
Issues in ground truth • What objects should be considered detectable? • Subjective judgement by size in image, level of occlusion, detection without ‘inference’ • Disagreements will cause noise in evaluation i.e. incorrectly-judged false positives • “Errors” in training data • Un-annotated objects • Requires machine learning algorithms robust to noise on class labels • Inaccurate bounding boxes • Hard to specify for some instances e.g. bicycles • Detection threshold was set “liberally”
Methods • Interest points (LoG/Harris) + patches/SIFT • Histogram of clustered descriptors • SVM: INRIA: Dalal, INRIA: Zhang • Log-linear model: Aachen • Logistic regression: Edinburgh • Other: METU • No clustering step • SVM with other kernels: MPITuebingen, Southampton • Additional features • Color: METU, moments: Southampton
Methods • Image segmentation and region features: HUT • MPEG-7 color, shape, etc. • Self organizing map • Classification by detection: Darmstadt • Generalized Hough transform/SVM verification
EER AUC Evaluation • Receiver Operating Characteristic (ROC) • Equal Error Rate (EER) • Area Under Curve (AUC)
1.1: Motorbikes Max EER: 0.977 (INRIA: Jurie) Competition 1: train+val/test1
Competition 1: train+val/test1 • 1.2: Bicycles • Max EER: 0.930 (INRIA: Jurie, INRIA: Zhang)
Competition 1: train+val/test1 • 1.3: People • Max EER: 0.917 (INRIA: Jurie, INRIA: Zhang)
Competition 1: train+val/test1 • 1.4: Cars • Max EER: 0.961 (INRIA: Jurie)
Competition 2: train+val/test2 • 2.1: Motorbikes • Max EER: 0.798 (INRIA: Zhang)
Competition 2: train+val/test2 • 2.2: Bicycles • Max EER: 0.728 (INRIA: Zhang)
Competition 2: train+val/test2 • 2.3: People • Max EER: 0.719 (INRIA: Zhang)
Competition 2: train+val/test2 • 2.4: Cars • Max EER: 0.720 (INRIA: Zhang)
Classes and test1 vs. test2 • Mean EER of ‘best’ results across classes • test1: 0.946, test2: 0.741
Conclusions? • Interest points + SIFT + clustering (histogram) + SVM did ‘best’ • Log-linear model (Aachen) a close second • Results with SVM (INRIA) significantly better than with logistic regression (Edinburgh) • Method using detection (Darmstadt) did not do so well • Cannot exploit context (= unintended bias?) of image • Used subset of training data and is able to localize
Competitions 3 & 4 • Classification • Any (non-test) training data to be used • No entries submitted
Methods • Generalized Hough Transform • Interest points, clustered patches/descriptors, GHT • Darmstadt: (SVM verification stage), side views with segmentation mask used for training • INRIA: Dorko: SIFT features, semi-supervised clustering, single detection per image • “Sliding window” classifiers • Exhaustive search over translation and scale • FranceTelecom: Convolutional neural network • INRIA: Dalal: SVM with SIFT-based input representation
Methods • Baselines: Edinburgh • Detection confidence • class prior probability • Whole-image classifier (SIFT + logistic regression) • Bounding box • Entire image • Scale-normalized mean bounding box from training data • Bounding box of all interest points • Bounding box of interest points weighted by ‘class purity’
Measured Interpolated Evaluation • Correct detection: 50% overlap in bounding boxes • Multiple detections considered as (one true + ) false positives • Precision/Recall • Average Precision (AP) as defined by TREC • Mean precision interpolated at recall = 0,0.1,…,0.9,1
Competition 5: train+val/test1 • 5.1: Motorbikes • Max AP: 0.886 (Darmstadt)
Competition 5: train+val/test1 • 5.2: Bicycles • Max AP: 0.119 (Edinburgh)
Competition 5: train+val/test1 • 5.3: People • Max AP: 0.013 (INRIA: Dalal)
Competition 5: train+val/test1 • 5.4: Cars • Max AP: 0.613 (INRIA: Dalal)
Competition 6: train+val/test2 • 6.1: Motorbikes • Max AP: 0.341 (Darmstadt)
Competition 6: train+val/test2 • 6.2: Bicycles • Max AP: 0.113 (Edinburgh)
Competition 6: train+val/test2 • 6.3: People • Max AP: 0.021 (INRIA: Dalal)
Competition 6: train+val/test2 • 6.4: Cars • Max AP: 0.304 (INRIA: Dalal)
Classes and test1 vs. test2 • Mean AP of ‘best’ results across classes • test1: 0.408, test2: 0.195
Conclusions? • GHT (Darmstadt) method did ‘best’ on classes entered • SVM verification stage effective • Limited to lower recall (by use of only side views) • SVM (INRIA: Dalal) comparable for cars, better on test2 • Smaller objects?, higher recall • Performance on bicycles, people was ‘poor’ • “Non-solid” objects, articulation?
Competition 7: any train/test1 • One entry: 7.3: people (INRIA: Dalal) • AP: 0.416 • Use of own training data improved results dramatically(AP: 0.013)
Competition 8: any train/test2 • One entry: 8.3: people (INRIA: Dalal) • AP: 0.438 • Use of own training data improved results dramatically(AP: 0.021)