Detecting Cartoons a Case Study in Automatic Video-Genre Classification

Detecting Cartoons a Case Study in Automatic Video-Genre Classification Tzvetanka Ianeva Arjen de Vries Hein Röhrig

Outline • Goal: remove cartoons from search results in TREC-2002 video track • Our Approach: extract Image Descriptors & SVM Machine Learning • Related work • Novel Descriptors from Granulometry • SVM Learning • Experimental Results

TREC-2002 video track • TREC- workshops for large scale evaluation of information retrieval technology • CWI participation: Probabilistic Multimedia Retrieval Model • does not distinguish sufficiently “Cartoons”

Example of undesirable ‘cartoon’ Query Best Matches returned

Related work • M.Roach et al. Motion based classificationof cartoons (2001) • B.T.Truong et al. Automatic genre identification for content-based video categorization (2000) • J.R.Smith et al. Searching for images and videos on the world wide web • N.C.Rowe et al. Automatic caption localization for photographs on www pages • V.Athitsos et al. [ASF] Distinguishing photographs and graphics on the www

Cartoons • What is a Cartoon? • Cartoons do not contain any photographic material • Photos photographic camera • Appears easy to find cartoons • Few, simple, strong colors, patches of uniform colors, strong black edges, text

Quiz: Cartoon or Photo?

Examples not so Typical

Photos like cartoons

“Cartoons” like photos

Artificial photos

Small cues

Overlapping Frames

Mixed

Shadow & Sparkle

Image Descriptors Input Image Image descriptors 0.6231 0.9266 … … • greater correlation • normalized • Example: avg. sat., thresh. brightness 1 2 148 (240x352x3) 0.2880 0.4125 … … 1 2 148

Overview of our all image descriptors Image Descriptors Dimension average saturation1 threshold brightness 1 color histogram 45 edge-direction histogram 40 compression ratio 1 multi-scale pat. spectrum 60

Brightness and Saturation • HSV color model • Cartoons brighter => use % pixels with Value > 0.4 • Cartoons have strong colors => use average Saturation

Saturation in cartoon and photo images RGB S-(HSV) RGB S-(HSV) 0.6231 0.2880

Brightness in cartoon and photo images . RGB V-(HSV) RGB V-HSV 0.9266 0.4125

Histograms • Image I : XxY -> Rc • Filter F : I -> I’ • Bins Bk partition of Rc • hk = #{ (x,y) : I’(x,y) є Bk } • E.g. brightness metric: I grayscale, c=1, B1 = [ 0, 0.4 ], B2=[0.4,1], return h2

Color Histogram • More general than brightness & saturation • Again HSV color space • Partition HSV into 3x3x5 = 45 bins • Cartoons have less colors => col. hist. desc.

Color histogram for in the 45-bin HSV

Edge detection • Cartoons have strong black edges => • Approx. total derivative of intensity  I(x,y) = ( I(x,y), I(x,y) )   x y • Approx. || and  • histogram of (, ||) • 5 intervals for||  0 … sqrt(20) • 8 intervals for  0 … 2  

Edge angles & edge magnitudes

Edge histogram

Compressibility 0.23365 0.13548 • Cartoons: more simple composition • Detect complexity by measuring compression ratio • Theory: “Kolmogorov complexity” • Our application: use lossless PNG compression • Lossy JPEG not useful

Granulometries • Idea: measure size distribution of objects • How? openings by structuring element of growing scale • Normalized size distribution • Derivative = pattern spectrum

Openings Opening = erosion then dilation with same SE

Structuring Elements • Non-flat parabola better(?) than flat disk • Parabola: efficient computation, symmetry

Small-scale pattern spectrum descriptors SE disk ri = i, i = 1,…20

SVM Learning • Simplest case:  linear separator • SVM finds hyperplane with largest margin • Closest points = Support Vectors

SVM Learning: nonseparable • Noisy data: no separating hyperplane at all! • Solution: penalty C for points inside the margin • C SVM machines

SVM = quadratic programming SVM task: Equivalent dual problem:

SVM with kernels SVM task: Equivalent dual problem:

SVM kernels RBF kernels Polynomial kernels

SVM with kernels: decision function SVM task: Equivalent dual problem: Decision function:

Experimental Data • Key frames from TREC 2002 Video Track • 13,026 photographic images • 1,620 cartoons • Manually classified • Experiments 1-3: train on (random) 3908 photos and 486 cartoons

Experiment 1: individual performance Et= Ep+Ec |p| |c| |p|+|c| |p|+|c| σ2 = 0.1 0.05 < σ2 < 0.5 σ2 = 0.07 0.05 < σ2 < 0.5 0.05 < σ2 < 0.5 σ2 = 0.07

Experiment 2: “convergence” of SVM learning (Pattern spectrum)

Experiment 3: combined performance σ2 = 0.06

Experiment 4: web-image classifier on our data Test set: random 1,000 photos and 1,000 cartoons

Experiment 5: Performance on web images Comparison with 14,039 photographic and 9,512 graphical images harvested from WWW train on (random) 4239 photographics and 2826 graphics + dimension and file type features

Conclusions • Hard task: good classifier • Use dynamics/spatio-temporal relations ? • Semantic Gap? • Combine classifiers? • Granulometry not enough

Detecting Cartoons a Case Study in Automatic Video-Genre Classification