260 likes | 360 Views
A Hybrid Diagnosis to Real-time Image De-duplication. Global Media – Photo track Hong-Ming Chen hmchen@yahoo-inc.com. Image duplication. Sometimes it is good for art. But it is annoying for most of the other time …. Case 1: Yahoo! News .
E N D
A Hybrid Diagnosis toReal-timeImage De-duplication Global Media – Photo track Hong-Ming Chen hmchen@yahoo-inc.com
Image duplication • Sometimes it is good for art. • But it is annoying for most of the other time …
Our Recipe for Yahoo! • A hybrid real-time de-duplication system • Submitted to Yahoo! Tech Pulse 2012 • Will be on production soon
Concerns and solutions Users’ concern Solutions • 1. short response time • Faster than fast! • 2. Good de-dup result • Sweeping off all the duplications, • keeping all the others. • 1. Fast Approach • “Fingerprint” comparison per image pair • Not accurate enough • 2. Accurate Approach • Sophisticated image matching. • Impossible to be real time.
Difficulty and limitation 1/2 • Huge Computation v.s. Real Time • Pair-wise comparison • # = C(N, 2), N is total image amount. • Computation grows exponentially with the size of image set. • N = 10, # = 45 • N = 20, # = 190 • N = 100, # = 4950
Difficulty and limitation 2/2 • Limited storage space • Photos are described by limited information. Photo CCM (meta-data) Name: URL: Created date: Info for de-dup: …
Proposed Solution • Hybrid referral system: • first consultation: • Fast approach • subsequent consultation: • Accurate approach, exam ambiguous pairs
Fast consultation: Grid Color Moment • Discover Statistical property • 5x5 Grid • HSV color space • 3 moments/grid • Mean, variance, skewness Feature extraction Image descriptor: 1 2 3 … 224 225 Vector length: 5x5x3x3 = 225
Fast consultation : Grid Color Moment Feature extraction Feature extraction Image descriptor: Image descriptor: =similarity - 1 1 2 2 3 3 … … 224 224 225 225
Concerns and solutions Users’ concern Solutions • 1. short response time • Faster than fast! • 2. Good de-dup result • Sweeping off all the duplications, • keeping all the others. • 1. Fast Approach • “Fingerprint” comparison per image pair • Not accurate enough • 2. Accurate Approach • Sophisticated image matching. • Impossible to be real time. Comparing time: ~1 us/pair ! 1000,000 pairs/sec.
How about accuracy? More than 99.6% in average!
How about accuracy? Not high enough?
Result and Observation • Non-Duplicated image pairs number: 460 • Duplicated-image pairs number: 257,454 • Pairs located in [T1, T2] = 1,770 • Pairs located outside [T1,T2] = 256,144 • In average, only 1770/256144 = 0.7% pairs need to be re-examined. • For a set with 50 images, only 8 out of 1225 pairs need to be re-examined. --Non-Duplicated image pairs --Duplicated image pairs [T1, T2] = [5, 25] T1 T2 T1 T2 Pairs Amount GCM Distance • GCM Distance
Accurate Consultation: LIPM – Local Interest Point Matching • Local interest points are described by SURF feature.
The system provide: • Fast 1st round de-duplication • Accurate 2nd round de-duplication (optional) • Similarity scores for: • Remove duplications • Clues to rearrange the photo layout: increase diversity