1 / 21

Internet- scale MM retrieval

RNDr. Jakub Lokoč , Ph.D . Siret Research Group ( www.siret.cz ) Department of SW Engineering Faculty of Mathematics and Physics Charles University in Prague. Internet- scale MM retrieval. What does it mean „internet- scale “?. http://royal.pingdom.com statistics for 2011.

kuniko
Download Presentation

Internet- scale MM retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. RNDr. Jakub Lokoč, Ph.D. Siret Research Group (www.siret.cz) Department of SW Engineering FacultyofMathematicsandPhysics Charles University in Prague Internet-scale MMretrieval Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze

  2. Whatdoesitmean „internet-scale“? http://royal.pingdom.com statisticsfor 2011 • 2.1 billion – Internet users worldwide • 3.146 billion – number of email accounts worldwide • 800+ million – number of users on Facebook • 555 million – number of websites (+300 million in 2011) • 1trillion – number of video playbacks on YouTube • 48 hours –  amount of video uploaded to YouTube every minute • 100 billion – Estimated number of photos on Facebook • 4.5 million – Number of photos uploaded to Flickr each day MM data Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze

  3. Many problems to solve… Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze

  4. Searching huge MM collections • Text-based techniques • Advantage – scalable retrieval by inverted files • Problem – missing or misguiding annotations • Content-based techniques • Advantage – no annotation needed, visual similarity • Problem – slow retrieval for complex similarity models • Hybrid techniques • Text-based query + content-based reranking/exploration • Content-based query + text-based filtering • Adapting content-based data for invertedfiles Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze

  5. Text-based retrieval • Document vector model • User issues keywords query (google, bing, …) • Efficient query evaluation using inverted files • Problems • Manual annotation only for small data • Subjectivityoftheannotation • Homonyms, etc. • Automatic annotation • Surrounding text + linguistic methods + ontologies • Content-basedkeywordassignment • Still lot of problems to solve… Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze

  6. Example – www.google.com • Text-based retrieval Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze

  7. query object Content-based retrieval • All objects transformed into a similarity model • Objects represented by descriptors (histograms, signatures) • Descriptors measured by a distance measure d (Lp, SQFD, EMD) • User issues an example object as a query q • Objects x sorted according to the visual similarity d(q, x) • How to solve efficiency problem? • Hybrid techniques – not whole DB is searched inthe CB way • Distance-based indexes • Distributed architectures needed (storage, throughput, …) Feature extraction Similarity evaluation Feature extraction Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze

  8. Example – www.google.com • Hybrid techniques –reranking page 1 Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze

  9. Example – www.google.com • Hybrid techniques –reranking page 2 Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze

  10. Example – siret.ms.mff.cuni.cz/sir • Hybrid techniques –exploration J. Lokoč, T. Grošup, T. Skopal Image Exploration using Online Feature Extraction and Reranking ICMR, 2012, Hongkong, China, ACM J. Lokoč, T. Grošup, T. Skopal SIR: The Smart Image Retrieval Engine SISAP, 2012, Toronto, Canada, Springer Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze

  11. Distance-based indexing • MM objectsorganizedintoclustersaccording to theirsimilarity • Effectiveness depends on the similarity model Zezula, P., Amato, G., Dohnal, V., Batko, M. Similarity Search: The Metric Space Approach (Springer, 2006) J. Lokoč, P. Čech, J. Novák, T. Skopal, SISAP, 2012, Toronto, Canada, Springer Cut-region: A Compact Building Block For Hierarchical Metric Indexing D. Novak, M. Batko, P. Zezula, Information systems, 2011, Elsevier Metric Index: An efficient and scalable solution for precise and approximate similarity search Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze

  12. Example - Mufin • Content-based search in 100 million Flickr images Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze

  13. Example - Mufin • MPEG-7 descriptors used – efficient, but effective? Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze

  14. Distance-based indexing • Effective measure • Often complex and expensive • Efficiency • Depends on the index performance • Depends also on the data “indexability” Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze

  15. Distance-based indexing • Indexability depends onthe distance distribution ofused distance space E. Chavez, G. Navarro, R. Baeza-Yates, and J. L. Marroquin Searching in Metric Spaces, ACM Computing Surveys, 2001 Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze

  16. Facing bad indexability • Centralized computing • Approximate search • Parallel processing • Distributed computing • Peer-to-peer architecture Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze

  17. Approximate search • Based on various ideas • Early termination for good results • Reducing query radius • When time elapses • Accessing % of DB • Also distance modifications • However, for fast retrieval, the quality deteriorates rapidly Zezula, P., Amato, G., Dohnal, V., Batko, M. Similarity Search: The Metric Space Approach (Springer, 2006) Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze

  18. Parallel processing • Multi-core CPUs cheap and available • Intel Xeon Phi coprocessor • GPU cards with thousands of cores • Amdahl's and Gustafson's law Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze

  19. Distributed indexes • Peer-to-peer architecture • Chord protocol (efficient routing) • M-Chord, M-Index • Map objects to real domain R • Use chord protocol for object distribution • Query causes interval queries, results merged D. Novak, P. Zezula, M-Chord: a scalable distributed similarity search structure InfoScale, 2006, ACM D. Novak, M. Batko, P. Zezula, Large-scale similarity data management with distributed Metric Index, Information Processing & Management Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze

  20. And all together Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze

  21. Thanks for your attention … … any questions? Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze

More Related