490 likes | 664 Views
Matching Images to Words for Large-Scale Datasets. Rong Yan Research Scientist, Facebook Email: rongyan@facebook.com. Outline . State-of-the-art: Large-Scale Image-Word Matching Vocabulary Design Visual Concept Detection Scalability Facebook projects for Images / Videos
E N D
Matching Images to Words for Large-Scale Datasets • Rong Yan • Research Scientist, Facebook • Email: rongyan@facebook.com
Outline • State-of-the-art: Large-Scale Image-Word Matching • Vocabulary Design • Visual Concept Detection • Scalability • Facebook projects for Images / Videos • Photo Suggestion • Haystack • Facebook Video
Scale of Facebook Images (#1 Photo Site) 20 billion images in 4 resolutions = 80b Imageswould wrap around the earth more than 10 times Photo by: woodley wonderworks at http://www.flickr.com/photos/wwworks/2222523978/ and used under Creative Commons license (Source: Wikipedia, http://en.wikipedia.org/wiki/Earth, Earth’s diameter: 12,742 km; 80 billion photos printed out and placed side by side = 82,240 square miles/132,352 km)
Speed of Image Sharing in Facebook 1,200,000 images persecond Photo by Eole: at http://www.flickr.com/photos/eole/2193801804// and used under Creative Commons license
Make Images Searchable by Their Content • Search beyond filenames and textual meta-data (Screen mock based on Facebook interface , no relation to real product plan)
Matching Images to Words • One picture is worth one thousand words[Fred Barnard, 1921] • Translate images into documents of “words”, or “visual concepts” • Retrieve images by matching queries and visual concepts People: Tourist Scene: Outdoors Objects: Building Organization: HSBC Landmark: Lion Statue Event: Summer Travel Meta-data: High Quality
Why Matching Images to Words? • Keyword query is primary input method for visual search • Comparison to “image-retrieval-by-example” • Difficult to find appropriate query examples as initial queries • More difficult to scale, and lack of semantic meaning “[Users expects to] type in a few words at most, then expect the engine to bring back the perfect results. More than 95 percent of us never use the advanced search features most engines include, …” – The Search, J. Battelle, 2003
Visual concepts can be extracted from video Multimodal content, e.g., audio, text and visual, are available Temporal relation between video shots Matching Videos to Words Scene: Studio People: Kofi Annan Event: UN Meeting Objects: Tank, Jet … … ... fray between the United States and Iraq ... tanks were training in the sands of Kuwait ... U.N. secretary general Kofi Annan will go to Baghdad, where he will meet president Saddam Hussein ...
Examples of Visual Concepts: Large Number and Wide Coverage fireworks parade flag burning earthquake combat launch fire Abandoned bag flood shoplifting wreckage bridge mountains waterfront traffic cityscape buildings street scene monument face team photo couple person with baby glasses few crowd soldiers helicopter airplane ferry police car humvee vehicle military ship bus truck 10’s 100’s 1K 10K 100K Typical Visual Objects Large-scale Visual Concepts activities scenes people objects specialized generalized # categories 9
User Summarization and Profiling Vehicles Sport Nature Outdoors Waterscape Air Vehicle Car Visual CAPTCHA Decryption Commercials, advertising clips • Find following items: • Head of person • The back of chair • The raising hand <ad> <ad> cars chefs food Applications of Visual Concepts Keyword-based Visual Retrieval Content-based Advertisement • Connect Computer Vision to many other Areas • Information RetrievalHuman-Computer Interaction • Computational Advertising • Large-Scale Systems
Broadcast News Location People Objects Activities & Events Office People Related Crowd Court Flag-US Face Meeting Walk/Run Animal Studio Person March Concept Computer Outdoor Events Roles Vehicle Road Explosion/Fire Sky Govt Leader Airplane Snow Car Corp Leader Natural Disaster Urban Boat/Ship Police/Security Waterscape Bus Mountain Military Desert Truck Prisoner Building Vegetatio Three Main Issues: Match Images to Visual Concepts • Vocabulary Design: • Design concept vocabulary • Taxonomy analysis • Concept Extraction: • Manual Annotation • Automatic Detection • Context Modeling A B Design Extract Nature, Day, Outdoors Scale • Scalability: • Scale computation to billions of images • Utilize massive training data from users C
Concept Concept Vocabulary Design A Design Extract Scale
Design Principles of Concept Vocabulary Dimensions in evaluating/designing concept vocabularies: • Detectability: observable from data (e.g., not abstract like “happy”) • Utility: useful for retrieval, categorization or other applications • Generality: sufficiently frequent across data collections • Specificity: not too frequent (e.g., present in most of the data) • Clarity: no definition ambiguity, no need for context (e.g., no “bank”) • Domains: applicable/adaptable to multiple data domains
Example: ImageNet (2009) [Deng, Fei-Fei, et al., 2009] http://www.image-net.org/ • Populate 80,000 WordNetsynsets with ~500-1000 images per synset • Collecting Methods • Collect images from search engines • Leverage Amazon Mechanical Turk to clean candidate images • Collection statistics so far • 14847 synsets, 9.3 million images
Other Large-Scale Concept Vocabulary • NUS-WIDE(http://lms.comp.nus.edu.sg/research/NUS-WIDE.htm) • 269,648 Flickr images with 5,018 unique tags • Annotate images with 81 concepts using 3,000 man-hours • LabelMe(http://labelme.csail.mit.edu/) • Crowd-sourcing object annotation for web images • 43,244 annotated images, and 242,249 labeled objects • Tiny Image(http://people.csail.mit.edu/torralba/tinyimages/) • 80 million 32 x 32 low resolution images • Use all words in WordNet to image search engines, 10 - 20% accuracy • LIBSCOM(http://lastlaugh.inf.cs.cmu.edu/libscom/) • NSF-funded shared community resource • ~50,000 YouTube video videos, 3705 weak semantic keywords • MSR-MM(http://research.microsoft.com/enus/um/people/xshua/imm2009/dataset.html) • 1 million images and 23 thousand videos, 50,000 images annotated with 100 concepts Future Trends1. Larger-scale 2.Crowd-sourcing
Recent Theoretical Advance on Concept Vocabulary • How many concepts are needed for an effective retrieval system? • [Hauptman et al., TMM 2007] Needs 3000 – 4000 concepts to achieve similar performance with current text retrieval systems • Can we detect a sufficient number of concepts from video? • [Hauptman et al., CIVR 2007] Zipf’s law suggests it is feasible to find required number of concepts • Are current concept vocabularies mature (e.g., LSCOM)? • [Kender et al., ICME’05] Follows Zipf law, but too generic and too sparse • What are the semantic concepts with small semantic gaps? • [Lu et al., CVPR’08] A small number of concepts are identified to be easy to model
Concept Concept Extraction B Design Extract Scale
Spatial- Temporal Multi-Modal Multi-Concept Context Insensitive Supervised Learning Unsupervised Manual Landscape of Visual Concept Extraction Context Window-based Unsupervised Temporal Mining Discriminative: CRF,… ZoneTag Generative: HMM, … Unsupervised w. Side Info. Scene Understanding Multimodal Fusion Multi-view Semi-supervised Brain Signal Multi-concept Modeling Image Annotation Speech Joint Text-Image Cross-domain Semi-supervised Object Recognition Gaming Active Learning Multi-Instance Browsing Standard Model (Generative, Discrim.) Clustering Tagging Supervision
Landscape of Visual Concept Extraction - Manual Context Window-based Unsupervised Temporal Mining Spatial- Temporal Discriminative: CRF,… ZoneTag Generative: HMM, … Unsupervised w. Side Info. Scene Understanding Multimodal Fusion Multi-view Semi-supervised Brain Signal Multi-Modal Multi-Concept Multi-concept Modeling Image Annotation Speech Joint Text-Image Cross-domain Semi-supervised Object Recognition Gaming Context Insensitive Active Learning Multi-Instance Browsing Standard Model (Generative, Discrim.) Clustering Tagging Supervision Supervised Learning Unsupervised Manual
Manual Concept Extraction – Approaches • Gaming with a purpose • ESP Game: labeling image as games [von Ahn, SIGCHI’04] • Two people see the same image, and type keywords until they match • Tagging / Social Tagging • Associate a single image / video at a time with multiple keywords • Social tagging with millions of users The crowd is able to label all Google images in months
Manual Concept Extraction – Advanced Approaches • Explore information beyond multimedia content for concept extraction • Social network, semantic network, location, and … mind-reading Semantic Network (http://www.expertsystem.net) Social Network [Special Session in MM’07] Brain-Computer Interface [CMU, VideOlympics’2008] Hybrid Tag-Browse Labeler [Yan et al., CVPR’ 2008] Yahoo! ZoneTag
Limitations of Manual Approaches • Time consuming and labor intensive • Tagging: 5 – 6 seconds per keyword [Yan et al., CVPR’08] • Browsing: 1.5 second per relevant keyword, 0.2 per irrelevant [Yan et al., CVPR’08] • ESP Game: 15 seconds per keyword [Von Ahn et al., SIGCHI’04] • Subjective and inaccurate for social tagging [Chang, 08] New York Landmark Labels (Flickr)
Landscape of Semantic Concept Extraction - Automatic Context Window-based Unsupervised Temporal Mining Spatial- Temporal Discriminative: CRF,… ZoneTag Generative: HMM, … Unsupervised w. Side Info. Scene Understanding Multimodal Fusion Multi-view Semi-supervised Brain Signal Multi-Modal Multi-Concept Multi-concept Modeling Image Annotation Speech Joint Text-Image Cross-domain Semi-supervised Object Recognition Gaming Context Insensitive Active Learning Multi-Instance Browsing Standard Model (Generative, Discrim.) Clustering Tagging Supervision Supervised Learning Unsupervised Manual
General Approach for Automatic Concept Detection • Example: learn “car” concept using statistical learning approaches Positive Examples Feature Extraction (e.g. color, texture, SIFT, …) Features Learning Data Negative Examples Concept: “car”
Image Annotation: Hierarchical Aspect Models [Barnard et al., JMLR 2005] General words and blobs • Images are clustered based on hierarchical priors over concepts • Learning the localized concept models from global annotations • Addresses the image-blob correspondence problem • Assumption: concept models generate both words and blobs sun Specific words sun sky water waves Slide courtesy of Kobus Barnard
Image Annotation: Hierarchical Aspect Models (Cont.) General words and blobs • A generative model for assembling image data sets from multimodal clusters • Choose an image cluster by p(c) • Choose multimodal concept clusters w. p(l|c) • From each multimodal cluster, sample a Gaussian for blob features, p(b|l), and a multinomial for words, p(w|l) • For a given image-blob correspondence sun Specific words sun sky water waves Barnard et al. JMLR, 2005
Good Acceptable Barnard et al. JMLR, 2005 Bad
Landscape of Visual Concept Extraction - Automatic Context Window-based Unsupervised Temporal Mining Spatial- Temporal Discriminative: CRF,… ZoneTag Generative: HMM, … Unsupervised w. Side Info. Scene Understanding Multimodal Fusion Multi-view Semi-supervised Brain Signal Multi-Modal Multi-Concept Multi-concept Modeling Image Annotation Speech Joint Text-Image Cross-domain Semi-supervised Object Recognition Gaming Context Insensitive Active Learning Multi-Instance Browsing Standard Model (Generative, Discrim.) Clustering Tagging Supervision Supervised Learning Unsupervised Manual | Oct 2009
Unlabeled Labeled Selection Strategy User Labeling Active Learning • Motivation: reduce manual supervision with human decision in a loop • Incremental learning framework with selective manual supervision • Active learning with support vector machines (SVMs) [Tong et al., MM’01] • Multi-label active learning for object detection [Yan et al., ICCV’03] • Collaborative annotation using active learning [Quénot et al., TRECVID’07] • Observations: • Consistently achieve maximum performance with a small number of manual labels (<15%) • Smaller batch size typically outperform, but be aware of higher re-learning cost and more frequent user context switches | Oct 2009
Cross-Domain Adaptation / Transfer Learning • Motivation: visual data distribution is sensitive to domain change • Adapt classifiers from auxiliary domains to a target domain with a very small number of new training data • Data-level adaptation[Wu et al.,ICML’04] [Liao et al.,ICML’05] • Parametric-level adaptation [Zhang et al.,NIPS’06][Ritendra et al.,MM’07] • Function-level adaptation [Yang et al.,MM’07][Jiang et al.,ICIP’07] • Semantic graph adaption [Jiang et al.,ICCV’09] • Observations: • A rapid-developing and challenging area • Reasonable improvement for related domains • When to adapt remains an open question | Oct 2009
Towards Total Scene Understanding [Li, Socher, Fei-Fei, CVPR’2009] Segmentation Classification Annotation Sky Tree Athlete class: Polo Horse Horse Athlete Horse Grass Trees Sky Saddle Horse Horse Horse Horse Grass
How Large-Scale Collections Change the Landscape? Context Spatial- Temporal Richer conceptsand richer context Efficient learningmethods required Imperfect labels Advanced compu-tational platforms Multi-Modal Multi-Concept Context Insensitive Supervision Supervised Learning Unsupervised Manual | Oct 2009
Concept Detection on Large-Scale Collections I: Richer Concepts and Context Enable More Applications Estimate locations using over 6M GPS-tagged images [Hays et al., CVPR’09] Mining Reliable Tags from 19M FlickrImages [Kennedy et al., WSMC’09] Identify News Perspectives w. Large-Scale Video Ontology [Lin et al., MM’09] | Oct 2009
Concept Detection on Large-Scale Collections II: Efficient Learning Algorithms become More Critical • Standard state-of-the-art concept detectors do not scale well • SVM takes ~7 days to learn 100K images for 40 concepts per 2GHz CPU • It takes ~3.5 days to predict 1M testing images for 40 concepts • Efficient optimization and prediction for large-scale collections • Data down-sampling and compression (e.g., random sampling) • Ensemble of simple learners (e.g., boosting + NN [M. Cooper, LS-MMRM’09] ) • Online / Incremental Learning (e.g., stochastic gradient descent) • Sparse Parameter Regularization (e.g., L1 regularization) • Distributed computing (e.g., MapReduce, see next) • Observation: achieve up to 100x – 1000x speedup | Oct 2009
Concept Detection on Large-Scale Collections III: Learning with Imperfect Annotations • Learning with imperfect annotations that can be automatically obtained but not fully accurate (e.g., manual tags, video-level tags) • Probabilistic EM-like algorithm to identify relevant frames using video-level annotation from 230-hour YouTube videos [Ulges et al., CIVR’08] • Combine kernel learning, incremental update and human interaction for cleansing millions of weakly-tagged images [Gao et al., ICME’09] • Obtain negative examples from 6.5M social-tagged images [Li et al., ACM MM’09] • Exploit four families of metadata tags: scene brightness, flash light, subject distance, focal length [Boutell and Luo, Pattern Recognition 2005] • Observation: Useful but not as good as perfect labels yet | Oct 2009
Recap: Matching Images / Videos to Words • Important but challenging research direction • Manual extraction: effective but time-consuming • Automatic extraction: reduces manual effort but not as robust • Context: help to improve performance, robustness • Advanced learning: active learning, cross-domain, semi-supervised, … • Large-scale collections: new opportunities and challenges | Oct 2009
Future Challenges and Opportunities • Statistical Learning • Insufficiency: what if the number of positive data is insufficient? • Noisiness: what if the data labels are inaccurate / incomplete? • Visual Recognition • Variance: what if concept appearance is highly variant? • Occlusion: what if target objects are occluded by other objects? • Semantic Management • Construction: how to automatically construct ontology for semantic concepts? • Utilization: how to leverage ontology knowledge to improve detection performance? • Context information • Context: how to harness social context, geographical info, photo metadata? • Large-Scale Visual Content • Annotation: how to obtain accurate annotation? • Scalability: what if the training data/concepts are too many? | Oct 2009
Outline • State-of-the-art: Large-Scale Image-Word Matching • Vocabulary Design • Visual Concept Detection • Scalability • Facebook projects for Images / Videos • Photo Suggestion • Haystack • Facebook Video
Project I: Photo Recommendation • Recommend the best photos to Facebook users • Extend friend suggestion to multimedia data types
Key technical challenges • How to combine the following factors to rank photos? • Relevance of photo content • Friendship with the photo owner • Number of photo comments • Photo upload time We are working on it …
Project II: Haystack - Image / Video Storage • Haystack: a new generation of scalable storage solution designed for images / videos • Challenges of traditional disk-based storage approaches • File systems aren’t really good at supporting large numbers of files • Metadata is too large to fit in memory, and thus many disk operations required for each file read • Limited by I/O not storage density
Then we optimize • Cachr: Cache the high volume smaller images to offload the main storage systems • Only 300M images in 3 resolutions • Distribute these through a CDN to reduce network latency • Cache them in memory for scalability, redundancy, and performance • NFS file hand cache: Eliminates some of the NFS storage tier metadata overhead We can do better
directory inode directory data file inode • owner info • size • timestamps • blocks • inode # • filename • owner info • size • timestamps • blocks 3rd Generation System: Haystack Network Haystack Traditional filesystem data id data
The efficiency of Haystack • Compared with other systems 1stgenerationsolution Optimizedfile system Haystack
Project III: Facebook Videos • Then: implemented by two engineers in Hackerthon • Now: serve billions of videos per month, top-10 video sites
Video for Facebook Video • http://www.facebook.com/careers/life.php
Stay tuned! More Image / Video Applicationsare Coming to Facebook!