350 likes | 456 Views
JIGSAW ( Joint search with ImaGe , Speech And Words). Interactive Mobile Visual Search with Multimodal Queries. Yang Wang University of Science and Technology of China. Yang Wang. Jingdong Wang. Houqiang Li. Shipeng Li. Tao Mei. Search for images On The Go.
E N D
JIGSAW(Joint search with ImaGe, Speech And Words) Interactive Mobile Visual Search with Multimodal Queries Yang Wang University of Science and Technology of China Yang Wang Jingdong Wang Houqiang Li Shipeng Li Tao Mei
Search for images On The Go ACM Ubicomp 2011
Search for images On The Go ? ACM Multimedia 2011
Existing image search method on mobile phone Formulate an intent Present the resuts Drawbacks • Clear goals • Explicit intent • See and search • Product, landmark, … ACM Multimedia 2011
Motivations – partial object Mind Description Image Information • User has no exact entity name or instant photo. • The user has only visual descriptions: • Can only describe an oil paint without it’s name. (“Find a oil paint with a man with a straw hat.”) • Want to find local business with only description (“Find a restaurant with a red gate and two stone lions in front and two red lantern on the top.”) • ……
Our approach • Multi-modal + Multi-touch = Visual intent • Benefits: • Explicate the search intent • Express visual intent • Natural interface … “sky”, “grass” Find a picture with sky and grass … Describe a visual scenario with speech Extract entity words from the speech Exemplary images Composite a visual query Search results ACM Multimedia 2011
Existing research -> A solution for mobile -> Using exemplar images -> Using region-based matching HaoXu, et al. Image Search by Concept Map. SIGIR ’10. Changhu Wang, et al. MindFinder: Image Search by Interactive Sketching and Tagging. WWW ’10. ACM Multimedia 2011
Flow chart ② ① ④ ③ ACM Multimedia 2011
Speech recognition & entity extraction voice natural sentence “Find an iron tower under grass” commercial speech recognition engine • Nouns from WordNet* (117, 798 nouns) • Can be represented by images in ImageNet* • Ignoring preposition, verbs, adjectives, etc. • 22, 117 entities “tower”, “grass” ① *http://wordnet.princeton.edu *http://www.image-net.org ACM Multimedia 2011
Exemplary image generation • Obtain Images with each text query (top 500) & extract features • Cluster images and keep cluster centers • User can choose one exemplary image from each entity (such as I1 & I2) ② ACM Multimedia 2011
Composite visual query • Component: • Ck={Tk, Ik, Rk} • Visual query: • {Ck} • Component: • Ck={Tk, Ik} ③ ACM Multimedia 2011
Search ④ ACM Multimedia 2011
Visual matching ACM Multimedia 2011
Implementation • local pattern: SIFT • 6000-D bag of words • Color information: Color histogram • 192-D HSV • Shape information: Gradient histogram • 64-D • Normalized and combined in a single vector. • Calculate similarity with a idf weight. ACM Multimedia 2011
Application UI design • Windows Phone 7 • Two-step interaction • User interface ACM Multimedia 2011
Experiments • Settings • One million images from commercial search engine • Objective Evaluations • 100 test queries • Normalized Discounted Cumulative Gain (NDCG) • Response time • User study • Usability ACM Multimedia 2011
NDCG • Compared JIGSAW, Concept Map*, and text search • More Efficient than text search • Better performance than Concept Map *HaoXu, et al. Image Search by Concept Map. SIGIR ’10. ACM Multimedia 2011
System response time System response time in searching (on the phone) • Given a number of key wordsn • (500ncandidate images) x (nscores are calculated) =O(n2) • Pruned by early abortion • ~O(n) ACM Multimedia 2011
User’s time The time distribution for different users to complete a task • Single component: 30 sec • Failed trial: +20 sec • Extra component: +20 sec ACM Multimedia 2011
Demo ACM Multimedia 2011
Number of interactions (tap & drag) • Multi-touch takes up only 5% of all operations because the exemplary images are always too small on the screen. ACM Multimedia 2011
Visual results ACM Multimedia 2011
Visual results ACM Multimedia 2011
Visual results ACM Multimedia 2011
Discussions • Contributions • Introduce a new interactive visual search system on mobile • Propose a visual search method for this application • Deployed the system on a WP7 mobile phone • Future works • Improve the efficiency of visual search • Handle relative positions between objects ACM Multimedia 2011
Similarity between image and exemplar ACM Multimedia 2011
Similarity between image and exemplar We index the features in 9x9 cells Multiple cells are combined to approximate the region to be compared ACM Multimedia 2011
Similarity between image and exemplar • Combine features covered by the region • Slide the window around • Calculate similarity e(i, j) • Calculate similarity e Save features in M x M cells ACM Multimedia 2011
Consider more positions Draw the desired distribution from a Gaussian shape centered at the desired position Compare ekJ(i, j) with a desired distribution of the k-th component Dk(i, j) ACM Multimedia 2011
Penalty • For cells outside R(k), penalty is pooled by • Instead of e, the feature of single cell o(i, j) is used to accelerate the matching speed. • The spatial relevance score between the candidate image and the k-th component: ACM Multimedia 2011
Fusion scores count the average score Divergence penalty: ACM Multimedia 2011
Ranking -1 < score(J) < 1 Rank the candidate images by their scores in descending order ACM Multimedia 2011