320 likes | 602 Views
Windows Live Image Search. Hugh Williams Senior Software Design Engineer Windows Live Search Microsoft Corporation. Overview. Windows Live Image Search Problem Definition and Background User Interface Architecture Why is it a beta? Questions?. Introduction.
E N D
Windows Live Image Search Hugh Williams Senior Software Design Engineer Windows Live Search Microsoft Corporation
Overview • Windows Live Image Search • Problem Definition and Background • User Interface • Architecture • Why is it a beta? • Questions?
Introduction • Windows Live Image Search is new: • Released in Beta form on March 8, 2006 • Architected, designed, and engineered in Redmond • Close relative of MSN/Windows Live web search • Microsoft’s Image search is available only at Windows Live • The MSN Image Search solution is provided by a third-party • Strong partnership between the Windows Live Search product team and: • Microsoft Research, Cambridge UK • Microsoft Research, Asia (Beijing, China) • Microsoft Research, Redmond
Problem Definition • Find thumbnail images using a text query • There are no CBIR-based web-scale imagesearch engines • All modern image search engines share fundamentals with AltaVista’s originalPhotoFinder (1998) • The thumbnail images represent web pages “containing” the original image • We crawl web pages and images • More than a billion images • Pages and images regularly refreshed • Large numbers of images enter and leave the collection daily • More later…
Queries • From an MSN Search sample drawnfrom a month: • Most frequent: 65,000+ occurrences • Median: 2 occurrences • Most queries are 1 to 3 words in length • Most popular queries: lindsay lohan, scarlett johansson, angelina jolie, sex, jessica simpson, kate beckinsale, paris hilton, britney spears, shakira, sexy, jessica alba,jennifer lopez • Random queries: bridge, rodolfo font, playboy, douwe egberts, jesus, tanning, beauty, oakenfold, priyankachopra, actors • Around 60 of the top 100 queries are adultor celebrity • Other popular scenarios are places, animals,or objects
More On Queries… • In the US, around 10% are spelling errors • Less in some languages, more in others • Word forms are extremely common • Tom’s Diner, Toms Diner, Tom Diner • Lots of weirdness: • Math.abs • 3/4” Ply • 103,5 versus 103.5 • www cnn.com • Every conceivable spelling of “Britney” • Navigational queries
How Users Click Through Around 75% of Web search result page views are page one. For image search it is 43%, and the 75% threshold in image search is reached around page eight
Searching And Ranking • Our ranking process matches queriesto documents • So, what is a document? • We refer to our documents as nodules • A nodule is created for each link between an HTML document and an image (where we haveretrieved both) • The alternative is a nodule per image, or a nodule per page • A nodule typically contains: • The thumbnail of the image • Text and headers from the HTML page • Image metadata
Background: Ranking • So, how do we rank? • We rank using: • Static Rank: Query Independent value • Image and page properties, web link analysis, junk page probability, and so on • Dynamic Rank: Query Dependent value • TF-IDF, BM25, and so on • The overall rank is a combination of Static and Dynamic Rank • Broad answer: we compute the similarity between selected nodules and a query, and order the results by decreasing similarity • The selected nodules are those that contain all query terms (Boolean AND to find a filter set, then similarity-based ordering of the filter set)
Algorithmic Search • Traditional Information Retrieval focuseson Intelligence • Recall • Long queries • Well-formed documents • Small (low millions) index • Image search focuses on • Precision • Short queries • Poor documents • Billions of nodules in the index
Nodule Text • Nodules represent the link between an HTML page and an image • Nodule text includes elements such as: • The HTML page <title> • Text from the HTML page • Text from near the image is a good start… • ALT or anchor text from the image • Images can be embedded in a page using the <img> tag or linked-to using the <a> tag
Image Metadata • Ranking uses text and image properties (the latter are exclusively for image search) • These include: • AspectRatio (the ratio of the X dimension tothe Y dimension) • Pixels (the product of X and Y dimensions) • PhotoGraphic (whether an image is a photographor a graphic) • …
Throwing Out Junk • The Web is full of balls, lines, and Amazon logos • Right now, we ignore very small images • Some we don’t fetch (HTML width and height attributes help us), many we drop after fetching • Junk properties help us in ranking: • We lower the rank of images with extremeaspect ratios • We lower the rank of images with few pixels
Duplicates And Near Duplicates • Duplication is problematic, particularly for logos, products, and posters • We compute a hash of all images • All except the highest-ranked exact duplicate is removed from the filter set at query time • We are working on techniques for removing near duplicates
User Interface • The Windows Live image search user interface has five new features: • “Infinite scroll” or “smart scroll” • Thumbnail size slider • Film strip results view • Show full image • Metadata grow experience
Infinite Or Smart Scroll • Results are presented in a single page • Removes others’ paging model • Smooths the click curve • Improves browsability • Motivated by click data • As discussed previously, only 43% of users stayon page one • Many sessions show very deep click behaviors • Same motivation for the thumbnail size slider
Other Features… • Motivated and reinforced by usability studies • Film Strip Results View: • Improve results navigation • Remove unnecessary click actions • Make it easy to find a page or image • Show full image feature: • Helps locate original image • Particularly useful for <a> links • Metadata grow • Most users don’t use metadata • Reduce clutter, improve browse experience
Architecture And Design • Crawl and index over a billion nodules every two weeks • Crawl 750 nodules per second • Answer queries in less than 250ms, with most answered in less than 50ms • Serve several million queries per day • Peak load of 150+ queries per second • Serve 10,000+ thumbnails per secondat peak • Manage several petabytes of raw storage
Indexing: Selection And Crawl • Only way into Search is via our Crawler • We used to have “paid inclusion” but abandoned it • Google doesn’t have it, Yahoo! does • Crawl is partly prioritized by Static Rank • We crawl the top few billion pages • Biggest issue with crawling: politeness
Distributed Searching I: Single Box • Monolithic Model (AltaVista, WebCrawler) – the index goes on a single (big) box. • Advantages: • Easy to scale query volume: just buy more web server frontends and Big Boxes • Full visibility on results while ranking • Disadvantages: • Hard to scale index size --- limited by CPU and Memory • Reliability
Distributed Searching II: Word-Striping • Stripe the index by term across index servers • Have a central box send the query terms to appropriate servers • Merge the results • Advantages: • Only boxes that have answers get used per query • Have full visibility of results while ranking • Disadvantages: • Some boxes are likely to be more loaded than others • It turns out this creates significant network traffic
Distributed Searching III: Document Striping • Stripe documents randomly across boxes • Send query to all boxes • Merge the results from all boxes • Advantages: • Scales with both index size and query traffic volume • Minimal network traffic, aggregation is easy • Disadvantage: • No visibility on all results while ranking
Why Is It A Beta? • We are working on multiple features • Continuous improvement of rankingand relevance • Internationalization and accessibility • Scaling and reliability • Adult filtering • New, thought-leading features • Many of these involve colleagues inMicrosoft Research
© 2006 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.