620 likes | 706 Views
Using HTML Metadata to Retrieve Relevant Images from the World Wide Web. Ethan V. Munson University of Wisconsin-Milwaukee. Why is image search important?. The Web is becoming the world’s primary information source Images are one of the Web’s key features
E N D
Using HTML Metadata to Retrieve Relevant Images from the World Wide Web Ethan V. Munson University of Wisconsin-Milwaukee
Why is image search important? • The Web is becoming the world’s primary information source • Images are one of the Web’s key features • Few WWW image search engines exist currently • Using textual search engines to find images manually is laborious
A Requirement for Web Image Search • We need an efficient method of discovering and indexing image content. • Two main sources of information about image content: • image processing • associated text • text content • markup
Related work • QBIC (the IBM Almaden Research Center) • indexes and retrieves images according to: • shape • color • texture • object layout • queries are formulated through visual examples • a sample image • user provided sketches
QBIC: Advantages and Disadvantages • Advantages • well-developed visual query language • interesting GUI • queries are based on image appearance • Disadvantages • works only at the primitive feature level (color, texture, shape) • doesn’t recognize semantics of image • very sensitive to camera viewpoint • doesn’t scale up to the Web
Related work • WebSeek(J. Smith & S. Chang, Columbia University) • performs a semi-automated classification of the images • automatically extracts keywords from image file names • computes the keyword histogram • manually creates a subject hierarchy • manually maps the images into the subject hierarchy • User can • browse the categories • search the categories by keyword • search the database using image features • color content
Webseek: Advantages/Disadvantages • Advantages • Large index of Web images • Supports both text and image search • Disadvantages • Not clear that database can scale up • Manual categorization is very expensive • Relevance feedback mechanism is computationally expensive
Related work • WebSeer(M. Swain et al., The University of Chicago) • uses associated text and markup to supplement information derived from analyzing image content • uses multiple kinds of metadata • image file names • alternate text • text of a hyperlink • decides which images are photographs, portraits, or computer generated drawing • research emphasized categorization, not metadata-based search
Why seek new image retrieval methods? • The number of WWW documents is growing rapidly and constantly changing • We need fast and efficient methods for finding images • Image processing is • complex • computationally expensive • limited (misses true image semantics) • unnecessary
Research Goals • Show that images can be found using HTML “metadata” • textual content • HTML tag structure • attribute values • Determine which metadata features are the best clues to image content
The URL Filter • assembles a list of URLs from the results returned by Alta Vista • parses the first page returned by Alta Vista • follows the URLs of results pages, retrieves these pages, and parses them • extracts list of URLs from the results pages
The Crawler • retrieves the pages • saves each page’s HTML source code in a separate file
“Tidy” • converts arbitrary and probably ill-formed HTML into XHTML
XHTML Parser • parses an XHTML document • builds an XHTML parse tree
The Document Analyzer • scans the parse tree for image URLs • an image URL appears in either an image or anchor element • converts relative URLs into absolute URLs • uses various heuristics to determine which URLs point to relevant images
Search Strategies • Image’s file name • Textual content of the TITLE element • Value of the ALT attribute of IMG elements • Textual content of anchor elements • Value of the title attribute of anchor elements • Textual content of the paragraph surrounding an image • Textual content of any paragraph located within the same center element as the image • Textual content of heading elements
Experimental Questions • Which HTML features reveal the most information about image? • Do particular patterns of HTML structure carry useful information? • Do image search results depend on the type of query?