Bringing Order to the Web: Automatically Categorizing Search Results

Bringing Order to the Web: Automatically Categorizing Search Results Hao Chen, CS Division, UC Berkeley Susan Dumais, Microsoft Research ACM:CHI April 4, 2000

List Organization Category Org (SWISH) Organizing Search Results Query: jaguar

Outline • Background • Using category structure to organize information • SWISH SystemSearching With Information Structured Hierarchically • Text classification • User interface • User Study • Future Work

Using Category Structure • To Organize Information • Superbook, Cat-a-Cone, etc. • To Help Web Search • Yahoo!, Northern Light • What’s New in SWISH? • Automatic categorization of new documents • User interface that tightly couples hierarchical category structure with search results • User study for the new user interface

SWISH System • Combines the Advantages of • Manually crafted & easily understood directory structure • Broad coverage from search engines • System Components • Text classification models • User interface

Text Classification • Text Classification • Assign documents to one or more of a predefined set of categories • E.g., News feeds, Email - spam/no-spam, Web data • Manually vs. automatically • Inductive Learning for Classification • Training set: Manually classified a set of documents • Learning: Learn classification models • Classification: Use the model to automatically classify new documents

Automotive • Business & Finance • Computers & Internet • Entertainment & Media • Health & Fitness • Hobbies & Interests • Home & Family • People & Chat • Reference & Education • Shopping & Services • Society & Politics • Sports & Recreation • Travel & Vacations Training Set:LookSmart Web Directory • Category Structure (spring 99) • 13 top-level categories • 150 second-level categories • Training Set • ~50k web pages; chosen randomly from all cats • Top-level Categories

Learning & Classification • Support Vector Machine (SVM) • Accurate and efficient for text classification (Dumais et al., Joachims) • Model = weighted vector of words • “Automobile” = motorcycle, vehicle, parts, automobile, harley, car, auto, honda, porsche … • “Computers & Internet” = rfc, software, provider, windows, user, users, pc, hosting, os, downloads ... • Hierarchical Models • 1 model for N top level categories • N models for second level categories • Very useful in conjunction w/ user interaction

... web search results local search results Train (offline) Classify (online) manually classified web pages SVM model SWISH Architecture

Interface Characteristics • Problems • Large amount of information to display • Search results • Category structure • Limited screen real estate • Solutions • Information overlay • Distilled information display

Information Overlay • Use tooltips to show • Summaries of web pages • Category hierarchy

Expansion of Category Structure

Expansion of Web Page List

Category Interface List Interface User Study - Conditions

User Study

User Study • Participants: • 18 intermediate Web users • Tasks • 30 search taskse.g., “Find home page for Seattle Art Museum” • Search terms are fixed for each task • Experimental Design • Category/List – within subjects • 15 search tasks with each interface • Order (Category/List First) – counterbalanced between subjects • Both Subjective and Objective Measures

Subjective Results • 7-point rating scale (1=disagree; 7=agree) • Questions:

Use of Interface Features Average Number of Uses of Feature per Task

Search Time Category: 56 secs List: 85 secs p < .002 50% faster with Category interface

Search Time by Query Difficulty • Top20: 57 secs • NotTop20: 98 secs • No reliable interaction between query difficulty and interface condition • Category interface is helpful for both easy and difficult queries

Summary • Text Classification • Organize search results • Use hierarchical category models • Classify new web pages on-the-fly • User Interface • Tightly couple search results with category structure • Allow manipulation of presentation of category structure • User Study • Suggest strong preference and performance advantages for categorically organized presentation of searchresults

Open Issues • Improve Accuracy of Classification Algorithms • Enhance User Interface • Heuristics for selecting categories and pages to display • Query_Match: rank of page, and sometimes match score • Categ_Match: p(category for each page) • Integration with non-content information • Conduct End-to-end User Study • More info: • http://research.microsoft.com/~sdumais

Searching With Information Structured Hierarchically SWISH

Bringing Order to the Web: Automatically Categorizing Search Results