Maximizing Search Precision and Recall: Informal Interface and Clustering

SIMS 296a-3:UI Background Marti Hearst Fall ‘98

Interface Topics Today • (Other topics will be covered later) • Supporting the Dynamic Continuing Process of Search • Search Starting Points Marti Hearst UCB SIMS, Fall 98

Human Information Seeking Behavior Marti Hearst UCB SIMS, Fall 98

Standard Model • Assumptions: • Maximizing precision and recall simultaneously • The information need remains static • The value is in the resulting document set Marti Hearst UCB SIMS, Fall 98

Index Query Parse Query Reformulation Rank or Match Pre-process User’s Information Need Collections text input

“Berry-Picking” as an Information Seeking Strategy (Bates 90) • Standard IR model • The information need remains the same throughout the search session. • Goal is to produce a perfect set of relevant docs. • Berry-picking model • The query is continually shifting. • Users may move through a variety of sources. • New information may yield new ideas and new directions. • The value of search is on the bits and pieces picked up along the way. Marti Hearst UCB SIMS, Fall 98

A sketch of a searcher… “moving through many actions towards a general goal of satisfactory completion of research related to an information need.” (after Bates 90) Q2 Q4 Q3 Q1 Q5 Q0 Marti Hearst UCB SIMS, Fall 98

Implications • Interfaces should make it easy to store intermediate results • Interfaces should make it easy to follow trails with unanticipated results • Difficulties with evaluation Marti Hearst UCB SIMS, Fall 98

Supporting the Information Seeking Process • Two recent similar approaches that focus on supporting the process • SketchTrieve (Hendry & Harper 97) • DLITE (Cousins 97) Marti Hearst UCB SIMS, Fall 98

Informal Interface • Informal does not mean less useful • Show how the search is • unfolding or evolving • expanding or contracting • Prompt the user to • reformulate and abandon plans • backtrack to points of task deferral • make side-by-side comparisons • define and discuss problems Marti Hearst UCB SIMS, Fall 98

SketchTrieve: An Informal Interface(Hendry & Harper 96, 97) • A “spreadsheet” for information access • Make use of layout, space, and locality • comprehension and explanation • search planning • A data-flow notation for information seeking • link sources to queries • link both to retrieved documents • align results in space for comparison Marti Hearst UCB SIMS, Fall 98

SketchTrieve: Connecting Results with Next Query Marti Hearst UCB SIMS, Fall 98

DLITE (Cousins 97) • Drag and Drop interface • Reify queries, sources, retrieval results • Animation to keep track of activity Marti Hearst UCB SIMS, Fall 98

Starting Points for Search • Faced with a prompt or an empty entry form … how to start? • Lists of sources • Overviews • Clusters • Category Hierarchies/Subject Codes • Co-citation Links • Examples • Automatic source selection Marti Hearst UCB SIMS, Fall 98

List of Sources • Have to guess based on the name • Requires prior exposure/experience Marti Hearst UCB SIMS, Fall 98

Marti Hearst UCB SIMS, Fall 98

Overviews in the User Interface • Unsupervised Groupings • Clustering • Kohonen Feature Maps • Supervised Categories • Yahoo! • Superbook • HiBrowse • Cat-a-Cone • Combinations • DynaCat • SONIA Marti Hearst UCB SIMS, Fall 98

Text Clustering • Finds overall similarities among groups of documents • Finds overall similarities among groups of tokens • Picks out some themes, ignores others Marti Hearst UCB SIMS, Fall 98

Text Clustering Clustering is “The art of finding groups in data.” -- Kaufmann and Rousseeu Term 1 Term 2 Marti Hearst UCB SIMS, Fall 98

Document/Document Matrix Marti Hearst UCB SIMS, Fall 98

Agglomerative Clustering A B C D E F G H I Marti Hearst UCB SIMS, Fall 98

AgglomerativeClustering A B C D E F G H I Marti Hearst UCB SIMS, Fall 98

K-Means Clustering • 1 Create a pair-wise similarity measure • 2 Find K centers using agglomerative clustering • take a small sample • group bottom up until K groups found • 3 Assign each document to nearest center, forming new clusters • 4 Repeat 3 as necessary Marti Hearst UCB SIMS, Fall 98

The Cluster Hypothesis “Closely associated documents tend to be relevant to the same requests.” van Rijsbergen 1979 “… I would claim that document clustering can lead to more effective retrieval than linear search [which] ignores the relationships that exist between documents.” van Rijsbergen 1979 Marti Hearst UCB SIMS, Fall 98

Clustering as Categorization “In a traditional library environment … the items are classified first into subject areas, and a search is restricted to times within a few chosen subject classes. The same device can also be used … [to construct] groups of related documents and confining the search to certain groups only.” Salton 71 Marti Hearst UCB SIMS, Fall 98

Clustering as Categorization “… In experiments we often want to vary the cluster representatives at search time. … Of course, were we to design an operational classification, the cluster representatives would be constructed once and for all at cluster time. van Rijsbergen 79 Marti Hearst UCB SIMS, Fall 98

Scatter/Gather Cutting, Pedersen, Tukey & Karger 92, 93 Hearst & Pedersen 95 • Cluster sets of documents into general “themes”, like a table of contents • Display the contents of the clusters by showing topical terms andtypical titles • User chooses subsets of the clusters and re-clusters the documents within • Resulting new groups have different “themes” Marti Hearst UCB SIMS, Fall 98

query Collection Rank Cluster

S/G Example: query on “star” Encyclopedia text 14 sports 8 symbols 47 film, tv 68 film, tv (p) 7 music 97 astrophysics 67 astronomy(p) 12 steller phenomena 10 flora/fauna 49 galaxies, stars 29 constellations 7 miscelleneous Clustering and re-clustering is entirely automated Marti Hearst UCB SIMS, Fall 98

Two Queries: Two Clusterings AUTO, CAR, ELECTRIC AUTO, CAR, SAFETY 8control drive accident … 25 battery california technology … 48 import j. rate honda toyota … 16 export international unit japan 3 service employee automatic … 6control inventory integrate … 10 investigation washington … 12 study fuel death bag air … 61 sale domestic truck import … 11 japan export defect unite … The main differences are the clusters that are central to the query Marti Hearst UCB SIMS, Fall 98

Publication History of Scatter/Gather (Publication timing may lag significantly behind when the work was done) • 1991 Patents Filed • SIGIR 92 Initial Algorithm Introduced • SIGIR 93 Optimizations Presented • AAAIFS 95 Examples of Use on Retrieval Results • TREC 95 Use in Interactive Track Experiments • CHI 96 Experiments providing evidence that users learn collection structure • SIGIR 96 Evidence that clustering can improve ranking for TREC-like scenario Marti Hearst UCB SIMS, Fall 98

Another use of clustering • Use clustering to map the entire huge multidimensional document space into a huge number of small clusters. • “Project” these onto a 2D graphical representation: Marti Hearst UCB SIMS, Fall 98

Clustering Multi-Dimensional Document Space(image from Wise et al 95) Marti Hearst UCB SIMS, Fall 98

Concept “Landscapes” Disease Pharmocology Anatomy Legal Hospitals Built using Kohonen Feature Maps Xia Lin, H.C. Chen Marti Hearst UCB SIMS, Fall 98

Visualization of Clusters • Huge 2D maps may be inappropriate focus for information retrieval • Can’t see what documents are about • Documents forced into one position in semantic space • Space is difficult to use for IR purposes • Hard to view titles • Perhaps more suited for pattern discovery • problem: often only one view on the space Marti Hearst UCB SIMS, Fall 98

Using Clustering in Document Ranking • Cluster entire collection • Find cluster centroid that best matches the query • This has been explored extensively • it is expensive • it doesn’t work well Marti Hearst UCB SIMS, Fall 98

Using Clustering in Interfaces • Alternative (scatter/gather): • cluster top-ranked documents • show cluster summaries to user • Seems useful • experiments show relevant docs tend to end up in the same cluster • users seem able to interpret and use the cluster summaries some of the time • More computationally feasible Marti Hearst UCB SIMS, Fall 98

Clustering • Advantages: • Sometimes discover meaningful themes • Data-driven, so reflect emphases present in the collection of documents • Can differentiate heterogeneous collections • Domain independent • Disadvantages • Variability in quality of results • Only one view on documents’ themes • Not good at differentiating homogenous collections • Require interpretation • May mis-match users’ interests

Incorporating Categories into the Interface • Yahoo is the standard method • Problems: • Hard to search, meant to be navigated. • Only one category per document (usually) Marti Hearst UCB SIMS, Fall 98

Marti Hearst UCB SIMS, Fall 98

Integrated Browsing & Search • Search for category labels • Browse category labels • Search within document collection • Browse resulting documents in book Marti Hearst UCB SIMS, Fall 98

Example: MeSH and MedLine • MeSH Category Hierarchy • ~18,000 labels • manually assigned • ~8 labels/article on average • avg depth: 4.5, max depth 9 • Top Level Categories: anatomy diagnosis related disc animals psych technology disease biology humanities drugs physics Marti Hearst UCB SIMS, Fall 98

Large Category Sets • Problems for User Interfaces • Too many categories to browse • Too many docs per category • Docs belong to multiple categories • Need to integrate search • Need to show the documents • We’ll discuss this more next week. Marti Hearst UCB SIMS, Fall 98

Category Labels • Advantages: • Interpretable • Capture summary information • Describe multiple facets of content • Domain dependent, and so descriptive • Disadvantages • Do not scale well (for organizing documents) • Domain dependent, so costly to acquire • May mis-match users’ interests Marti Hearst UCB SIMS, Fall 98

Other Starting Points Approaches • Co-citation Links • Examples, Guided Tours Marti Hearst UCB SIMS, Fall 98

Maximizing Search Precision and Recall: Informal Interface and Clustering