1 / 51

Maximizing Search Precision and Recall: Informal Interface and Clustering

Explore the process of human information seeking behavior and supporting the dynamic search with SketchTrieve and DLITE. Learn about text clustering techniques for better categorization.

phodge
Download Presentation

Maximizing Search Precision and Recall: Informal Interface and Clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SIMS 296a-3:UI Background Marti Hearst Fall ‘98

  2. Interface Topics Today • (Other topics will be covered later) • Supporting the Dynamic Continuing Process of Search • Search Starting Points Marti Hearst UCB SIMS, Fall 98

  3. Human Information Seeking Behavior Marti Hearst UCB SIMS, Fall 98

  4. Standard Model • Assumptions: • Maximizing precision and recall simultaneously • The information need remains static • The value is in the resulting document set Marti Hearst UCB SIMS, Fall 98

  5. Index Query Parse Query Reformulation Rank or Match Pre-process User’s Information Need Collections text input

  6. “Berry-Picking” as an Information Seeking Strategy (Bates 90) • Standard IR model • The information need remains the same throughout the search session. • Goal is to produce a perfect set of relevant docs. • Berry-picking model • The query is continually shifting. • Users may move through a variety of sources. • New information may yield new ideas and new directions. • The value of search is on the bits and pieces picked up along the way. Marti Hearst UCB SIMS, Fall 98

  7. A sketch of a searcher… “moving through many actions towards a general goal of satisfactory completion of research related to an information need.” (after Bates 90) Q2 Q4 Q3 Q1 Q5 Q0 Marti Hearst UCB SIMS, Fall 98

  8. Implications • Interfaces should make it easy to store intermediate results • Interfaces should make it easy to follow trails with unanticipated results • Difficulties with evaluation Marti Hearst UCB SIMS, Fall 98

  9. Supporting the Information Seeking Process • Two recent similar approaches that focus on supporting the process • SketchTrieve (Hendry & Harper 97) • DLITE (Cousins 97) Marti Hearst UCB SIMS, Fall 98

  10. Informal Interface • Informal does not mean less useful • Show how the search is • unfolding or evolving • expanding or contracting • Prompt the user to • reformulate and abandon plans • backtrack to points of task deferral • make side-by-side comparisons • define and discuss problems Marti Hearst UCB SIMS, Fall 98

  11. SketchTrieve: An Informal Interface(Hendry & Harper 96, 97) • A “spreadsheet” for information access • Make use of layout, space, and locality • comprehension and explanation • search planning • A data-flow notation for information seeking • link sources to queries • link both to retrieved documents • align results in space for comparison Marti Hearst UCB SIMS, Fall 98

  12. SketchTrieve: Connecting Results with Next Query Marti Hearst UCB SIMS, Fall 98

  13. DLITE (Cousins 97) • Drag and Drop interface • Reify queries, sources, retrieval results • Animation to keep track of activity Marti Hearst UCB SIMS, Fall 98

  14. Starting Points for Search • Faced with a prompt or an empty entry form … how to start? • Lists of sources • Overviews • Clusters • Category Hierarchies/Subject Codes • Co-citation Links • Examples • Automatic source selection Marti Hearst UCB SIMS, Fall 98

  15. List of Sources • Have to guess based on the name • Requires prior exposure/experience Marti Hearst UCB SIMS, Fall 98

  16. Marti Hearst UCB SIMS, Fall 98

  17. Overviews in the User Interface • Unsupervised Groupings • Clustering • Kohonen Feature Maps • Supervised Categories • Yahoo! • Superbook • HiBrowse • Cat-a-Cone • Combinations • DynaCat • SONIA Marti Hearst UCB SIMS, Fall 98

  18. Text Clustering • Finds overall similarities among groups of documents • Finds overall similarities among groups of tokens • Picks out some themes, ignores others Marti Hearst UCB SIMS, Fall 98

  19. Text Clustering Clustering is “The art of finding groups in data.” -- Kaufmann and Rousseeu Term 1 Term 2 Marti Hearst UCB SIMS, Fall 98

  20. Text Clustering Clustering is “The art of finding groups in data.” -- Kaufmann and Rousseeu Term 1 Term 2 Marti Hearst UCB SIMS, Fall 98

  21. Document/Document Matrix Marti Hearst UCB SIMS, Fall 98

  22. Agglomerative Clustering A B C D E F G H I Marti Hearst UCB SIMS, Fall 98

  23. Agglomerative Clustering A B C D E F G H I Marti Hearst UCB SIMS, Fall 98

  24. AgglomerativeClustering A B C D E F G H I Marti Hearst UCB SIMS, Fall 98

  25. K-Means Clustering • 1 Create a pair-wise similarity measure • 2 Find K centers using agglomerative clustering • take a small sample • group bottom up until K groups found • 3 Assign each document to nearest center, forming new clusters • 4 Repeat 3 as necessary Marti Hearst UCB SIMS, Fall 98

  26. The Cluster Hypothesis “Closely associated documents tend to be relevant to the same requests.” van Rijsbergen 1979 “… I would claim that document clustering can lead to more effective retrieval than linear search [which] ignores the relationships that exist between documents.” van Rijsbergen 1979 Marti Hearst UCB SIMS, Fall 98

  27. Clustering as Categorization “In a traditional library environment … the items are classified first into subject areas, and a search is restricted to times within a few chosen subject classes. The same device can also be used … [to construct] groups of related documents and confining the search to certain groups only.” Salton 71 Marti Hearst UCB SIMS, Fall 98

  28. Clustering as Categorization “… In experiments we often want to vary the cluster representatives at search time. … Of course, were we to design an operational classification, the cluster representatives would be constructed once and for all at cluster time. van Rijsbergen 79 Marti Hearst UCB SIMS, Fall 98

  29. Scatter/Gather Cutting, Pedersen, Tukey & Karger 92, 93 Hearst & Pedersen 95 • Cluster sets of documents into general “themes”, like a table of contents • Display the contents of the clusters by showing topical terms andtypical titles • User chooses subsets of the clusters and re-clusters the documents within • Resulting new groups have different “themes” Marti Hearst UCB SIMS, Fall 98

  30. query Collection Rank Cluster

  31. S/G Example: query on “star” Encyclopedia text 14 sports 8 symbols 47 film, tv 68 film, tv (p) 7 music 97 astrophysics 67 astronomy(p) 12 steller phenomena 10 flora/fauna 49 galaxies, stars 29 constellations 7 miscelleneous Clustering and re-clustering is entirely automated Marti Hearst UCB SIMS, Fall 98

  32. Two Queries: Two Clusterings AUTO, CAR, ELECTRIC AUTO, CAR, SAFETY 8control drive accident … 25 battery california technology … 48 import j. rate honda toyota … 16 export international unit japan 3 service employee automatic … 6control inventory integrate … 10 investigation washington … 12 study fuel death bag air … 61 sale domestic truck import … 11 japan export defect unite … The main differences are the clusters that are central to the query Marti Hearst UCB SIMS, Fall 98

  33. Publication History of Scatter/Gather (Publication timing may lag significantly behind when the work was done) • 1991 Patents Filed • SIGIR 92 Initial Algorithm Introduced • SIGIR 93 Optimizations Presented • AAAIFS 95 Examples of Use on Retrieval Results • TREC 95 Use in Interactive Track Experiments • CHI 96 Experiments providing evidence that users learn collection structure • SIGIR 96 Evidence that clustering can improve ranking for TREC-like scenario Marti Hearst UCB SIMS, Fall 98

  34. Another use of clustering • Use clustering to map the entire huge multidimensional document space into a huge number of small clusters. • “Project” these onto a 2D graphical representation: Marti Hearst UCB SIMS, Fall 98

  35. Clustering Multi-Dimensional Document Space(image from Wise et al 95) Marti Hearst UCB SIMS, Fall 98

  36. Concept “Landscapes” Disease Pharmocology Anatomy Legal Hospitals Built using Kohonen Feature Maps Xia Lin, H.C. Chen Marti Hearst UCB SIMS, Fall 98

  37. Visualization of Clusters • Huge 2D maps may be inappropriate focus for information retrieval • Can’t see what documents are about • Documents forced into one position in semantic space • Space is difficult to use for IR purposes • Hard to view titles • Perhaps more suited for pattern discovery • problem: often only one view on the space Marti Hearst UCB SIMS, Fall 98

  38. Using Clustering in Document Ranking • Cluster entire collection • Find cluster centroid that best matches the query • This has been explored extensively • it is expensive • it doesn’t work well Marti Hearst UCB SIMS, Fall 98

  39. Using Clustering in Interfaces • Alternative (scatter/gather): • cluster top-ranked documents • show cluster summaries to user • Seems useful • experiments show relevant docs tend to end up in the same cluster • users seem able to interpret and use the cluster summaries some of the time • More computationally feasible Marti Hearst UCB SIMS, Fall 98

  40. Clustering • Advantages: • Sometimes discover meaningful themes • Data-driven, so reflect emphases present in the collection of documents • Can differentiate heterogeneous collections • Domain independent • Disadvantages • Variability in quality of results • Only one view on documents’ themes • Not good at differentiating homogenous collections • Require interpretation • May mis-match users’ interests

  41. Incorporating Categories into the Interface • Yahoo is the standard method • Problems: • Hard to search, meant to be navigated. • Only one category per document (usually) Marti Hearst UCB SIMS, Fall 98

  42. Marti Hearst UCB SIMS, Fall 98

  43. Integrated Browsing & Search • Search for category labels • Browse category labels • Search within document collection • Browse resulting documents in book Marti Hearst UCB SIMS, Fall 98

  44. Example: MeSH and MedLine • MeSH Category Hierarchy • ~18,000 labels • manually assigned • ~8 labels/article on average • avg depth: 4.5, max depth 9 • Top Level Categories: anatomy diagnosis related disc animals psych technology disease biology humanities drugs physics Marti Hearst UCB SIMS, Fall 98

  45. Large Category Sets • Problems for User Interfaces • Too many categories to browse • Too many docs per category • Docs belong to multiple categories • Need to integrate search • Need to show the documents • We’ll discuss this more next week. Marti Hearst UCB SIMS, Fall 98

  46. Category Labels • Advantages: • Interpretable • Capture summary information • Describe multiple facets of content • Domain dependent, and so descriptive • Disadvantages • Do not scale well (for organizing documents) • Domain dependent, so costly to acquire • May mis-match users’ interests Marti Hearst UCB SIMS, Fall 98

  47. Other Starting Points Approaches • Co-citation Links • Examples, Guided Tours Marti Hearst UCB SIMS, Fall 98

More Related