350 likes | 476 Views
Hierarchical Summaries. for Search. By: Dawn J. Lawrie University of Massachusetts, Amherst. The Problem. Possible Solution. Possible Solution. Solution: Automatic Hierarchies. Strengths of Automatic Hierarchies. Word-based summary Focus on topics of the documents
E N D
Hierarchical Summaries for Search By: Dawn J. Lawrie University of Massachusetts, Amherst
The Problem Dawn J. Lawrie University of Massachusetts, Amherst
Possible Solution Dawn J. Lawrie University of Massachusetts, Amherst
Possible Solution Dawn J. Lawrie University of Massachusetts, Amherst
Solution: Automatic Hierarchies Dawn J. Lawrie University of Massachusetts, Amherst
Strengths of Automatic Hierarchies • Word-based summary • Focus on topics of the documents • Allows users to navigate through the results • Easy to understand • Bonus: Useful for summarizing documents Dawn J. Lawrie University of Massachusetts, Amherst
mammals (1710) marine (128) fish (70) whales (74) marine mammals (188) birds (30) sea lions (22) permits (102) insects (30) jaguars (20) amphibians (10) Critical Habitat (160) deer (11) Endangered Species Act (10) Hawaii (30) Melicope Species (10) manatees (11) California (20) Wainae Plant Cluster Recovery Plan (10) Threatened (10) legislation (64) rats (10) Utah (10) habitat protection (11) Ecosystem Management (20) Waianae Mountains (10) Virginia (10) Example • Hand-generated hierarchy of 50 documents Query: “Endangered Species (Mammals)” Endangered Animals (2910) Endangered plants (70) Dawn J. Lawrie University of Massachusetts, Amherst
Term Selection Algorithm Hierarchy Document Set Language Model Proposed Framework “Term” = word or phrase Dawn J. Lawrie University of Massachusetts, Amherst
Challenges • Selecting terms for the hierarchy • Displaying the hierarchy • Showing that it works Dawn J. Lawrie University of Massachusetts, Amherst
Outline • Introduction • Description of framework for creating hierarchies • Examples • Methods of evaluation • Future Improvements Dawn J. Lawrie University of Massachusetts, Amherst
Methodology • Build probabilistic word model of documents • Find “best” terms • On topic • Predictive • Recursive definition creates hierarchy Dawn J. Lawrie University of Massachusetts, Amherst
Endangered Steller sea lions Term characteristics • Why topicality? • Distinguish topic terms from the rest of the vocabulary The Secretary of Interior listed bald eagles south of the 40th parallel as endangered under the Endangered Species Preservation Act of 1966. • Why predictiveness? • Topic words can be strongly related • Represent different facets of the vocabulary • Example: P(“Endangered”|”Stellar sea lions”) = 1.00 Dawn J. Lawrie University of Massachusetts, Amherst
Statistical Model • AT refers to topicality with respect to topic T • Find if the word w is in set T • Brefers to predictiveness • Precondition for other terms to occur • Find if word w is in set P Dawn J. Lawrie University of Massachusetts, Amherst
Probabilistic Word Model • Captures statistical information about text • Called a “language model” in speech recognition • Provides basis for estimation of probabilities Dawn J. Lawrie University of Massachusetts, Amherst
Estimating Topicality • Use term’s contribution to relative entropy • Compares two models using K-L divergence • Model of documents in hierarchy • Model of general English Dawn J. Lawrie University of Massachusetts, Amherst
marine species fishery mammal KL Example endangered Dawn J. Lawrie University of Massachusetts, Amherst
v P(t|v) mammal species fishery marine t .98 mammal .31 .35 .99 .31 .35 marine .50 .65 species .65 .04 .03 .01 fishery Estimating Predictiveness • Relates the vocabulary to a set of candidate topic terms • Use conditional probability - Px (t|v) • x is the maximum distance between t and v Dawn J. Lawrie University of Massachusetts, Amherst
Interpret predictive language model as graph edges weighted by the conditional probability Finds terms that are connected to lots of terms with a high weight Chooses topic terms until vocabulary is dominated (predicted) Dominating Set Approximation Dawn J. Lawrie University of Massachusetts, Amherst
P(t|v) v t Term Selection Example Dawn J. Lawrie University of Massachusetts, Amherst
Generating a Summary • 4-step process (1) Preprocess document set (2) Generate a language model (3) Select the terms (4) Create a Hierarchy recursive Dawn J. Lawrie University of Massachusetts, Amherst
Outline • Introduction • Description of framework for creating hierarchies • Examples • Methods of evaluation • Future Improvements Dawn J. Lawrie University of Massachusetts, Amherst
Example Hierarchies • Generated from 50 documents retrieved for the query: Endangered Species - Mammals • Demonstrate the difference between using different topic models • Web hierarchy using same query Dawn J. Lawrie University of Massachusetts, Amherst
amended (154) endangered (86) regulations (124) Act (41) fish (117) State (32) permit (146) Committee (43) number (93) address (85) bill (51) operations (43) Secretary (73) incidental take (42) research (105) NMFS (64) population (32) commercial fishing operations (42) Uniform Topic Model Hierarchy species (439) marine mammals (187) plan (192) marine (187) Dawn J. Lawrie University of Massachusetts, Amherst
mammals (126) Endangered Species Act (294) marine mammal stocks (20) endangered species (204) marine mammal species (42) habitat (283) fishery (53) Marine Mammal Commission (21) Secretary (42) fish (277) NMFS (83) National Marine Fisheries Service (113) stock (51) fish species (32) Act (313) MMPA (51) permit (164) incidental (74) protection (244) research (63) KL-Topic Model Hierarchy marine mammals (187) species (439) marine (187) Marine Mammal Protection Act (73) management plan (51) Dawn J. Lawrie University of Massachusetts, Amherst
Web Hierarchies • Submit query to a web search engine • Gather titles and snippets of documents • Text considered a document • Documents are about 30 words Dawn J. Lawrie University of Massachusetts, Amherst
marine species (4) marine mammals (91) marine mammals (97) terrestrial mammals (2) animal species (1) birds (114) Endangered Mammals (22) Critically Endangered Mammals (2) endangered marine species (2) Endangered Mammals (13) threatened (144) Endangered Species Act (8) species of marine mammals (1) birds (140) threatened (78) species of mammals (27) Animal Info (2) Species Management (2) Listed Species (1) species of marine mammals (1) Mammals species (4) Ecosystems (2) Species Information (1) listing of species (1) Scientists (2) Endangered Species Coalition (2) Canadian Endangered Species (3) protected species (2) Protected Resources (2) native species (1) small mammals (13) Endangered Spaces (2) endangered mammal species (4) Candidate species (2) large mammals (12) sea otter (2) 100 species (1) British mammals (4) new species (1) dolphins (7) whales (13) List of Endangered Species (5) federal Endangered Species (1) Cetaceans (2) Example of Web Hierarchy marine (76) Endangered Species (440) endangered (491) mammals (600) Dawn J. Lawrie University of Massachusetts, Amherst
Outline • Introduction • Description of framework for creating hierarchies • Examples • Methods of evaluation • Future Improvements Dawn J. Lawrie University of Massachusetts, Amherst
Evaluations • Summary Evaluation • Tests how well the topic terms chosen predict the vocabulary • Access Evaluation • Compare number of documents a user can find • Relevance Evaluation • Path length to find all relevant documents Dawn J. Lawrie University of Massachusetts, Amherst
Automatic Evaluation Test Set • Use 50 standard queries • Document sets • 500 documents retrieved from TREC volumes 4 and 5 (have relevance judgments) • 200 documents retrieved from a news database • 1000 titles and snippets retrieved using Google™ Search Engine Dawn J. Lawrie University of Massachusetts, Amherst
? Evaluating Hypotheses • Denotes an evaluation confirmed hypothesis • Denotes evaluation showed no significant difference ? Relevance Summary TREC Collection and News Documents Access Use KL-topic model Use sub-collections Dawn J. Lawrie University of Massachusetts, Amherst
Web Document Evaluation • Results completely different • Best hierarchy uniform topic model • Hierarchies do not look as good to human inspection Dawn J. Lawrie University of Massachusetts, Amherst
User Study • Include 12 to 16 users • Compare ranked list and hierarchy to ranked list alone • Users asked to find all instances that are relevant to the query • Only have to identify one document about a particular instance • Study includes 10 queries Dawn J. Lawrie University of Massachusetts, Amherst
Future Work • Complete user study • Failure Analysis • Explore the use of topic hierarchies in other organizational tasks • Personal collections of documents • E-mails Dawn J. Lawrie University of Massachusetts, Amherst
Conclusions • Developed a formal framework for topic hierarchies • Created hierarchies from full text and snippets of documents • Verified intuition concerning hierarchies generated from full text Dawn J. Lawrie University of Massachusetts, Amherst
Questions? Demo: http://www-ciir.cs.umass.edu/~lawrie/categories/google-qry/ Dawn J. Lawrie University of Massachusetts, Amherst