1 / 49

Castanet: Using WordNet to Build Facet Hierarchies

This study focuses on using WordNet to create hierarchical faceted metadata to enhance search and navigation on websites. Explore how facets can improve browsing efficiency and information organization across various subjects.

chambersb
Download Presentation

Castanet: Using WordNet to Build Facet Hierarchies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Castanet:Using WordNet to Build Facet Hierarchies Emilia Stoica and Marti HearstSchool of Information, Berkeley

  2. Focus: Search and Navigation of Large Collections Shopping Sites Digital Libraries E-Government Sites Image Collections

  3. Problems with Site Search • Study by Vividence in 2001 on 69 Sites • 70% eCommerce • 31% Service • 21% Content • 2% Community • Poorly organized search results • Frustration and wasted time • Poor information architecture • Confusion • Dead ends • "back and forthing" • Forced to search

  4. robin penguin salmon cobra bat otter wolf robin bat penguin otter, seal salmon wolf robin bat salmon wolf cobra otter penguin seal The Problem With Hierarchy • Most things can be classified in more than one way. • Most organizational systems do not handle this well. • Example: Animal Classification otter penguin robin salmon wolf cobra bat Skin Covering Locomotion Diet

  5. The Problem With Hierarchy start swim fly run slither fur scales feathers fur scales feathers fur scales feathers … fish fish fish fish fish fish fish fish fish rodents rodents rodents rodents rodents rodents rodents rodents rodents insects insects insects insects insects insects insects insects insects salmon bat robin wolf

  6. The Idea of Facets • Facets are a way of labeling data • A kind of Metadata (data about data) • Can be thought of as properties of items • Facets vs. Categories • Items are placed INTO a category system • Multiple facet labels are ASSIGNED TO items

  7. Fruit Apricot Flavor gingerroot Vegetables pepper The Idea of Facets Hot and Sweet Chicken: 1 pepper, 2 apricots, 1 pound chicken breast, 1 Tbsp gingerroot Meat Chicken

  8. Using Facets • Now there are multiple ways to get to each item Preparation Method Fry Saute Boil Bake Broil Freeze Desserts Cakes Cookies Dairy Ice Cream Sherbet Flan Fruits Cherries Berries Blueberries Strawberries Bananas Pineapple Fruit > Pineapple Dessert > Cake Preparation > Bake Dessert > Dairy > Sherbet Fruit > Berries > Strawberries Preparation > Freeze

  9. Castanet • Semi-automatic algorithm for creating hierarchical faceted metadata • Carves out a structure from the hypernym(IS-A) relations within WordNet • Produces surprisingly good results for a wide range of subjects • e.g., arts, medicine, recipes, math, news, bibliographical records

  10. #1 cactus tuna food fish #2 fish bony fish WordNet Challenges • A word may have more than one sense -Fine granularity of word sense distinctions e.g., newspaper (#1) - daily publication on folded sheets newspaper (#3) - physical object - Ambiguity for the same sense

  11. WordNet Challenges (cont.) • The hypernym path may be quite long (e.g., sense #3 of tuna has 14 nodes) • Sparse coverage of proper names and noun phrases (not addressed)

  12. Algorithm Goals • Build a set of facet hierarchies • Balance depth and breadth • Avoid “skinny” paths • Don’t go too deep or too broad • Choose understandable labels • Disambiguate words • Currently a word can take on only one sense

  13. Build core tree Augment core tree Select terms WordNet Divide into facets Remove top level categories Compress Tree Our Approach Documents

  14. Select well-distributed terms from the collection Eliminate stopwords Retain only those terms with a distribution higher than a threshold (default: top 10%) 1. Select Terms Build core tree Augm. core tree Documents Select terms Comp. tree Remove top level categ. WordNet

  15. Get hypernym path if term: - has only one sense, or - matches a pre-selected WordNet domain Adding a new term increases a count at each node on its path by # of docs with the term. Build core tree Augm. core tree Documents Select terms Comp. tree Remove top level categ. WordNet entity entity substance,matter substance,matter nutriment nutriment dessert dessert frozen dessert frozen dessert ice cream sundae sherbet,sorbet sherbet sundae 2. Build Core Tree • Build a “backbone” • Create paths from unambiguous terms only • Bias the structure towards appropriate senses of words

  16. Merge hypernym paths to build a tree entity entity entity substance,matter substance,matter substance,matter nutriment nutriment nutriment frozen dessert dessert dessert dessert frozen dessert frozen dessert ice cream sundae sherbet,sorbet ice cream sundae sherbet,sorbet sherbet sundae sundae sherbet 2. Build Core Tree (cont.)

  17. Build core tree Augm. core tree Documents Select terms Comp. tree Remove top level categ. WordNet 3. Augment Core Tree • Attach to Core tree the terms with more than one sense • Favor the more common path over other alternatives

  18. Date (p1) Date (p2) entity abstraction substance,matter measure, quantity food, nutrient fundamental quality nutriment time period food calendar day (18) edible fruit (78) date date Choose this path since it has more items assigned Augment Core Tree (cont.)

  19. Optional Step: Domains • To disambiguate, use Domains • Wordnet has 212 Domains • medicine, mathematics, biology, chemistry, linguistics, soccer, etc. • A better collection has been developed by Magnini 2000 • Assigns a domain to every noun synset • Automatically scan the collection to see which domains apply • The user selects which of the suggested domains to use or may add own • Paths for terms that match the selected domains are added to the core tree

  20. Using Domains dip glosses: Sense 1: A depression in an otherwise level surface Sense 2: The angle that a magnet needle makes with horizon Sense 3: Tasty mixture into which bite-size foods are dipped dip hypernyms Sense 1 Sense 2 Sense 3 solid shape, form food => concave shape => space => ingredient, fixings => depression => angle => flavorer Given domain “food”, choose sense 3

  21. Build core tree Augm. core tree Documents Select terms Comp. tree Remove top level categ. WordNet abstraction dessert frozen dessert sundae parfait sherbet 4. Compress Tree • Rule 1: • Eliminate a parent with fewer than kchildren unless it is the root or its distribution is larger than 0.1*maxdist dessert frozen dessert ice cream sundae parfait sherbet,sorbet sundae sherbet

  22. Build core tree Augm. core tree Documents Select terms Comp. tree Remove top level categ. WordNet abstraction dessert sundae parfait sherbet 4. Compress Tree (cont.) • Rule 2: • Eliminate a child whose name appears within the parent’s name dessert frozen dessert sundae parfait sherbet

  23. Divide into facets 5. Divide into Facets

  24. entity substance,matter food,nutriment food stuff,food product ingredient,fixings flavorer flavorer herb herb sweetening sweetening parsley oregano sugar syrup parsley oregano sugar syrup Divide into facets 5. Divide into Facets(Remove top levels) Rule 1: Manually eliminate the top t levels (t =4 for recipe collection). Rule 2: For each resulting tree, test if it has more than n children (n =2) If yes, stop. If not, delete the root and test again.

  25. Example: Recipes (3500 docs)

  26. Castanet Output(shown in Flamenco)

  27. Castanet Output

  28. Castanet Output

  29. Castanet Output

  30. Castanet Output

  31. Castanet Evaluation • This is a tool for information architects, so people of this type did the evaluation • We compared output on • Recipes • Biomedical journal titles • We compared to two state-of-the-art algorithms • LDA (Blei et al. 04) • Subsumption (Sanderson & Croft ’99)

  32. Subsumption Output

  33. Subsumption Output

  34. Subsumption Output

  35. Subsumption Output

  36. LDA Output

  37. LDA Output

  38. LDA Output

  39. Evaluation Method • Information architects assessed the category systems • For each of 2 systems’ output: • Examined and commented on top-level • Examined and commented on two sub-levels • Then comment on overall properties • Meaningful? • Systematic? • Likely to use in your work?

  40. Evaluation (cont.) Sample questions for top level categories: - Would you add/remove/rename any category ? - Did this category match your expectations ? Sample questions for a specific category: - Would you add/move/remove any sub-categories ? - Would you promote any sub-category to top level ? General questions: - Would you use Castanet ? - Would you use LDA ? - Would you use Subsumption ? - Would you use list of most frequent terms ?

  41. Evaluation Results • Results on recipes collection for “Would you use this system in your work?” • # “Yes in some cases” or “yes, definitely”: • Castanet: 29/34 • LDA: 0/18 • Subsumption: 6/16 • Baseline: 25/34 • Average response to questions about quality(4 = “strongly agree”)

  42. Evaluation Results • Average responses for top-level categories • 4= no changes, 1 = change many • Average responses for 2 subcategories

  43. Needed Improvements • Take spelling variations and morphological variants into account • Use verbs and adjectives, not just nouns • Normalize noun phrases • Allow terms to have more than one sense • Improve algorithm for assigning documents to categories.

  44. Opportunities for Tagging • New opportunity: Tagging, folksonomies • (flickr, de.lici.ous) • People are creating facets in a decentralized manner • They are assigning multiple facets to items • This is done on a massive scale • This leads naturally to meaningful associations

  45. Conclusions • Castanet builds a set of faceted hierarchies by finding IS-A relations between terms using WordNet. • The method has been tested on various domains: • medicine, recipes, math, news, arts, bibliographical records • Usability study shows: • Castanet is preferred to other state-of-the art solutions. • Information architects want to use the tool in their work.

  46. Learn More • Funding • This work supported in part by NSF (IIS-9984741) • For more information: • Stoica, E., Hearst, M., and Richardson, M., Automating Creation of Hierarchical Faceted Metadata Structures, NAACL/HLT 2007 • See http://flamenco.berkeley.edu

  47. Motivation Want to assign labels from multiple hierarchies

  48. The Problem with Hierarchy • Inflexible • Force the user to start with a particular category • What if I don’t know the animal’s diet, but the interface makes me start with that category? • Wasteful • Have to repeat combinations of categories • Makes for extra clicking and extra coding • Difficult to modify • To add a new category type, must duplicate it everywhere or change things everywhere

More Related