1 / 23

Helping Editors Choose Better Seed Sets for Entity Set Expansion

Helping Editors Choose Better Seed Sets for Entity Set Expansion. Vishnu Vyas, Patrick Pantel, Eric Crestan CIKM ’ 09 Speaker: Hsin-Lan, Wang Date: 2010/05/10. Outline. Introduction Impact of Seed Sets Systems Prototype Removal Clustering Minimum Overlap Criterion (MOC) Experiment

lyle-holman
Download Presentation

Helping Editors Choose Better Seed Sets for Entity Set Expansion

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Helping Editors Choose Better Seed Sets for Entity Set Expansion Vishnu Vyas, Patrick Pantel, Eric Crestan CIKM’09 Speaker: Hsin-Lan, Wang Date: 2010/05/10

  2. Outline • Introduction • Impact of Seed Sets • Systems • Prototype Removal • Clustering • Minimum Overlap Criterion (MOC) • Experiment • Conclusions

  3. Introduction • Collections of named entities are used in many commercial and research applications. • Semi-supervised methods (set expansion): • Pattern based techniques • Distributional techniques

  4. Introduction • Problem: • The quality of the expansion can vary greatly based on the nature of the concept and the seed set. • Human editors generate widely varying sets and poor expansion quality.

  5. Introduction • In this paper: • Employ a seed set expansion system to study the impact of different seed sets. • Propose several algorithms for improving the seed sets by human editors. • Identify three factors of seed set composition that affect the expansion quality.

  6. Impact of Seed Sets • Seed Set Composition

  7. Impact of Seed Sets • Do humans generate good seed sets?

  8. Impact of Seed Sets • Factors in Seed Set Composition • Prototypicality • superordinate concept • {dog, cat} -> pets (not animal) • Ambiguity • polysemy • {mercury} -> elements and planets • Coverage • seed set shares in common with semantic space • {iron, boron, nitrogen} vs. {helium, argon, xenon}

  9. Systems • Prototype Removal • Clustering • Minimum Overlap Criterion (MOC)

  10. Prototype Removal • A prototype is a common and unambiguous instance from a concept. • sort: based on prototypicality score • remove: the most prototypical seeds

  11. Clustering • Ambiguous seed instances belong to more than one concept. • They tend to be less similar to any particular concept than their non-ambiguous counterparts.

  12. Clustering • distributional feature vector • weight: point-wise mutual information • average-link clustering • Chose the tightest cluster as candidate seed set.

  13. Clustering • PMI(w) = (pmiw1, pmiw2, …, pmiwm) • cwf: the frequency of feature f occurring for term w • n: the number of unique terms • N:the total number of features for all terms

  14. Minimum Overlap Criterion • seeds can best represent a concept C: • maximum information • minimum redundancy • Represent the concept with the set of features which are shared between a minimum of two seeds in the seed set.

  15. Minimum Overlap Criterion • joint information

  16. Datasets and Baseline • Select nine lists from Wikipedia’s List of pages which were considered complete and treated as the gold standard. • Three of the lists were designated as the development set. • The remaining sets were used to test the expansion performance of the seed sets generated by the three methods.

  17. Experimental Setup • Created trial seed sets from the original seed sets provided to us by the editors. • 1024 trial seed sets for each list • total of 9216 trails • training the parameters • prototype removal: remove 3 seeds • MOC: remove 4 seeds • clustering: k=2

  18. Experimental Results • Overall Analysis

  19. Experimental Results • Intrinsic Analysis of Prototype Removals, Clustering and MOC

  20. Experimental Results

  21. Experimental Results • MOC’s high performance: • compare to prototype: • Minimize semantic overlap between the seed sets. • Seeds which are prototypical tend to overlap semantically with almost all seeds in a seed set. • compare to ambiguous words: • Ambiguous words do not share a lot of highly informative distributional features with the concept.

  22. Conclusions • Showed that the composition of seed sets can significantly affect the performance of set expansion. • Showed that an average editor does not produce seed sets that result in high quality expansions.

  23. Conclusions • Identified three important factors in seed set composition –prototypicality, ambiguity and coverage. • Proposed three algorithms, each one tackling a different factor affecting seed set composition.

More Related