Helping Editors Choose Better Seed Sets for Entity Set Expansion

Helping Editors Choose Better Seed Sets for Entity Set Expansion Vishnu Vyas, Patrick Pantel, Eric Crestan CIKM’09 Speaker: Hsin-Lan, Wang Date: 2010/05/10

Outline • Introduction • Impact of Seed Sets • Systems • Prototype Removal • Clustering • Minimum Overlap Criterion (MOC) • Experiment • Conclusions

Introduction • Collections of named entities are used in many commercial and research applications. • Semi-supervised methods (set expansion): • Pattern based techniques • Distributional techniques

Introduction • Problem: • The quality of the expansion can vary greatly based on the nature of the concept and the seed set. • Human editors generate widely varying sets and poor expansion quality.

Introduction • In this paper: • Employ a seed set expansion system to study the impact of different seed sets. • Propose several algorithms for improving the seed sets by human editors. • Identify three factors of seed set composition that affect the expansion quality.

Impact of Seed Sets • Seed Set Composition

Impact of Seed Sets • Do humans generate good seed sets?

Impact of Seed Sets • Factors in Seed Set Composition • Prototypicality • superordinate concept • {dog, cat} -> pets (not animal) • Ambiguity • polysemy • {mercury} -> elements and planets • Coverage • seed set shares in common with semantic space • {iron, boron, nitrogen} vs. {helium, argon, xenon}

Systems • Prototype Removal • Clustering • Minimum Overlap Criterion (MOC)

Prototype Removal • A prototype is a common and unambiguous instance from a concept. • sort: based on prototypicality score • remove: the most prototypical seeds

Clustering • Ambiguous seed instances belong to more than one concept. • They tend to be less similar to any particular concept than their non-ambiguous counterparts.

Clustering • distributional feature vector • weight: point-wise mutual information • average-link clustering • Chose the tightest cluster as candidate seed set.

Clustering • PMI(w) = (pmiw1, pmiw2, …, pmiwm) • cwf: the frequency of feature f occurring for term w • n: the number of unique terms • N:the total number of features for all terms

Minimum Overlap Criterion • seeds can best represent a concept C: • maximum information • minimum redundancy • Represent the concept with the set of features which are shared between a minimum of two seeds in the seed set.

Minimum Overlap Criterion • joint information

Datasets and Baseline • Select nine lists from Wikipedia’s List of pages which were considered complete and treated as the gold standard. • Three of the lists were designated as the development set. • The remaining sets were used to test the expansion performance of the seed sets generated by the three methods.

Experimental Setup • Created trial seed sets from the original seed sets provided to us by the editors. • 1024 trial seed sets for each list • total of 9216 trails • training the parameters • prototype removal: remove 3 seeds • MOC: remove 4 seeds • clustering: k=2

Experimental Results • Overall Analysis

Experimental Results • Intrinsic Analysis of Prototype Removals, Clustering and MOC

Experimental Results

Experimental Results • MOC’s high performance: • compare to prototype: • Minimize semantic overlap between the seed sets. • Seeds which are prototypical tend to overlap semantically with almost all seeds in a seed set. • compare to ambiguous words: • Ambiguous words do not share a lot of highly informative distributional features with the concept.

Conclusions • Showed that the composition of seed sets can significantly affect the performance of set expansion. • Showed that an average editor does not produce seed sets that result in high quality expansions.

Conclusions • Identified three important factors in seed set composition –prototypicality, ambiguity and coverage. • Proposed three algorithms, each one tackling a different factor affecting seed set composition.

Helping Editors Choose Better Seed Sets for Entity Set Expansion