230 likes | 386 Views
An Integrated Approach for Relation Extraction from Wikipedia Texts. Yulan Yan Yutaka Matsuo Mitsuru Ishizuka The University of Tokyo WWW 2009. Abstract. Relation Extraction from Wikipedia Texts A novel distance function A linear clustering algorithm Wikipedia Texts
E N D
An Integrated Approach for Relation Extraction fromWikipedia Texts Yulan Yan Yutaka Matsuo Mitsuru Ishizuka The University of Tokyo WWW 2009
Abstract • Relation Extraction from Wikipedia Texts • A novel distance function • A linear clustering algorithm • Wikipedia Texts • High quality texts • Heavily cross-linked articles • Sentence -> Dependency tree • Web Texts • Frequency information • Relation terms • Sentence -> Surface pattern • Experiments on two different domains • American chief executives • Companies
Problem definition • Relation extraction between • article entitled concept (ec) and • one of related concepts (rc) • There is a salient semantic relation r between p and p’ l(p)
Problem definition Concept pairs (Eric E. Schmidt, Google) (Eric E. Schmidt, Compiler) (Eric E. Schmidt, Atherton, California) … … (Bill Gates, Microsoft) … … Clustering Evaluation
Overview of the Approach • Text preprocessor • Concept pair collection • Sentence filtering • Web Context Collector • A set of ranked relational terms • A set of surface patterns • Dependency pattern modeling • Linguistic information • Linear clustering algorithm • Local clustering • Global clustering
1. Text Preprocessor -Relation Candidate Generation • Wikipedia article texts to get • relation candidates • corresponding sentences. • All hyper-linked concepts in the article as related concepts, which may share a semantic relationship with the entitled concept • Concept pairs • Appling a linguistic parser to split article text into sentences • for the dependency pattern modeling module
2. Web Context Collection • Querying with a concept pair • Hypothesis • The web exists some key terms and patterns that provide clues to the relation the concept pair assume • Two kinds of relational information • a set of ranked relational terms as keywords • a set of surface patterns
2. Web Context Collection - Relational Term Ranking (1/2) • To collect relational terms as indicators for each concept pair • Verbs, nouns • Such as “CEO”, “founder” • Entropy-based feature ranking algorithm • Chen et al., 2005 (IJCNLP) • After the ranking • A relational term list Tcp is ranked according to term order • A keyword kcp is selected as co-appearing in the term list Tcp and corresponding Wikipedia sentence
Entropy-based Feature Ranking - J. Chen, D. Ji, C.L. Tan, and Z. Niu. 2005. Unsupervised Feature Selection for Relation Extraction. In Proceedings of JCNLP-2005. • Local context vectors of co-occurrences of entity pair E1 and E2 • P ={ p1, p2, … pN } • The words occurred in P • W ={ w1, w2, … wM} • To select a subset of important features from W ;
2. Web Context Collection - Surface Pattern Generation (2/2) • Content Words(CWs) • ec( entitled concept), rc(related concept) , keyword kcp • Function Words • Bag of words is to look for verbs, nouns, and coordinating conjunctions
3. Dependency Pattern Modeling • Dependency patterns for relation clustering • selected sentences • one of entitled concept, one of the related concepts • parsing into dependency structures • R. Bunescu and R. Mooney. 2005. A shortest path dependency kernel for relation extraction. In Proceedings of HLT/EMLNP-2005. • M. Zhang, J. Zhang, J. Su and G. Zhou. 2006. A Composite Kernel to Extract Relations between Entities with both Flat and Structured Features. In Proceedings of ACL-2006.
4. Linear Clustering Algorithm- Distance Function & Centroid Selection (1/2) • All concept pairs are grouped by their keywords tcp • Let G={G1,G2, …Gn }, • Gi={cpi1,cpi2,…, } shares the same keyword tcp • A centroid ci is selected for group Gi
4. Linear Clustering Algorithm- Distance Function & Centroid Selection (2/2) • cost function cost(sp1i,sp2j) • B. Rosenfeld and R. Feldman. 2006. URES: an Unsupervised Web Relation Extraction System. In Proceedings of COLING/ACL-2006.
4. Linear Clustering Algorithm- Local Dependency Pattern Clustering
4. Linear Clustering Algorithm- Local Dependency Pattern Clustering
4. Linear Clustering Algorithm- Global Surface Pattern Clustering
Experiments • Wikipedia dump on 03/12/2008 • Two categories • American chief executives • 526 articles, 7310 concept pairs • 1/3,1/3 for Dl and Dg , 18 groups • Companies • 434 articles, 4935 concept pairs • 1/3, 1/3 for Dl and Dg , 28 groups • Compare with • B. Rosenfeld and R. Feldman. 2007. Clustering for Unsupervised Relation Identification. In Proceedings of CIKM-2007. • surface feature
Conclusions • A novel distance function • A linear clustering algorithm • Combination of two kinds of patterns • Dependence patterns • Surface patterns • J. Chen, D. Ji, C.L. Tan, and Z. Niu. 2005. Unsupervised Feature Selection for Relation Extraction. In Proceedings of JCNLP-2005. • R. Bunescu and R. Mooney. 2005. A shortest path dependency kernel for relation extraction. In Proceedings of HLT/EMLNP-2005. • M. Zhang, J. Zhang, J. Su and G. Zhou. 2006. A Composite Kernel to Extract Relations between Entities with both Flat and Structured Features. In Proceedings of ACL-2006. • B. Rosenfeld and R. Feldman. 2006. URES: an Unsupervised Web Relation Extraction System. In Proceedings of COLING/ACL-2006.