160 likes | 256 Views
The Problem. Finding information about people in huge text collections or on-line repositories on the Web is a common activity Person names, however, are highly ambiguous -- in the US 90,000 different names are shared by 100 million people
E N D
The Problem • Finding information about people in huge text collections or on-line repositories on the Web is a common activity • Person names, however, are highly ambiguous -- in the US 90,000 different names are shared by 100 million people • Cross-document coreference resolution is the task of identifying if two mentions of the same (or similar) name in different sources refer to the same individual • Solving this problem is important no only for better access to information but also in practical applications
SemEval 2007 Web People Search Task • A search engine user types in a person name as a query • Instead of ranking the Web pages, an ideal system should organize the results in as many clusters as different individuals sharing the name have been returned • System receive a set of documents matching a person name and returns clusters, each cluster refers to the same individual
SemEval 2007 Web People Search Data • Training data (100 documents per person name) • 10 person names from the European Conference on Digital Libraries; 7 person names from Wikipedia; 32 person names from a previous study (Gideon&Mann’03) • Testing: 30 person names; pages returned by Yahoo! • Systems output compared to gold standard produced by human
Examples STATE AGENT SCIENTIST TRAINING ACTOR SPORTMAN TESTING
Clustering Given a set of documents and a threshold • Initially there are as many clusters as documents • All clusters are compared using a similarity metric • At each iteration the two most similar clusters are merged if their similarity is greater than a threshold (otherwise stop and return clusters) • Goto step 2
Document Representation • term frequency (tf) of term t in document d = the number of times t occurs in d • inverted document frequency (idf) of term t in collection c = the number of documents in c containing t • Bag-of-word approach = terms are words • text = (word1=w1….) • Semantic-based approach = terms are named entities (person, location, organization, date, address) • text = (ne1=w1….) • Two approaches to extract terms: • terms belong to the full document (full document condition) • terms belong to personal summaries (summary condition)
Examples of terms Organization: DARPA; MIT Press; Artificial Intelligence Center; AAAI; Department of Computer Science; etc. Person: Douglas E. Appelt; David J. Israel; Jean-Claude Martin; etc. Location: Menlo Park; Las Palmas; Clearwater Beach; etc. Date: 1995-2007; 15 February 2007; 20:34; etc. Address: http://acl.ldc.upenn.edu/J/J87/; Los Angeles Area; ontherecord@foxnews.com; 105 Chamber Street; etc.
Implementation Details • local IDF tables are computed for each set of documents • weights are tf*log(N/idf) – N is the size of the document set • simC is the cluster similarity; simD is the document similarity which is the cosine metric • threshold estimated over training data • the algorithm was run over the ECDL training data and the similarity value for the optimal f-score is recorded for each instance • the threshold for testing is set to the average of the optimal thresholds (for word-based representation and semantic-based representation)
WEB SEARCH ENGINE PERSON NAME GATE ANNIE SYSTEM SUMMARIZATION TOOLKIT DOCS IDF TABLES PERSONAL SUMMARIES ANNOTATED DOCS THRESHOLD VECTOR EXTRACTOR VECS CLUSTERING CLUSTERS
NLP Components • Use ANNIE system - a GATE information extraction system (http://gate.ac.uk) • Tokeniser • Sentence splitter • Gazetteer list lookup • Regular expressions over annotations (JAPE) • Parts of speech tagging • Coreference resolution • Use in-house Summarization Toolkit (http://www.dcs.shef.ac.uk/~saggion) • term frequency statistics; Vector Space Model representation; IDF tables computation • Personal summaries
Personal Summaries • Coreference chains are identified (in each document) • All elements in a coreference chain containing the target person are marked • Sentences containing marked person name are selected for summary
SemEval Results • 4 configurations for SemEval 2007 • System 1 = full document & words • System 2 = full document & NEs – system submitted for official evaluation • System 3 = summary & words • System 4 = summary & NEs • best system obtained f-score = 0.78; our system ranked 5th out of 16 participants; all our system configurations f-score > average system
The Effect of Semantic Information • Post SemEval experiments studied the effect of each type of information – basically vectors were created for each type of NE and documents re-clustered FULL TEXT CONDITION SUMMARY CONDITION
Conclusions • Presented an approach to cross-document coreference based on available robust extraction and summarization technology • Approach is largely unsupervised – need some training data to set up parameters • System demonstrated good performance in SemEval 2007 Web People Search Task • Special attention should be given to the type of information used for representing vectors in order to achieve optimal performance