The Problem

The Problem • Finding information about people in huge text collections or on-line repositories on the Web is a common activity • Person names, however, are highly ambiguous -- in the US 90,000 different names are shared by 100 million people • Cross-document coreference resolution is the task of identifying if two mentions of the same (or similar) name in different sources refer to the same individual • Solving this problem is important no only for better access to information but also in practical applications

SemEval 2007 Web People Search Task • A search engine user types in a person name as a query • Instead of ranking the Web pages, an ideal system should organize the results in as many clusters as different individuals sharing the name have been returned • System receive a set of documents matching a person name and returns clusters, each cluster refers to the same individual

SemEval 2007 Web People Search Data • Training data (100 documents per person name) • 10 person names from the European Conference on Digital Libraries; 7 person names from Wikipedia; 32 person names from a previous study (Gideon&Mann’03) • Testing: 30 person names; pages returned by Yahoo! • Systems output compared to gold standard produced by human

Examples STATE AGENT SCIENTIST TRAINING ACTOR SPORTMAN TESTING

Metrics

Clustering Given a set of documents and a threshold • Initially there are as many clusters as documents • All clusters are compared using a similarity metric • At each iteration the two most similar clusters are merged if their similarity is greater than a threshold (otherwise stop and return clusters) • Goto step 2

Document Representation • term frequency (tf) of term t in document d = the number of times t occurs in d • inverted document frequency (idf) of term t in collection c = the number of documents in c containing t • Bag-of-word approach = terms are words • text = (word1=w1….) • Semantic-based approach = terms are named entities (person, location, organization, date, address) • text = (ne1=w1….) • Two approaches to extract terms: • terms belong to the full document (full document condition) • terms belong to personal summaries (summary condition)

Examples of terms Organization: DARPA; MIT Press; Artificial Intelligence Center; AAAI; Department of Computer Science; etc. Person: Douglas E. Appelt; David J. Israel; Jean-Claude Martin; etc. Location: Menlo Park; Las Palmas; Clearwater Beach; etc. Date: 1995-2007; 15 February 2007; 20:34; etc. Address: http://acl.ldc.upenn.edu/J/J87/; Los Angeles Area; ontherecord@foxnews.com; 105 Chamber Street; etc.

Implementation Details • local IDF tables are computed for each set of documents • weights are tf*log(N/idf) – N is the size of the document set • simC is the cluster similarity; simD is the document similarity which is the cosine metric • threshold estimated over training data • the algorithm was run over the ECDL training data and the similarity value for the optimal f-score is recorded for each instance • the threshold for testing is set to the average of the optimal thresholds (for word-based representation and semantic-based representation)

WEB SEARCH ENGINE PERSON NAME GATE ANNIE SYSTEM SUMMARIZATION TOOLKIT DOCS IDF TABLES PERSONAL SUMMARIES ANNOTATED DOCS THRESHOLD VECTOR EXTRACTOR VECS CLUSTERING CLUSTERS

NLP Components • Use ANNIE system - a GATE information extraction system (http://gate.ac.uk) • Tokeniser • Sentence splitter • Gazetteer list lookup • Regular expressions over annotations (JAPE) • Parts of speech tagging • Coreference resolution • Use in-house Summarization Toolkit (http://www.dcs.shef.ac.uk/~saggion) • term frequency statistics; Vector Space Model representation; IDF tables computation • Personal summaries

Analysed Document

Personal Summaries • Coreference chains are identified (in each document) • All elements in a coreference chain containing the target person are marked • Sentences containing marked person name are selected for summary

SemEval Results • 4 configurations for SemEval 2007 • System 1 = full document & words • System 2 = full document & NEs – system submitted for official evaluation • System 3 = summary & words • System 4 = summary & NEs • best system obtained f-score = 0.78; our system ranked 5th out of 16 participants; all our system configurations f-score > average system

The Effect of Semantic Information • Post SemEval experiments studied the effect of each type of information – basically vectors were created for each type of NE and documents re-clustered FULL TEXT CONDITION SUMMARY CONDITION

Conclusions • Presented an approach to cross-document coreference based on available robust extraction and summarization technology • Approach is largely unsupervised – need some training data to set up parameters • System demonstrated good performance in SemEval 2007 Web People Search Task • Special attention should be given to the type of information used for representing vectors in order to achieve optimal performance

The Problem

The Problem

Presentation Transcript

The Problem

The Problem

The problem:

The Problem

The Problem:

The Problem:

The Problem

The Problem

The Problem

THE PROBLEM

The Problem

The Problem

The Problem

The Problem…

THE PROBLEM

The problem

The problem

The Problem

The Problem

The Problem

The Problem

THE PROBLEM