1 / 16

The Problem

The Problem. Finding information about people in huge text collections or on-line repositories on the Web is a common activity Person names, however, are highly ambiguous -- in the US 90,000 different names are shared by 100 million people

Download Presentation

The Problem

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Problem • Finding information about people in huge text collections or on-line repositories on the Web is a common activity • Person names, however, are highly ambiguous -- in the US 90,000 different names are shared by 100 million people • Cross-document coreference resolution is the task of identifying if two mentions of the same (or similar) name in different sources refer to the same individual • Solving this problem is important no only for better access to information but also in practical applications

  2. SemEval 2007 Web People Search Task • A search engine user types in a person name as a query • Instead of ranking the Web pages, an ideal system should organize the results in as many clusters as different individuals sharing the name have been returned • System receive a set of documents matching a person name and returns clusters, each cluster refers to the same individual

  3. SemEval 2007 Web People Search Data • Training data (100 documents per person name) • 10 person names from the European Conference on Digital Libraries; 7 person names from Wikipedia; 32 person names from a previous study (Gideon&Mann’03) • Testing: 30 person names; pages returned by Yahoo! • Systems output compared to gold standard produced by human

  4. Examples STATE AGENT SCIENTIST TRAINING ACTOR SPORTMAN TESTING

  5. Metrics

  6. Clustering Given a set of documents and a threshold • Initially there are as many clusters as documents • All clusters are compared using a similarity metric • At each iteration the two most similar clusters are merged if their similarity is greater than a threshold (otherwise stop and return clusters) • Goto step 2

  7. Document Representation • term frequency (tf) of term t in document d = the number of times t occurs in d • inverted document frequency (idf) of term t in collection c = the number of documents in c containing t • Bag-of-word approach = terms are words • text = (word1=w1….) • Semantic-based approach = terms are named entities (person, location, organization, date, address) • text = (ne1=w1….) • Two approaches to extract terms: • terms belong to the full document (full document condition) • terms belong to personal summaries (summary condition)

  8. Examples of terms Organization: DARPA; MIT Press; Artificial Intelligence Center; AAAI; Department of Computer Science; etc. Person: Douglas E. Appelt; David J. Israel; Jean-Claude Martin; etc. Location: Menlo Park; Las Palmas; Clearwater Beach; etc. Date: 1995-2007; 15 February 2007; 20:34; etc. Address: http://acl.ldc.upenn.edu/J/J87/; Los Angeles Area; ontherecord@foxnews.com; 105 Chamber Street; etc.

  9. Implementation Details • local IDF tables are computed for each set of documents • weights are tf*log(N/idf) – N is the size of the document set • simC is the cluster similarity; simD is the document similarity which is the cosine metric • threshold estimated over training data • the algorithm was run over the ECDL training data and the similarity value for the optimal f-score is recorded for each instance • the threshold for testing is set to the average of the optimal thresholds (for word-based representation and semantic-based representation)

  10. WEB SEARCH ENGINE PERSON NAME GATE ANNIE SYSTEM SUMMARIZATION TOOLKIT DOCS IDF TABLES PERSONAL SUMMARIES ANNOTATED DOCS THRESHOLD VECTOR EXTRACTOR VECS CLUSTERING CLUSTERS

  11. NLP Components • Use ANNIE system - a GATE information extraction system (http://gate.ac.uk) • Tokeniser • Sentence splitter • Gazetteer list lookup • Regular expressions over annotations (JAPE) • Parts of speech tagging • Coreference resolution • Use in-house Summarization Toolkit (http://www.dcs.shef.ac.uk/~saggion) • term frequency statistics; Vector Space Model representation; IDF tables computation • Personal summaries

  12. Analysed Document

  13. Personal Summaries • Coreference chains are identified (in each document) • All elements in a coreference chain containing the target person are marked • Sentences containing marked person name are selected for summary

  14. SemEval Results • 4 configurations for SemEval 2007 • System 1 = full document & words • System 2 = full document & NEs – system submitted for official evaluation • System 3 = summary & words • System 4 = summary & NEs • best system obtained f-score = 0.78; our system ranked 5th out of 16 participants; all our system configurations f-score > average system

  15. The Effect of Semantic Information • Post SemEval experiments studied the effect of each type of information – basically vectors were created for each type of NE and documents re-clustered FULL TEXT CONDITION SUMMARY CONDITION

  16. Conclusions • Presented an approach to cross-document coreference based on available robust extraction and summarization technology • Approach is largely unsupervised – need some training data to set up parameters • System demonstrated good performance in SemEval 2007 Web People Search Task • Special attention should be given to the type of information used for representing vectors in order to achieve optimal performance

More Related