470 likes | 583 Views
Connecting the Docs: Integrating Information from Multiple Documents. Presentation to ASIS&T PNC Annual Meeting Mark Wasson Senior Architect, Research Scientist LexisNexis New Technology Research mark.wasson@lexisnexis.com May 14, 2004. Talk Outline. Introduction
E N D
Connecting the Docs: Integrating Information from Multiple Documents Presentation to ASIS&T PNC Annual Meeting Mark Wasson Senior Architect, Research Scientist LexisNexis New Technology Research mark.wasson@lexisnexis.com May 14, 2004
Talk Outline • Introduction • Search and retrieval, classification and indexing • Clustering and summarization • Extraction and aggregation • Record linkage • Analysis, visualization and discovery • Closing remarks, Q&A • References and related materials Connecting the Docs - Mark Wasson
Introduction Connecting the Docs - Mark Wasson
What is Information Integration? • Pull together an appropriate amount of information about some subject matter (company, person, topic, product, event, etc.) into a single information product • Key steps • Target some subject matter • Find relevant information across all relevant sources • Focus on the particularly useful information • Connect information about the target found in different documents, sources • Eliminate redundant information • Package the information Connecting the Docs - Mark Wasson
Search and Retrieval, Classification and Indexing Connecting the Docs - Mark Wasson
Search and Retrieval • Search basics • Choose sources, search tools • Formulate query • Submit search • Review results • Refine and repeat as appropriate • The result is generally a set of documents Connecting the Docs - Mark Wasson
Search and Retrieval • Accuracy – all over the place • Recall (completeness) • Precision (correctness) • What impacts results? • What you are searching for • Ambiguity, synonymy, variants • Source size and focus • Search functionality • Search engine algorithms, coverage • Data annotations and enhancements • Searcher’s skills, knowledge of the topic • User still must analyze search results Connecting the Docs - Mark Wasson
Google “Mark Wasson” Connecting the Docs - Mark Wasson
Google “Mark Wasson” Results • 57 references in Top 100 (April 22, 2004) • About me • My papers • My pictures • Conference programs and attendees lists • Cites to my papers • Links to my site and pictures • Using the retrieval results • Need to know a lot about me to select, connect the 57 • Look at most to get a fairly complete profile • Look at more than a few to get a solid introduction (unless you turn up a really good page early on) Connecting the Docs - Mark Wasson
Categorization and Indexing Map documents to a taxonomy of topics • Current state of the technology • State of art at 90-95% accuracy (recall, precision) • Many at 80-85% accuracy • Often designed to work with human editors • Academic research community skeptical • Big commercial applications • Inxight/Factiva • Machine learning technology/editorial hybrid • LexisNexis SmartIndexing • Knowledge-based approach • Thomson-West CaRE (used in West km) • Machine learning-based approach Connecting the Docs - Mark Wasson
Categorization and Indexing Pros and Cons • Pros • Creates sets of related documents • Higher accuracy (recall and precision) • With good organization and UI, can support ease of search, retrieval • Cons • Coverage gaps • Incompatible scopes • Different recall, precision priorities And you’re still dealing with documents Connecting the Docs - Mark Wasson
Clustering and Summarization Connecting the Docs - Mark Wasson
Statistical Document Clustering • Find sets of potentially related documents • Create a feature representation for each document • Words, phrases, equivalences, variants, frequencies • Classifications • Publication attributes • Compare, score feature similarity • Cluster most similar documents together • You’re still working with documents • Select most representative documents, one or more of those closest to a cluster’s centroid Connecting the Docs - Mark Wasson
Clusters and Centroids • Dots are documents • Ovals are clusters • Xs are centroids Picture from CS5604 – Information Storage and Retrieval class notes, Ed Fox, Virginia Tech, http://ei.cs.vt.edu/~cs5604/ Connecting the Docs - Mark Wasson
Google News Connecting the Docs - Mark Wasson
Google News • Integrates information at the document level • Finds, retrieves, organizes, presents today’s news • Enough info is provided to provide a nice overview • Links are provided for those who want the details • Beginning to go beyond documents • Sub-document • Headlines • Leading sentences • Pictures • Across documents • Story ranking based on cluster attributes • Representative documents are selected Connecting the Docs - Mark Wasson
The Information Unit • Information takes lots of forms • Documents • Paragraphs • Sentences • Sentence fragments • Headlines, other document components • Tables • Databases • Directories • Lists • Facts • Ideas • Relationships (within, across documents) Connecting the Docs - Mark Wasson
Multidocument Summarization • Identify related documents and create a single summary that captures their highlights • Document classification and clustering • Statistical sentence analysis • Extract key sentences, sentence fragments • Recombine the extracted information • Natural language analysis and generation to improve readability Connecting the Docs - Mark Wasson
Columbia Newsblaster Daily Page Connecting the Docs - Mark Wasson
Columbia Newsblaster Summary, Links Connecting the Docs - Mark Wasson
Extraction and Aggregation Connecting the Docs - Mark Wasson
Extraction and Aggregation • Find related pieces of information across a document collection and package those pieces into a single information product • Information can be spread across lots of sources • Information can be found in lots of formats • Information is not always explicitly linked Connecting the Docs - Mark Wasson
LexisNexis Company Dossiers • Users want good information about companies • Company information is found in numerous news, directory, financial, government, legal and other sources • Literally dozens of searches needed to find everything • Company names are not always used consistently across sources • Need ability to create a common search key across content, e.g., normalized form of company names • Information is presented in free text, lists, tables, databases and directory entry formats • Need ability to find and extract important information Connecting the Docs - Mark Wasson
Company Dossier Connecting the Docs - Mark Wasson
Company Dossier (cont.) Connecting the Docs - Mark Wasson
Company Dossier (cont.) Connecting the Docs - Mark Wasson
Company Dossier (cont.) Connecting the Docs - Mark Wasson
Company Dossier (cont.) Connecting the Docs - Mark Wasson
Record Linkage Connecting the Docs - Mark Wasson
Record Linkage • Record linkage techniques are used to connect related records when there is no explicit key • Data lacks explicit keys, such as ID numbers, normalized company names, etc. • Data lacks consistent features, such as unique names, presence of address or phone number, etc. • Combine feature extraction and analysis • Identify, extract, normalize features as evidence • Compare features across records, looking for a preponderance of evidence of relatedness • Apply other heuristics, e.g., top-ranked, score threshold Connecting the Docs - Mark Wasson
Westlaw Profiler-related Research • Users want background information on attorneys, judges and expert witnesses • Information about attorneys and judges found in case law, jury verdicts, directories, etc. • Information about expert witnesses found in jury verdicts, medical publications, news, websites, etc. • People names are problematic • Many people with same names • Variation is common • But set of attorneys, judges is somewhat defined by directories. Connecting the Docs - Mark Wasson
Westlaw Profiler-related Research (cont.) • Link judges, attorneys between case law and West Legal Directory (Dozier & Haschart, 2000) • Case law feature extraction • Find critical sections within cases • For each attorney, attempt to extract first name, middle name, last name, name suffix, firm name, city, state • For each judge, attempt to extract first name, middle name, last name, name suffix, court, date • Package features into Template Records • West Legal Directory feature extraction • Extract similar features from directory entries for judges and attorneys • Package features into Biography Records Connecting the Docs - Mark Wasson
Westlaw Profiler-related Research (cont.) • Match Template Records to Biography Records • Attempt to match normalized features between pairs of records to create a “match probability score” • For given attorney or judge Template Record, the match to Biography Record with highest match probability score is likely correct match • Additional heuristics • The dates must be compatible • Highest match probability score must exceed threshold • No match is made if a tie score occurs Connecting the Docs - Mark Wasson
Westlaw Profiler-related Research (cont.) • Attorney match accuracy • 99% precision, 92% recall • Judge match accuracy • 98% precision, 90% recall • Common causes of errors • Marriage-based name changes • Spelling errors in the data • Gaps in the directory, such as past positions • See Dozier et al. (2003) for similar work with expert witness-related information Connecting the Docs - Mark Wasson
Analysis, Visualization and Discovery Connecting the Docs - Mark Wasson
From Integration to Exploration and Discovery • Analytical, visualization and discovery tool uses • Summarize key information in a document set • Find and explain interesting facts, relationships and patterns in a document set • Discover previously unknown information • Key components • Extract entities, co-occurrence patterns, subject-verb-object relationship • Coreference resolution, name variant linkage • Statistical analysis • Link analysis • Report generation tools • Data visualization tools Connecting the Docs - Mark Wasson
Insightful’s InFact Concept Graph Example from Insightful website Connecting the Docs - Mark Wasson
ClearForest’s ClearResearch Relations Map Example from ClearForest website Connecting the Docs - Mark Wasson
Closing Remarks Connecting the Docs - Mark Wasson
Closing Thoughts “We have solved the information overload problem!” • Content has exploded • Web: 0 pages > 1 billion pages > 6 billion pages? • Subscription services: Elsevier, Factiva, LexisNexis, Westlaw, lots of others • Deep web: 500 times bigger than surface web • Even if we solve retrieval, classification, indexing • Amount of highly relevant material often overwhelming Connecting the Docs - Mark Wasson
Closing Thoughts • Information integration is coming (some is here!) • Information retrieval • Document categorization and indexing • Document clustering • Entity identification • Information extraction • Relationship extraction • Information aggregation • Record linkage • Multidocument summarization • Analytical tools • Data visualization • Knowledge discovery Connecting the Docs - Mark Wasson
The End Any questions? Mark Wasson mark.wasson@lexisnexis.com http://www.emarkwasson.com (206) 728-7109 Product and service names are trademarks or registered trademarks of their holders. Connecting the Docs - Mark Wasson
References and Related Materials Connecting the Docs - Mark Wasson
References and Related Materials • ClearForest • ClearForest, http://www.clearforest.com • ClearResearch, http://www.clearforest.com/Products/Analytics/ClearResearch.asp • Columbia • Columbia Natural Language Processing Group, http://www.cs.columbia.edu/nlp/ • Columbia Newsblaster, http://newsblaster.cs.columbia.edu/ • Schiffman et al. (2002). Experiments in Multidocument Summarization. 2002 Human Language Technology Conference. • McKeown et al. (2003). Columbia's Newsblaster: New Features and Future Directions. 2003 Human Language Technology-North American Association for Computational Linguistics Conference. Connecting the Docs - Mark Wasson
References and Related Materials • Google • Google, http://www.google.com • Google News, http://news.google.com • Insightful • Insightful, http://www.insightful.com • Insightful InFact, http://www.insightful.com/products/infact/ • Inxight • Inxight, http://www.inxight.com • Inxight classification, http://www.inxight.com/products/smartdiscovery/ • Hersey (2003). Factiva Reaps Benefits from Automatic Text Classification – An End User Case Study. 3rd Workshop on Operational Text Classification Systems. Connecting the Docs - Mark Wasson
References and Related Materials • LexisNexis • LexisNexis, http://www.lexisnexis.com • LexisNexis Company Dossier, http://www.lexisnexis.com/companydossier/ • Wasson (2000). Large-scale Controlled Vocabulary Indexing for Named Entities. Language Technology Joint Conference: ANLP-NAACL 2000. Connecting the Docs - Mark Wasson
References and Related Materials • Thomson-West • Thomson-West, http://west.thomson.com • Westlaw Profiler, http://west.thomson.com/store/product.asp?product%5Fid=Westlaw+Profiler&catalog%5Fname=wgstore • Dozier & Haschart (2000). Automatic Extraction and Linking of Person Names in Legal Text. RIAO-2000. • Dozier et al. (2003). Creation of an Expert Witness Database Through Text Mining. 9th International Conference on Artificial Intelligence and Law. • Dabney et al. (2003). West km 2.0 – Classifying Document Collections with CaRE. Thomson-West white paper. Connecting the Docs - Mark Wasson