1 / 47

Connecting the Docs: Integrating Information from Multiple Documents

Connecting the Docs: Integrating Information from Multiple Documents. Presentation to ASIS&T PNC Annual Meeting Mark Wasson Senior Architect, Research Scientist LexisNexis New Technology Research mark.wasson@lexisnexis.com May 14, 2004. Talk Outline. Introduction

read
Download Presentation

Connecting the Docs: Integrating Information from Multiple Documents

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Connecting the Docs: Integrating Information from Multiple Documents Presentation to ASIS&T PNC Annual Meeting Mark Wasson Senior Architect, Research Scientist LexisNexis New Technology Research mark.wasson@lexisnexis.com May 14, 2004

  2. Talk Outline • Introduction • Search and retrieval, classification and indexing • Clustering and summarization • Extraction and aggregation • Record linkage • Analysis, visualization and discovery • Closing remarks, Q&A • References and related materials Connecting the Docs - Mark Wasson

  3. Introduction Connecting the Docs - Mark Wasson

  4. What is Information Integration? • Pull together an appropriate amount of information about some subject matter (company, person, topic, product, event, etc.) into a single information product • Key steps • Target some subject matter • Find relevant information across all relevant sources • Focus on the particularly useful information • Connect information about the target found in different documents, sources • Eliminate redundant information • Package the information Connecting the Docs - Mark Wasson

  5. Search and Retrieval, Classification and Indexing Connecting the Docs - Mark Wasson

  6. Search and Retrieval • Search basics • Choose sources, search tools • Formulate query • Submit search • Review results • Refine and repeat as appropriate • The result is generally a set of documents Connecting the Docs - Mark Wasson

  7. Search and Retrieval • Accuracy – all over the place • Recall (completeness) • Precision (correctness) • What impacts results? • What you are searching for • Ambiguity, synonymy, variants • Source size and focus • Search functionality • Search engine algorithms, coverage • Data annotations and enhancements • Searcher’s skills, knowledge of the topic • User still must analyze search results Connecting the Docs - Mark Wasson

  8. Google “Mark Wasson” Connecting the Docs - Mark Wasson

  9. Google “Mark Wasson” Results • 57 references in Top 100 (April 22, 2004) • About me • My papers • My pictures • Conference programs and attendees lists • Cites to my papers • Links to my site and pictures • Using the retrieval results • Need to know a lot about me to select, connect the 57 • Look at most to get a fairly complete profile • Look at more than a few to get a solid introduction (unless you turn up a really good page early on) Connecting the Docs - Mark Wasson

  10. Categorization and Indexing Map documents to a taxonomy of topics • Current state of the technology • State of art at 90-95% accuracy (recall, precision) • Many at 80-85% accuracy • Often designed to work with human editors • Academic research community skeptical • Big commercial applications • Inxight/Factiva • Machine learning technology/editorial hybrid • LexisNexis SmartIndexing • Knowledge-based approach • Thomson-West CaRE (used in West km) • Machine learning-based approach Connecting the Docs - Mark Wasson

  11. Categorization and Indexing Pros and Cons • Pros • Creates sets of related documents • Higher accuracy (recall and precision) • With good organization and UI, can support ease of search, retrieval • Cons • Coverage gaps • Incompatible scopes • Different recall, precision priorities And you’re still dealing with documents Connecting the Docs - Mark Wasson

  12. Clustering and Summarization Connecting the Docs - Mark Wasson

  13. Statistical Document Clustering • Find sets of potentially related documents • Create a feature representation for each document • Words, phrases, equivalences, variants, frequencies • Classifications • Publication attributes • Compare, score feature similarity • Cluster most similar documents together • You’re still working with documents • Select most representative documents, one or more of those closest to a cluster’s centroid Connecting the Docs - Mark Wasson

  14. Clusters and Centroids • Dots are documents • Ovals are clusters • Xs are centroids Picture from CS5604 – Information Storage and Retrieval class notes, Ed Fox, Virginia Tech, http://ei.cs.vt.edu/~cs5604/ Connecting the Docs - Mark Wasson

  15. Google News Connecting the Docs - Mark Wasson

  16. Google News • Integrates information at the document level • Finds, retrieves, organizes, presents today’s news • Enough info is provided to provide a nice overview • Links are provided for those who want the details • Beginning to go beyond documents • Sub-document • Headlines • Leading sentences • Pictures • Across documents • Story ranking based on cluster attributes • Representative documents are selected Connecting the Docs - Mark Wasson

  17. The Information Unit • Information takes lots of forms • Documents • Paragraphs • Sentences • Sentence fragments • Headlines, other document components • Tables • Databases • Directories • Lists • Facts • Ideas • Relationships (within, across documents) Connecting the Docs - Mark Wasson

  18. Multidocument Summarization • Identify related documents and create a single summary that captures their highlights • Document classification and clustering • Statistical sentence analysis • Extract key sentences, sentence fragments • Recombine the extracted information • Natural language analysis and generation to improve readability Connecting the Docs - Mark Wasson

  19. Columbia Newsblaster Daily Page Connecting the Docs - Mark Wasson

  20. Columbia Newsblaster Summary, Links Connecting the Docs - Mark Wasson

  21. Extraction and Aggregation Connecting the Docs - Mark Wasson

  22. Extraction and Aggregation • Find related pieces of information across a document collection and package those pieces into a single information product • Information can be spread across lots of sources • Information can be found in lots of formats • Information is not always explicitly linked Connecting the Docs - Mark Wasson

  23. LexisNexis Company Dossiers • Users want good information about companies • Company information is found in numerous news, directory, financial, government, legal and other sources • Literally dozens of searches needed to find everything • Company names are not always used consistently across sources • Need ability to create a common search key across content, e.g., normalized form of company names • Information is presented in free text, lists, tables, databases and directory entry formats • Need ability to find and extract important information Connecting the Docs - Mark Wasson

  24. Company Dossier Connecting the Docs - Mark Wasson

  25. Company Dossier (cont.) Connecting the Docs - Mark Wasson

  26. Company Dossier (cont.) Connecting the Docs - Mark Wasson

  27. Company Dossier (cont.) Connecting the Docs - Mark Wasson

  28. Company Dossier (cont.) Connecting the Docs - Mark Wasson

  29. Record Linkage Connecting the Docs - Mark Wasson

  30. Record Linkage • Record linkage techniques are used to connect related records when there is no explicit key • Data lacks explicit keys, such as ID numbers, normalized company names, etc. • Data lacks consistent features, such as unique names, presence of address or phone number, etc. • Combine feature extraction and analysis • Identify, extract, normalize features as evidence • Compare features across records, looking for a preponderance of evidence of relatedness • Apply other heuristics, e.g., top-ranked, score threshold Connecting the Docs - Mark Wasson

  31. Westlaw Profiler-related Research • Users want background information on attorneys, judges and expert witnesses • Information about attorneys and judges found in case law, jury verdicts, directories, etc. • Information about expert witnesses found in jury verdicts, medical publications, news, websites, etc. • People names are problematic • Many people with same names • Variation is common • But set of attorneys, judges is somewhat defined by directories. Connecting the Docs - Mark Wasson

  32. Westlaw Profiler-related Research (cont.) • Link judges, attorneys between case law and West Legal Directory (Dozier & Haschart, 2000) • Case law feature extraction • Find critical sections within cases • For each attorney, attempt to extract first name, middle name, last name, name suffix, firm name, city, state • For each judge, attempt to extract first name, middle name, last name, name suffix, court, date • Package features into Template Records • West Legal Directory feature extraction • Extract similar features from directory entries for judges and attorneys • Package features into Biography Records Connecting the Docs - Mark Wasson

  33. Westlaw Profiler-related Research (cont.) • Match Template Records to Biography Records • Attempt to match normalized features between pairs of records to create a “match probability score” • For given attorney or judge Template Record, the match to Biography Record with highest match probability score is likely correct match • Additional heuristics • The dates must be compatible • Highest match probability score must exceed threshold • No match is made if a tie score occurs Connecting the Docs - Mark Wasson

  34. Westlaw Profiler-related Research (cont.) • Attorney match accuracy • 99% precision, 92% recall • Judge match accuracy • 98% precision, 90% recall • Common causes of errors • Marriage-based name changes • Spelling errors in the data • Gaps in the directory, such as past positions • See Dozier et al. (2003) for similar work with expert witness-related information Connecting the Docs - Mark Wasson

  35. Analysis, Visualization and Discovery Connecting the Docs - Mark Wasson

  36. From Integration to Exploration and Discovery • Analytical, visualization and discovery tool uses • Summarize key information in a document set • Find and explain interesting facts, relationships and patterns in a document set • Discover previously unknown information • Key components • Extract entities, co-occurrence patterns, subject-verb-object relationship • Coreference resolution, name variant linkage • Statistical analysis • Link analysis • Report generation tools • Data visualization tools Connecting the Docs - Mark Wasson

  37. Insightful’s InFact Concept Graph Example from Insightful website Connecting the Docs - Mark Wasson

  38. ClearForest’s ClearResearch Relations Map Example from ClearForest website Connecting the Docs - Mark Wasson

  39. Closing Remarks Connecting the Docs - Mark Wasson

  40. Closing Thoughts “We have solved the information overload problem!” • Content has exploded • Web: 0 pages > 1 billion pages > 6 billion pages? • Subscription services: Elsevier, Factiva, LexisNexis, Westlaw, lots of others • Deep web: 500 times bigger than surface web • Even if we solve retrieval, classification, indexing • Amount of highly relevant material often overwhelming Connecting the Docs - Mark Wasson

  41. Closing Thoughts • Information integration is coming (some is here!) • Information retrieval • Document categorization and indexing • Document clustering • Entity identification • Information extraction • Relationship extraction • Information aggregation • Record linkage • Multidocument summarization • Analytical tools • Data visualization • Knowledge discovery Connecting the Docs - Mark Wasson

  42. The End Any questions? Mark Wasson mark.wasson@lexisnexis.com http://www.emarkwasson.com (206) 728-7109 Product and service names are trademarks or registered trademarks of their holders. Connecting the Docs - Mark Wasson

  43. References and Related Materials Connecting the Docs - Mark Wasson

  44. References and Related Materials • ClearForest • ClearForest, http://www.clearforest.com • ClearResearch, http://www.clearforest.com/Products/Analytics/ClearResearch.asp • Columbia • Columbia Natural Language Processing Group, http://www.cs.columbia.edu/nlp/ • Columbia Newsblaster, http://newsblaster.cs.columbia.edu/ • Schiffman et al. (2002). Experiments in Multidocument Summarization. 2002 Human Language Technology Conference. • McKeown et al. (2003). Columbia's Newsblaster: New Features and Future Directions. 2003 Human Language Technology-North American Association for Computational Linguistics Conference. Connecting the Docs - Mark Wasson

  45. References and Related Materials • Google • Google, http://www.google.com • Google News, http://news.google.com • Insightful • Insightful, http://www.insightful.com • Insightful InFact, http://www.insightful.com/products/infact/ • Inxight • Inxight, http://www.inxight.com • Inxight classification, http://www.inxight.com/products/smartdiscovery/ • Hersey (2003). Factiva Reaps Benefits from Automatic Text Classification – An End User Case Study. 3rd Workshop on Operational Text Classification Systems. Connecting the Docs - Mark Wasson

  46. References and Related Materials • LexisNexis • LexisNexis, http://www.lexisnexis.com • LexisNexis Company Dossier, http://www.lexisnexis.com/companydossier/ • Wasson (2000).  Large-scale Controlled Vocabulary Indexing for Named Entities.  Language Technology Joint Conference:  ANLP-NAACL 2000. Connecting the Docs - Mark Wasson

  47. References and Related Materials • Thomson-West • Thomson-West, http://west.thomson.com • Westlaw Profiler, http://west.thomson.com/store/product.asp?product%5Fid=Westlaw+Profiler&catalog%5Fname=wgstore • Dozier & Haschart (2000). Automatic Extraction and Linking of Person Names in Legal Text. RIAO-2000. • Dozier et al. (2003). Creation of an Expert Witness Database Through Text Mining. 9th International Conference on Artificial Intelligence and Law. • Dabney et al. (2003). West km 2.0 – Classifying Document Collections with CaRE. Thomson-West white paper. Connecting the Docs - Mark Wasson

More Related