1 / 21

Concepts, Semantics and Syntax in E-Discovery

Concepts, Semantics and Syntax in E-Discovery. David Eichmann Institute for Clinical and Translational Science The University of Iowa. Our Approach. Analyze the human-generated metadata available for document collections for organizational and individual interactions

eddiesoto
Download Presentation

Concepts, Semantics and Syntax in E-Discovery

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Concepts, Semantics and Syntax in E-Discovery David Eichmann Institute for Clinical and Translational Science The University of Iowa

  2. Our Approach • Analyze the human-generated metadata available for document collections for organizational and individual interactions • Explore the syntactic and semantic nature of document content and the potential for automatic generation of metadata • Explore the concept space generated by the previous step and its correspondence to boolean predicate specification in discovery

  3. Our Target Corpus • The Illinois Institute of Technology Complex Document Information Processing Test Collection (IIT CDIP), v. 1.0 • Derived from the tobacco master settlement agreement • Comprises 6,910,192 ‘documents’ • Or more properly the OCR output from those documents • Two merged XML tag sets of metadata, with overlapping content • <A> • <LTDLWOCR>

  4. Metadata Entity Frequencies

  5. Metadata Entity Frequencies

  6. Metadata Entity Frequencies

  7. Metadata Entity Frequencies

  8. Database Schema • We map the XML structure to a set of relational database tables • Non-recurring fields are collected in a table named ‘document’ • docid • title • description • OCR text • Recurring elements each get a table • docid • value

  9. Identifying an Individual

  10. How Many Reininghaus? • Reininghaus,R • Reininghaus,W

  11. Co-mention Connections

  12. Co-mention Connections

  13. Co-mention Connections

  14. Co-mention Affiliations

  15. Semantics and Structure • Our analysis of content involves the following phases: • Lexical analysis • Sentence boundary detection • Named entity recognition • Sentence parsing • Relationship extraction • The nature of the OCR data seriously impacts each of the phases (sometimes in different ways)

  16. CDIP Parse Tree Complexity

  17. Clean Text Parse Tree Complexity

  18. Next Steps • Experiment with custom lexical analysis of the OCR • Start with simple white space detection • Construct a lexicon and look for out-of-band vocabulary as OCR errors candidates • Rewrite the analyzer to support OCR error correction • Sentence boundary detect and parse the full corpus • Generate entity relationships using our question answering framework

  19. And Beyond That… • Return to the document images and analyze document layout • Regenerate OCR to include token coordinates • Use our PDF structure extraction framework to generate logical document structure • Generate a set of document models based upon similar layout • Use the document models to map OCR text to metadata elements

  20. For Example

  21. For Example

More Related