210 likes | 227 Views
Concepts, Semantics and Syntax in E-Discovery. David Eichmann Institute for Clinical and Translational Science The University of Iowa. Our Approach. Analyze the human-generated metadata available for document collections for organizational and individual interactions
E N D
Concepts, Semantics and Syntax in E-Discovery David Eichmann Institute for Clinical and Translational Science The University of Iowa
Our Approach • Analyze the human-generated metadata available for document collections for organizational and individual interactions • Explore the syntactic and semantic nature of document content and the potential for automatic generation of metadata • Explore the concept space generated by the previous step and its correspondence to boolean predicate specification in discovery
Our Target Corpus • The Illinois Institute of Technology Complex Document Information Processing Test Collection (IIT CDIP), v. 1.0 • Derived from the tobacco master settlement agreement • Comprises 6,910,192 ‘documents’ • Or more properly the OCR output from those documents • Two merged XML tag sets of metadata, with overlapping content • <A> • <LTDLWOCR>
Database Schema • We map the XML structure to a set of relational database tables • Non-recurring fields are collected in a table named ‘document’ • docid • title • description • OCR text • Recurring elements each get a table • docid • value
How Many Reininghaus? • Reininghaus,R • Reininghaus,W
Semantics and Structure • Our analysis of content involves the following phases: • Lexical analysis • Sentence boundary detection • Named entity recognition • Sentence parsing • Relationship extraction • The nature of the OCR data seriously impacts each of the phases (sometimes in different ways)
Next Steps • Experiment with custom lexical analysis of the OCR • Start with simple white space detection • Construct a lexicon and look for out-of-band vocabulary as OCR errors candidates • Rewrite the analyzer to support OCR error correction • Sentence boundary detect and parse the full corpus • Generate entity relationships using our question answering framework
And Beyond That… • Return to the document images and analyze document layout • Regenerate OCR to include token coordinates • Use our PDF structure extraction framework to generate logical document structure • Generate a set of document models based upon similar layout • Use the document models to map OCR text to metadata elements