110 likes | 129 Views
Enhancing Legal Discovery with Linguistic Processing. Daniel G. Bobrow Research Fellow Palo Alto Research Center Inc. with Tracy King and Lawrence Lee June 4, 2007. The problems in Legal Discovery. Recall Nothing relevant left behind Precision Very little irrelevant to ignore
E N D
Enhancing Legal Discovery with Linguistic Processing Daniel G. Bobrow Research Fellow Palo Alto Research Center Inc. with Tracy King and Lawrence Lee June 4, 2007
The problems in Legal Discovery • Recall • Nothing relevant left behind • Precision • Very little irrelevant to ignore • Scalability • Need to handle more and more • Privacy • What they see is only what they should get
Today: negotiated keyword search protocol • All documents discussing or referencing scientific research on the effects of secondhand smoking published prior to 1985. Defendant’s Initial Proposal: “secondhand smok!” and (finding or science or or research) and (1985 or 1984 or 1983 or 1982 or 1981 or 1980 or 197! or 196! or 195!) Plaintiffs’ Rejoinder: ((find! or result! or effect!) w/page (secondhand or “second hand”)) or (other! w/5 smok!) • All documents relating to destruction of records under defendants’ records retention policies and practices. Defendant’s Initial Proposal: “records” and “destruction” Plaintiffs’ Counterproposal: destr! or elim! or dispos! or purg! or recycl! or retain! or reten!
Linguistic enhancement of keyword queries • Inflexional morphology – forms of verbs • destroy destroys, destroyed, destroying, … • comply complies, complied, complying • Derivational morphology – verbs nouns • destroy destruction, destroyer, .. • comply compliance, … • retain retention, … • Word taxonomy (e.g. WordNet) • result consequence, effect, outcome, result, event, issue, upshot
Inference-sensitive lexical resources Entailment & Contradiction Detection Normalize to AKR ASKER Knowledge repository Passages + AKR with semantic index Processing the collection rather than the queriesASKER: A Semantically-indexed Knowledge Repository IntelligenceSource Documents Filteredanswers TextPassages Query QueryAKR Expand Simplify Queryindexterms Passage, AKR+ index terms Retrievedpassages+ AKR
Normalize to Semantic Representation • Syntactic Normalization • morphological: • bought buy +past • structural: • the file was lost by Mary Mary lost the file • derivational: • the destruction of the memo by the CEO the CEO destroyed the memo • Semantic normalization • word to list of WordNet synsets • buy [buy, purchase, …] [ …] • Connect predicate and arguments • Pred:destroy Agent: CEO Theme: memo • Fill in implicit arguments • Ed was easy to please Ed was pleased
Improved Recall(Google and Asker on Wikipedia) Query: How many terrorists have died? Google: In addition to the 19 hijackers, 2973 people died in the terrorist attack ... Although there were security alerts at many locations, no other terrorist incidents occurred outside central London. This is a list of sportspeople who have died … Asker: The encounter resulted in the deaths of twoterrorists of the Al Omar Tanzeem In blazing gunfire, five of the insurgentsperished… “…see to it that those terroristsdie and are broken”
Improved Precision(Using argument roles for relevance test) Query: What terrorists have been killed? Google: .. not include most people killed in big terrorist bombings …act of terrorism in which 93 innocent people have been killed or are missing in the ruins Asker: During a two-hour gun battle in Mdantsane, police kill a terrorist or freedom fighter All the three terroristskilled in this incident have been identified as Pakistani Nationals. … the former Socialist government carried out a covert campaign in which 27 suspected Basque terrorists were killed.
Scalability (Cost of doing linguistic processing at scale) • Linguistic processing time: < 1 CPU sec/sentence • parsing, semantic normalization, indexing • Assumptions: • Average collection size: 100 million documents • Document size: 25 sentences • 8 core processor -- $6K or $250/month (depreciated and housed for 3 years) • 2.5 million seconds month= 100,000 documents/core/month • Cost for handling 100 million documents/month • 1000 cores = 125 processors*$250= $32,000 • Use human review: query costs are in the noise
Privacy • Identify sensitive content by entity type and relationship (linguistic processing) • e.g. Phone numbers of people • Encrypt content to make content unreadable(PARC security technology) • Provide content-specific keys for those people with a need to know specific information • Additional PARC security technologies can identify additional content to be redacted to mitigate inference channels • can redacted information be discovered based on what is available?
Linguistic processing can be useful in legal discovery Thank you With good Recall, Precision, Scalability, Privacy