180 likes | 278 Views
Tsuen-Wan “Johnny†Ngan Symantec Research Labs. Reducing E-Discovery Cost by Filtering Included Emails. The E-Discovery Problem. Email becomes core part of communications Storage is a pain Problem worsened by legislation like SOx
E N D
Tsuen-Wan “Johnny” Ngan Symantec Research Labs Reducing E-Discovery Cost by Filtering Included Emails
The E-Discovery Problem • Email becomes core part of communications • Storage is a pain • Problem worsened by legislation like SOx • E-discovery: discovery of evidence from electronic data in civil litigation • Manually reviewed by lawyers • Time-consuming and expensive • Reduce this cost by reviewing fewer emails
The E-Discovery Process Electronic Discovery Reference Model • Identification • Preservation • Collection • Processing • Review • Analysis • Production • Presentation
The E-Discovery Process Electronic Discovery Reference Model • Identification • Preservation • Collection • Processing • Review • Analysis • Production • Presentation } Done once } Once per litigation
The E-Discovery Process Electronic Discovery Reference Model • Identification • Preservation • Collection • Processing • Review • Analysis • Production • Presentation Volume decreases Relevance increases
The E-Discovery Process Electronic Discovery Reference Model • Identification • Preservation • Collection • Processing • Review • Analysis • Production • Presentation
How to Filter? • Must be careful to not remove valuable evidence • The last email in an email thread often contain the whole conversation • (More?) likely in corporate environment between executives • Other emails can be ignored without affecting accuracy • Grouping "similar" emails can also expedite review
Basic Unit to Compare Emails • When is an email included? • The whole email in verbatim? • All sentences in any order? • Paragraph is chosen as a midpoint • "Idea" is usually preserved • Usually unmodified after quotation • Fewer of them for efficient comparisons
System Overview • Target use in a live email achieve system • Emails arrive over time • Need to find inclusion in both directions • Include any other email? • Included by any other email? • Given an email: • Find candidate emails by finding shared paragraphs • Bottleneck: Some paragraphs are shared by many emails • "Hi" "Thanks" "John" Ads disclaimers
Popular vs. Unpopular Paragraphs • Build two inverted indices • Unpopular paragraphs to emails • Popular paragraphs to emails • For emails with unpopular paragraphs • Only use these unpopular paragraphs to find candidates • For emails with only popular paragraphs • Need to compare with many candidates • But this is extremely rare!
Bloom Filters to Compare Subsets • A space-efficient data structure to test set membership • Extended to test for subsets • Fast way to filter false positives
Experiment Result Highlights • Data Sets: • Enron email trace (517k emails at 961MB) • Mailing list discussion groups (487k emails at 680MB) • Duplicated emails are removed in advance • ~20% of emails can be filtered • Processing speed: 2 to 4MB/s on commodity hardware • Scale reasonably well • Last 1% is only 40 to 50% slower than the first 1%
Summary • Observation: Emails usually contain unpopular paragraphs • Experiments shown a 20% reduction in emails • Huge cost saving for reviews • Computation time is fast enough for practical usage • Dividing popular and unpopular paragraphs is a special case • Could potentially divide into more levels • Benefit from finer granularity left as future work
Email Threads • Cannot simply use thread ids to find all threads • They may not always available • They may not be compatible • Threads != Inclusion • Emails in the same thread may not include each other • Emails in different threads may include each other • Still need to do all comparisons
Implementation Highlights • Remove email software generated text • Divide email into paragraphs • Hash alphanumerical characters in each paragraph • Remove formatting characters • Use Bloom filters for fast approximate subset test • Inverted index built (paragraph -> email) • Popular paragraphs become bottleneck • Handle popular/unpopular paragraphs differently
Cannot Ignore Short Paragraphs • A short paragraph like "No" can carry important meaning • Ignoring them could lose important evidence