1 / 18

Reducing E-Discovery Cost  by Filtering Included Emails

Tsuen-Wan “Johnny” Ngan Symantec Research Labs. Reducing E-Discovery Cost  by Filtering Included Emails. The E-Discovery Problem. Email becomes core part of communications Storage is a pain Problem worsened by legislation like SOx

tamera
Download Presentation

Reducing E-Discovery Cost  by Filtering Included Emails

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Tsuen-Wan “Johnny” Ngan Symantec Research Labs Reducing E-Discovery Cost by Filtering Included Emails

  2. The E-Discovery Problem • Email becomes core part of communications • Storage is a pain • Problem worsened by legislation like SOx • E-discovery: discovery of evidence from electronic data in civil litigation • Manually reviewed by lawyers • Time-consuming and expensive • Reduce this cost by reviewing fewer emails

  3. The E-Discovery Process Electronic Discovery Reference Model • Identification • Preservation • Collection • Processing • Review • Analysis • Production • Presentation

  4. The E-Discovery Process Electronic Discovery Reference Model • Identification • Preservation • Collection • Processing • Review • Analysis • Production • Presentation } Done once } Once per litigation

  5. The E-Discovery Process Electronic Discovery Reference Model • Identification • Preservation • Collection • Processing • Review • Analysis • Production • Presentation Volume decreases Relevance increases

  6. The E-Discovery Process Electronic Discovery Reference Model • Identification • Preservation • Collection • Processing • Review • Analysis • Production • Presentation

  7. How to Filter? • Must be careful to not remove valuable evidence • The last email in an email thread often contain the whole conversation • (More?) likely in corporate environment between executives • Other emails can be ignored without affecting accuracy • Grouping "similar" emails can also expedite review

  8. Basic Unit to Compare Emails • When is an email included? • The whole email in verbatim?  • All sentences in any order? • Paragraph is chosen as a midpoint • "Idea" is usually preserved  • Usually unmodified after quotation • Fewer of them for efficient comparisons

  9. System Overview • Target use in a live email achieve system • Emails arrive over time • Need to find inclusion in both directions • Include any other email? • Included by any other email? • Given an email: • Find candidate emails by finding shared paragraphs • Bottleneck: Some paragraphs are shared by many emails • "Hi" "Thanks" "John" Ads disclaimers

  10. Popular vs. Unpopular Paragraphs • Build two inverted indices • Unpopular paragraphs to emails • Popular paragraphs to emails  • For emails with unpopular paragraphs • Only use these unpopular paragraphs to find candidates • For emails with only popular paragraphs • Need to compare with many candidates • But this is extremely rare!

  11. Bloom Filters to Compare Subsets • A space-efficient data structure to test set membership • Extended to test for subsets • Fast way to filter false positives

  12. Experiment Result Highlights • Data Sets: • Enron email trace (517k emails at 961MB)‏ • Mailing list discussion groups (487k emails at 680MB)‏ • Duplicated emails are removed in advance • ~20% of emails can be filtered • Processing speed: 2 to 4MB/s on commodity hardware • Scale reasonably well • Last 1% is only 40 to 50% slower than the first 1%

  13. Summary • Observation: Emails usually contain unpopular paragraphs • Experiments shown a 20% reduction in emails • Huge cost saving for reviews • Computation time is fast enough for practical usage • Dividing popular and unpopular paragraphs is a special case • Could potentially divide into more levels • Benefit from finer granularity left as future work

  14. Thank You!

  15. Backup slides

  16. Email Threads • Cannot simply use thread ids to find all threads • They may not always available • They may not be compatible • Threads != Inclusion • Emails in the same thread may not include each other • Emails in different threads may include each other • Still need to do all comparisons

  17. Implementation Highlights • Remove email software generated text • Divide email into paragraphs • Hash alphanumerical characters in each paragraph • Remove formatting characters • Use Bloom filters for fast approximate subset test • Inverted index built (paragraph -> email)‏ • Popular paragraphs become bottleneck  • Handle popular/unpopular paragraphs differently

  18. Cannot Ignore Short Paragraphs • A short paragraph like "No" can carry important meaning • Ignoring them could lose important evidence

More Related