1 / 31

Analysis 360: Blurring the line between EDA and PC

Analysis 360: Blurring the line between EDA and PC. Andrea Gibson , Product Director, Kroll Ontrack March 27, 2014. Discussion Overview. Pushing the Boundaries of Early Data Analysis (EDA) Examining Traditional EDA Tools Leveraging Predictive Coding (PC) for Analysis

barr
Download Presentation

Analysis 360: Blurring the line between EDA and PC

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Analysis 360: Blurring the line between EDA and PC Andrea Gibson, Product Director, Kroll Ontrack March 27, 2014

  2. Discussion Overview • Pushing the Boundaries of Early Data Analysis (EDA) • Examining Traditional EDA Tools • Leveraging Predictive Coding (PC) for Analysis • Using PC in an EDA Environment

  3. Pushing the Boundaries of EDA

  4. EDA | an acronym worth defining • Early Data Analysis (EDA) aides fact-finding and narrows the data scopeby helping attorneys understand their datasets • Triage data into critical and non-critical groupings • Identify and reduces number of key players • Test search terms • Identify critical case arguments • Categorize documents as efficiently as possible for production • A true methodology – technology fuels human decisions

  5. Traditional EDA | Overview Collection Production Identify, Collect & Process Export to Review Platform Import & Perform Early Analysis Document Review • Processing • Ensure portabilityof groups and tags • Ensure production/search capabilities of review platform Analysis • Log • Route • Report • Test • QC • Filter • Search • Cluster • Search • Tag • Redact

  6. Where does Predictive Coding fit in? Collection Production Identify, Collect & Process Export to Review Platform Import & Perform Analysis Document Review • Processing • Ensure portabilityof groups and tags • Ensure production/search capabilities of review platform Analysis • Log • Route • Report • Test • QC • Filter • Search • Cluster • Search • Tag • Redact Predictive Coding!

  7. Traditional EDA | How efficient is it? Collection Production Identify, Collect & Process Export to Review Platform Import & Perform Analysis Review • Ensure portabilityof groups and tags • Ensure production/search capabilities of review platform The Bermuda Triangle of ediscovery • PC is massively underused • The tools used during analysis and review overlap substantially • Pointless inefficiencies are created by jockeying data between two standalone platforms Analysis • Log • Route • Report • Test • QC • Search • Tag • Redact • Filter • Search • Cluster Predictive Coding!

  8. EDA + Review | Could it look like this? Collection Production Identify, Collect & Process Analyze and Review • Test • QC • Route • Report • Tag • Process • PC • Filter • Search • Cluster

  9. Examining Traditional EDA Tools

  10. Keyword Search&Concept Search • Uses search terms and Boolean operators (&, or, not) to retrieve documents that contain those exact terms • Standard practice • Generally accepted in the courts • “baseball & field” • Technology alternative • Allows reviewers to find documents with similar conceptual terms even if they do not contain exact search terms • Seldom used for filtering; increasingly used for review • “baseball”  diamond, MLB, hit, out

  11. Topic Grouping&Language Identification Topic Grouping & • Documents automatically grouped by theme without human input Contract 非披露協議 كتيب الموظف Finance • Identify all languages in a document • Used to group and sort documents for review by multilingual reviewers

  12. Email Threading&Near Deduplication Topic Grouping & • Reviewers can quickly identify and compare documents that are very similar to one another but are not exact duplicates • Identifies and groups e-mail conversations based on content Start-Point Email RE: FWD: End-Point Email

  13. Finding a Common Thread Topic Group • At their cores, these tools help attorneys learn more about their data • Does PC fit the bill? Key Word Search Analytical Tools Concept Search Language ID • Email • Threading Dedupe Predictive Coding

  14. Leveraging PC for Analysis

  15. Predictive Coding for Production

  16. Predictive Coding For Analysis • PC has been praised for its ability to reduce the amount of documents manually reviewed during first pass • But at least three critical components of PC empower attorneys with unrivaled knowledge about their case: • Prioritization • Categorization • Active Learning

  17. The Prioritization Component • Learns from reviewer decisions and escalates documents based on two binary categories • Responsive or nonresponsive • Works based on modest amount of learning • Increases the ratio of responsive documents that get routed to reviewers 480,000 74,000 Responsive Non-responsive

  18. The Prioritization Component • How does this help attorneys analyze their case? • When attorneys ‘check out’ documents to review, they are seeing those documents most likely to be responsive • For the same reasons this speeds up production, attorneys who put eyes on these richly relevant documents will know more about their case earlier – driving arguments and filling knowledge gaps • It runs in the background, you don’t need to carve into billable hours to test keywords Request batch Entire Corpus

  19. The Categorization Component • Learns from trainer decisions and suggests coding on multiple categories for an entire collection of documents • Assigns a predicted responsiveness score 89% Predicted Responsive 75% Predicted Non-responsive Privileged 67% Predicted • Improves speed and quality of categorization decisions

  20. The Categorization Component • How does this help attorneys analyze their case? • Allows attorneys to segregate data at user-defined predicted responsiveness ratings after modest training • Empowers attorneys to route certain categories of documents (e.g. “hot” docs) to certain sub-groups within the team Post Round One Categorization Results (65% cutoff) To: Brief-writer Bryan Re: Good Luck on the first draft! 9,522 docs 1,427 docs 0% 100% 65% % likelihood to be responsive

  21. The Active Learning Component • Key component of any true PC solution • Automatically escalates focus documents for training (as opposed to just handpicked, or just randomly selected training documents) • Focus Documents: • Come from grey areas in the classifier because the machine is currently uncertain whether they are responsive or not responsive • Ideal candidates to improve machine learning • Not random, but queried 50% 60% 40% 70% 30% 80% 20% 90% 10% 0% non-responsive 100% responsive

  22. The Active Learning Component • How does this help attorneys analyze their case? • Introduces attorneys to the documents on the fringe of relevancy • These could be case-changing documents that the machine just doesn’t know enough about yet • Most effective way to boost metrics and improve results between early training rounds • Reduces false positives; improves accuracy of machine’s concept of relevancy TR1 TR2 Recall Precision Recall Precision

  23. Additional Efficiencies • Production • Can easily transition into production whether leveraging PC, or not • Most practical form of PC for EDA • Reporting • Even if just one or two training rounds are performed, metrics will show where you stand • In this vein, no other EDA tool comes close to PC’s automatic reporting • There’s a reason courts often ask for recallandprecision - these indicate whether you’re understanding of the data set is accurate

  24. Additional Efficiencies • Other ECA tools complement predictive coding • Predictive coding requires reviewing a few thousand documents in training • Most PC solutions also come equipped with all other EDA tools available • This helps you navigate the training set as well as during review • Intra-team quality control • Can compare reviewer-machine agreement rates side-by-side • Identify points of disagreement and inconsistency

  25. Additional Efficiencies • The small case conundrum • The analytical value from PC is greater where the same subject-matter expert who trains the system is the same attorney who is forming case strategy • This is most likely true in small-medium cases where one attorney may be in charge of a case through trial • The production value from using PC to aid review is greater where high upfront costs can be recouped from applying the machine’s logic to a large amount of documents • Traditionally, this has been true only in large cases

  26. Additional Efficiencies • This is all changing • The “portfolio approach” to ediscovery • Pay yearly for PC (and everything that preceded it) in all your cases for a data hosting fee (process on the vendor’s side) • Upload on day one, train on day one, see a list of documents ranked by relevancy on day one

  27. Using PC in an EDA Environment

  28. Overview • It’s not that crazy • EDA tools let you learn more about your data—so does PC • Many of the tools discussed today (e.g. de-duplication, concept searching) already exist in standalone “PC solutions” • Aggressive culling via keywords can have an impact on training in PC • Any search strategy must be well designed according to the matter at hand • The producing party has substantial deference in conducting its search

  29. Pre-PC Keyword Cull? • In re Biomet • Defendant’s search strategy: • Plaintiffs argued: the defendant should have used PC on the whole 19.5 million document corpus; the keywords tainted the training. We want joint review of training docs. • Court held: defendant’s search was reasonable PC Keyword Production 19.5 million documents 3 million documents

  30. Parting Thoughts • There are many ways to learn about data • Different tools on the same belt; multi-modal search • Solutions are emerging that offer all of these tools in one location • No more data jockeying • More information for better decisions • Quality control is essential whenever you use one of these tools to remove documents from production

More Related