310 likes | 453 Views
Analysis 360: Blurring the line between EDA and PC. Andrea Gibson , Product Director, Kroll Ontrack March 27, 2014. Discussion Overview. Pushing the Boundaries of Early Data Analysis (EDA) Examining Traditional EDA Tools Leveraging Predictive Coding (PC) for Analysis
E N D
Analysis 360: Blurring the line between EDA and PC Andrea Gibson, Product Director, Kroll Ontrack March 27, 2014
Discussion Overview • Pushing the Boundaries of Early Data Analysis (EDA) • Examining Traditional EDA Tools • Leveraging Predictive Coding (PC) for Analysis • Using PC in an EDA Environment
EDA | an acronym worth defining • Early Data Analysis (EDA) aides fact-finding and narrows the data scopeby helping attorneys understand their datasets • Triage data into critical and non-critical groupings • Identify and reduces number of key players • Test search terms • Identify critical case arguments • Categorize documents as efficiently as possible for production • A true methodology – technology fuels human decisions
Traditional EDA | Overview Collection Production Identify, Collect & Process Export to Review Platform Import & Perform Early Analysis Document Review • Processing • Ensure portabilityof groups and tags • Ensure production/search capabilities of review platform Analysis • Log • Route • Report • Test • QC • Filter • Search • Cluster • Search • Tag • Redact
Where does Predictive Coding fit in? Collection Production Identify, Collect & Process Export to Review Platform Import & Perform Analysis Document Review • Processing • Ensure portabilityof groups and tags • Ensure production/search capabilities of review platform Analysis • Log • Route • Report • Test • QC • Filter • Search • Cluster • Search • Tag • Redact Predictive Coding!
Traditional EDA | How efficient is it? Collection Production Identify, Collect & Process Export to Review Platform Import & Perform Analysis Review • Ensure portabilityof groups and tags • Ensure production/search capabilities of review platform The Bermuda Triangle of ediscovery • PC is massively underused • The tools used during analysis and review overlap substantially • Pointless inefficiencies are created by jockeying data between two standalone platforms Analysis • Log • Route • Report • Test • QC • Search • Tag • Redact • Filter • Search • Cluster Predictive Coding!
EDA + Review | Could it look like this? Collection Production Identify, Collect & Process Analyze and Review • Test • QC • Route • Report • Tag • Process • PC • Filter • Search • Cluster
Keyword Search&Concept Search • Uses search terms and Boolean operators (&, or, not) to retrieve documents that contain those exact terms • Standard practice • Generally accepted in the courts • “baseball & field” • Technology alternative • Allows reviewers to find documents with similar conceptual terms even if they do not contain exact search terms • Seldom used for filtering; increasingly used for review • “baseball” diamond, MLB, hit, out
Topic Grouping&Language Identification Topic Grouping & • Documents automatically grouped by theme without human input Contract 非披露協議 كتيب الموظف Finance • Identify all languages in a document • Used to group and sort documents for review by multilingual reviewers
Email Threading&Near Deduplication Topic Grouping & • Reviewers can quickly identify and compare documents that are very similar to one another but are not exact duplicates • Identifies and groups e-mail conversations based on content Start-Point Email RE: FWD: End-Point Email
Finding a Common Thread Topic Group • At their cores, these tools help attorneys learn more about their data • Does PC fit the bill? Key Word Search Analytical Tools Concept Search Language ID • Email • Threading Dedupe Predictive Coding
Predictive Coding For Analysis • PC has been praised for its ability to reduce the amount of documents manually reviewed during first pass • But at least three critical components of PC empower attorneys with unrivaled knowledge about their case: • Prioritization • Categorization • Active Learning
The Prioritization Component • Learns from reviewer decisions and escalates documents based on two binary categories • Responsive or nonresponsive • Works based on modest amount of learning • Increases the ratio of responsive documents that get routed to reviewers 480,000 74,000 Responsive Non-responsive
The Prioritization Component • How does this help attorneys analyze their case? • When attorneys ‘check out’ documents to review, they are seeing those documents most likely to be responsive • For the same reasons this speeds up production, attorneys who put eyes on these richly relevant documents will know more about their case earlier – driving arguments and filling knowledge gaps • It runs in the background, you don’t need to carve into billable hours to test keywords Request batch Entire Corpus
The Categorization Component • Learns from trainer decisions and suggests coding on multiple categories for an entire collection of documents • Assigns a predicted responsiveness score 89% Predicted Responsive 75% Predicted Non-responsive Privileged 67% Predicted • Improves speed and quality of categorization decisions
The Categorization Component • How does this help attorneys analyze their case? • Allows attorneys to segregate data at user-defined predicted responsiveness ratings after modest training • Empowers attorneys to route certain categories of documents (e.g. “hot” docs) to certain sub-groups within the team Post Round One Categorization Results (65% cutoff) To: Brief-writer Bryan Re: Good Luck on the first draft! 9,522 docs 1,427 docs 0% 100% 65% % likelihood to be responsive
The Active Learning Component • Key component of any true PC solution • Automatically escalates focus documents for training (as opposed to just handpicked, or just randomly selected training documents) • Focus Documents: • Come from grey areas in the classifier because the machine is currently uncertain whether they are responsive or not responsive • Ideal candidates to improve machine learning • Not random, but queried 50% 60% 40% 70% 30% 80% 20% 90% 10% 0% non-responsive 100% responsive
The Active Learning Component • How does this help attorneys analyze their case? • Introduces attorneys to the documents on the fringe of relevancy • These could be case-changing documents that the machine just doesn’t know enough about yet • Most effective way to boost metrics and improve results between early training rounds • Reduces false positives; improves accuracy of machine’s concept of relevancy TR1 TR2 Recall Precision Recall Precision
Additional Efficiencies • Production • Can easily transition into production whether leveraging PC, or not • Most practical form of PC for EDA • Reporting • Even if just one or two training rounds are performed, metrics will show where you stand • In this vein, no other EDA tool comes close to PC’s automatic reporting • There’s a reason courts often ask for recallandprecision - these indicate whether you’re understanding of the data set is accurate
Additional Efficiencies • Other ECA tools complement predictive coding • Predictive coding requires reviewing a few thousand documents in training • Most PC solutions also come equipped with all other EDA tools available • This helps you navigate the training set as well as during review • Intra-team quality control • Can compare reviewer-machine agreement rates side-by-side • Identify points of disagreement and inconsistency
Additional Efficiencies • The small case conundrum • The analytical value from PC is greater where the same subject-matter expert who trains the system is the same attorney who is forming case strategy • This is most likely true in small-medium cases where one attorney may be in charge of a case through trial • The production value from using PC to aid review is greater where high upfront costs can be recouped from applying the machine’s logic to a large amount of documents • Traditionally, this has been true only in large cases
Additional Efficiencies • This is all changing • The “portfolio approach” to ediscovery • Pay yearly for PC (and everything that preceded it) in all your cases for a data hosting fee (process on the vendor’s side) • Upload on day one, train on day one, see a list of documents ranked by relevancy on day one
Overview • It’s not that crazy • EDA tools let you learn more about your data—so does PC • Many of the tools discussed today (e.g. de-duplication, concept searching) already exist in standalone “PC solutions” • Aggressive culling via keywords can have an impact on training in PC • Any search strategy must be well designed according to the matter at hand • The producing party has substantial deference in conducting its search
Pre-PC Keyword Cull? • In re Biomet • Defendant’s search strategy: • Plaintiffs argued: the defendant should have used PC on the whole 19.5 million document corpus; the keywords tainted the training. We want joint review of training docs. • Court held: defendant’s search was reasonable PC Keyword Production 19.5 million documents 3 million documents
Parting Thoughts • There are many ways to learn about data • Different tools on the same belt; multi-modal search • Solutions are emerging that offer all of these tools in one location • No more data jockeying • More information for better decisions • Quality control is essential whenever you use one of these tools to remove documents from production