1 / 0

How Can Lawyers Benefit From Using Visual Analytics in E-Discovery?

How Can Lawyers Benefit From Using Visual Analytics in E-Discovery?. VASS 2013: 3 rd UKVAC International Visual Analytics Summer School Middlesex University London, U.K. July 18, 2013 Jason R. Baron Director of Litigation Office of General Counsel

lara
Download Presentation

How Can Lawyers Benefit From Using Visual Analytics in E-Discovery?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. How Can Lawyers Benefit From Using Visual Analytics in E-Discovery?

    VASS 2013: 3rd UKVAC International Visual Analytics Summer School Middlesex University London, U.K. July 18, 2013 Jason R. Baron Director of Litigation Office of General Counsel USNational Archives and Records Administration Washington, D.C.
  2. Overview What is e-discovery and e-disclosure in the law? The “As is” Model of performing “searches” Case example: US v Phillip Morris Moving Beyond Boolean: the emergence of clustering algorithms and supervised learning alternatives to keyword searching The Promise of Visual Analytics Strategic challenges
  3. Searching the Haystack….
  4. to find relevant needles…
  5. ends up like searching in a maze…
  6. Email is still the 800 lb. gorilla of ediscovery (whether in the clouds or not)
  7. Reality:The era of Big Data in Litigation has just begun….

    Lehman Brothers Investigation -- 350 billion page universe (3 petabytes) -- Examiner narrowed collection by selecting key custodians, using dozens of Boolean searches -- Reviewed 5 million docs (40 million pages using 70 contract attorneys) Source: Report of Anton R. Valukas, Examiner, In re Lehman Brothers Holdings Inc., et al., Chapter 11 Case No. 08-13555 (U.S. Bankruptcy Ct. S.D.N.Y. March 11, 2010), Vol. 7, Appx. 5, at http://lehmanreport.jenner.com/.
  8. U.S. v. Philip Morris et al. Civil lawsuit brought by Clinton Administration against tobacco companies in 1999 Racketeering allegation that companies have conspired since 1953 to defraud the American public as to the true health effects of smoking 1,726 Requests to Produce from tobacco companies for tobacco-related records (including email) from 30 federal agencies 32 million Clinton-era email records held by National Archives
  9. Case Study: U.S. v. Philip Morris (con’t) – Employing a limited feedback loop Original set of 12 keywords searched unilaterally After informal negotiations, additional terms explored Sampling against database to find “noisy” terms generating too many false positives (Marlboro, PMI, TI, etc.) Report back and consensus on what additional terms would be in search protocol.
  10. tobacco cigarette smoking <tar> nicotine Smokeless Synar Amendment Philip Morris R.J. Reynolds BAT Industries Liggett group Brown and Williamson Liggett PMI (Philip Morris Institute) MSA (Master Settlement Agreement) ETS (Environmental Tobacco Smoke) B&W (Brown & Williamson) TI (Tobacco Institute) … Query Terms Round 2 Round 1
  11. White House Counsel False Positives Relevant Smoking Policy Emails VP Chief of Staff Ron Klain OMB Office of the U.S. Trade Rep.
  12. Suppressing False Positives Marlboro vs. Upper Marlboro, Maryland Philip Morris Institute vs. Presidential Management Intern (PMI) program Master Settlement Agreement vs. Medical Savings Accounts and Metropolitan Standard Areas Tobacco Institute (TI) vs. do re me….
  13. Example of Boolean search string from U.S. v. Philip Morris (((master settlement agreement OR msa) AND NOT (medical savings account OR metropolitan standard area)) OR s. 1415 OR (ets AND NOT educational testing service) OR (liggett AND NOT sharon a. liggett) OR atco OR lorillard OR (pmi AND NOT presidential management intern) OR pm usa OR rjr OR (b&w AND NOT photo*) OR phillip morris OR batco OR ftc test method OR star scientific OR vector group OR joe camel OR (marlboro AND NOT upper marlboro)) AND NOT (tobacco* OR cigarette* OR smoking OR tar OR nicotine OR smokeless OR synar amendment OR philip morris OR r.j. reynolds OR ("brown and williamson") OR ("brown & williamson") OR bat industries OR liggett group)
  14. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ National Archives Clinton White House search request Tobacco Policy 20 million emails 100,000 relevant docs, with 20,000 privileged hired 25 persons for 6 months … 200,000 docs with “hits”
  15. A Hypothetical 1 billion emails, 25% with attachments Reviewed at 50 per hour Would take 100 people, 10 hrs per day, 7 days a week, 52 weeks a year …. 54 YEARS TO COMPLETE At $100/hr, $ 2 billion in cost Even 1% (10 million docs) … 28 weeks and $20 million in cost …..
  16. Judge Grimm writing for the U.S. District Court for the District of Maryland “[W]hile it is universally acknowledged that keyword searches are useful tools for search and retrieval of ESI, all keyword searches are not created equal; and there is a growing body of literature that highlights the risks associated with conducting an unreliable or inadequate keyword search or relying on such searches for privilege review.” Victor Stanley, Inc. v. Creative Pipe, Inc., 250 F.R.D. 251 (D. Md. 2008); see id., text accompanying nn. 9 & 10 (citing to Sedona Search Commentary & TREC Legal Track research project)
  17. Judge Facciola writing for the U.S. District Court for the District of Columbia “Whether search terms or ‘keywords’ will yield the information sought is a complicated question involving the interplay, at least, of the sciences of computer technology, statistics and linguistics. See George L. Paul & Jason R. Baron, Information Inflation: Can the Legal System Adapt?', 13 RICH. J.L. & TECH.. 10 (2007) * * * Given this complexity, for lawyers and judges to dare opine that a certain search term or terms would be more likely to produce information than the terms that were used is truly to go where angels fear to tread.” -- U.S. v. O'Keefe, 537 F.Supp.2d 14, 24 D.D.C. 2008).
  18. Boolean Retrieval The Model documents represented by descriptors descriptors originally manually assigned concepts from controlled vocabulary modern implementations generally use words in text as descriptors information need represented by descriptors structured with Boolean operators modern implementations include more operators than just AND, OR, NOT a match occurs if and only if doc satisfies Boolean expression “fuzzy match” systems use descriptor weights, relax strict binary interpretation Pros and cons good: transparency---clear exactly why doc retrieved bad: little control over retrieved set size; no ranking; searchers must learn query language From Ellen Voorhees, Georgetown 2009
  19. Vector Space Model The Model documents represented as vectors in N-dimensional space where N is number of ‘terms’ in the document set term is usually a word (stem); but might be phrase or thesaurus class terms are weighted based on frequency and distribution of occurrences information need is natural language text mapped in same space matching is similarity between query and doc vectors example similarity: cosine of angle between vectors allows documents to be ranked by decreasing similarity Pros and Cons good: less brittle than pure Boolean bad: less transparency---depending on weights, a doc with few query terms can be ranked higher than a doc with many From Ellen Voorhees, Georgetown 2009
  20. Vector Similarities T1 T2 T3 T4 … Document-Document similarity docs are similar to the extent they contain the same terms doc pairs with maximal similarity detects duplicates document clustering cluster hypothesis: “Closely associated documents tend to be relevant to the same requests.” thus, do retrieval based on returning whole clusters since usually much more information in doc-doc comparison than doc-query Term-Term similarity terms are similar to the extent the occur inthe same documents term clustering query expansion provide bottom-up description of document set From Ellen Voorhees, Georgetown 2009 5 0 33 0 … 0 0 8 0 … 1 4 0 2 … 0 3 0 4 … 0 1 0 0 … 3 2 0 … D1 D2 D3 D4 D5 D6…
  21. Improved review and case assessment: cluster docs thru use of software with minimal human intervention at front end to code “seeded” data set Emerging New Strategies:“Predictive Analytics” Slide adapted from Gartner Conference June 23, 2010 Washington, D.C.
  22. Judicial endorsement of predictive analytics in document review by Judge Peck in Da Silva Moore v. PublicisGroupe(SDNY Feb. 24, 2012)

    This opinion appears to be the first in which a Court has approved of the use of computer-assisted review. . . . What the Bar should take away from this Opinion is that computer-assisted review is an available tool and should be seriously considered for use in large-data-volume cases where it may save the producing party (or both parties) significant amounts of legal fees in document review. Counsel no longer have to worry about being the ‘first’ or ‘guinea pig’ for judicial acceptance of computer-assisted review . . . Computer-assisted review can now be considered judicially-approved for use in appropriate cases.
  23. The da Silva Moore Protocol Supervised learning Random sampling Establishment of seed set Issue tags Iteration Random sampling of docs deemed irrelevant
  24. The e-discovery world of tomorrow optimizing visual analytics….

  25. Visual Analysis Examples(Presentation by Dr. Victoria Lemieux, Univ. British Columbia, at Society of American Archivist Annual Mtg. 2010, Washington, D.C.) With acknowledgments to Jeffrey Heer, Exploring Enron, http://hci.stanford.edu/jheer/projects/enron/, Adam Perer, Contrasting Portraits, http://hcil.cs.umd.edu/trs/2006-08/2006-08.pdf, and Fernanda Viegas, Email Conversations, http://fernandaviegas.com/email.html
  26. Showing Data changes over time

    Source: Pacific Northwest National Laboratory.
  27. Another ‘River’ chart The colors represent word frequency. Susan Havre. Pacific Northwest National Laboratory.
  28. Bar charts Jeffrey Heer: researcher at Berkeley. The Enron Emails. http://hci.stanford.edu/jheer/projects/enron/v2/ Documents can be viewed as a function of time, and then organized by color based on author, topic, keywords, etc.
  29. Where keywords can’t go: networks Source: CarstenGörg and John Stasko, Georgia Institute of Technology. http://eprints.ucl.ac.uk/9136/1/9136.pdf
  30. Social Graphs Have the Potential to Provide Greater Insight into Corporate Relationships: Who is Talking To Whom?
  31. Important Individuals To Zoom in on For Evidentiary Purposes
  32. Network maps cont’d
  33. Links of near infinite complexity
  34. More links examples
  35. Concentrating on message stream between specific individuals
  36. Wikipedia and philosophy connections: Simon Raper, UK Statistician and Bayesian modeling/VA hobbyist. http://drunks-and-lampposts.com/2012/06/13/graphing-the-history-of-philosophy/
  37. Treemap distribution in data repository:file sizes and distribution in directories Credit: Maria Esteva et al., Texas Adv. Computing Center
  38. Where are electronic messages coming from inside a corporation?
  39. Types of documents found in given repository
  40. Hypothetical: noticing spikes in communications between 2 units in a corporation
  41. Spikes in conversations about specific topics of evidentiary interest
  42. Hypothetical: Conversations on Email Per Hour on Topic X xxxxxxxxxxxx
  43. Challenges for VA in E-Discovery/E-Disclosure Cases

    Ease of use Analytical reasoning techniques Data input Visual representation and intepretation Scalability Tech-savvy-ness of lawyers and judges (or lack thereof) Defensibility in the context of an adversary system
  44. Question A research topic of interest: how can use of Visual Analytics change the kind of questions asked in e-discovery cases?Example: From “Produce all documents on Topic X…” TO interrogatories asking: “What was the first time that there was a spike in communications on topic X….” “How many messages were recorded and between whom?” “Who were the senior most individuals who knew on date A that topic X was being discussed?” How did the knowledge of Topic X spread through the institution?
  45. Strategic Challenges

    Convincing the legal community that advanced forms of automated searches including visual analytics are not just desirable but necessary in response to large volume of documents to search.
  46. Challenges (cont.)

    Designing an overall review process which maximizes the potential to find responsive documents in a large data collection (no matter which search tool is used), and using sampling and other visual analytic techniques to test hypotheses early on.
  47. Challenges (cont.)

    Having parties understand that the use of visual analytics does not guarantee all responsive documents will be identified in a large data collection.
  48. Challenges (cont.)

    Being open to using new and evolving search and information retrieval methods and tools, including advances in visual analytics based on increasing powerful data mining and AI techniques.
  49. Problem: Innovative Thinking
  50. References Topic Classification and Visualization Articles V. Lemieux and J. Baron, “Overcoming the Digital Tsunami in e-Discovery: is Visual Analysis the Answer?,” Canadian J. of Law & Tech. (Spring 2011) http://ediscovery.umiacs.umd.edu/fall11rg/LemieuxAndBaron2011.pdf D. Hillard et al., “Computer-Assisted Topic Classification for Mixed-Methods Social Science Research, J. of Info. Tech. 4:4 (2007), available at http://jitp.haworthpress.com C. Gorg & J. Stasko, “Jigsaw: Investigative Analysis on Text Document Collections through Visualization,” available at http://eprints.ucl.ac.uk/9136/1/9136.pdf Background Law Review Referencing Autocategorization & Advanced Search J. Baron, “Law in the Age of Exabytes: Some Further Thoughts on ‘Information Inflation’ and Current Issues in E-Discovery Search, 17 Richmond J. Law & Technology (2011), see http://law.richmond.edu “Predictive Coding” Report N. Pace & L. Zakaras, “Where the Money Goes: Understanding Litigant Expenditures for Producing Electronic Discovery, RAND Report (2012), available at http://www.rand.org/pubs/monographs/MG1208.html
  51. Jason R. Baron Director of Litigation Office of General Counsel National Archives and Records Administration 8601 Adelphi Road # 3110 College Park, MD 20740 (301) 837-1499 Email: jason.baron@nara.gov
More Related