1 / 40

Mass Declassification What If?

Explore innovative approaches for mass declassification using contextual accumulation systems. Learn how predictions and data points play a crucial role in handling vast document volumes efficiently.

cathyt
Download Presentation

Mass Declassification What If?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mass DeclassificationWhat If? Jeff Jonas, IBM Distinguished Engineer Chief Scientist, IBM Entity Analytics JeffJonas@us.ibm.com September 23, 2010

  2. The Ask • What emerging technology or innovative approaches come to mind … which may have applicability to this task? • Use your imagination. What if? • Not talking about any specific products • Not focusing on the widely available COTS/GOTS technologies (OCR, document management, case management, workflow, etc.)

  3. The Problem at Hand • Volumes may be beyond human, brute force review (@5min/ea = 18,382 FTEs) • Necessitates some form of machine triage • Red: A disclosure risk • Yellow: A possible disclosure risk • Green: No disclosure risk • Reliable machine triage requires substantially better prediction systems • Even then, advanced means for humans to deal with the remaining large volumes of “possibles” is still required

  4. Background • Early 80’s: Founded Systems Research & Development (SRD), a custom software consultancy • 1989 – 2003: Built numerous systems for Las Vegas casinos including a technology known as Non-Obvious Relationship Awareness (NORA) • 2001/2003: Funded by In-Q-Tel • 2005: IBM acquires SRD • Cumulatively: I have had a hand in a number of systems with multi-billions of rows describing 100’s of millions of entities • Affiliations: • Member, Markle Foundation Task Force on National Security in the Information Age • Senior Associate, Center for Strategic and International Studies (CSIS) • Distinguished Research Faculty (adjunct), Singapore Management University, School of Information Systems • Member, EPIC advisory board • Board Member, US Geospatial Intelligence Foundation (USGIF), the GEOINT organizing body

  5. In Today’s Session • Intro to context accumulating systems • Predictions and data points needed for mass declassification • Strawman architecture • Challenges • Q&A

  6. Context Accumulating Systems

  7. Contextualization From Pixels to Pictures to Insight Relevance Observations Consumer (An analyst, a system, the sensor itself, etc.) Context

  8. Context, definition of: Better understanding something by taking into account the things around it.

  9. scrila34@msn.com Without Context

  10. Consequences • Algorithms flat-lining (e.g., alert queues) • Enterprise amnesia on the rise • Overwhelmed by false positives and false negatives? You have seen nothing yet • Not enough humans to fix this with brute force • Risk assessment becomes the risk

  11. scrila34@msn.com Job Applicant Trusted Supplier Known Terrorist Stolen Identity Context Accumulation

  12. Puzzle Metaphor Primer • Imagine an ever-growing pile of puzzle pieces of varying sizes, shapes and colors • What it represents is unknown – there is no picture on hand • Is it one puzzle, 15 puzzles, or 1,500 puzzles? • Some pieces are duplicates and some are missing • Some are pieces are incomplete, low quality, or have been misinterpreted • Some pieces may even be professionally fabricated lies • Until you take the pieces to the table, you don’t know what you are dealing with

  13. How Context Accumulates • With each new observation … one of three assertions are made: 1) Un-associated; 2) near like neighbors; or 3) connections • Asserted connections must favor the false negative • New observations sometimes reverse earlier assertions • Some observations produce novel discovery • As the working space expands, computational effort increases • The emerging picture helps focus collection interests • Given sufficient observations, there can come a tipping point • Thereafter, confidence improves while computational effort decreases!!!!

  14. False Negatives Overstate The Universe Unique Identities True Population Observations

  15. Counting Is Difficult Mark R Smith (707) 433-0000 DL: 00001234 Mark Smith 6/12/1978 443-43-0000 File 2 File 1

  16. The Rise and Fall of a Population Unique Identities True Population Observations

  17. New Record Mark Randy Smith 443-43-0000 DL: 00001234 Data Triangulation Mark R Smith (707) 433-0000 DL: 00001234 Mark Smith 6/12/1978 443-43-0000 File 2 File 1

  18. Increasing Accuracy and Performance Unique Identities True Population Observations

  19. “Expert Counting” is Fundamental to Prediction • Is it 5 people each with 1 account … or is it 1 person with 5 accounts? • If one cannot count … one cannot estimate vector or velocity (direction and speed). • Without vector and velocity … prediction is nearly impossible. • Therefore, if you can’t count, you can’t predict.

  20. Mass DeclassificationPredictions

  21. Mass Declassification Predictions • Whose equity is it? • Machine triage – disposition • Queue prioritization

  22. Using What Data Points? FOR EXAMPLE: • 450M target documents • Dirty words • Previous declassifications • Previous declassification denials • FOIA’s • Intellipedia • Wikipedia • WikiLeaks • Deceased persons • Publically available accounts/facts

  23. Open Source Discovery/Scoring • “Height of Pakistan’s Mufasa missile.” • What is 15.5 meters? • New York Times, Sept 21, 2010, C3 “Pakistan unveils Mufasa 7 Warhead” • Wikipedia: Mufasa_7_Warhead

  24. Mufasa 7 Warhead Open Source Reference FOIA March 2010 Classified – Asserted Dirty Word Context Accumulation

  25. Context Accumulation + Statistics Document Element Total | Declass | Class-Default | Class-Asserted Author: “Billy K” 4503 1600 403 0 Codeword: “Tomatoe” 4818 4600 218 0 Classification: “SI/TK/001” 23 22 1 0 Actors: “Salam Ahmed” 782 700 82 0 Declassification dispositions … becoming a force multiplier. The more human dispositions, the more automated dispositions. Humans Auto Triage 5,000 20 10,000 4,000 100,000 65,000 1,000,000 17,000,000

  26. Policy Questions • What related information is already available in the public domain? • Evidence: Exists in open source • What damage might conceivably result from disclosure and what benefits might ensue • Evidence: Same text already released (by same equity holder)

  27. Strawman Architecture

  28. Strawman Architecture 450M Docs Predictions(*) Feature Extraction & Classification Historical Dispositions Context Accumulation DirtyWords Workflow System Dispositions Etc. (*) Recommendations: Equity of, Disposition, Priority

  29. Another Idea: Crowd Sourcing • Can you predict specific people with privileges and knowledge … to whom can be routed selected documents for evaluation? • Can you publish machine-triage recommendations to a wiki or other form of internal broadcast for community crowd sourcing?

  30. Another Idea: Better Classification • Using the overall declassification platform to assist in proper classification (real-time) • And, better pre-tagging to assist in future auto-declassification

  31. Challenges

  32. Challenges • Entity extraction is imperfect • Predictions may still not good enough, often enough • Not in English • The user work surface and its distribution • Consequences of an inappropriate release • With super access and super tools, this may call for stronger audit and insider-threat protections • Your contracting cycle and the creation of the system might take until mid-2011 or 2012 or 2013

  33. Closing Thoughts

  34. Closing Thoughts • Contextualization is essential to better prediction • There are not enough humans to ask every question every day • “Human attention directing” systems are critical to the mission • The data must find the data, the relevance must find the user

  35. Worst Case Scenario • Rich context enables better hints for users, results in faster dispositions • Rich context enables improved sequencing of the work

  36. Related Blog Posts Smart Sensemaking Systems, First and Foremost, Must be Expert Counting Systems Data Finds Data Puzzling: How Observations Are Accumulated Into Context The Fast Last Puzzle Piece Algorithms At Dead-End: Cannot Squeeze Knowledge Out Of A Pixel How to Use a Glue Gun to Catch a Liar It Turns Out Both Bad Data and a Teaspoon of Dirt May Be Good For You Smart Systems Flip-Flop

  37. Questions? Blogging At: www.JeffJonas.TypePad.com Information Management Privacy National Security and Triathlons

  38. Mass DeclassificationWhat If? Jeff Jonas, IBM Distinguished Engineer Chief Scientist, IBM Entity Analytics JeffJonas@us.ibm.com September 23, 2010

  39. The Problem at Hand • 450M documents • x5min/document • =2.25B minutes • /60 = 37.5M hours • /2040 = 18,382 FTE’s

More Related