400 likes | 412 Views
Explore innovative approaches for mass declassification using contextual accumulation systems. Learn how predictions and data points play a crucial role in handling vast document volumes efficiently.
E N D
Mass DeclassificationWhat If? Jeff Jonas, IBM Distinguished Engineer Chief Scientist, IBM Entity Analytics JeffJonas@us.ibm.com September 23, 2010
The Ask • What emerging technology or innovative approaches come to mind … which may have applicability to this task? • Use your imagination. What if? • Not talking about any specific products • Not focusing on the widely available COTS/GOTS technologies (OCR, document management, case management, workflow, etc.)
The Problem at Hand • Volumes may be beyond human, brute force review (@5min/ea = 18,382 FTEs) • Necessitates some form of machine triage • Red: A disclosure risk • Yellow: A possible disclosure risk • Green: No disclosure risk • Reliable machine triage requires substantially better prediction systems • Even then, advanced means for humans to deal with the remaining large volumes of “possibles” is still required
Background • Early 80’s: Founded Systems Research & Development (SRD), a custom software consultancy • 1989 – 2003: Built numerous systems for Las Vegas casinos including a technology known as Non-Obvious Relationship Awareness (NORA) • 2001/2003: Funded by In-Q-Tel • 2005: IBM acquires SRD • Cumulatively: I have had a hand in a number of systems with multi-billions of rows describing 100’s of millions of entities • Affiliations: • Member, Markle Foundation Task Force on National Security in the Information Age • Senior Associate, Center for Strategic and International Studies (CSIS) • Distinguished Research Faculty (adjunct), Singapore Management University, School of Information Systems • Member, EPIC advisory board • Board Member, US Geospatial Intelligence Foundation (USGIF), the GEOINT organizing body
In Today’s Session • Intro to context accumulating systems • Predictions and data points needed for mass declassification • Strawman architecture • Challenges • Q&A
Contextualization From Pixels to Pictures to Insight Relevance Observations Consumer (An analyst, a system, the sensor itself, etc.) Context
Context, definition of: Better understanding something by taking into account the things around it.
scrila34@msn.com Without Context
Consequences • Algorithms flat-lining (e.g., alert queues) • Enterprise amnesia on the rise • Overwhelmed by false positives and false negatives? You have seen nothing yet • Not enough humans to fix this with brute force • Risk assessment becomes the risk
scrila34@msn.com Job Applicant Trusted Supplier Known Terrorist Stolen Identity Context Accumulation
Puzzle Metaphor Primer • Imagine an ever-growing pile of puzzle pieces of varying sizes, shapes and colors • What it represents is unknown – there is no picture on hand • Is it one puzzle, 15 puzzles, or 1,500 puzzles? • Some pieces are duplicates and some are missing • Some are pieces are incomplete, low quality, or have been misinterpreted • Some pieces may even be professionally fabricated lies • Until you take the pieces to the table, you don’t know what you are dealing with
How Context Accumulates • With each new observation … one of three assertions are made: 1) Un-associated; 2) near like neighbors; or 3) connections • Asserted connections must favor the false negative • New observations sometimes reverse earlier assertions • Some observations produce novel discovery • As the working space expands, computational effort increases • The emerging picture helps focus collection interests • Given sufficient observations, there can come a tipping point • Thereafter, confidence improves while computational effort decreases!!!!
False Negatives Overstate The Universe Unique Identities True Population Observations
Counting Is Difficult Mark R Smith (707) 433-0000 DL: 00001234 Mark Smith 6/12/1978 443-43-0000 File 2 File 1
The Rise and Fall of a Population Unique Identities True Population Observations
New Record Mark Randy Smith 443-43-0000 DL: 00001234 Data Triangulation Mark R Smith (707) 433-0000 DL: 00001234 Mark Smith 6/12/1978 443-43-0000 File 2 File 1
Increasing Accuracy and Performance Unique Identities True Population Observations
“Expert Counting” is Fundamental to Prediction • Is it 5 people each with 1 account … or is it 1 person with 5 accounts? • If one cannot count … one cannot estimate vector or velocity (direction and speed). • Without vector and velocity … prediction is nearly impossible. • Therefore, if you can’t count, you can’t predict.
Mass Declassification Predictions • Whose equity is it? • Machine triage – disposition • Queue prioritization
Using What Data Points? FOR EXAMPLE: • 450M target documents • Dirty words • Previous declassifications • Previous declassification denials • FOIA’s • Intellipedia • Wikipedia • WikiLeaks • Deceased persons • Publically available accounts/facts
Open Source Discovery/Scoring • “Height of Pakistan’s Mufasa missile.” • What is 15.5 meters? • New York Times, Sept 21, 2010, C3 “Pakistan unveils Mufasa 7 Warhead” • Wikipedia: Mufasa_7_Warhead
Mufasa 7 Warhead Open Source Reference FOIA March 2010 Classified – Asserted Dirty Word Context Accumulation
Context Accumulation + Statistics Document Element Total | Declass | Class-Default | Class-Asserted Author: “Billy K” 4503 1600 403 0 Codeword: “Tomatoe” 4818 4600 218 0 Classification: “SI/TK/001” 23 22 1 0 Actors: “Salam Ahmed” 782 700 82 0 Declassification dispositions … becoming a force multiplier. The more human dispositions, the more automated dispositions. Humans Auto Triage 5,000 20 10,000 4,000 100,000 65,000 1,000,000 17,000,000
Policy Questions • What related information is already available in the public domain? • Evidence: Exists in open source • What damage might conceivably result from disclosure and what benefits might ensue • Evidence: Same text already released (by same equity holder)
Strawman Architecture 450M Docs Predictions(*) Feature Extraction & Classification Historical Dispositions Context Accumulation DirtyWords Workflow System Dispositions Etc. (*) Recommendations: Equity of, Disposition, Priority
Another Idea: Crowd Sourcing • Can you predict specific people with privileges and knowledge … to whom can be routed selected documents for evaluation? • Can you publish machine-triage recommendations to a wiki or other form of internal broadcast for community crowd sourcing?
Another Idea: Better Classification • Using the overall declassification platform to assist in proper classification (real-time) • And, better pre-tagging to assist in future auto-declassification
Challenges • Entity extraction is imperfect • Predictions may still not good enough, often enough • Not in English • The user work surface and its distribution • Consequences of an inappropriate release • With super access and super tools, this may call for stronger audit and insider-threat protections • Your contracting cycle and the creation of the system might take until mid-2011 or 2012 or 2013
Closing Thoughts • Contextualization is essential to better prediction • There are not enough humans to ask every question every day • “Human attention directing” systems are critical to the mission • The data must find the data, the relevance must find the user
Worst Case Scenario • Rich context enables better hints for users, results in faster dispositions • Rich context enables improved sequencing of the work
Related Blog Posts Smart Sensemaking Systems, First and Foremost, Must be Expert Counting Systems Data Finds Data Puzzling: How Observations Are Accumulated Into Context The Fast Last Puzzle Piece Algorithms At Dead-End: Cannot Squeeze Knowledge Out Of A Pixel How to Use a Glue Gun to Catch a Liar It Turns Out Both Bad Data and a Teaspoon of Dirt May Be Good For You Smart Systems Flip-Flop
Questions? Blogging At: www.JeffJonas.TypePad.com Information Management Privacy National Security and Triathlons
Mass DeclassificationWhat If? Jeff Jonas, IBM Distinguished Engineer Chief Scientist, IBM Entity Analytics JeffJonas@us.ibm.com September 23, 2010
The Problem at Hand • 450M documents • x5min/document • =2.25B minutes • /60 = 37.5M hours • /2040 = 18,382 FTE’s