1 / 0

Protecting Information in the NIEM Lifecycle Using Synthetic Data

Protecting Information in the NIEM Lifecycle Using Synthetic Data . 15 December 2009. Nykia Jackson, Barbara Shapter Kim Sterret -Day Johns Hopkins University Applied Physics Laboratory Barbara.shapter@jhuapl.edu. Introduction. JHU/APL for DHS S&T CCI in support of DHS EDMO

aqua
Download Presentation

Protecting Information in the NIEM Lifecycle Using Synthetic Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Protecting Information in the NIEM Lifecycle Using Synthetic Data

    15 December 2009 Nykia Jackson, Barbara Shapter Kim Sterret-Day Johns Hopkins University Applied Physics Laboratory Barbara.shapter@jhuapl.edu
  2. Introduction JHU/APL for DHS S&T CCI in support of DHS EDMO Objective: to help EDMO and NIEM community By learning of needs and gaps By exploring technologies that fill critical gaps in the NIEM lifecycles of model management, IEPD development, and implementation support. NIEM Blue Team Tools Day
  3. Filling a Gap Objective: To develop a proof-of-concept capability to create test data for an information exchange using synthetic data NIEM Blue Team Tools Day At the practitioner level Need a tool to aid the testing of an implementation by the generation of “safe” test data
  4. Vision for Solution Synthetic Data SYNINGEN NIEM Blue Team Tools Day
  5. Design IEPD Schemas Pre-process Proof-of-Concept for one IEPD. Synthetic Instance Generator (SYNINGEN) Synthetic data source Embedded database Dynamic insertion of “controlled” erroneous data Schema Binding Synthetic Data SYNINGEN Test Records NIEM Blue Team Tools Day
  6. IEPD Selection Requirements Consists of a cross-section of commonly used data fields Contains a minimal amount of domain specific data to utilize the capabilities of the previously developed synthetic data generator Provides a concrete method for assessing IEPD implementation / test data via development of a web service (WS) Selection:CONNECT Driver License Search IEPD Designed to facilitate the effective exchange of criminal justice information amongst the CONNECT states (Alabama, Nebraska, Wyoming, Tennessee, and Kansas) Defines the driver license search parameters, driver license search results (summary), and driver license details NIEM Blue Team Tools Day
  7. Demonstration IEPD Schemas Pre-process Generate Test Data Good values “Bad” Values Test Web Service Client Schema Binding Synthetic Data SYNINGEN Web Service Test Records Driver License Query Client NIEM Blue Team Tools Day
  8. Synthetic Data Generation Synthetic Data SYNINGEN NIEM Blue Team Tools Day
  9. Why Use Synthetic Data? Need for large-scale, high-quality synthetic datasets to support DHS Test and Evaluation (T&E) activities Designing Modeling Testing (including usability) Training Tool studies Need poses privacy protection challenges due to lack of access to actual data, i.e., Personally Identifiable Information (PII), and other access limitations Four possible data methods available to address data needs (with limitations) Use of actual data Sanitized or anonymized data Manually created fictitious data Machine-generated large-scale datasets from real world models, algorithms, or reference statistical patterns NIEM Blue Team Tools Day
  10. Synthetic Data Generator (SDG) Synthetic Data: datasets comprised entirely of fictitious data, that can be used in a given context (or situation), instead of directly measurable or accessible actual data. Prototype developed for Department of Homeland Security (DHS) Science and Technology (S&T) Command, Control, and Interoperability (CCI) Division Automatic capability to produce robust datasets comprised of entities (e.g. people with behaviors/ “footprints” over time) Creates synthetic test data that models a community with highly connected social networks of entities and relationships Data reflects typical daily activities in which people travel, communicate, and spend money in ways that are normally expected in a reasonable world Datasets are in simple delimited text format NIEM Blue Team Tools Day
  11. SDG Data City JHU/APL developed the concept and rules that characterize the “reasonable world” Categories of interest Demographics (including immigration) Social networks Communication patterns Travel Financial transactions (including consumer spending) Produce datasets that are consistent in time and space Credit Card Number travels Person purchases Credit Card Transaction communicates using Phone Number Phone Number receiver caller Call Transcript NIEM Blue Team Tools Day
  12. Current Synthetic Fields PHONE_NUMBERPersonIDType (Landline or Mobile)Number CREDIT_CARD_ TRANSACTIONPersonIDTransactionNumberCreditCardNumberPurchaseCityDateAmountCompanyIndustry PERSONEthnicityCodeEthnicityTextEyeColorCodeEyeColorTextGenderCodeGenderTextHairColorCodeHairColorTextHeightInchesWeightPoundsAddressStreetNumberAddressStreetNameAddressCityAddressCounty*AddressStateAddressPostalCodeAddressPostalExtensionCode PHONE_CALLPersonIDDateDurationSecondsTypeFromCityIDFromNumberToCityIDToNumber TRAVELPersonIDFromCityIDToCityIDDate CITYCityIDCityStateRegionCountry PERSON …PersonIDBinaryBase64ObjectBinaryDescriptionTextBinaryFormatIDBinaryFormtStandardTextBinaryCategoryTextGivenNameFamilyNameMiddleNameSuffixCitizenshipPassportNumberDriverLicenseNumberDriverLicenseStateDriverLicenseExpiration. DateDriverLicenseIssueDateDOB NIEM Blue Team Tools Day
  13. Synthetic Data Sample Bio: Eladio Berstis, a USA citizen, lives in Lansing Michigan, and was born on June 8, 1974. He subscribes to two landline telephone numbers: 517-513-5528 and 517-567-1171. He shares these numbers with family members who live with him. Eladio has a relative, Soto Berstis, who lives in Providence Rhode Island. Eladio calls Soto regularly. Eladio owns two MasterCard credit cards. NIEM Blue Team Tools Day
  14. Utility Developed prototype web portal interface to SDG User specifies characteristic attributes for a dataset through this interface Has been extended to generate other domain specific data Applications North American Threat (NAT) Dataset for intelligence analysis Privacy Protection Technology NIEM Test Data Suspicious Activity Reports (SARs) Datasets have been generated and distributed to research institutions and agencies NIEM Blue Team Tools Day
  15. Feedback CY2010 Q1: Delivery of SYNINGEN software to DHS EDMO Independent of IEPD Definition of fields desired in a synthetic dataset for NIEM What is useful? What level of fidelity desired for “reasonable world”? Feedback welcomed Barbara.shapter@jhuapl.edu Nykia.Jackson@jhuapl.edu NIEM Blue Team Tools Day
  16. Backup Slides NIEM Blue Team Tools Day
  17. How reasonable is the data? Names People in the same family tend to share the same family name Western naming convention of a single first name and single last name Many companies do not have realistic names Actual cities around the world Affiliations can have either actual (e.g. Al Qaeda) or fictitious (e.g. Augusta Gang) names Travels Does not specify the transportation modes for travels Tracks people at the city level (data does not tell us whether a person was seen at a restaurant) No more than one travel event in one day Phone Calls Simplified communications among people Access to both landline and mobile phone numbers Mobile number is “owned” by only one person Landline phone number may be used by a number of people Two phone calls originating from the same phone number will not overlap in time (no guarantee that a phone number could not be a receiver and a caller at the same time) Credit Card Transactions Types of data that people are likely to find in their own monthly statements: date, amount, company, and industry Transactions occur in the same city as a person’s current location NIEM Blue Team Tools Day
More Related