350 likes | 358 Views
Virginia’s Longitudinal Data System. A Federated Approach to Longitudinal Data. April 4 th , 2011. Agenda. The Challenge Virginia’s Approach Best Practice and SME Findings Design Considerations Proposed Solution Summary. The Challenge.
E N D
Virginia’s Longitudinal Data System A Federated Approach to Longitudinal Data April 4th, 2011
Agenda • The Challenge • Virginia’s Approach • Best Practice and SME Findings • Design Considerations • Proposed Solution • Summary
The Challenge • To develop a Statewide Longitudinal Data System (SLDS) that, without violating privacy policies or law, provides users with a capability to query, link, download and create reports from record level or aggregate data between one or more agencies • Because of existing Commonwealth law, the SLDS could not be based on an underlying data warehouse • De-identified data may be merged when a viable reason exist. However, The use of persistent, de-identified, linked (merged) data was determined to be highly inefficient and raised political issues which could have endangered the project.
Virginia’s Approach • Virginia undertook a comprehensive investigation of best practices and subject matter experts to determine the feasibility of a federated data model. • Between October and December 2010, the Center for Innovative Technology (CIT), Virginia Information Technologies Agency (VITA) and the Department of Education (DOE) interviewed six best practice organizations and ten subject matter experts. • Those findings led to a SLDS Technical Architecture which fulfilled the objective of the grant while adhering to the Commonwealth’s privacy constraints.
Important Design Considerations • User friendly • Maximize use of existing technologies/solutions • Minimize sustainment costs • Record level data queries were not time sensitive • Strong central security model
The Solution • A federated data model and technical architecture comprised of a web based user interface (UI), a query/linking engine, a multi-level security module, a rich business intelligence (BI) capability, a Lexicon and integrated workflow. SLDS Portal Workflow Reporting Data Security Shaker Lexicon Security
SLDS Portal Workflow Reporting Data Security Shaker Lexicon Security
Conceptual Portal SLDS Portal Workflow Reporting Data Security Shaker Lexicon Security
Portal Components SLDS Portal • Shaker • Distributed Query Engine (DQE) • For use by Agency employees and named users • Reports • Public Facing Aggregated Data • Named Users - Query Building Tool (QBT) • Lexicon • Workflow • Account request • Data request Workflow Reporting Data Security Shaker Lexicon Security
Portal Features(Public Facing) SLDS Portal • Aggregated Data Reports • Lexicon • Links to Agency reports • Help Files • FAQs • Request for Named User Account Workflow Reporting Data Security Shaker Lexicon Security
Portal Features(Named Users) SLDS Portal • Help / Training • Reports • Non-suppressed aggregated data • Query Building Tool (QBT) • Lexicon • Workflow • Account and Data request • Data retrieval • File Attachment for uploading NDAs, etc. • Ability to check status, modify or cancel account and/or data request • Password reset Workflow Reporting Data Security Shaker Lexicon Security
SLDS Portal Workflow Reporting Data Security Shaker Lexicon Security
Security Overview SLDS Portal Workflow Reporting Data Security Shaker Lexicon Security
Security SLDS Portal SLDS Portal Authentication Workflow Workflow Reporting Reporting • Viewing Data Data • Viewing Security Security Shaker Shaker Lxicon Lexicon • Suppressed Data • Non-Suppressed Data Authorization • Database • Table • Column • Role Based • Permission Security Security • Viewing • Editing
SLDS Portal Workflow Reporting Data Security Shaker Lexicon Security
Workflow SLDS Portal Workflow Reporting Data Security Shaker Lexicon Security
SLDS Portal Workflow Reporting Data Security Shaker Lexicon Security
Reporting: Record Level Linked Data SLDS Portal Source Data DOE SCHEV VEC Workflow Reporting 1. Instantiates the information contained in the Lexicon. 2. Contains dummy data. Shell Database1,2 Lexicon Data Report link will display report with dummy data. Report will have a button that will allow submission of report to workflow. Distributed query engine generate queries to each of the source data systems and join the result sets . Engine will interact with Lexicon. Options for report display include a Logi Analysis Grid (depending on number of records returned.) or a link to download a file. Access may be provided through Ad Hoc report portal. Ad Hoc Metadata Security Shaker3,4 Shaker Lexicon Query Results5,6 Security Results Approval Report Creation1,2 (Ad Hoc interface) Report Creation1,2 (Ad Hoc interface)
Source Data Reporting: Aggregate Linked Data DOE SCHEV VEC SLDS Portal ETL process will periodically pull source data and load aggregate data tables. The tool used for the ETL process may be SSIS or LogiETL. . Data access through Stored Procedures which will handle data suppression. Record Level Linked Data Aggregate Linked Data3 Workflow Reporting ETL1,2 Data Direct DB Connection There will be prebuilt reports for linked data from the different sources (e.g., DOE to SCHEV, SCHEV to VEC). The prebuilt reports may provide the user with some capabilities to perform analysis on the data (e.g., crosstabbing, grouping, filtering, etc.) Security Prebuilt Reports1,2 Shaker Lexicon Public Reports HTTP SLDS Portal Portal1 Security SLDS Portal HTTP Prebuilt Reports will be displayed within iFrames in Portal. User
SLDS Portal Workflow Reporting Data Security Shaker Lexicon Security
Lexicon Defined For Our Purposes: The Lexicon is an inventory of every available data field in every available data source, the structure of their storage, the possible values and meanings of the information stored, all possible transformations of each set of field values to another set of field values, methods of data source access, and matching algorithms and how they are to be used in conjunction with possible field value transformations. Transformations & Matching Algorithms
Lexicon Maintenance • To maintain accuracy and manage extensibility, the linking module will process all data sources periodically at a predetermined time/interval looking for: • Changes in data ranges ( a new code was added for race/ethnicity ) • New fields (more data, more data, more data!) • Anything else that would disrupt the probabilistic matching or provide more ways to slice and dice the data • Anomalies found by the linking module will prompt an alert for a system administrator to modify the matching algorithm or add query choices • For new sources, or those with known common fields/links, this would be the method of entry
Shaker SLDS Portal Workflow Reporting Data Security Shaker Lexicon Security
Lexicon – Shaker Process Common IDs [deterministic] or Common Elements with appropriate Transforms, Matching Algorithms and Thresholds [probabilistic] User Interface/ Portal/ LogiXML Lexicon Shell Database Query Building Process (Pre-Authorization) Sample Data ? Workflow Manager DS 1 Linking Control A linking engine process will update the Lexicon periodically to allow query building on known available matched data fields. No data is used in this process. Queries are built on the relationships between data fields in the Lexicon. DS 2 Data Access Control Sub-Query Optimization Hashed ID Matrix Authorized Query DS 3 Query Results
Joining Sub-Queries on Hashed-IDs SLDS Portal Add’l Data Sources Workflow Reporting Possible Connection using Web Service – creates Web Services Data Source (Oracle) - enables application and data integration by turning external web service into an SQL data source, making external Web services appear as regular SQL tables. This table function represents the output of calling external web services and can be used in an SQL query. Possible Connection using Homogeneous link between Oracle DBs – establish synonyms for global names of remote objects in the distributed system so that the Shaker can access them with the same syntax as local objects Data Possible Connection using Heterogeneous link using available Transparent Gateway or Generic ODBC/OLE Sub-query processing priority will be determined for each query to minimize unnecessary data transfer (e.g. not downloading unmatched records unless specifically requested) to optimize join performance – see Query Sub-Process Optimization Matched Hash ID Values Security • The SLDS server will match records from different agencies using the Hash ID • After records are matched, the SLDS server will delete the Hash ID values and replace them with randomly generated unique IDs. Shaker Lexicon Security
Sub-Query Process Optimization SLDS Portal Agency Creates Hash-IDs DS 1 DS 2 DS 3 2nd DS to query is DS with next least count using specified criteria (if Inner Join) Query 2nd DS using today’s key AND hashed-ID list from 1st DS 1st DS to query is DS with least count using specified criteria Query 1st DS using today’s key Returns set with hashed IDs Workflow Reporting Data Security Get COUNTS from each DS Web Service for each set of limiting criteria Shaker Parse Sub-Queries Lexicon Join Sub-Queries on Hashed ID Run 1st Sub-Query Create Hash-Key Run 2nd Sub-Query Derive JOIN Criteria from Lexicon - Common IDs [Deterministic] or Common Elements with appropriate Transforms, Matching Algorithm and Thresholds [probabilistic] Security Query Results Lexicon Query
Data SLDS Portal Workflow Reporting Data Security Shaker Lexicon Security
Data Architecture SLDS Portal DS 1 DS 2 DS 3 Contains DBs for Shaker, Ad Hoc metadata, logging, auditing, etc. Database for Shaker process and that temporarily stores linked record level data. The temporary tables will be dropped after a set period of time. For canned reports, Stored Procedures will be used for data querying and suppression. DS 1 Workflow Reporting Data ETL1 Security VITA (CESC) Metadata and Security1 Workflow Lexicon Shell DB Shaker/ Deidentified Record Level Data2 Aggregate Linked Data Shaker Lexicon SPs3 Security Workflow Lexicon UI / Admin Record Level Query / Reports Aggregate Linked Reports SLDS Portal
Security SLDS Portal • Authentication • COV AUTH • Authorization • Role Based • Anonymous User • Named User • System Administrator • Agency Employee • Researcher • Permissions • Workflow • Reports (Suppressed and Non-Suppressed) • Query Building Tool • Lexicon • Data elements • User Account Management • Data security enforced by/at …. • Portal • Lexicon • Viewing • Editing • Reports • Suppressed Data • Non-Suppressed Data • Workflow • Data • Database • Table • Column Workflow Reporting Data Security Shaker Lexicon Security