VIFI: Virtual Information Fabric for Data-Driven Discovery

VIFI : Virtual Information Fabric for Data-Driven Discovery from Distributed Fragmented Repositories PI: Dr. Ashit Talukder Bank of America Endowed Chair in IT Email: atalukde@uncc.edu Web: http://cs.uncc.edu/directory/talukder-ashit

VIFI Concept • Novel VIFI cyberinfrastructure that facilitates data-driven discovery from distributed, fragmented datasets • without requiring movement of massive amounts of data • without exposing sensitive raw datasets to end users. • Overarching Goals: • Open source middleware tools • Evaluate and demonstrate on multiple domains: Earth Science, Astronomy, Health Informatics, Resilient Human-building ecosystems • Useful in domains involving massive, or heterogeneous data streams, with noveledge analytics, fog computing.

Traditional Data Fabric: Limitations • Complex and timely processes, standards, APIs, MOUs - may include format conversion, DB import, select field encryption, data redaction or de-identification, etc. • Given appropriate authorizations and consideration for data privacy, bulk datasets are transported across bandwidth limited connections. • After staging bulk data ingest, analytics differentiates valuable information from irrelevant data. Irrelevant data volume often eclipses that of the usable information. 1.45pm to 3.45pm - Room 232

VIFI Proof of Concept: Early Stage Demonstrations • Demonstrate initial core components in VIFI proof of concept use-case: • User interface and visualization of distributed data and VIFI features • Portable analytics container (PAC) – prepare self-contained analytics scripts and algorithms • Docker swarm – deploy, monitor, execute portable analytics (PAC) on remote repositories • Orchestration of distributed infrastructure • Distributed computation and analytics without moving distributed repositories • User visualization of analytics and data-driven insights Demonstrate on pilot Earth science use-case for climate and weather precipitation model prediction from distributed earth science repositories Demonstrate on pilot Astronomy use-case for detecting specific statistical patterns from distributed astronomy data

VIFI POC Use Case: Hourly Precipitation datasets over the Great Plains • When it rains at somewhere in the Great Plains, would there be a probability density function to forecast how (strong/long/much) it rains? • The example uses rainfall data for one day from NASA’s observational (GPM) and model datasets at three different spatial resolutions. from [Bukovsky 2011] 10: Northern Plains

VIFI Motivation: Traditional Data Fabric Architecture Model Data 3. Observational data re-gridded to the same resolution of the model data (if necessary) Model Server 4. JPDF is computed for the observational data 1. Download Model data User Node (or Server) 5. JPDF is computed for the model data 2. Download Obs data Observation Server 6. Observed and simulated JPDFs are used to compute an Evaluation Metric Observations Data

VIFI Motivation: Traditional Data Fabric Architecture • Disadvantages: • Long time for transferring massive datasets to the User Node • High requirements for storing massive datasets on the User Node • All computations are executed on the same server • All data are transferred to the User Node – including data that might not be relevant for the analysis in question • Scientist must manually install the algorithms (including all dependencies) on the User Node

VIFI Motivation: ViFi Enabled Data Fabric Architecture Model Data 5. Execute Model PAC 10. Execute Evaluation PAC 6. Execute Obs PAC Model Server 3. Request Model PAC Docker Image 1. Send Model PAC Script 7. Send Model Results 9. Request Evaluation PAC Docker Image User Node (or Server) 8. Send Obs Results Docker Hub 2. Send Obs PAC Script 4. Request Obs PAC Docker Image Observation Server Observations Data

VIFI Motivation: ViFi Enabled Data Fabric Architecture • Advantages: • All phases of the scientific analysis lifecycle (compute and data transfer) are executed by a single agent (NIFI), without any manual intervention or a-priori knowledge on the scientist part. • Science algorithms are encapsulated in re-usable PACs, which can be seamlessly deployed and run on any ViFi-enabled Node • Computations are distributed onto multiple servers, which have direct access to the data (NO NEED TO MOVE DATA). • Only a subset of the data (i.e., results of Model and Observation PACs) are transferred over the network, drastically reducing the data transfer times • Scalability of overall infrastructure to any new data source by simply installing the ViFi software.

VIFI User Interface PAC script Upload PAC script Write Visualization Types Results

NIFI at User Site

NIFI at Server(s) Site(s) NIFI at Model Server NIFI at Observation Server

PoC Current Status • Open source (extensible and portable across infrastructures) • Initial deployment on AWS (for speed of demonstration – portable and easy to deploy on local managed infrastructure if needed) • AWS virtual machines • AWS S3 bucket to keep results • First Datacenter hosts Model data + NIFI + Docker Swarm • Second Datacenter hosts Observation data + NIFI + Docker Swarm • User node with NIFI + Docker Swarm • Docker Image of Apache OCW at Docker Hub • User interface and visualization base functionalities

PoC Future Work • Expand Pilot, commence HLA design • Integration between UI and NIFI. • Common NIFI workflow design for most datacenters (i.e., not only for JPL): • Identification of common attributes of users, as well as, datacenters. • Data search and virtualization separate from PAC scripts. • Data governance, data management, search and query • Workflow scheduling and optimization (e.g., DAWN, IReS). • Security integration. (authentication, authorization, audit, provenance) • Encryption integration (encrypt relevant data and run computations on encrypted data) • Demonstrate, evaluate, benchmark on multiple application domains.

VIFI: Virtual Information Fabric for Data-Driven Discovery