Big Data: Size, Complexity and Analytics

Big Data: Size, Complexity and Analytics Nicoleta Serban, PhD Associate Professor H. Milton Stewart School of Industrial & Systems Engineering Georgia Institute of Technology

What is Big? Size or Quantity • Gigabyte ( bytes) vs Terabyte ( bytes) vs. Petabyte ( bytes) vs. Exabyte ( bytes) Complexity or Heterogeneity • Dependencies: temporal, spatial or network • Randomness: sampling scheme • High dimensionality: multiple features • Depth: multiple hierarchies

Why Size Matters? Infrastructure for managing information: • Storage – relational database vs. distributed systems vs. cloud computing • Retrieval – random vs. sequential access • Representation – level of knowledge vs. derivation of features • Safeguards – protection of privacy and confidentiality

Why Complexity Matters? Translation of information to data to knowledge: • Infrastructure – supercomputers vs. distributed computers • Computation - single-threaded vs. parallelizable computational methods • Analytics – exponentially growing number of hypotheses • Inference – the dangers of ‘blind’ data mining vs. mathematical rigor

Data Science Framework • Data • Representation • Sampling • Information • Infrastructure • Management • Decisions • System engineering • Knowledge • Computation • Tools • Data architectures • Data integration, sharing and federation • Data privacy rules • Data wrangling • Deriving hypotheses • Validating hypotheses • Eliciting causal relations • Designing, planning, and optimizing • Testing, ranking, scoring • System dynamics • Data mining • Machine learning • Statistical inference • Network analysis • Simulations • Visualization

A Proof of Concept: Medicaid Project • Information: • Identifiable patient-level claims data • 5 years+14 states = • 266,839,307,070 Observations • 2 Terabytes of information • Data: • Represented as patient care trajectories: utilization, cost and patient characteristics • Sampled by disease Challenge #1: HIPPA and CMS data safeguards compliance - data environment: access, sharing, linking, storage Challenge #2: Database backbone - projected research needs - projected computational needs Challenge #3: Data Processing - unavailability of tools to process-mine claims - additional data and information needs - expert opinion & collaborations

Medicaid Project: Health Analytics • Data: • Condition: Pediatric Asthma • Baseline Metrics • Care Pathway • Access & Outcomes • Knowledge • Systematic disparities in access, outcomes and cost • Network of providers • Profiles of patient-level care pathways Process Mining Spatial Statistical Models Functional Data Analysis Unsupervised classification Sequence clustering Markov-decision processes Optimization

Medicaid Project: Health Analytics • Knowledge: • Systematic disparities in access, outcomes and cost • Network of providers • Profiles of patient-level care pathways • Decision Making: • Policy interventions • Network Interventions Markov-decision processes Causal Inference Optimization Modeling Simulations

Medicaid Project: Resources • Legal Process & CMS Approval (~ 2yrs) • Costly IT infrastructure implementation • Extensive IT support • Constrained computing infrastructure • Large team of students • Funding & Deliverables • Visibility

Medicaid Project: Opportunities • Developing the proof of concept in developing larger infrastructures for protected information • Becoming the center for deployment of tools for mining claims data • Advancing rigor in health analytics • Educating students and visiting researchers • Informing policy making in understanding and managing the healthcare system

Acknowledgements Co-Principal investigator: Dr. Swann Supporting Institutes and Organizations • National Science Foundation (CAREER Award) • Institute of People and Technology • Children’s Healthcare of Atlanta Research Team IT Staff: Matthew Sanders and Paul Diederich Postdoctoral fellow: Dr. Monica Gentili Undergraduate students: Yuchen Zheng, Alex Terry, Pravara Harati, Qiming Zhang, Sean Monahan Graduate students: Kevin Johnson (MS), Erin Garcia, Ben Johnson, Zihao Li, Ross Hilton

Contact Us NicoletaSerban nserban@isye.gatech.edu Julie Swann jswann@isye.gatech.edu

Big Data: Size, Complexity and Analytics