280 likes | 339 Views
Big Data. Lloyd Brodsky Healthtechnet 10/21/2016. Goals for this talk. Introduce concept of learning health/continuous improvement Explain what big data is (and isn’t) Where data-centric approaches are used for healthcare Tell you about specific tools and education you can go out and use.
E N D
Big Data Lloyd Brodsky Healthtechnet 10/21/2016
Goals for this talk • Introduce concept of learning health/continuous improvement • Explain what big data is (and isn’t) • Where data-centric approaches are used for healthcare • Tell you about specific tools and education you can go out and use
Do more of what works • And less of what doesn’t • In order for that to happen you need • Data that produces metrics for what actually happened • Your prediction of what you thought was going to happen • A means of figuring out why the discrepancies happened • Which could be a real problem, bad luck, or a bad model • A plan to resolve the discrepancy in the right direction • Which means involving the people delivering the service
Feedback flaw in most quality reporting • Many programs have mandatory quality reporting • Most quality measures are inputs, not outputs • % of diabetics having an eye exam, not number of diabetics that went blind • Often results don’t get back to clinicians or patients • Most reporting doesn’t track follow-up • How sick were the people that showed up for screening? • How many of these tests were positive? • How many positive tests got appropriate follow-up? • What is the average cost of detecting a treatable case? • How much did the incidence of diabetes-related blindness drop? • Providers worry (legitimately) about being judged on matters outside their control
So the big management questions are • What is your organization trying to accomplish? • What metrics measure those accomplishments? • What tools and data do you need to calculate and share those metrics? • What’s your plan for dealing with the difference between plan and reality?
What is big data (as opposed to just data)? • Big data is a poorly defined marketing term comprised of a set of four things • Big Computer • NoSQL database management systems • Big Datasets • Big Models • At the same time • Cloud services dropped the cost of infrastructure • Heavy competition plus published pricing • Open source dropped the cost of analytics and infrastructure software
Big computer • It’s cheaper and more scalable to tie together large numbers of commodity computers than to make bigger single computers • That’s what Hadoop and Spark are about • This is important if you have a data set that’s really really big • Won’t fit onto single disk drives • Too big to be processed in a reasonable period of time • Metaphor: To find a needle in a haystack, divide it into 100 smaller haystacks and assign 100 people to check each one at the same time • For mass personalization, speed is more important than accuracy
How cheap? • Buying Oracle on Exadata for a 100 terabyte data warehouse costs about $6K per terabyte plus 20% annual maintenance • Add labor for a system administrator, tech refresh every five years & real estate • Buying Amazon Redshift (cloud data warehouse service) with a three year commitment is about $1K per terabyte per year • Administration, tech refresh, and real estate is included • You can rent computing by the hour • 64 cores/256 RAM $3.83/hour on demand (Amazon) • 31% off for one year commitment no upfront • 61% off for three year commitment all upfront • Could be as low as 43 cents per hour on auction • Cloud storage is $30/terabyte per month, less for archive
NoSQL (nonrelational) databases • These are the workhorses of transaction processing systems • And they’ve been around for decades • The new ones • Don’t require you to predefine what’s in the data • And that’s really important for unstructured data • And to be able to defer work, get data stored in-house, and wait until it becomes important
Big data sets • There are an increasing number of businesses with very large data sets arising from the ordinary course of business. Reasons include: • Side effect of production processes (such as log files in ecommerce) • Emergent from user interaction (such as social media) • Very large scale transaction processor (such as insurance companies) • Totally new data source (genomics) • Hadoop itself came from Google needing to manage its data • Scanning the text in the Library of Congress is 10-20 terabytes, so a petabyte (1,000 terabytes) is VERY big • Library of Congress would be a tenth of that if converted to text • Texas Medicaid is about 40 terabytes
Very few entities have that much data • For health care providers, clinical images are the storage hogs • Not billing, not EMRs • More data is not necessarily better data • Watching a movie at 4K isn’t 16x more entertaining than HD streaming • Polls wouldn’t be more accurate if they called everybody • If the data is too inaccurate or biased or irrelevant it won’t help • But there is a good chance you have multiple related sources of data • Oracle and IBM came up with reasons for “big data” to be a useful concept in more places • You’ll hear about data lakes containing all kinds of data • And you’ll hear about the Four V’s
Where does health data come from? • Billing/Administrative data (Payers often have the most holistic view) • Medical records • Institution-specific system of record for clinicians caring for their patient • Medical devices (although mostly not stored centrally) • Clinical scale providers • Research systems • Pharmaceutical companies and research institutions • Government surveys • Such as CDC Pregnancy risk, AHRQ HCUP and Census American Community • Patients • Often aggregate their own data (which they have a right to) • Best source of information on social issues
What health data can you get? • It’s not a coincidence that most health services research papers are based on Medicare billing data, come from the VA or Kaiser, or come from a research hospital at the center of a care network • They had data (and perhaps organizaltional slack) • You should have access to your own data (although you may have integration problems) • But absent participate in a sharing program you may have trouble getting data • Both HIEs and some companies facilitate pooling data
Big models • Business intelligence and data visualization • Enable interactive exploratory data analysis on a collection of data sets • Still the first and most important thing to do • Text analysis • Such as looking at doctor’s noted to suggest diagnosis codes • Machine learning • Train a classifier on a sample and then score the entire population • Used for health risk scoring and fraud detection • Social network analysis (Analyze the links between entities) • Used for fraud detection and workflow analysis
What’s good for health care now? • Traditional business intelligence • Now with data visualization • Pop health triage • Run a model grouping claims into • Episodes of care • Create risk scores for individual patients • Identify care a patient should have received, but didn’t • Fraud detection • Cluster analysis with the goal of identifying practice/cost outliers • Social network analysis looking for suspicious referral patterns • Text analysis of clinician notes is more popular with providers • Computer-assisted coding
What’d be good for health care later? • Democratizing analysis • One thing for headquarters to calculate quality based on missed opportunities of care; another to push that information out to patients and front-line clinicians • Time series for chronic illness • Most pop health tools take a years of data to predict a year. Chronic illness runs for decades • Patient-supplied data • Social work and clinical questionnaire information usually isn’t in EMRs • Medical device integration • Many have computers with audit trails but don’t export the data into EMRs • Merging administrative and clinical data, especially interorganizationally • Would rather base diagnoses on lab readings than diagnostic codes • Census integration (the missing denominator problem)
Advice on how to get started • Do little data before you do big data • Make it right before you make it big • Explore with free tools and public data • Do exploratory analysis before you do statistics • Visualizing is easy to do and usually insightful • And easier to sell to your boss • Estimate on a sample before you do the population • Do a Kaggle competition
What software am I recommending and why? • Microsoft PowerBI is a business intelligence and data visualization tool • The desktop is a free download; functionality also in current Excel • It cooperates with most data sources • Competes with Tableau ($2K, but there’s a less-functional free version) • R is an open source statistical programming language • Commercially supported versions from Microsoft, Oracle, and Tibco • Over 7,000 add-on packages available at no charge • Scales easily if you buy the corresponding database • Competes with SAS ($9K for desktop) • RStudio is an integrated development environment for R • Competes (sort of) with Visual Studio and Eclipse
Software • Microsoft PowerBI • https://powerbi.microsoft.com/en-us/downloads/ • Microsoft R Open • https://mran.revolutionanalytics.com/download/ • Rstudio • https://www.rstudio.com/products/rstudio/download/
Education • Nice summary of R resources at https://www.rstudio.com/online-learning/ • edX, Coursera, and Datacamp are freemium online education providers • edX and Coursera are university MOOCs; classes are free; certification costs • Datacamp is software with training wheels; into classes are free else $29/month • Kaggle organizes data science competitions • Which makes for a good learning environment – people share code and help each other out • There’s a lot of voluntary online support for open source
Educational resources • Learning Health series • https://www.nap.edu/catalog/13301/the-learning-health-system-series • Microsoft PowerBI • https://powerbi.microsoft.com/en-us/guided-learning/ • https://www.edx.org/course/analyzing-visualizing-data-power-bi-microsoft-dat207x-3 • R • https://www.datacamp.com/ • https://www.edx.org/microsoft-professional-program-certficate-data-science • https://www.coursera.org/specializations/jhu-data-science • https://www.edx.org/course/analytics-edge-mitx-15-071x-2
Cloud resources • Kagglehttps://www.kaggle.com/ • Titanic competition https://www.kaggle.com/c/titanic • https://www.kaggle.com/mrisdal/titanic/exploring-survival-on-the-titanic/discussion • Stack Overflow http://stackoverflow.com/ • Data • https://www.healthdata.gov/ • Major cloud providers offer trial big data services • That’s AmazonWeb, Microsoft Azure, and Google Cloud • Microsoft Bizspark has a particularly good deal for tech startups
Summary • Need to establish relevant metrics based on data you can get • Broke down “big data” into four parts and came to the conclusion that the democratization of analysis and the drop cost was more important than size • Identified education, tools, and data that you can take and use