1 / 26

Big Data

Big Data. Lloyd Brodsky Healthtechnet 10/21/2016. Goals for this talk. Introduce concept of learning health/continuous improvement Explain what big data is (and isn’t) Where data-centric approaches are used for healthcare Tell you about specific tools and education you can go out and use.

hardiman
Download Presentation

Big Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Big Data Lloyd Brodsky Healthtechnet 10/21/2016

  2. Goals for this talk • Introduce concept of learning health/continuous improvement • Explain what big data is (and isn’t) • Where data-centric approaches are used for healthcare • Tell you about specific tools and education you can go out and use

  3. Institute of Medicine Learning Health System

  4. Do more of what works • And less of what doesn’t • In order for that to happen you need • Data that produces metrics for what actually happened • Your prediction of what you thought was going to happen • A means of figuring out why the discrepancies happened • Which could be a real problem, bad luck, or a bad model • A plan to resolve the discrepancy in the right direction • Which means involving the people delivering the service

  5. ONC Learning Health System IT model

  6. Feedback flaw in most quality reporting • Many programs have mandatory quality reporting • Most quality measures are inputs, not outputs • % of diabetics having an eye exam, not number of diabetics that went blind • Often results don’t get back to clinicians or patients • Most reporting doesn’t track follow-up • How sick were the people that showed up for screening? • How many of these tests were positive? • How many positive tests got appropriate follow-up? • What is the average cost of detecting a treatable case? • How much did the incidence of diabetes-related blindness drop? • Providers worry (legitimately) about being judged on matters outside their control

  7. So the big management questions are • What is your organization trying to accomplish? • What metrics measure those accomplishments? • What tools and data do you need to calculate and share those metrics? • What’s your plan for dealing with the difference between plan and reality?

  8. What is big data (as opposed to just data)? • Big data is a poorly defined marketing term comprised of a set of four things • Big Computer • NoSQL database management systems • Big Datasets • Big Models • At the same time • Cloud services dropped the cost of infrastructure • Heavy competition plus published pricing • Open source dropped the cost of analytics and infrastructure software

  9. Big computer • It’s cheaper and more scalable to tie together large numbers of commodity computers than to make bigger single computers • That’s what Hadoop and Spark are about • This is important if you have a data set that’s really really big • Won’t fit onto single disk drives • Too big to be processed in a reasonable period of time • Metaphor: To find a needle in a haystack, divide it into 100 smaller haystacks and assign 100 people to check each one at the same time • For mass personalization, speed is more important than accuracy

  10. How cheap? • Buying Oracle on Exadata for a 100 terabyte data warehouse costs about $6K per terabyte plus 20% annual maintenance • Add labor for a system administrator, tech refresh every five years & real estate • Buying Amazon Redshift (cloud data warehouse service) with a three year commitment is about $1K per terabyte per year • Administration, tech refresh, and real estate is included • You can rent computing by the hour • 64 cores/256 RAM $3.83/hour on demand (Amazon) • 31% off for one year commitment no upfront • 61% off for three year commitment all upfront • Could be as low as 43 cents per hour on auction • Cloud storage is $30/terabyte per month, less for archive

  11. NoSQL (nonrelational) databases • These are the workhorses of transaction processing systems • And they’ve been around for decades • The new ones • Don’t require you to predefine what’s in the data • And that’s really important for unstructured data • And to be able to defer work, get data stored in-house, and wait until it becomes important

  12. Big data sets • There are an increasing number of businesses with very large data sets arising from the ordinary course of business. Reasons include: • Side effect of production processes (such as log files in ecommerce) • Emergent from user interaction (such as social media) • Very large scale transaction processor (such as insurance companies) • Totally new data source (genomics) • Hadoop itself came from Google needing to manage its data • Scanning the text in the Library of Congress is 10-20 terabytes, so a petabyte (1,000 terabytes) is VERY big • Library of Congress would be a tenth of that if converted to text • Texas Medicaid is about 40 terabytes

  13. Very few entities have that much data • For health care providers, clinical images are the storage hogs • Not billing, not EMRs • More data is not necessarily better data • Watching a movie at 4K isn’t 16x more entertaining than HD streaming • Polls wouldn’t be more accurate if they called everybody • If the data is too inaccurate or biased or irrelevant it won’t help • But there is a good chance you have multiple related sources of data • Oracle and IBM came up with reasons for “big data” to be a useful concept in more places • You’ll hear about data lakes containing all kinds of data • And you’ll hear about the Four V’s

  14. Kinds of Data

  15. Where does health data come from? • Billing/Administrative data (Payers often have the most holistic view) • Medical records • Institution-specific system of record for clinicians caring for their patient • Medical devices (although mostly not stored centrally) • Clinical scale providers • Research systems • Pharmaceutical companies and research institutions • Government surveys • Such as CDC Pregnancy risk, AHRQ HCUP and Census American Community • Patients • Often aggregate their own data (which they have a right to) • Best source of information on social issues

  16. What health data can you get? • It’s not a coincidence that most health services research papers are based on Medicare billing data, come from the VA or Kaiser, or come from a research hospital at the center of a care network • They had data (and perhaps organizaltional slack) • You should have access to your own data (although you may have integration problems) • But absent participate in a sharing program you may have trouble getting data • Both HIEs and some companies facilitate pooling data

  17. Big models • Business intelligence and data visualization • Enable interactive exploratory data analysis on a collection of data sets • Still the first and most important thing to do • Text analysis • Such as looking at doctor’s noted to suggest diagnosis codes • Machine learning • Train a classifier on a sample and then score the entire population • Used for health risk scoring and fraud detection • Social network analysis (Analyze the links between entities) • Used for fraud detection and workflow analysis

  18. What’s good for health care now? • Traditional business intelligence • Now with data visualization • Pop health triage • Run a model grouping claims into • Episodes of care • Create risk scores for individual patients • Identify care a patient should have received, but didn’t • Fraud detection • Cluster analysis with the goal of identifying practice/cost outliers • Social network analysis looking for suspicious referral patterns • Text analysis of clinician notes is more popular with providers • Computer-assisted coding

  19. What’d be good for health care later? • Democratizing analysis • One thing for headquarters to calculate quality based on missed opportunities of care; another to push that information out to patients and front-line clinicians • Time series for chronic illness • Most pop health tools take a years of data to predict a year. Chronic illness runs for decades • Patient-supplied data • Social work and clinical questionnaire information usually isn’t in EMRs • Medical device integration • Many have computers with audit trails but don’t export the data into EMRs • Merging administrative and clinical data, especially interorganizationally • Would rather base diagnoses on lab readings than diagnostic codes • Census integration (the missing denominator problem)

  20. Advice on how to get started • Do little data before you do big data • Make it right before you make it big • Explore with free tools and public data • Do exploratory analysis before you do statistics • Visualizing is easy to do and usually insightful • And easier to sell to your boss • Estimate on a sample before you do the population • Do a Kaggle competition

  21. What software am I recommending and why? • Microsoft PowerBI is a business intelligence and data visualization tool • The desktop is a free download; functionality also in current Excel • It cooperates with most data sources • Competes with Tableau ($2K, but there’s a less-functional free version) • R is an open source statistical programming language • Commercially supported versions from Microsoft, Oracle, and Tibco • Over 7,000 add-on packages available at no charge • Scales easily if you buy the corresponding database • Competes with SAS ($9K for desktop) • RStudio is an integrated development environment for R • Competes (sort of) with Visual Studio and Eclipse

  22. Software • Microsoft PowerBI • https://powerbi.microsoft.com/en-us/downloads/ • Microsoft R Open • https://mran.revolutionanalytics.com/download/ • Rstudio • https://www.rstudio.com/products/rstudio/download/

  23. Education • Nice summary of R resources at https://www.rstudio.com/online-learning/ • edX, Coursera, and Datacamp are freemium online education providers • edX and Coursera are university MOOCs; classes are free; certification costs • Datacamp is software with training wheels; into classes are free else $29/month • Kaggle organizes data science competitions • Which makes for a good learning environment – people share code and help each other out • There’s a lot of voluntary online support for open source

  24. Educational resources • Learning Health series • https://www.nap.edu/catalog/13301/the-learning-health-system-series • Microsoft PowerBI • https://powerbi.microsoft.com/en-us/guided-learning/ • https://www.edx.org/course/analyzing-visualizing-data-power-bi-microsoft-dat207x-3 • R • https://www.datacamp.com/ • https://www.edx.org/microsoft-professional-program-certficate-data-science • https://www.coursera.org/specializations/jhu-data-science • https://www.edx.org/course/analytics-edge-mitx-15-071x-2

  25. Cloud resources • Kagglehttps://www.kaggle.com/ • Titanic competition https://www.kaggle.com/c/titanic • https://www.kaggle.com/mrisdal/titanic/exploring-survival-on-the-titanic/discussion • Stack Overflow http://stackoverflow.com/ • Data • https://www.healthdata.gov/ • Major cloud providers offer trial big data services • That’s AmazonWeb, Microsoft Azure, and Google Cloud • Microsoft Bizspark has a particularly good deal for tech startups

  26. Summary • Need to establish relevant metrics based on data you can get • Broke down “big data” into four parts and came to the conclusion that the democratization of analysis and the drop cost was more important than size • Identified education, tools, and data that you can take and use

More Related