310 likes | 339 Views
Introduction to Big Data & Data Science. Module 1. Disclosure.
E N D
Disclosure The following information has been developed with assistance and input from Kenneth J. Wilkins, PhD – Office of the Director, National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK), National Institutes of Health (NIH) This information does not constitute an endorsement by NIDDK or NIH
Module Objectives Define big data & data science List and describe the “4 Vs” Provide examples of data and data sources
Data Science • Multidisciplinary approach to analysis and evaluation of information • The practice of using and applying data • The term “data science” may have been first coined by Bill Cleveland of AT&T/Bell Labs, now a proponent of the Open Science Tessera platform for big data analysis • The growing segment of the workforce who identify as data scientists tend to have multi-sector experience in areas such as: • Mathematics/statistics • Computer science • Data architecture/algorithms • Informatics • Domain sciences (i.e. biology)
What is “Big Data”? • Large OR complex data sets that cannot be easily analyzed using traditional methods • This is our working definition to provide an introduction into associated concepts • Traditional methods means different things to different people; can refer to hardware issues or to the increased use of analytics for more sophisticated visualization, prediction, or estimation than what is broadly available (i.e. Microsoft Excel) • “Big Data” itself is a buzzword, a term whose strict definition is fluid and subject to debate even among those who work in the field • Users may disagree whether it’s the size (e.g. all of the data can’t comprised into one computer’s memory) or the complexity (e.g. data has a multiple variables and observations) that qualify it as “big data”
What is “Big Data”? Cont. • Often analyzed for patternsor trends • Used in “everyday life” to make predictionsabout: • Shopping/purchases • Weather • Video rental • Traffic patterns and driving assistance • Warehouses/stocking/supply chain • Disease development and associated factors • Disease treatment, outcomes, and demographics of those who have good outcomes, and by which treatment
Other Definitions of “Big Data” • What is Big Data? - NIH BD2K site • What is Big Data? - DataScience@Berkeley blog • Market-driven answers from industry: Forbes, IBM, SAS, etc. • Crowdsourced definition on Wikipedia: https://en.wikipedia.org/wiki/Big_data • Big Data: Biomedicine- BD2K documentary film collaboration with UCLA’s Keck School to convey the magnitude of this issue.
Big Data Research Big Data Research is often used to estimate correlationsor associations, thus different from – but may inform the design of – trials that seek to identify cause-effect relationships. How is correlation vs. association different from cause-effect? • Association may arise from a common cause or a common consequence (effect) • For example, causation and association is demonstrated in why individuals with diabetes experience progressive loss of eyesight and nerve sensibility – poor blood sugar control leads to damage in blood vessels that are essential to eye/nerve health • One diabetic complication does NOT cause another, but the fact that the co-occur frequently leads to greater association • A common consequence explaining why certain associations arise is illustrated by what happens when you select a convenient source for health data, i.e. a hospital
Benefits of Big Data For individualsor consumers: • More effective and efficient treatments • For example – compare and contrast the effectiveness of two drugs (one formulary and one generic) • Using data from electronic health records to identify long-term effectiveness • Using data from FDA’s Sentinel system to identify side effects • Better coordinated care and decision support at the point of care • Ideally, your health information follows you across providers, allowing an accessible and unbiased presentation of evidence needed for shared decision-making
Benefits of Big Data For individuals or consumers (cont.): • Increased potential for open discussion of individualized care options • Better understanding of disease and treatment options – taking the patient’s personal preferences, lifestyle, etc. into consideration to determine the “effectiveness” of different treatment options • Transparency – the ACA’s Open Payments List of physician’s non-practice funding adds transparency as to why physicians may guide patients to certain therapies • Promotion of patient-centered treatments and therapies • Opportunities to participate in initiatives • i.e. clinical trials, patient registries
Benefits of Big Data For researchers: • A move towards new and complementary forms of research to yield more comprehensive information and target novel areas for knowledge gain • Synthesizing multiple analyses to advance therapies’ effective reach • Improving a sense of what research avenues are NOT worth pursuing
Benefits of Big Data For providers/health systems and pharmaceutical companies: Providers/Health Systems • Increase in partnering with patients and use of evidence-based medicine and personalized treatments to impact patient-centered outcomes • Finding new methods to increase value and quality while decreasing cost • Informing better treatments and more individualized care – what works, and for whom? Pharmaceutical companies • Obtaining better, more effective, and cost-efficient therapies • Increasing capacity to screen candidate therapies and design trials • Decreased sponsor investment in FDA-required post-approval surveillance if able to leverage existing safety data
Data & Data Sources What is “Big Data” in healthcare? • Clinical Data • Doctors’ notes • Prescriptions • Lab images & notes • Insurance data • Electronic Health Record (EHR) data • Patient-reported data • Sensors • Social media posts • Wearables information
The 4 Vs of Big Data VOLUME VELOCITY VARIETY VERACITY Big Data is often described using “4 Vs” – volume, velocity, variety, and veracity Some people have suggested there are more than 4 Vs (i.e. value) IBM has a great schematic of the 4 Vs: http://www.ibmbigdatahub.com/infographic/four-vs-big-data
Elements of Big Data “4 Cs” have also been suggested to go along with the 4 Vs: Confidence– in the accuracy of data and data sources Context– understanding how is asking the question and why Choice– choosing the right data and platform for queries Cognitive– understanding needs, judgements, and observations of data analysis
Velocity The speed of the data. • How quickly are the data recorded? Transmitted? Is the data “real time”? • When were the data last updated? Six months ago? Six years ago? • For example – medical tools that are informed by near real-time analytics: • Wearables that provide information such as heart rate, distance walked, and approximate calorie burn • Bedside monitors that provide basic physiological information to patients in a hospital • Vital sign monitoring to identify complications such as internal bleeding or infections
Volume The size of the data set. • How much data do we have? • The amount of data that is present • Independent observations, completely measured variables • Richness of variation in values needed to yield associations • Is the data set big enough to answer our questions with precision or to estimate associations? • Is it a sufficiently large amount of data to make an informed decision? • Remember, “big” can imply the complex structure or size of the data
Variety The different types of data being used. • What types of data do we have? • i.e. numerical, text, images • Where do the data come from? • i.e. insurance records, self-reports, EHRs • Are all people represented or only a subset (i.e. only those with a certain health insurance)? • The greater the variety of data, the more likely it is to see associations • Associations may also be more likely due to chance with a greater variety of data Greater data variety may illuminate trends that may not be evident with only one data source.
Veracity The measure of how well the data reflects reality. • How accurate are the data? • Are the data trustworthy? • Are the data recorded accurately? • Are the data objective? • If the data are subjective, have the measures been evaluated for validity?
Big Data in Everyday Life Big Data is everywhere– consider the real-world examples and illustrations of the 4 Vs on the following slides. While reviewing, contemplate: Is this an example of big data? Why or why not? What kinds of data are being collected? Did this example of big data lead to an accurate prediction? Why or why not?
Big Data in Everyday Life Example A: In 2008, Google researchers demonstrated that they had detected the peak of the flu season based on people’s Google searches. When people were sick with the flu, they often searched for flu-related information online. This provided almost an instant signal of overall flu prevalence, producing accurate estimates of flu prevalence two weeks earlier than the CDC’s data from healthcare providers. However, in subsequent years, Google’s ability to accurately identify the peak of the flu season based on Google search information varied. For example, in 2013, Google missed the peak of the flu season by 140%. Source: http://www.wired.com/2015/10/can-learn-epic-failure-google-flu-trends/ This is an example of velocityin big data.
Big Data in Everyday Life Example B: In 2011, the BBC miniseries House of Cards was up for purchase, with actor Kevin Spacey and director David Fincher involved with the project. Netflix’s data on the preferences of its then over 26 million customers indicated that people who watched the original BBC series were also likely to watch movies starring Kevin Spacey and movies directed by David Fincher. Netflix’s communications director explained, “We know what people watch on Netflix, and we’re able with a high degree of confidence to understand how big a likely audience is for a given show based on people’s viewing habits.” Netflix was confident it had an audience for a remake of the series, and committed $100 million to the project. Source: http://www.salon.com/2013/02/01/how_netflix_is_turning_viewers_into_puppets/ This is an example of volumein big data.
Big Data in Everyday Life Example C: Microsoft’s Traffic Prediction Project is compiling data from sources such as historical traffic numbers, road cameras, traffic maps, and drivers’ social networks. The goal of the project is to be able to use established patterns to predict future traffic jams 15-60 minutes before they occur. Microsoft tested the model in London, Chicago, Los Angeles, and New York, reporting traffic jam prediction accuracy of 80% using traffic flow data alone. Researchers believe they can achieve 90% accuracy with the use of additional data sources. Source: http://venturebeat.com/2015/04/03/how-microsofts-using-big-data-to-predict-traffic-jams-up-to-an-hour-in-advance/ This is an example of varietyin big data.
Big Data in Everyday Life Example D: With the rise of wellness programs, companies have initiated various exercise challenges to encourage employee fitness. Some companies, to ensure truthful reporting, have provided employees with activity trackers (such as Fitbits) that count heart rate, calories burned, steps-taken, and sleep quality. However, several companies have reported instances of employees outsmarting their Fitbits. The most common workaround? Attaching the tracker to their dog. Researchers at Stridekick found that “Dogs typically do between 13,000 and 30,000 steps in a day.” Source: http://www.wsj.com/articles/want-to-cheat-your-fitbit-try-using-a-puppy-or-a-power-drill-1465487106 This is an example of veracityin big data.
Four Dimensions of Big Data • This schematic from FDA provides one way of envisioning illustrative examples of the 4Vs as four dimensions of big data.
Four Dimensions of Big Data: All known? There may be unmeasured, random variation in big data collection: • Multiscale, multilevel sources of variability • Patient/family – genetic and environmental risk, health and treatment history • Clinic/provider – different practice patterns and diagnostic approaches • Lab/insurer – distinct protocols or equipment, difference in coverage • Population/region – differing access to care, different risk factors in environment • Study/cohort/collection system – choice of data driven by distinct reasons • For reliable conclusions, account for all heterogeneity • Methods may indirectly assume that data comes from common processes • Use distinct methods for recording and presenting data
Four Dimensions of Big Data There can also be non-random or systematic variation in big data collection: • Forms of bias, due to convenience of “found” data • Opportunistic data collection is a frequent feature of big data – in other words, it is “found” data • Two forms that predominate observational big data • Selection bias – “found” samples of observations are selected, while completely measured observations are not a random sample of all observations that are part of the research process • Confounding bias – associations between health outcomes and other variables may be due to other factors, e.g. sicker patients tend to experience more side effects
Four Dimensions of Big Data: Any Unknown? • It is important to have domain experts (i.e. scientists, clinicians, patients) inform assumptions of data collection • Does data capture changes with health status, or are there missing values? • The proportion of the population with available variables • Certain assumptions are unverifiable and do not have data to check them (i.e. selection of observations) • Explore how different assumptions would have to be to change conclusions: • Sensitivity analysis – to “stress test” findings numerically • “Tipping point” analysis – to quantify how distinct assumptions must be in order to change conclusions
Self-Led Activity: Review & Reflect Review the key concepts covered during this module and reflect on your role as a patient advocate. Should any of your own work be shared with other participants to help everyone learn from experience? Do such concepts change how you view planned “big data” and health research efforts?
Upon Completing this Module… You should be able to: Understand the origins of big data terminology Understand the multiple, varying definitions for “Big Data” List several benefits of using big data Describe how big data can be used by – and be useful for – different populations Define the fundamental elements of big data (“4 Vs”)