440 likes | 703 Views
Big Earth Sciences Data – From Descriptive to Prescriptive Analytics. Dr. Brand Niemann Director and Senior Data Scientist Semantic Community http://semanticommunity.info/ http://www.meetup.com/Federal-Big-Data-Working-Group/
E N D
Big Earth Sciences Data – From Descriptive to Prescriptive Analytics Dr. Brand Niemann Director and Senior Data Scientist Semantic Community http://semanticommunity.info/ http://www.meetup.com/Federal-Big-Data-Working-Group/ http://semanticommunity.info/Data_Science/Federal_Big_Data_Working_Group_Meetup April 11, 2014 DRAFT for April 17 and May 6
Overview • Must Read Articles: • Big Data – From Descriptive to Prescriptive: • Follow the progression from: What Happened (descriptive analytics), Why Did It Happen (correlation analytics), What Will Happen Next (predictive analytics), and What Should I Do About It (prescriptive analytics). We agree and will follow this framework. • The Sexiest Job of the 21st Century is Tedious, and that Needs to Change: • Data preparation (easy) and coding (minimal). We agree and use NodeXL & Spotfire. • Practical illustration of Map-Reduce (Hadoop-style), on real data • Excellent. I have been looking for something like this. • New Book: Developing Analytic Talent - Becoming a Data Scientist: • Eight Chapters (buy book) and Six Addendums (free) • Excellent. Each Chapter has a Summary.
Overview (continued) • My Story: From Data Science Central to Data Science Results: • Data Science Central is: • Online Resource for Big Data Practitioners: Robust Editorial Platform, Social Interaction, Forum-Based Technical Support, Latest in Technology Tools and Trends, and Industry Job Opportunities. Very comprehensive. • Data Science Central Data Science Results are: • What Happened (descriptive analytics): Registered meteorites that has impacted on Earth visualized. I did this. • My Story: Big Earth Sciences Data – From Descriptive to Prescriptive Analytics: • April 17th ESIP Earth Sciences Analytics Meeting and May 6th Federal Big Data Working Group Meetup • Find environmental/climate change data sets for Why Did It Happen (correlation analytics), What Will Happen Next (predictive analytics), and What Should I Do About It (prescriptive analytics). We are working on this.
Data Science Central:9 “must read” articles My Selection and Other Internal Links: See next slides. http://www.datasciencecentral.com/profiles/blogs/9-must-read-articles
My Selection (Vincent Granville):List • Big Data – From Descriptive to Prescriptive • Can big data be racist? • NodeXL Graph Gallery: Graph Details • Best Metrics For Digital Marketing: Rock Your Own And Rent Strategies • Big Data: from mining to meaning • Beautiful versus useful visualizations (in French, but interesting) • Learning and Teaching Machine Learning: A Personal Journey • Big data techniques and technologies • The Sexiest Job of the 21st Century is Tedious, and that Needs to Change (*) My Note: See next slides. • From the trenches: 360-degree data science
Big Data – From Descriptive to Prescriptive Visually communicating the value of Big Data is challenging because of the need to convey different concepts simultaneously. These charts plot analytical complexity against some sort of business value measurement in a positive correlation that looks entertainingly similar to human evolution charts we’ve all seen, with man becoming more upright and intelligent with time. Regardless of graphic representation, they all follow the progression from (1) What Happened (descriptive analytics), (2) Why Did It Happen (correlation analytics), (3) What Will Happen Next (predictive analytics), and (4) What Should I Do About It (prescriptive analytics). This chart is unique in that it goes all the way back to the beginning when data is first created and gathered in raw form. So much of the resources needed to develop prescriptive analytics takes place in the very early stages of the process. Source: SAP
My Selection (Vincent Granville): Footnote and Comment • * I (Vincent Granville) disagree with this Harvard Business Review author. Senior data scientists work on high level data from various sources, use automated processes for EDA (exploratory analysis) and spend little to no time in tedious, routine, mundane tasks (less than 5% of my time, in my case). I also use robust techniques that work well on relatively dirty data, and ... I create and design the data myself in many cases. • My (Brand Niemann) experience as a senior data scientist is similar in that I find the data preparation to be a very interesting and worthwhile activity that informs my data science results and story and I actually delight in "creating and designing the data myself in many cases." It is the art in the data science work along with the resulting visualizations. I also avoid coding if at all possible by using Spotfire.
My Selection (Vincent Granville):Other Internal Links • 17 short tutorials all data scientists should read (and practice) • 10 types of data scientistsMy Note: Actually 9. See next slide. • 66 job interview questions for data scientists • Data Science Certification • Update about our Data Science Apprenticeship • Our Wiley Book on Data Science • Data Science Top Articles • Our Data Science Weekly Newsletter • Practical illustration of Map-Reduce (Hadoop-style), on real data • What makes up data science? • Data science webinars • Data science competition
Six Categories of Data Scientists • Those strong in statistics: • They sometimes develop new statistical theories for big data that even traditional statisticians are not aware of. • Those strong in mathematics: • NSA (national security agency) or defense/military people working on big data. • Those strong in data engineering: • Hadoop, database/memory/file systems optimization and architecture. • Those strong in machine learning • Those strong in business • Those strong in production code development, software engineering • Those strong in visualization • Those strong in GIS, spatial data, data modeled by graphs, graph databases • Those strong in a few of the above. (Vincent Granville) My Note: I suggest you read his interesting commentary on this at http://www.datasciencecentral.com/profiles/blogs/six-categories-of-data-scientists
Update About our Data Science Apprenticeship - March 10, 2014 • At the request of many prospective participants, here's an update about our DSA (Data Science Apprenticeship): • Stage 1 (Available now): DIY (do-it-yourself) for self-learners: material is available for free throughout DSC, including data sets and projects to work on. No registration required. Get started by checking our most recent announcements. My Note: See next slide • Stage 2 (April 2014): Participants will purchase our Wiley book as well as our data science cheat sheet to get jump-started. • Stage 3: Projects will be evaluated for a fee, and a certification delivered. • Also, I have added a few large data sets, new projects and more material. • If you have already earned a data science certificate or diploma, but was not requested to develop and use your own API in batch mode, and harvest/work on a data set with at least 50 million observations in a distributed environment, then it's time to learn the real stuff that will land you a real job! My Note: I added this bolding. http://www.datasciencecentral.com/group/data-science-apprenticeship/forum/topics/update-about-our-data-science-apprenticeship
Update About Our Data Science Apprenticeship - March 29, 2014 • Here are six important updates: • Our book will be on the market by April 7. Check the updated table of contents (PDF document) and download the additional material not published in the book. My Note: See next slides • We have added new tutorials, projects and data sets: check the starred items. My Note: See previous slide for URL • Successful candidates will automatically become certified data scientists. My Note: See next slides • There is now one project (involving creating and working on simulated data) that you can work on to complete our program: click here for details. More projects will be considered later, but right now, we only have one reviewer (Dr. Granville) to grade submitted contributions. The good thing is that the apprenticeship is still free for now - even better, you can earn $1,000 by completing this project. My Note: See next slides • We will soon add a test that applicants will have to complete, as part of the apprenticeship. Many of the questions have answers in our book. Different questions will be sent to each candidates, via e-mail. • Also, we are making progress on writing our data science cheat sheet. A preliminary version can be found here, but it will be much more comprehensive and useful when completed, within the next 30 days. My Note: Now it is 17 short tutorials all data scientists should read (and practice) http://www.datasciencecentral.com/group/data-science-apprenticeship/forum/topics/update-about-our-data-science-apprenticeship-march-29-2014
Developing Analytic Talent • Acknowledgments and Introduction • Chapter 1: What is Data Science? • Chapter 2: Big Data is Different • Chapter 3: Becoming a Data Scientist • Chapter 4: Data Science Craftsmanship - Part I • Chapter 5: Data Science Craftsmanship - Part II • Chapter 6: Data Science Applications - Case Studies • Chapter 7: Launching Your New Data Science Career • Chapter 8: Data Science Resources • Addendum (released free) • 1. Nine Categories of Data Scientists • 2. Practical Illustration of Map-Reduce (Hadoop-Style), on Real Data • 3. Answers to Job Interview Questions • 4. Additional Topics • 5. Improving Visuals • 6. Essential Features for any Database, SQL or NoSQL http://semanticommunity.info/Data_Science/Data_Science_Central#Developing_Analytic_Talent
Data Science Certification • This group is for Certified Data Scientists only. There are three ways to become a Certified Data Scientist: • Join this group, there is no cost, but only Data Science Central members are allowed. Your profile will be reviewed in 2-3 days. Based on your experience (two years of practice minimum, in an analytic, data-intensive occupation, with success stories) you may be accepted, regardless of your actual job title (data scientist, statistician, analytics manager, operations research analyst etc.). • Or you successfully complete our Data Science Apprenticeship(DSA). Join this group, mention the DSA on your profile, you will automatically be approved. • Or (coming soon) you are certified or graduated from a program managed by one of our partner universities and organizations. • Once approved, you can add our certification in your profile (LinkedIn, resume, etc.) and be found by companies and organizations looking for serious data scientists and related professionals. See also our data science handbook (aka new book: Developing Analytic Talent) to learn core data science principles, featuring salary surveys, job interview questions, reference books, skills to acquire, sample resumes, difference between data scientist and other analytic professions, big data, case studies, becoming a freelance data scientist, Map-Reduce, Hadoop, and data science tricks, recipes, rules of thumb and tutorials. My Note: “difference between data scientist and other analytic professions” http://www.datasciencecentral.com/group/data-science-certification
Write a data science research paper and win fame and award • In connection with our proposed methodology to create a black-box, automated, easy-to-interpret, sample-based, robust technique called jackknife regression, to be used in small and big data environments by non-statisticians, We offer an award and massive promotion to the successful candidate who • 1. Provide the exact formulas for the solution of the 2x2, 3x3 and 4x4 linear systems of equations described in section 3.2 in my recent article (this is straightforward) • 2. Perform more tests on simulated data (say 10 data sets, each with 10,000 observations) to compare my methodology (with one and two M's computed on the first 100 observations) with full classical regression. The test must include data with strong correlation structure, and data with up to n=20 independent variables. Comparison should be about (i) accuracy and (ii) sensitivity to little changes in the data set (measured e.g. via confidence intervals for regression coefficients, both for classical regression and my methodology) • This project must be completed by August 31, 2014. You will be authorized to publish a paper featuring your research results (with your name as main or only author), and your results will also be published on Data Science Central, and seen by dozens of thousands of practitioners. Your article must meet professional quality standards similar to those required by leading peer-reviewed statistical journals. Payment will be sent after completion of the project. Depending on the success of this initiative, and the quality of participants, we might offer more than one award. • Read details here (see section 5). My Note: See URLs below http://www.datasciencecentral.com/profiles/blogs/jackknife-logistic-and-linear-regression http://www.datasciencecentral.com/profiles/blogs/write-a-data-science-research-paper-and-win-fame-and-money
Dr. Vincent Granville:A Visionary Data Scientist After 20 years of experience across many industries, big and small companies (and lots of training), I'm strong both in stats, machine learning, business, mathematics and more than just familiar with visualization and data engineering. This could happen to you as well over time, as you build experience. I mention this because so many people still think that it is not possible to develop a strong knowledge base across multiple domains that are traditionally perceived as separated (the silo mentality). Indeed, that's the very reason why data science was created. http://www.datasciencecentral.com/profile/VincentGranville
Big Data – From Descriptive to Prescriptive Examples • What Happened (descriptive analytics) • Data Science Central: Registered meteorites that has impacted on Earth visualized • Why Did It Happen (correlation analytics), • In process • What Will Happen Next (predictive analytics), and • In process • What Should I Do About It (prescriptive analytics) • In process My Note: See Forecasting Meteorite Hits, pages 248-252.
How was the data collected? http://semanticommunity.info/Data_Science/Data_Science_Central#Registered_meteorites_that_has_impacted_on_Earth_visualized
Where is the data stored? http://semanticommunity.info/@api/deki/files/27220/meteors.xlsx
What were the results? Web Player
What is the data story? • Vincent Granville is interested to see this info visually summarized in 5 dimensions, as follows: • 2 dimensions for the location: Mouse over to see Latitude and Longitude • 1 dimension for the size (represented by radius): Mouse over to see mass in 5 bins • 1 dimension for the type (represented by color): Mouse over to see type of meteorite • Click on point to see Details-on-Demand. Then Unmark Marked Rows • 1 dimension for time: turning this static image into a video, where each second represent (say) one year: Use Filter to Right to select Year. Then Reset All Filters • QED http://semanticommunity.info/Data_Science/Data_Science_Central#From_Data_Science_Central_to_Data_Science_Results
Developing Analytic Talent • Acknowledgments and Introduction: • Book publishing is like data scientist turning unstructured into structured data • Data Science Central is the leading data science community and a modern, lean start-up focused on value • How this book is structured: What data science & big data is, Career training resources, & Technical Material • Chapter 1: What is Data Science? • Chapter 2: Big Data is Different • Chapter 3: Becoming a Data Scientist • Chapter 4: Data Science Craftsmanship - Part I • Chapter 5: Data Science Craftsmanship - Part II • Chapter 6: Data Science Applications - Case Studies • Chapter 7: Launching Your New Data Science Career • Chapter 8: Data Science Resources • Addendum (released free): • 1. Nine Categories of Data Scientists • 2. Practical Illustration of Map-Reduce (Hadoop-Style), on Real Data • 3. Answers to Job Interview Questions • 4. Additional Topics • 5. Improving Visuals • 6. Essential Features for any Database, SQL or NoSQL http://semanticommunity.info/Data_Science/Data_Science_Central#Developing_Analytic_Talent
Chapter 1: What is Data Science? • Real Versus Fake Data Science: 2 • Repackaging old material like statistics and R programming with the new label “data science.” • See Chapter 2 for what MapReduce can’t do. • The Data Scientist: 3 • ETL (extract/transform/load) is for data engineers and DAD (discover/access/distill) is for data scientists • Data Science Applications in 13 Real-World Scenarios: 13 • Chapters 4 and 5 discuss solutions to such problems. • Data Science History, Pioneers, and Modern Trends: 4 • Data scientist is broader than data miner, and encompasses data integration, data gathering, data visualization (including dashboards), and data architecture. Data scientist also measures ROI on data science activities. • I have a few examples of “light analytics” doing better than sophisticated architectures in Chapter 6. • The big data ecosystem is discussed in Chapter 2. • Summary: • What data science is not, including how traditional degrees will have to adapt as business and government evolves.
Chapter 2: Big Data is Different • Two Big Data Issues: 2 • The “curse” and the rapid data flow. • Examples of Big Data Techniques: 3 • Excel with 100 Million Rows: Use the PowerPivot add-in from Microsoft to work with large datasets. • What MapReduce Can’t Do: 3 • Problems requiring massive computations. • Communication Issues: 1 • It’s definitely a people/organization issue. • Data Science: The End of Statistics?: 3 • See how modern statistics can help make data science better. • The Big Data Ecosystem: 1 • It consists of products and services (hardware, cloud providers, data integration and database vendors, dashboards, visualization tools, and data science and analytic tools). My Note: Why I like TIBCO Spotfire. • Summary: • Why standard statistical techniques fail when blindly applied to big data. • In general solutions include sampling and/or compression in cases where it makes sense. • Data science is more than data analysis, computer science, or statistics.
Chapter 3: Becoming a Data Scientist • Key Features of Data Scientists: 2 • Horizontal knowledge is important. D.J. Patil, previously a chief data scientist at Linkedin, is now Data Scientist in Residence at Greylock Partners that advises In-Q-Tel (CIA) on investments. • Types of Data Scientists: 4 • Fake, Self-Made, Amateur, and Extreme (developing powerful, robust predictive solutions without any statistical models) • Data Science Demographics: 1 • Data science websites attract highly educated, wealthy males, predominantly with Asian origin, living , mostly in the U.S. • Training for Data Science: 3 • University Programs (8), Corporate and Association Training Programs (7), and Free Training Programs (Coursera.com and Data Science Central) • Data Scientist Career Paths: 2 • The Independent Consultant and The Entrepreneur (see 13 Startup Ideas for Data Scientists) • Summary: See the above!
Chapter 4: Data Science Craftsmanship - Part I • New Types of Metrics: 2 • Choosing Proper Analytic Tools: 4 • Visualization: 2 • Statistical Modeling Without Models: 3 • Three Classes of Metrics: Centrality, Volatility, and Bumpiness: 4 • Statistical Clustering for Big Data: 1 • Correlation and R-Squared for Big Data: 2 • Computational Complexity: 2 • Structured Coefficient: 1 • Identifying the Number of Clusters: 2 • Internet Topology Mapping: 1 • Securing Communications Data Encoding:1 • Summary: This is the most technical chapter in the book based on articles first published at Data Science Central to cover many different techniques, recipes, and topics so you can reproduce them when needed.
Chapter 5: Data Science Craftsmanship – Part II • Data Dictionary: 2 • One of the most valuable tools when performing exploratory data analyses. My Note: I agree! • Hidden Decision Trees: 3 • Model-Free Confidence Intervals: 4 • The first Analyticbridge Theorem, which provides a simple, model-free, nonparametric way to compute confidence intervals without statistical theory or knowledge, • Random Numbers: 1 • Four Ways to Solve a Problem: 4 • Causation Versus Correlation: 1 • In all contexts, using predictors that are directly causal typically helps reduce the variance in the model and yields more robust solutions. • How Do You Detect Causes?: 1 • Life Cycle of Data Science Projects: 1 • Predictive Modeling Projects: 1 • Predictive Modeling Mistakes: 1 • Logistic Related Regressions: 4 • Experimental Design: 3 • Analytics as a Service and APIs: 3 • Miscellaneous Topics: 4 • New Synthetic Variance for Hadoop and Big Data: 8 • Summary: The topics discussed in this chapter are typically classified as data analyses rather than statistical or computer analyses. Most of the material has not been published before. Traditional statisticians typically don’t learn or use these techniques, but data scientists do.
Chapter 6: Data Science Applications - Case Studies • Stock Market: 7 • Encryption: 3 • Fraud Detection: 11 • Digital Analytics: 9 • Miscellaneous: 6 • Forecasting Meteorite Hits: • Define the scope of the analysis: This is a small project to be completed in 10 hours of work or less, billed at $100/hour. Provide the risk of meteorite hit per year per meteorite size. • Summary: 36 Case Studies, Real-Life Applications, and Success Stories
Chapter 7: Launching Your New Data Science Career • Job Interview Questions: http://bit.ly/1cGlFA5 • Testing Your Own Visual and Analytical Thinking • From Statistician to Data Scientist: http://bit.ly/197Jsfa 160 comments on Linkedin) • Taxonomy of a Data Scientist: • Top Data Scientists on Linkedin: Kirk Borne-Analytics (0.00 with Vincent Granville=0.38), Big Data (0.15 with MilandBhandarkar=0.54), Data Mining (0.45 with Dean Abbott=0.46), Machine Learning (0.39 with Monica Rogati=0.43), and Purity (0.70 with Dean Abbott=0.93) • 400 Data Scientist Job Titles: http://bit.ly/11WhOcu (from 10,000 data scientists in Linkedin network) • Salary Surveys: http://bit.ly/1dmCouo • Summary: See the above!
Chapter 8: Data Science Resources • Professional Resources: • Data Sets: http://bit.ly/W2HTJU • Books: 100+ (some free) • Conferences and Organizations: Vendors (e.g. SAS), Professional Societies (e.g. INFORMS), and Conference Organizers (O’Reilly Strata) • Websites: http://bit.ly/lghDR7K (add your own) • Definitions: http://bit.ly/l8UcD7c • Career Building Resources • Companies Employing Data Scientists: 21 leading and 6,000+ at http://bit.ly/19vRlNV • Sample Data Science Job Ads: http://bit.ly/1hVAmr7 • Sample Resumes: http://bit.ly/1j4PNuP • Summary: See the above!
Addendum:2. Practical Illustration of Map-Reduce (Hadoop-Style), on Real Data Start: Extract/summarize data from say large log files Map: Create an hierarchical data base Reduce: High-level summaries corresponding to rules Finish: Find result (e.g. credit card fraud) • Goal: Build a system to score Internet Clicks (50M) (“click data”): • Extract relevant fields (e.g. 6 of 60) • Build a summary table: the Map step (text file like in Hadoop) • Split the big data in smaller data sets (say 20) (called subsets), and perform this operation separately on each subset • Build a summary table: the Reduce step • Simple merging will not work at this scale so sort each by a key field and merge the sorted subsets to produce a big summary table which is much more manageable and compact, although still far too large to fit in Excel. • Create a rule set by building less granular summary tables, on top of S, and testing. • Improvements: • New technology to "split / sort subsets / merge and aggregate“ faster and better • Conclusions: • The granular table S (and the way it is built) is similar to the Hadoop architecture. My Note: I do not find this to be very interesting data science, but it is the way to make money!
Big Data – From Descriptive to Prescriptive Example:Forecasting Meteorite Hits, pages 248-252. • What Happened (descriptive analytics) • Data Science Central: Registered meteorites that has impacted on Earth visualized (original data set) • Why Did It Happen (correlation analytics), • See next slides • What Will Happen Next (predictive analytics), and • See next slides • What Should I Do About It (prescriptive analytics) • See next slides
Forecasting Meteorite Hits • Statistical analyses in 8 steps: • Define the scope of analysis: • 10 hours of work or less, billed at $100 hour to provide the risk of meteorite hit per year per meteorite size. • Identify data and caveats (URL did not work)*: • http://osm2.cartodb.com/tables/2320/public#/map • Data cleaning: • The data seem comprehensive, but are messy. Discard data prior to 1900. • Exploratory analysis: • Strong patterns emerge, despite messy data, etc. like smaller meteorites are now detected because of the growing surface of inhabited land and better instruments, etc. • The actual analysis in an Excel spreadsheet with data and formulas (Vincent Granville): • http://bit.ly/1gaiIMm • Model selection: • Two decades show relatively good pattern stability and recency: 2000-2010 and 1990-2000. • Prepare forecasts: • Yearly_Occurrences (weight) = 1/(A + B* log (weight)). The “every 40 year” claim for the 2013 Russian bang is plausible. • Followup: • A more detailed analysis would involve predictions broken down by meteor type (iron and water), angle, and velocity. Also the impact of population growth could be assessed in this risk analysis. * Source Web Site: Download entire data set (see Excel): it's a 7MB spreadsheet consisting of 34,513 meteorites, last updated in 2012.
Where is the data stored? And What are the results? • http://bit.ly/1gaiIMm
Why Did It Happen (correlation analytics) and What Will Happen Next (predictive analytics)? Web Player
What Should I Do About It (prescriptive analytics) LSST = Large Synoptic Survey Telescope: http://www.lsst.org/
What Should I Do About It (prescriptive analytics) • Professor Kirk Borne - My current research is focused on outlier detection, which I prefer to call Surprise Discovery – finding the unknown unknowns and the unexpected patterns in the data. These discoveries may reveal data quality problems (i.e., problems with the experiment or data processing pipeline), but they may also reveal totally new astrophysical phenomena: new types of galaxies or stars or whatever. That discovery potential is huge within the huge data collections that are being generated from the large astronomical sky surveys that are taking place now and will take place in the coming decades. I haven’t yet found that one special class of objects or new type of astrophysical process that will win me a Nobel Prize, but you never know what platinum-plated needles may be hiding in those data haystacks. – • See more at: http://www.eeriedigest.com/wordpress/2013/01/taem-interview-with-dr-kirk-borne-of-george-mason-university/ Dr. Kirk Borne of George Mason UniversitySlides