220 likes | 235 Views
Musings on Data Science and Students Experiencing Data Analytics New England SENCER Center for Innovation. Prof. Randy Paffenroth Data Science Program Department of Mathematical Sciences Worcester Polytechnic Institute rcpaffenroth@wpi.edu 2014. My Research.
E N D
Musings on Data Science and Students Experiencing Data Analytics New England SENCER Center for Innovation Prof. Randy Paffenroth Data Science Program Department of Mathematical Sciences • Worcester Polytechnic Institute rcpaffenroth@wpi.edu 2014
My Research "Internet Connectivity Access layer" by User:Ludovic.ferre - Internet_Connectivity_Overview2_Access.svg. Licensed under Creative Commons Attribution-Share Alike 3.0 via Wikimedia Commons - http://commons.wikimedia.org/wiki/File:Internet_Connectivity_Access_layer.svg#mediaviewer/File:Internet_Connectivity_Access_layer.svg
This is a panel, so I want to be provocative! • Provocative • Adjective • 1. tending or serving to provoke; inciting, • stimulating, irritating, or vexing. • So, I will be a little sad if I don’t end up irritating anyone
The first war: Terminology • Analyzing data has a long history! • There have been many terms that have been used to describe such endeavors: • Statistics • Artificial Intelligence • Machine learning • Data analytics • Since I happen to work in a “Data Science” program perhaps I may be allowed the indulgence of using that terminology…
The Good Experiments, observations, and numerical simulations in many areas of science and business are currently generating terabytes of data, and in some cases are on the verge of generating petabytes and beyond. Analyses of the information contained in these data sets have already led to major breakthroughs in fields ranging from genomics to astronomy and high-energy physics and to the development of new information-based industries. - Frontiers in Massive Data Analysis, National Research Council of the National Academies The Bad Given a large mass of data, we can by judicious selection construct perfectly plausible unassailable theories—all of which, some of which, or none of which may be right. - Paul Arnold Srere
The Hopeful The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it—that’s going to be a hugely important skill in the next decades, not only at the professional level but even at the educational level for elementary school kids, for high school kids, for college kids. Because now we really do have essentially free and ubiquitous data. So the complimentary scarce factor is the ability to understand that data and extract value from it. - Hal Varian, Google's Chief Economist, http://www.mckinsey.com/insights/innovation/hal_varian_on_how_the_web_challenges_managers My personal goal: Getting students to be able to think critically about data.
What is Big Data? The are many examples of "data", but what makes some of it “big”? The classic definition revolves around the three Vs. Volume, velocity, and variety. Volume: There is a just a lot of it being generated all the time. Things get interesting and “big”, when you can’t fit it all on one computer anymore. Why? There are many ideas here such as MapReduce, Hadoop, etc. that all revolve around being able to process data that goes from Terabytes, to Petabytes, to Exabytes. Velocity: Data is being generated very quickly. Can you even store it all? If not, then what do you get rid of and what do you keep? Variety: The data types you mention all take different shapes. What does it mean to store them so that you can play with or compare them? http://pl.wikipedia.org/wiki/Green_Giant#mediaviewer/Plik:Jolly_green_giant.jpg
Is Big Data the same as Data Science? • Are Big Data and Data Science the same thing? • I wouldn't say so... • Data Science can be done on small data sets. • And not everything done using Big Data would necessarily be called Data Science. Big Data Data Science
Is Big Data the same as Data Science? • Are Big Data and Data Science the same thing? • I wouldn't say so... • Data Science can be done on small data sets. • And not everything done using Big Data would necessarily be called Data Science. • But there certainly is a substantial overlap! Big Data Data Science
Can you even be certain? • For real world problems, I claim that you will never be certain of any inferences from data. • I mean, what happens to your carefully thought out marketing plan for some rocking slacks when the Martians land. • What is unacceptable is when the data you actually have does not support the conclusion you report. Public domain image
It can be easy to fool yourself! Human beings are really good at pattern detection... Perhaps a bit too good! http://en.wikipedia.org/wiki/Cydonia_(region_of_Mars)
It can be easy to fool yourself! http://en.wikipedia.org/wiki/Cydonia_(region_of_Mars)
Skills for Data Science http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
Which is most important? http://en.wikipedia.org/wiki/View_of_the_World_from_9th_Avenue http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
WPI Data Science Program:A Collaboration Computer Science Department Mathematical Sciences Department Business School
M.S. in Data Science Program Graduate Qualifying Project or MS Thesis (3 to 9 credits) Concentration and Electives (9 to 15 credits) Mathematical Analytics (3 credits) Business Intelligence & Case Studies (3 credits) Data Access & Management (3 credits) Data Analytics & Mining (3 credits) Integrative Data Science (3 credits) 18
Data Science Core • Integrative Data Science : • DS 501 Introduction to Data Science (new course) • Mathematical Analytics (Select one): • MA 543/DS 502 Statistical Methods for Data Science (new course) • MA 542 Regression Analysis • MA 554 Applied Multivariate Analysis • Data Access and Management (Select one): • CS 542 Database Management Systems • MIS 571 Database Applications Development • CS 561 Advanced Topics in Database Systems • CS 585/DS 503 Big Data Management (new course) • Data Analytics and Mining (Select one): • CS 548 Knowledge Discovery and Data Mining • CS 539 Machine Learning • CS 586/DS 504 Big Data Analytics (new course) • Business Intelligence and Case Studies (Select one): • MIS 584 Business Intelligence • MKT 568 Data Mining Business Applications • Data Science Certificate Program (18 credits); • 15 CREDIT DATA SCIENCE CORE • plus • 3 CREDIT ELECTIVE
2014 Data Science Cohort EDUCATIONAL FOUNDATION Quantitative/ computational backgrounds Programming with data structures and algorithms for computational skills Quantitative skills Calculus, linear algebra and statistics EMPLOYMENT HISTORIESSenior Research Analyst Senior Business AnalystPatient Financial Services Data Base Analyst-architect Decision Scientist Ministry of Finance Lahey Health Technical Program ManagementU.S. Department of State NATIONALITY CAMBODIA INDIA CHINA PAKISTAN TAIWAN IRAN U.S.A. BRAZIL NEPAL AFGHANISTAN INDONESIA 10% FULBRIGHT SCHOLARS GENDER 66.70% Male 33.3% Female
2014 Data Science Cohort FALL 2014 Total Applicants 126 Total acceptances 33 Fulbright Scholars 3Brazil Science Mobility Student 1 Countries Represented 9 Domestic Students 5 International Students 28 Many hold more than one earned Bachelor’s Degree US Universities include Columbia, UNH and WPI Dean Oates gave two Awards of $5K to outstanding students. These awards help attract top students.
Skills Acquired by Our Students • Fundamental/Technical : • SQL/ Data Modeling / Cleaning • Data Integration / Warehousing • Statistical Learning / Machine Learning • Distributed Computing • Big Data Management • Classif./Regression/DecisionTrees • Business Intelligence • Distributed Mining Algorithms • Professional Skills: • Business Use Cases / Entrepreneurship • Interdisciplinary Teams / Leadership • Tools : • Oracle /MySQL/DB2/SQLServer • R / SAS / SciKit • Weka /RapidMiner /MatLab • IBM Cognos / SPSS Modeler • Hadoop / Mahout / Cassandra • Python / Java / Cloud Computing • Storm / Sparc / InfoSphere Streams • Spotfire / Tableaux • Professional Skills: • Story Telling / Visualization • Presentations / Reports
Data Science Tools for Students: Free! • Software: • Python • http://www.python.org/ • iPython: http://ipython.org/ • Numpy: http://www.numpy.org/ • Pandas: http://pandas.pydata.org/ • Matplotlib: http://matplotlib.org/ • Mayavi: http://mayavi.sourceforge.net/ • Scikit-learn: http://scikit-learn.org/stable/ • Data: • UCI Machine learning repository • http://archive.ics.uci.edu/ml/ • Kaggle • https://www.kaggle.com/ • U.S. Government • https://www.data.gov/