1 / 73

Welcome to IST 380 !

Welcome to IST 380 !. Data Science Programming. We don't have strong enough words to describe this class. - US News and Course Report. When the course was over, I knew it was a good thing. an advocate of concrete computing – and HMC's mascot. - New York Times Review of Courses.

havard
Download Presentation

Welcome to IST 380 !

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Welcome to IST 380 ! Data Science Programming We don't have strong enough words to describe this class. - US News and Course Report When the course was over, I knew it was a good thing. an advocate of concrete computing – and HMC's mascot - New York Times Review of Courses We give this course two thumbs! - Ebert and Roeper

  2. Welcome to IST 380 ! Data Science Programming an advocate of concrete computing – and HMC's mascot

  3. About myself Who Zach Dodds Where Harvey Mudd College What Research includes robotics and computer vision When Mondays 7-10pm here in ACB 119 dodds@cs.hmc.edu 909-607-0867 Office Hours: Contact Information Friday mornings, 9-11 am or set up a time... HMC Beckman B111

  4. TMI? fan of low-tech games fan of low-level AI

  5. IST 380 ~ the big picture What is it? Why me?

  6. IST 380 ~ the big picture What is it? Data Science Venn Diagram Hmmm… where am I on this diagram?

  7. Data?! • Neighbor's name • A place they consider home • Are they working at a company now? • How many U.S. states have they visited? • Their favorite unhealthy food… ? • Do they have any "Data Science" background? (statistics, machine learning, CS) Where?

  8. state reminders…

  9. Data! • Neighbor's name • A place they consider home • Are they working at a company now? • How many U.S. states have they visited? • Their favorite unhealthy food… ? • Do they have any "Data Science" background? (statistics, machine learning, CS) Zachary Dodds Pittsburgh, PA Harvey Mudd Where? 44 M&Ms mostly CS for me…

  10. Data! • Neighbor's name • A place they consider home • Are they working at a company now? • How many U.S. states have they visited? • Their favorite unhealthy food… ? • Do they have any "Data Science" background? (statistics, machine learning, CS) Zachary Dodds Pittsburgh, PA This class is truly seminar-style: I'm here, as you are, in order to gain insights into this very new field… . Harvey Mudd Where? 44 M&Ms mostly CS for me… be sure to set up your login + profile for the submission site…

  11. Data Science concerns Is "Data Science" important or just trendy?

  12. Data Science concerns Hmmm…

  13. the companies are expanding as fast as the data!

  14. There's certainly a lot of it! Data, data everywhere… 1 Zettabyte 1.8 ZB 8.0 ZB 800 EB logarithmic scale Data produced each year 161 EB 5 EB 1 Exabyte 120 PB 100-years of HD video + audio 60 PB Human brain's capacity 1 Petabyte 14 PB 2002 2006 2009 2011 2015 1 Petabyte == 1000 TB 1 TB = 1000 GB References (2015) 8 ZB: http://www.emc.com/collateral/analyst-reports/idc-extracting-value-from-chaos-ar.pdf (2002) 5 EB: http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/execsum.htm (2011) 1.8 ZB: http://www.emc.com/leadership/programs/digital-universe.htm (life in video) 60 PB: in 4320p resolution, extrapolated from 16MB for 1:21 of 640x480 video (w/sound) – almost certainly a gross overestimate, as sleep can be compressed significantly! (2009) 800 EB: http://www.emc.com/collateral/analyst-reports/idc-digital-universe-are-you-ready.pdf (2006) 161 EB: http://www.emc.com/collateral/analyst-reports/expanding-digital-idc-white-paper.pdf (brain) 14 PB: http://www.quora.com/Neuroscience-1/How-much-data-can-the-human-brain-store

  15. I'd call it data, not information wisdom knowledge information data

  16. Big Data? I agree with this…

  17. Make data easier to use ~ by using it! It may be true that Data Science isn't a science – but that doesn't mean it's not useful!

  18. IST 380 ~ the big picture What? Why? Data Science Programming Data Rules All of our insights – large and small, permanent and ephemeral, natural and artificial – come about through the integration of lots of data. Data Science simply recognizes that the rules and skills behind those insights are widely applicable…

  19. A few examples… Make3d Andrew Ng ~ Computers and Thought award, 2009 How is this being done? and how do we succeed? … Data Science is at the heart of computer science

  20. A few examples… Learning to Powerslide Stanford's Autonomous Vehicles project (Thrun et al.) … Data Science is at the heart of computer science

  21. A few examples… Learning ground from obstacles "my summer was finding that red line" … Data Science is at the heart of computer science

  22. A few examples… classification segmentation Learning ground from obstacles

  23. Insights beyond science

  24. Marketing

  25. Visualization Motivation

  26. Recommender Systems predicting movie ratings

  27. Netflix Prize (I don't know this guy) Bob Bell, winner of the "Netflix prize" 1.22 .75 ?? ?? Napoleon Dynamite = Batman Begins = Finding Nemo = Lord of the Rings = Some films are difficult to predict…

  28. Netflix Prize (I don't know this guy) Bob Bell, winner of the "Netflix prize" 1.22 .75 .67 .42 Napoleon Dynamite = Batman Begins = Finding Nemo = Lord of the Rings = Some films are difficult to predict… and others are easier!

  29. Why IST 380 ? Specific skills: R statistical environment (and the S programming language) Experience with several statistical analyses (descriptive statistics) Experience with predictive statistics (modeling) and machine learning algorithms

  30. Why IST 380 ? Specific skills: R statistical environment (and the S programming language) Experience with several statistical analyses (descriptive statistics) Experience with predictive statistics (modeling) and machine learning algorithms Broad background: Final project ~ open-ended with datasets of your choice You'll be confident and capable with whatever datasets you encounter in the future – on your own or as part of a team.

  31. About IST 380 …

  32. Details Web Page: http://www.cs.hmc.edu/~dodds/IST380 Assignments, online text, necessary files, lecture slides are linked First week's assignment: Getting started with R Textbook An introduction to Data Science freely available online jsresearch.net/groups/teachdatascience/ and many online resources… Grab both of these now… Programming: R www.r-project.org/

  33. Homepage Go to the course page http://www.cs.hmc.edu/~dodds/IST380/ Grab R and the text from these two links…

  34. Homework Assignments ~ 2-5 problems/week ~ 100 points extra credit, often Due Tuesday of the following week by 11:59 pm. Assignment 1 due Tuesday, February 5. 1 week + 1 day…

  35. Homework Assignments ~ 2-5 problems/week ~ 100 points extra credit, often Due Tuesday of the following week by 11:59 pm. Assignment 1 due Tuesday, February 5. On your own or in groups of 2. Working on programs: Divide the work at the keyboard evenly! Submitting programs: at the submission website install software ensure accounts are working Today's Lab: try out R - the first HW is officially due on 2/5

  36. Outline using R approximate! descriptive statistics Weeks 1-5 predictive statistics "Data Science" probability distributions statistical modeling support vector machines (SVMs) Weeks 6-10 nearest neighbors (NN) random forests "Machine Learning" No breaks?! k-means algorithm Weeks 11-15 Final Project

  37. Grading Grades if score >= 0.95: grade = "A" if score >= 0.90: grade = "A-" if score >= 0.86: grade = "B+" Based on points percentage ~ 800 points for assignments see the course syllabus for the full list... ~ 400 points for the final project Final project • the last ~4 weeks will work towards a larger, final project • there will be a short design phase and a short final presentation • choose your own problem to study (I'll have some suggestions, too.) • I'd encourage you to connect R and our Data Science techniques to other datasets or projects that you use/need/like, etc.

  38. Academic Honesty • This course operates under CGU's (and all of Claremont Schools') Academic Honesty policies… • Your work must be your own. This must be true for the whole team, if you're working in a pair. • Consulting with others (except team members or myself) is encouraged, but has to be limited to discussion and debugging of problems. Sharing of written, electronic, or verbal solutions/files/code is a violation of CGU’s academic honesty policy. • A reasonable guideline: Work is your own if you could delete all of it and recreate it yourself.

  39. Thoughts?

  40. Getting to know… R

  41. Getting to know… R R is the programmer's toolkit for statistics; SAS, Stata, SPSS are preferred by those in business intelligence http://lang-index.sourceforge.net/#categ

  42. Getting to know… R Free… and very well supported online…

  43. Getting to know… R R is responsive, up-to-date, and flexible: Data Science vs. Statistics

  44. Getting to know… R Try it! 1) Find the IST 380 course webpage www.cs.hmc.edu/~dodds/IST380/ 2) Download and install R 3) Run R and try some basic commands at the prompt: 6 * 7 rnorm(10) x <- 380

  45. Getting started! 1) Open Matloff's Why R? notes 2) Skip ahead to page 7, the "5 minute example session" 3) Try out the commands in section 2.2 to get started… 4) When you finish, save your session and submit it! This is problem 1 this week

  46. Saving your session 1) Create a folder named hw1, perhaps on your desktop 2) Use the Save to file… (Windows) or Save as… (Mac) in order to save your current console session into hw1 3) Name that file pr1.txt 4) From your operating system, open up that file in order to confirm it contains your whole session! This is problem 1 this week

  47. Submitting your work 1) Zip up hw1 into hw1.zip 2) From the course webpage, click on the submission site link. 3) Choose a submission site login name & let me know! 4) Once your account is made, login, change your password to something you know, and submit hw1.zip 5) You can submit again – all copies are saved… troubles? email me! This webserver can be spacey -- I should know! You've completed Problem 1!

  48. Reflection Assignment? Creating a vector? Printing? Average and standard deviation? Comments? Comments?

  49. R types You can use mode() to view the type of a variable.

More Related