1 / 65

Introduction to Data Science – INFO 480 – Drexel University’s iSchool

Introduction to Data Science – INFO 480 – Drexel University’s iSchool. Sean P. Goggins, PhD April 2, 2013 Week One. What is “Data Science”?. Time, Number of participants Number of new participants, Number of sustained participants. Time Number of Participants.

katina
Download Presentation

Introduction to Data Science – INFO 480 – Drexel University’s iSchool

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Data Science – INFO 480 – Drexel University’s iSchool Sean P. Goggins, PhD April 2, 2013 Week One

  2. What is “Data Science”? Time, Number of participants Number of new participants, Number of sustained participants Time Number of Participants • Data scientists are story tellers. Variables?

  3. Story Telling

  4. Story Telling

  5. Story Telling

  6. What is Data Science? • Storytelling • Database Theory – How you organize your data has a big influence on what you can do with it. • Agile Manifesto – Key thing is iterative development; it’s a technology value system. • Spiral Dynamics – What we view as fact and what we desire emerges from the data presented to us. Credit: http://www.datascientists.net/what-is-data-science

  7. Database Theory • Relational Algebra & Set Theory • Thinking in relations helps you to connect disparate data; • What is the connecting field? • What is the cardinality? • Set Theory Helps you think about summarizing data • What time period? Weeks? Months? • By person? By Group? By Geography?

  8. Agile Manifesto • Individuals and interactions over processes and tools • Working software over comprehensive documentation • Customer collaboration over contract negotiation • Responding to change over following a plan http://www.agilemanifesto.org

  9. Spiral Dynamics New research unveiled at this year’s AERA conference documents a disturbing trend among the nation’s secondary schools: Between 2001 and 2012, high school graduation rates regularly spiked in late May and early June, ballooning from near zero to a staggering average of 78 percent.

  10. What you’ll need for this course • Interest in learning data analysis tools • R • Python • Curiosity • A laptop to bring to class (see me if this is a problem) • Persistence • A Github Account • Willingness to do weekly homeworks and participate in online iteration of data products you and your course mates develop • A dropbox account will be helpful 

  11. An Outline of The Class • Two Books • Janert, PK. (2011) Data Analysis with Open Source Tools. New York: O’Reilly. • Vaidhyanathan, S. (2005) The Anarchist in the Library: How the Clash Between Freedom and Control is Hacking the Real World and Crashing the System. New York: Basic Books. • Ten Weeks

  12. Schedule

  13. Activity One • Academic Integrity Form at the Back of The Syllabus • 15 Minutes to Work on Tool Configuration • R Download • Python Download & Config • http://seangoggins.net/info480 • Github Account and Client • http://mac.github.com • http://windows.github.com

  14. Basic R Instructions

  15. Installation • Download and install R • Run the “R-Libraries.R” script from the root of the github project directory

  16. Basics

  17. Google Trends – 2005 – Present • Produce a Trend graph for Google search phrases. • Identify four search phrases. Describe what makes these search phrases a coherent comparison in two sentences. What do they have in common? How do they provide a useful contrast? • BEFORE you run the search, write down what you expect the trends to look like. Spikey, trending upward, trending downward? • Examine the resulting changes • What did you find? What are some theories that might explain the similarities or differences you observed

  18. Network Analysis

  19. Telling Stories: A Visualization of Purely Qualitative Data You CAN do that without Quantitative Data You can do it with Qualitative Data And A LOT of Quantitative Data REQUIRES qualitative Analysis

  20. Idealized Distributed Team

  21. Actual Models Found

  22. The Different ICT Roles

  23. Organizational Evolution

  24. Network Analysis – Github Activities

  25. Motivation Underpants Gnomes With much discourtesy from the US TV Program “South Park”

  26. Motivation Underpants Gnomes

  27. Addressing The Underpants Gnome Postulate

  28. Group Informatics Described Identify Key Information Brokers Weight Connections Based on Time Distance, Grouped By Topic and informed by analysis of time distance between posts. Methodological Approach

  29. Data: Github

  30. Github Network Activity One

  31. Actual R – Code • Work through setup • Scripts are ready to run • Talk Through Them and Walk around to help

  32. Further Analysis Tools • Eight Mylyn Releases (Temporal Analysis) • R Packages Used • TNET • iGraph • Statnet

  33. Weighted Network: TNET

  34. The Dense Graph (Work) • Developers create a dense graph. Not a complete graph, but dense. Work

  35. A Sparser Graph (Talk) • Commenter's create a sparse graph Talk

  36. Release One (2.0) Analysis Release 1 Discussion Code Talk Work iGraph

  37. STATNET for Discussion • StatNet Red = Bug Commenter Blue = Bug Opener Release 1 Talk StatNET

  38. Release One Work & Talk

  39. Release 1 (2.0) iGraph & Statnet Talk Red = Bug Commenter Blue = Bug Opener Clusters Release 1 StatNET In Degree & Out Degree iGraph

  40. Release One (2.0): Filtered Code Discussion Talk Release 1 Work Google Summer Coder 304, 373, 399 & 143 form The Strongest Connections In both networks Red = Bug Commenter Blue = Bug Opener

  41. Release One (2.0): Filtered Code Discussion Work Talk Release 1 457, 391 & 159 – Comment & Open Google Summer Coder 304, 373, 399 & 143 form The Strongest Connections In both networks Red = Bug Commenter Blue = Bug Opener

  42. Compare Over Time First & Last Release

  43. Release 1 (2.0) Compared to Release 8 (3.3) Release 1 Talk Release 8 304, 399, 143, 159, 173, 373 399, 118, 304, 159, 391, 416 StatNET & ordinary plotting

  44. Release 1 (2.0) Compared to Release 8 (3.3) Work Release 1 143 & 304 disengaged Or missing entirely Release 8 304, 373, 399 & 143 Two disconnected Graphs in release 8 iGraph

  45. Release Eight Work & Talk

  46. Release 8 (3.3): Filtered Discussion Talk Code Release 8 Nobody is “Just Blue” Work Red = Bug Commenter Blue = Bug Opener

  47. Release 8 (3.3): Filtered Discussion Release 8 Talk Code Work Notice 416 in Talk & Second Coder Graph Red = Bug Commenter Blue = Bug Opener

  48. Release 8 (3.3) iGraph & Statnet 399, 118 & 159 are significant, But play with different clusters of Other people. Release 8 Talk Red = Bug Commenter Blue = Bug Opener Clusters StatNET In Degree & Out Degree Blue Cluster iGraph

More Related