480 likes | 599 Views
Lifecycle Seminar Series. Welcome to the Community!. Live Tweet to #DSSS2. The Lifecycle Series. #1: July 10 The Scientist, The Team and The Purpose #2: July 31 Organizing and Feeling Out Your Data Dates and Topics not Finalized, but roughly: #3: Data / Analytics Preparation
E N D
Lifecycle Seminar Series Welcome to the Community! Live Tweet to #DSSS2
The Lifecycle Series • #1: July 10 The Scientist, The Team and The Purpose • #2: July 31 Organizing and Feeling Out Your Data Dates and Topics not Finalized, but roughly: • #3: Data / Analytics Preparation • #4: Modeling, Classification, and Decision-Making • #5: The Data Science Team • #6: Telling The Story: Visualizing Results
We Want Contributors! • Looking for people willing to lead one of the Topics in given seminars • Looking for people who have an interesting anecdote or challenge to offer • Want to try integrating with main speaker or kick off networking session • Particularly interested in experiences/anecdotes for Session II (July 31) : Organizing and Feeling Out Your Data
Data Lifecycle = Where we are But!...
Data Science Lifecycle • Tonight, Focus is on Feeling Out Data • Primarily early-stage skill, but a part of all stages • Something everyone can do, increasingly so with modern tools Organizing and Feeling Out your Data
Tonight’s Agenda • The Data Scientist Seminar Series • Followup from Seminar 1 • Participation opportunities • Jason Sroka: “Organizing and Feeling Out your Data” • Wrap-up & Announcements • Networking Session – Buy Jason Tequila!
MarketMeSuite – Our Venue Sponsor MarketMeSuite’s Inbox For Social is how small businesses convert leads and market on social media
Approach & Goals • Walk through steps of organizing and feeling out data • Focus on Data Scientist Survey • Use Survey data and anecdotes to touch on Data Science topics • Not going deep, but trying to give a real feel • Tool Discussion • Tableau and Google Refine
Data Setup • We are all getting our data from somewhere • Personal data • Private data • Public data • Need tool(s) to look at it with • Will see Tableau here, many others available • Focus is on feeling out the data, not managing it • Will only mention some data management challenges • Not dealing with Big Data tonight (when we go international…) • These are topics that will be more central to future Meetup Seminars
What I did • Quick scan of source • Excel File • Nulls in Beige • True flags in Green • 84 Data Rows • Import the data • Tableau reads straight from Excel Source(s) Import Analytics Tool
What a Quick Scan Shows • Organization of Raw Data • Nulls in Beige • True flags in Green
Start with the Basics • The first question • How many data? • 85 records imported • Move to things you know/understand • Simple categories (gender, age, ..) • Check assumptions (e.g. more males than females)
Gender • Simple category • Binary • Meaningful to everyone • Data not quite so simple • What is a Null, compared to a Blank
Message #1: Data is Messy! • Data Scientists have gender issues! • We have a Null and 3 blanks • Back to the source… • Null is a bad record (header?) • Blanks were user option • Clean it up • Don’t re-discover and re-implement • Someone needs to track these! • Null filtered in Tableau • Count now at 84 • Blank relabeled to “N/A” in Excel • Tools Discussion and Seminar 3 will go into Data Cleansing in more detail Before Cleaning After Cleaning
Handedness • Didn’t we just fix the NULL thing? • Yes – this is a new Null • Excel had a cut-and-paste error! • Formula wasn’t used in column – values were hard-coded • Fixed formula, copied throughout Before Cleaning After Cleaning
Data Scientist Ethic • Don’t ignore the warts! • Most warts are meaningless • Of those that aren’t, most are easy to figure out • Of those that aren’t, most are at least easy to fix once you figure it out • Of those that aren’t, most times you can get someone else to help you fix it • Of those that aren’t, you’ll usually improve your implementation skills when you resolve it • Sometimes this line of work sucks • The ones that aren’t help you understand the data • In this case, a problem with the data process • In other cases, interesting quirks and potential insights!
Age • Survey question: Birth Year • Seeing old and new issues • Blanks • Number ranges • Survey did not constrain to YYYY
Age • Survey question: Birth Year • Seeing old and new issues • Blanks • Number ranges • Survey did not constrain to YYYY • Fixed these three entries
Age • Survey question: Birth Year • Seeing old and new issues • Nulls • Turn out to be blanks – valid option in Survey • Number ranges • Survey did not constrain to YYYY • Fixed these three entries
Age, as Age • Birth Year isn’t our interest, Age is • Transform your data to suit your needs • Be as direct between the data and the context as you can Age Birth Year Decade
The Art of Data Science • Message #2: Connect the Data to the Context • Transform the data to suit your needs • Easy investigation/understanding • Analytics goals • Operational goals • This is where Telling the Story feeds back • Effective plots help the data tell their story to you • Try things out!
Favorite Color • Here, I’ve assigned colors near the named color • Sorting by most prevalent to least • Blank isn’t adding anything • Removing
Favorite Color • Now, let’s add Gender • Okay – I see differences! • Something to form an impression from • Something to come back to • Blue is now the Official Data Scientist color!
Check Assumptions • Assumption 1: More Males than Females • Assumption 2: 10-15% Lefties • Underestimate! • Assumption 3: Different color preferences by Gender
Checking Assumptions… • Familiarizes You with the Data • Identifies data issues • Tests your assumptions • Gives you Confidence in the Data… • Confidence in the initial source • Confidence in Extraction, Transformation, Load • …and Your Assumptions • Confidence in your Intuition where it was right • Updates to your Intuition where it was off
Building a Data Model • Data comes in different types • Categorical • Gender, Handedness, Favorite Color, any true/false • Scalar • Age, height, weight • Label/identifier • … • These data types often associate with the purpose to which it will be applied • Categories are dimensions along which we might divide the records • Measurements (Scalars) are facts about specific instances of what we’re modeling • A good data model allows for rapid analytics • Modular construction of sets of dimensions and measurements • Automated investigation of cross-relationships
Survey Duration • Another processed ‘field’ • End Time – Start Time • Plotting it all: sparse info • A lot of short times • A few long times • Outliers are hiding the data! • After filtering out extremely high values, a different picture emerges… Same Data, Different Lenses
Playing with Plots 1:Beware Bad Binners! • How you choose bins and plot a histogram can impact your interpretation Same Data, Different Axes Very flat; One entry per bin Still flat, but the voids in X-axis have meaning
The Practice of Data Science Bin Size (seconds) 1,000 • I just tricked you into looking at a bunch of data! • That is Data Science in action • It is a skill like many others • We all have some ability • We get better with practice • It’s pattern recognition 1 60 3 45 5 10 15 20 30
The Science of Data • Distributions have meaning • Flat: random, fixed • Normal distributions: repeated processes • Exponential: cumulative processes • Over time, we interpret data in terms of known distributions • Survey Duration: Gaussian + Exponential Wikipedia.org Wikipedia.org
Survey Duration • Another processed ‘field’ • End Time – Start Time • Plotting it all: sparse info • A lot of short times • A few long times • Outliers are hiding the data! • After filtering out extremely high values, a different picture emerges • Normal Distribution plus sparse tail • People who start, complete, end • People who start, stop, return, <repeat>, end Same Data, Different Lenses
Tools • I used Tableau here • A lot can be done directly in Excel • Google Refine looks impressive http://www.youtube.com/watch?v=B70J_H_zAWM&feature=player_embedded • Highlights cleansing issues, supports resolution Source(s) Import Analytics Tool
Data Science Lifecycle • Tonight, Focus is on Feeling Out Data • Primarily early-stage skill, but a part of all stages • Something everyone can do, increasingly so with modern tools Organizing and Feeling Out your Data
Closing Thoughts • Message #1: Data is Messy • Don’t ignore the warts • Message #2: Connect the Data to the Context • Translate data so it is expressed in your terms • Message #3: Check Your Assumptions • Explore the data for insights • Message #4: Develop Your Intuition • Look at a lot of data in a lot of ways
Who Rocks? • A HUGE thanks to Peggy Sue for executing the survey and organizing the results! • Super thanks to Tammy for live tweeting and sponsoring us at CIC!
The Lifecycle Series Quick Note: • #6: Telling The Story: Visualizing Results • Speaker: • Hjalmar Gislason • CEO of DataMarket.com • Conference Speaker • Currently writing a book for O’Reilly called Effective Data Visualization