460 likes | 654 Views
Feature Engineering Studio. September 23, 2013. Welcome to Mucking Around Day. Sort into pairs. Partner with the person next to you One group of 3 is allowed. Sort into pairs. Do we have a group of 3? One of the 3 will work with me. Sort into pairs. Go over your reports together
E N D
Feature Engineering Studio September 23, 2013
Sort into pairs • Partner with the person next to you • One group of 3 is allowed
Sort into pairs • Do we have a group of 3? • One of the 3 will work with me
Sort into pairs • Go over your reports together • A maximum of 5 minutes apiece
Who here found something really cool while mucking around? • Show us, tell us
Who here found a histogram with a normal distribution? • Show us, tell us
Who here found a histogram with a hypermode? • Show us, tell us
Who here found a histogram with a flat distribution? • Show us, tell us
Who here found a histogram with a skewed distribution? • Show us, tell us
Who here found a histogram with a bimodal distribution? • Show us, tell us
Who here found a histogram with something else interesting? • Show us, tell us
Who here found something surprising with their min, max, average, stdev?
Categorical variables • Who here found something curious, weird, or interesting in the distribution of their categorical variables?
Who here hasn’t spoken yet?(and analyzed data) • Tell us something interesting you found in your data
Who here played with pivot tables? • What did you learn?
My turn to play with pivot tables • Who wants to volunteer their data? • (I might request a 2nd or 3rd data set, depending on how the 1st one goes)
Who here played with vlookup? • What did you learn?
My turn to play with vlookup • Using the same volunteered data set(s)
Other cool things you can create with a few simple formulas (plus demos!)
Comparing earlier behaviors to later behaviors through caching
Assignment 3 • Feature Engineering 1“Bring Me a Rock” • Get your data set • Open it in Excel • Create as many features as you feel inspired to create • Features should be created with the goal of predicting your ground truth variable • At least 12 separate features that are not just variations on a theme (e.g. “time for last 3 actions” and “time for last 4 actions” are variations on a theme; but “time for last 3 actions” and “total time between help requests and next action” are two separate features) • For each feature, write a 1-3 sentence “just so story” for why it might work • Test how good each features is
Testing Feature Goodness • For this assignment, there are a bunch of ways to test feature goodness • Single-feature prediction models in data mining or stats package, giving correlation or kappa (special session this Wednesday) • Compute correlation in Excel (want to see?) • You can do this with binaries variables too, although it’s not really optimal • Compute t-test in Excel (want to see?) • Compute kappa in Excel (if you don’t know how, easier to do in RapidMiner)
Were you right? • Which of your “just so stories” seem to be correct? • Did any of your feature correlate in the opposite direction from what you expected?
Assignment 3 • Write a brief report for me • Email me an excel sheet with your features • You don’t need to prepare a presentation • But be ready to discuss your features in class
Next Classes • 9/25 Special Session • Using RapidMiner to Produce Prediction Models • Come to this if you’ve never built a classifier or regressor in RapidMiner (or a similar tool) • Statistical significance tests using linear regression don’t count… • 9/30 Advanced Feature Distillation in Excel • Assignment 3 due • Online Equation Solver Tutorials should be in your INBOX
Upcoming Classes • 10/2 Special session on prediction models • Come to this if you don’t know why student-level cross-validation is important, or if you don’t know what J48 is • 10/7 Advanced Feature Distillation in Google Refine • 10/9 Special session? TBD.