300 likes | 1.14k Views
Data analysis. Malathi Veeraraghavan Professor Charles L. Brown Dept. of Electrical and Computer Engineering University of Virginia mvee@virginia.edu Date: July 6, 2010. Outline. Increasing interest in data Data mining New course: From Data to Knowledge One example data set analysis
E N D
Data analysis Malathi Veeraraghavan Professor Charles L. Brown Dept. of Electrical and Computer Engineering University of Virginia mvee@virginia.edu Date: July 6, 2010
Outline • Increasing interest in data • Data mining • New course: From Data to Knowledge • One example data set analysis • Summary
“The data deluge” “Data, data everywhere” • Economist Special Issue Feb 27-Mar. 5, 2010 • Walmart databases alone are estimated at more than 2.5 petabytes (a petabyte is 1 million gigabytes). • From businesses to governments, data collection and analysis is rapidly becoming the next big thing. • The industry of information management is growing at almost 10% a year, roughly twice as fast as the software business.
“The data deluge” • “A new kind of professional has emerged, the data scientist, who combines the skills of software programmer, statistician and storyteller/artist to extract the nuggets of gold hidden under mountains of data.” • Hal Varian, Google’s chief economist notes that “Data are widely available; what is scarce is the ability to extract wisdom from them.”
Business intelligence • Nestle sells > 100,000 products in 200 countries using 550,000 suppliers • Problem: not using its huge buying power effectively • Used SAP software and analyzed its data • Just one ingredient – vanilla – its American operation reduced the number of specifications and used fewer suppliers, saving $30M per year • Annual savings: $1 billion Economist special issue
Medical use • Dr. McGregor from University of Ontario • Goal: spot fatal infections in premature babies • Monitors subtle changes in 7 streams of real-time data, such as heart rate, blood pressure, etc. • ECG alone takes 1000 readings/second • Infections are detected before obvious symptoms emerge • Naked eye cannot see it, but the computer can! • Who programs these? CS graduates. • Another term: Evidence Based Medicine Economist special issue
Government usage • An add-on to a 1986 law required firms to disclose the harmful chemicals they release. • When the public started tracking these numbers, by 2000, American businesses had reduced their emissions by 40% Economist special issue
Best-sellers • “Super-crunchers: Why Thinking-by-Numbers Is the New Way to Be Smart” by Ian Ayres • “Money Ball: The Art of Winning an Unfair Game” by Michael Lewis • “The Long Tail” by Chris Anderson • Malcolm Gladwell books - Outliers • Microtrends – Mark Penn (elections) • Freakonomics – S. Dubner and S. Levitt
Moneyball example • 2002 season: Richest team, NY Yankees, had a payroll of $126 million, while the Oakland A’s had a payroll of less than a third of that, about $40 million, and yet they had reached the playoffs three years in a row, and took the Yankees close to elimination. How did they do it? • Billy Beane, general manager of Oakland A’s • Respected statistics • Hired Paul DePodesta, Harvard MBA, who applied Bill James’ formulas and selected players based on their statistics. • Runs created = (Hits + Walks) Total Bases/(At Bats + Walks) • Jeremy Brown – only player in the history of the SEC with 300 hits and 200 walks, but he was overweight • Scouts vs. statisticians! • The tendency of everyone to generalize wildly from his own experience. Most people think their own experience is typical!
Malcolm's Gladwell's "Outliers” hockey players story • Why Canadian hockey players born early in the year have a big advantage; cutoff date was Jan. 1 • ESPN conducted a little study: All the NHL players from this season who were born from 1980 to 1990. • Sure enough: Many more were born early in the year than late. http://sports.espn.go.com/espn/page2/story?page=merron/081208
Examples from “The Long Tail” • Rhapsody, an online music store, which in Dec. 2005 had 1.5M tracks, reported that the number of downloads/month for even the 100,000th track was in the 1000s, when a Walmart store, the largest brick-and-mortar music retailer, stocks only 55,000 tracks. • Rhapsody reports that 40% of its total sales came from the Long Tail products, i.e., those not available in retail stores. • Anderson gives several such examples, calling these businesses Long-Tail aggregators • Google as the long-tail aggregator of advertising • eBay of goods • Amazon of books • Apple of music • Netflix of movies
Experts vs. intuition • Ian Ayres’ book • “The future belongs to people like Wolfers who are comfortable with both intuition and numbers” • Wolfers analyzed 44,000 college basketball games (> 16 years) • Also see Jason Lehrer’s “How we Decide” – another bestseller Ian Ayres’ book, page 220
What Wolfers did • Plot density function of number of games that beat the Las Vegas spread • Perfect normal bell curve! • Just look at games with point spreads less than or equal to 12 • Perfect normal bell curve • Look at games with point spread > 12 • 47% chance that the favored team beat the spread (53% failed to cover the spread) • more than 20% of games fell in this category of games with >12 spreads • Is it point shaving? • Look at the score five minutes before the end of the game – right on track to beat the spread 50% of the time! • Indeed a stronger case for point shaving Ian Ayres’ book, page 216
2SD Rule: To understand variability • There is a 95% chance that a normally distributed variable will fall within two standard deviations (plus or minus) of its mean • Statistical significance – simple intuitive concept – there is less than 5% chance that a random variable will be more than two standard deviations away from the mean. • Stanford Law school students knew that professors were required to give a 3.2 mean. They wanted to know if the professor was a “spreader” or a “clumper”! Ian Ayres’ book, page 213
“Margin of error” • News article says “Laverne is leading Shirley 51% to 49% with a margin of error of 2%” and so the race is a “statistical dead heat.” • This is wrong! Why? • Margin of error = 2SD • So standard deviation is 1% • This means there is an 84% chance that Laverne leads in the polls Ian Ayres’ book, page 224
Exercise • See if you can use the 2SD rule and just your intuition to derive a number for the standard deviation for adult male height • Estimate two things: mean and standard deviation Ian Ayres’ book, page 214
An answer • Average adult male height is 5’ 9” • To estimate SD, 95% of adult males should fall between what two heights? • Say 5’ 3” and 6’3”? • Then SD = 3” – Just a guess • Can be fairly confident that SD is not 1” or 5”!
Technology trends enabling all this data analysis • Cloud computing • Amazon , Google, Yahoo, Microsoft • Open source software • R programming language • NY Times article, Jan. 7, 2009 • Hadoop allows ordinary PCs to analyze huge quantities of data that previously required supercomputers Economist special issue
Technology or techniques? • Moore’s Law • Processing power doubles every two years • Supercrunching does need CPUs, but computing power has been available • More important: Kryder’s Law • Storage capacity of hard drives has been doubling every two years • Chief technology office (Mark Kryder) for hard drive manufacturer, Seagate Ian Ayres’ book, page 151
Examples of data storage • Yahoo! • “Captures 12 TB of data every day” • Half the books in the Library of Congress • Costs • TB hard drive costs $400 (2007). Now (2010) it is $65! • Usage • Allows every Hertz and UPS employee to use handheld machines and capture every transaction’s data Ian Ayres’ book, page 152
Three techniques • Regressions • error term ~ N(0,2) • Randomization • Run experiments by treating different samples in different ways • Neural networks • Functional form is not assumed to be linear or anything specific Ian Ayres’ book
Opportunities for CS programmers • Implementing • statistical analysis techniques • machine learning techniques • neural networks • Visualization tools • Wattenberg’s idea to show a map of the market instead of graphs showing index movements • Privacy issues
Data mining • Techniques • statistical analysis • machine learning • neural networks • Examples • Walmart • NBA (basketball) analysis http://www.anderson.ucla.edu/faculty/jason.frand/teacher/technologies/palace/datamining.htm
Course material • From Data to Knowledge • First offering: Fall 2010 • Focus on data sets • Less on statistical techniques • Learn R programming through class-provided R programs • http://www.ece.virginia.edu/mv/edu/D2K/index.htm
Summary • Importance of data analysis • in every walk of life! • Application area • complexity in coding mathematical techniques, visualization, privacy • Importance of computer engineering advances, e.g., storage • Teaching languages/protocols with examples? Untested