220 likes | 396 Views
Introduction . Malathi Veeraraghavan Professor Charles L. Brown Dept. of Electrical and Computer Engineering University of Virginia mvee@virginia.edu. Outline. Increasing interest in data Course: From Data to Knowledge Summary. “The data deluge” “Data, data everywhere”.
E N D
Introduction Malathi Veeraraghavan Professor Charles L. Brown Dept. of Electrical and Computer Engineering University of Virginia mvee@virginia.edu
Outline • Increasing interest in data • Course: From Data to Knowledge • Summary
“The data deluge” “Data, data everywhere” • Economist Special Issue Feb 27-Mar. 5, 2010 • Walmart databases alone are estimated at more than 2.5 petabytes (a petabyte is 1 million gigabytes): 2010 numbers • From businesses to governments, data collection and analysis is rapidly becoming the next big thing. • 2012: http://www.nytimes.com/2012/02/12/sunday-review/big-datas-impact-in-the-world.html?pagewanted=all
“The data deluge” • “A new kind of professional has emerged, the data scientist, who combines the skills of software programmer, statistician and storyteller/artist to extract the nuggets of gold hidden under mountains of data.” • Hal Varian, Google’s chief economist notes that “Data are widely available; what is scarce is the ability to extract wisdom from them.”
Business intelligence • Nestle sells > 100,000 products in 200 countries using 550,000 suppliers • Problem: not using its huge buying power effectively • Used SAP software and analyzed its data • Just one ingredient – vanilla – its American operation reduced the number of specifications and used fewer suppliers, saving $30M per year • Annual savings from such operational improvements: $1 billion Economist special issue
Medical use • Dr. Carolyn McGregor from University of Ontario • Goal: spot fatal infections in premature babies • Monitors subtle changes in 7 streams of real-time data, such as heart rate, blood pressure, etc. • ECG alone takes 1000 readings/second • Infections are detected before obvious symptoms emerge • Naked eye cannot see it, but the computer can! • Who programs these? Stats experts. • Another term: Evidence Based Medicine Economist special issue
Government usage • An add-on to a 1986 law required firms to disclose the harmful chemicals they release. • When the public started tracking these numbers, by 2000, American businesses had reduced their emissions of the chemicals covered under the law by 40% Economist special issue
Best-sellers • “Super-crunchers: Why Thinking-by-Numbers Is the New Way to Be Smart” by Ian Ayres • “Money Ball: The Art of Winning an Unfair Game” by Michael Lewis • “The Long Tail” by Chris Anderson • Malcolm Gladwell books - Outliers • Microtrends – Mark Penn (elections) • Freakonomics – S. Dubner and S. Levitt
Moneyball example • 2002 season: Richest team, NY Yankees, had a payroll of $126 million, while the Oakland A’s had a payroll of less than a third of that, about $40 million, and yet they had reached the playoffs three years in a row, and took the Yankees close to elimination. How did they do it? • Billy Beane, general manager of Oakland A’s • Respected statistics • Hired Paul DePodesta, Harvard MBA, who applied Bill James’ formulas and selected players based on their statistics. • Runs created = (Hits + Walks) Total Bases/(At Bats + Walks) • Jeremy Brown – only player in the history of the SEC with 300 hits and 200 walks, but he was overweight • Scouts vs. statisticians! • The tendency of everyone to generalize wildly from his own experience. Most people think their own experience is typical!
Malcolm's Gladwell's "Outliers” hockey players story • Why Canadian hockey players born early in the year have a big advantage; cutoff date was Jan. 1 • ESPN conducted a little study: All the 2008 season NHL players who were born from 1980 to 1990. [Later disputed for 2011 players] • Sure enough: Many more were born early in the year than late. http://sports.espn.go.com/espn/page2/story?page=merron/081208
Examples from “The Long Tail” • Rhapsody, an online music store, which in Dec. 2005 had 1.5M tracks, reported that the number of downloads/month for even the 100,000th track was in the 1000s, when a Walmart store, the largest brick-and-mortar music retailer, stocks only 55,000 tracks. • Rhapsody reports that 40% of its total sales came from the Long Tail products, i.e., those not available in retail stores. • Anderson gives several such examples, calling these businesses Long-Tail aggregators • Google as the long-tail aggregator of advertising • eBay of goods • Amazon of books • Apple of music • Netflix of movies
Experts vs. intuition • Ian Ayres’ book • “The future belongs to people like Wolfers who are comfortable with both intuition and numbers” • Wolfers analyzed 44,000 college basketball games (> 16 years) • Also see Jason Lehrer’s “How we Decide” – another bestseller Ian Ayres’ book, page 220
What Wolfers did • Plot density function of number of games that beat the Las Vegas spread • Perfect normal bell curve! • Just look at games with point spreads less than or equal to 12 • Perfect normal bell curve • Look at games with point spread > 12 • 47% chance that the favored team beat the spread (53% failed to cover the spread) • more than 20% of games fell in this category of games with >12 spreads • Is it point shaving? • Look at the score five minutes before the end of the game – right on track to beat the spread 50% of the time! • Indeed a stronger case for point shaving Ian Ayres’ book, page 216
2SD Rule: To understand variability • There is a 95% chance that a normally distributed variable will fall within two standard deviations (plus or minus) of its mean • Statistical significance – simple intuitive concept – there is less than 5% chance that a random variable will be more than two standard deviations away from the mean. • Stanford Law school students knew that professors were required to give a 3.2 mean. They wanted to know if the professor was a “spreader” or a “clumper”! Ian Ayres’ book, page 221
“Margin of error” • News article says “Laverne is leading Shirley 51% to 49% with a margin of error of 2%” and so the race is a “statistical dead heat.” • Ayers declares this “balderdash!” Why? • Margin of error = 2SD • So standard deviation is 1% • This means there is an 84% chance that Laverne leads in the polls (i.e., has more than 50% of the vote) Ian Ayres’ book, page 224
Exercise • See if you can use the 2SD rule and just your intuition to derive a number for the standard deviation for adult male height • Estimate two things: mean and standard deviation Ian Ayres’ book, page 214
Technology trends enabling all this data analysis • Cloud computing • Amazon , Google, Yahoo, Microsoft • Open source software • R programming language • NY Times article, Jan. 7, 2009 • Hadoop allows ordinary PCs to analyze huge quantities of data that previously required supercomputers Economist special issue
Technology or techniques? • Moore’s Law • Processing power doubles every two years • Supercrunching does need CPUs, but computing power has been available • More important: Kryder’s Law • Storage capacity of hard drives has been doubling every two years • Chief technology office (Mark Kryder) for hard drive manufacturer, Seagate Ian Ayres’ book, page 151
Three techniques • Regressions • error term ~ N(0,2) • Randomization • Run experiments by treating different samples in different ways • Neural networks • Functional form is not assumed to be linear or anything specific Ian Ayres’ book
Course material • From Data to Knowledge • Focus on data sets • Less on details of statistical techniques • Learn R programming through class-provided R programs and assignments • http://www.ece.virginia.edu/mv/edu/D2K/index.htm
Summary • Importance of data analysis • in every walk of life! • How to extract the “story” hidden in the data set?