1 / 22

Introduction

Introduction . Malathi Veeraraghavan Professor Charles L. Brown Dept. of Electrical and Computer Engineering University of Virginia mvee@virginia.edu. Outline. Increasing interest in data Course: From Data to Knowledge Summary. “The data deluge” “Data, data everywhere”.

benjamin
Download Presentation

Introduction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction Malathi Veeraraghavan Professor Charles L. Brown Dept. of Electrical and Computer Engineering University of Virginia mvee@virginia.edu

  2. Outline • Increasing interest in data • Course: From Data to Knowledge • Summary

  3. “The data deluge” “Data, data everywhere” • Economist Special Issue Feb 27-Mar. 5, 2010 • Walmart databases alone are estimated at more than 2.5 petabytes (a petabyte is 1 million gigabytes): 2010 numbers • From businesses to governments, data collection and analysis is rapidly becoming the next big thing. • 2012: http://www.nytimes.com/2012/02/12/sunday-review/big-datas-impact-in-the-world.html?pagewanted=all

  4. “The data deluge” • “A new kind of professional has emerged, the data scientist, who combines the skills of software programmer, statistician and storyteller/artist to extract the nuggets of gold hidden under mountains of data.” • Hal Varian, Google’s chief economist notes that “Data are widely available; what is scarce is the ability to extract wisdom from them.”

  5. Business intelligence • Nestle sells > 100,000 products in 200 countries using 550,000 suppliers • Problem: not using its huge buying power effectively • Used SAP software and analyzed its data • Just one ingredient – vanilla – its American operation reduced the number of specifications and used fewer suppliers, saving $30M per year • Annual savings from such operational improvements: $1 billion Economist special issue

  6. Medical use • Dr. Carolyn McGregor from University of Ontario • Goal: spot fatal infections in premature babies • Monitors subtle changes in 7 streams of real-time data, such as heart rate, blood pressure, etc. • ECG alone takes 1000 readings/second • Infections are detected before obvious symptoms emerge • Naked eye cannot see it, but the computer can! • Who programs these? Stats experts. • Another term: Evidence Based Medicine Economist special issue

  7. Government usage • An add-on to a 1986 law required firms to disclose the harmful chemicals they release. • When the public started tracking these numbers, by 2000, American businesses had reduced their emissions of the chemicals covered under the law by 40% Economist special issue

  8. Best-sellers • “Super-crunchers: Why Thinking-by-Numbers Is the New Way to Be Smart” by Ian Ayres • “Money Ball: The Art of Winning an Unfair Game” by Michael Lewis • “The Long Tail” by Chris Anderson • Malcolm Gladwell books - Outliers • Microtrends – Mark Penn (elections) • Freakonomics – S. Dubner and S. Levitt

  9. Moneyball example • 2002 season: Richest team, NY Yankees, had a payroll of $126 million, while the Oakland A’s had a payroll of less than a third of that, about $40 million, and yet they had reached the playoffs three years in a row, and took the Yankees close to elimination. How did they do it? • Billy Beane, general manager of Oakland A’s • Respected statistics • Hired Paul DePodesta, Harvard MBA, who applied Bill James’ formulas and selected players based on their statistics. • Runs created = (Hits + Walks) Total Bases/(At Bats + Walks) • Jeremy Brown – only player in the history of the SEC with 300 hits and 200 walks, but he was overweight • Scouts vs. statisticians! • The tendency of everyone to generalize wildly from his own experience. Most people think their own experience is typical!

  10. Malcolm's Gladwell's "Outliers” hockey players story • Why Canadian hockey players born early in the year have a big advantage; cutoff date was Jan. 1 • ESPN conducted a little study: All the 2008 season NHL players who were born from 1980 to 1990. [Later disputed for 2011 players] • Sure enough: Many more were born early in the year than late. http://sports.espn.go.com/espn/page2/story?page=merron/081208

  11. Examples from “The Long Tail” • Rhapsody, an online music store, which in Dec. 2005 had 1.5M tracks, reported that the number of downloads/month for even the 100,000th track was in the 1000s, when a Walmart store, the largest brick-and-mortar music retailer, stocks only 55,000 tracks. • Rhapsody reports that 40% of its total sales came from the Long Tail products, i.e., those not available in retail stores. • Anderson gives several such examples, calling these businesses Long-Tail aggregators • Google as the long-tail aggregator of advertising • eBay of goods • Amazon of books • Apple of music • Netflix of movies

  12. Experts vs. intuition • Ian Ayres’ book • “The future belongs to people like Wolfers who are comfortable with both intuition and numbers” • Wolfers analyzed 44,000 college basketball games (> 16 years) • Also see Jason Lehrer’s “How we Decide” – another bestseller Ian Ayres’ book, page 220

  13. What Wolfers did • Plot density function of number of games that beat the Las Vegas spread • Perfect normal bell curve! • Just look at games with point spreads less than or equal to 12 • Perfect normal bell curve • Look at games with point spread > 12 • 47% chance that the favored team beat the spread (53% failed to cover the spread) • more than 20% of games fell in this category of games with >12 spreads • Is it point shaving? • Look at the score five minutes before the end of the game – right on track to beat the spread 50% of the time! • Indeed a stronger case for point shaving Ian Ayres’ book, page 216

  14. 2SD Rule: To understand variability • There is a 95% chance that a normally distributed variable will fall within two standard deviations (plus or minus) of its mean • Statistical significance – simple intuitive concept – there is less than 5% chance that a random variable will be more than two standard deviations away from the mean. • Stanford Law school students knew that professors were required to give a 3.2 mean. They wanted to know if the professor was a “spreader” or a “clumper”! Ian Ayres’ book, page 221

  15. “Margin of error” • News article says “Laverne is leading Shirley 51% to 49% with a margin of error of 2%” and so the race is a “statistical dead heat.” • Ayers declares this “balderdash!” Why? • Margin of error = 2SD • So standard deviation is 1% • This means there is an 84% chance that Laverne leads in the polls (i.e., has more than 50% of the vote) Ian Ayres’ book, page 224

  16. P(X≤1) = P (X ≥-1) = 0.84, where X~N(0,1)

  17. Exercise • See if you can use the 2SD rule and just your intuition to derive a number for the standard deviation for adult male height • Estimate two things: mean and standard deviation Ian Ayres’ book, page 214

  18. Technology trends enabling all this data analysis • Cloud computing • Amazon , Google, Yahoo, Microsoft • Open source software • R programming language • NY Times article, Jan. 7, 2009 • Hadoop allows ordinary PCs to analyze huge quantities of data that previously required supercomputers Economist special issue

  19. Technology or techniques? • Moore’s Law • Processing power doubles every two years • Supercrunching does need CPUs, but computing power has been available • More important: Kryder’s Law • Storage capacity of hard drives has been doubling every two years • Chief technology office (Mark Kryder) for hard drive manufacturer, Seagate Ian Ayres’ book, page 151

  20. Three techniques • Regressions • error term ~ N(0,2) • Randomization • Run experiments by treating different samples in different ways • Neural networks • Functional form is not assumed to be linear or anything specific Ian Ayres’ book

  21. Course material • From Data to Knowledge • Focus on data sets • Less on details of statistical techniques • Learn R programming through class-provided R programs and assignments • http://www.ece.virginia.edu/mv/edu/D2K/index.htm

  22. Summary • Importance of data analysis • in every walk of life! • How to extract the “story” hidden in the data set?

More Related