100 likes | 114 Views
Explore how big data impacts national accounts in the US, from using private sources to concerns about data quality and timeliness. Learn about innovative data sources like administrative data and web scraping.
E N D
Big Data in the National Accounts Experience in the United States Brent Moulton Advisory Expert Group on National Accounts Washington, DC 9 September 2014
What are big data? • Wikipedia: “Any collection of data sets so large and complex that it becomes difficult to process using… traditional data processing applications.” • IBM: “Every day we create 2.5 quintillion bytes of data… This data comes from everywhere… This is big data.” • Forbes: “12 big data definitions: what’s yours?” • # 11 – “The belief that the more data you have, the more insights and answers will arise automatically from the pool” • # 12 – “A new attitude… that combining data from multiple sources could lead to better decisions.”
Big data and official statistics • Statistical agencies as producers of big data • Consistency in format and presentation • Catalogued in common, machine-readable format • Accessible in bulk • Desirable to make government data available on a single platform • Big data as source data for national accounts • Administrative data, especially micro-data • Data from private sources • Web scraping
Concerns about using big data • Do the concepts match those needed for national accounts? • How representative are the data? • Selection biases • Is it possible to fill the gaps in coverage? • Do the data provide consistent time series and classifications? • How timely are the data? • How cost effective?
Defined-benefit pension funds • For the SNA’s new treatment of defined-benefit pensions, BEA found it useful to work with administrative micro-data filed by pension funds • “Form 5500” data from Pension Benefit Guaranty Corporation • ~ 45,000 records per year covering 98% of private pension funds • BEA had to edit data to remove data errors and anomalies
Private source data for early estimates • For “advance” GDP estimate (release about 30 days after the end of the quarter), official monthly/quarterly indicators are not always available • Examples of private source data used by BEA: • Ward’s/JD Powers/Polk (auto sales/price/registrations) • American Petroleum Institute (oil drilling) • Air Transport Association of America (airlines) • Variety magazine (motion picture admissions) • Smith Travel Research (hotels and motels) • Investment Company Institute (mutual fund sales)
Health care satellite account • Schultze Commission (At What Price? 2002) recommended that health care price indexes should be based on cost of treating a specific diagnosis • BEA is preparing a health care satellite care (http://www.bea.gov/national/health_care_satellite_account.htm) • One approach uses insurance claims data for several million insured individuals • Claims grouped in disease episodes • Allows comparison of change in cost for treating particular diseases
Local area tracking system • Used by BEA’s regional accounts staff for independent data on regional economies • Used to vet official statistics before publishing • Types of data • Employment data: largest employers, principal industries, recent layoffs • Natural events affecting the economy • Local real estate and financial trends • Automated using web scraping methods • Identifying key word searches • Archiving relevant articles
BEA research on depreciation • Identifying depreciation in the presence of obsolescence is a long-standing issue • BEA research on motor vehicle depreciation proposes to address this problem using data on “build dates,” which can differ from model years • Data scraping – VIN-level data from decodethis.com combined with auction data from NADA and data from other auto websites • Goal is improved estimates of depreciation
Conclusions • Big data will become increasingly important • Priority to improving data quality, filling gaps, and keeping up with changing economy • Big data especially useful for research projects • Big data may allow for more timely or higher frequency estimates • Attention must continue to be paid to traditional data quality issues