230 likes | 243 Views
Advisory Expert Group Big Data. Statistics Canada. Outline. Big data and the National Accounts Establishing the right infrastructure Lessons learned: case studies from Statistics Canada Traditional big data Scanner data Electricity consumption Credit card and Interact Remote sensing.
E N D
Advisory Expert GroupBig Data Statistics Canada
Outline • Big data and the National Accounts • Establishing the right infrastructure • Lessons learned: case studies from Statistics Canada • Traditional big data • Scanner data • Electricity consumption • Credit card and Interact • Remote sensing Statistics Canada • Statistique Canada
Big data and the National Accounts • From a business perspective "Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization.…. – (Gartner 2012) Wikipedia • From an NSO perspective "Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to which could reduce respondent burden, increase quality, develop new statistical products or enhance the detail of existing statistical products…..…. – ???? Statistics Canada • Statistique Canada
Big data and the National Accounts • MichCouper from the University of Michigan’s’ Survey Research Center sites the following limitations NSO will face when confronting Big data: • lack of covariates in the datasets; • self-selection and self-reporting biases; • lack of stability; • privacy issues; • access issues; • opportunity for mischief; • size issues; and • selective reporting of results (file drawer problem). • You could add to that • Sustainability – data sources disappear, systems change, perceptions change. • Couper, Mick P., Is the Sky Falling: New Technology, Changing Media, and the Future of Surveys. (Presentation, European Survey Research Association, 5th Conference, Ljubljana, Slovenia, July, 2013) Statistics Canada • Statistique Canada
Big data and the National Accounts • There needs to be up-front acknowledgement that we are trying to fit a square peg in a round hole…. • The needs of business (big data to increase business intelligence) and national accountants (big data to produce comprehensive macroeconomic statistics) is quite different. Statistics Canada • Statistique Canada
Putting in place the appropriate infrastructure • In order to determine how to best leverage big data NSO needs to put in place the proper infrastructure to: • Obtain the data • Process the data • Evaluate the data • Integrate the data Statistics Canada • Statistique Canada
Putting in place the appropriate infrastructure – Obtaining the data • Use of legislation – e.g., Section 13 of Canada’s Statistics Act states that “A person having the custody or charge of any documents or records that are maintained in any department or in any municipal office, corporation, business or organization, from which information sought in respect of the objects of this Act can be obtained or that would aid in the completion or correction of that information, shall grant access thereto for those purposes to a person authorized by the Chief Statistician to obtain that information or aid in the completion or correction of that information.” 1970-71-72, c. 15, s. 12. • Memorandum of understanding (MOUs) which outline: • Roles and responsibilities • Delivery mechanism • Uses of data • Termination of the agreement • Purchasing big data • Many firms sell big data that can be used for business intelligence – it could also be purchased for statistical purposes. Under what conditions and terms should NSOs purchase big data? Statistics Canada • Statistique Canada
Putting in place the appropriate infrastructure – Processing the data • File transfer system - NSOs need a secure, high capacity file transfer system to transfer data from the data provider to the NSO. • Storage and processing capacity - In most NSOs (especially NA divisions) the processing capacity for big data does not exist. • Software - Statistics Canada is leveraging the SAS distributed computing solution called “SAS Grid” to shorten the time needed to process and analyze its larger data holdings. Also, the Data Analysis Resource Center at Statistics Canada maintains a research computer with analytical software installed, offering a wide range of add-ons that provide advanced analytical and visualization tools particular to big data analytics. • Information management policies – Access, privacy, confidentiality, retention Statistics Canada • Statistique Canada
Putting in place the appropriate infrastructure – Evaluating the data • Big data community of practice • There needs to be a structure in place that allows analysts and programs to gain knowledge and share experiences with respect to big data, to engage with colleagues internally or externally when needed and to report findings to senior managers when appropriate. • Big data needs to be evaluated with respect to its: • Quality • Coverage • Timeliness • Detail • Regularity • In order to leverage big data we need to develop a research and development orientation. Statistics Canada • Statistique Canada
Examples of big data research at Statistics Canada:International merchandise trade statistics • Collection/access agreement: Access to detailed customs data is governed by two memorandum of understanding: one with the Canadian Revenue Agency and one with the U.S. Census Bureau • Cost: Nil • Dimensions: 1.5 Terabytes, 60 attributes • Uses: Balance of Payments, International Merchandise Trade Statistics • Timeliness: 35 days following the reference period • Frequency: Daily, if required • Potential uses: Creating an importer and exporter characteristics file which can be used to analyze the entry an exit of Canadian traders within the Canadian economy, used in studies of globalization, global production, goods for processing, foreign affiliate statistics. Statistics Canada • Statistique Canada
Examples of big data research at Statistics Canada:Taxation statistics • Collection/access agreement: Access to detailed taxation statistics is governed by a memorandum of understanding with the Canada Revenue Agency. • Cost: Approximately $1.6 million • Dimensions: 6 Terabytes and growing • Uses: Benchmark estimates of wages and salaries; output; property incomes, taxes, etc. • Timeliness: Earliest use – 45 data following the reference period • Frequency: Mainly annual, some monthly (goods and services taxation statistics) • Potential uses: Creation of a National Accounts longitudinal file—a business level micro-data file that can be used to undertake studies such as GDP by city, GDP by firm size, productivity by firm size. Statistics Canada • Statistique Canada
Examples of big data research at Statistics Canada:Government finance statistics • Collection/access agreement: No formal agreement in place – institutional understanding between Statistics Canada and the government jurisdictions. • Cost: Nil • Dimensions: 40 million financial transactions, 200 GB • Uses: Government Finance Statistics, government sector – National Accounts • Timeliness: Earliest is 15 days following the reference period. • Frequency: Monthly, quarterly, annual • Potential uses: Local government remains a ‘survey of municipalities’, access to electronic files will increase our ability to provide CMA level data as well as increased revenue and expenditure details. Potential data uses for the health, education and justice programs. Statistics Canada • Statistique Canada
Examples of big data research at Statistics Canada:Electronic household transactions (credit and debit) • Collection/access agreement: Memorandum of understanding outlining the roles and responsibilities of both Statistics Canada and the data provider. • Cost: Nil • Dimensions: “Aggregated” big data - number of transactions, value of transactions aggregated by merchant group by place of transaction (domestic, international) by class of transactor (personal or commercial). • Uses: Indicator for household final consumption expenditure and international travel abroad • Timeliness: Earliest is 15 days following the reference period. • Frequency: Monthly • Potential uses: International travel services, monthly household final consumption expenditure. Statistics Canada • Statistique Canada
Examples of big data research at Statistics Canada:Electronic household transactions (credit and debit) Statistics Canada • Statistique Canada
Examples of big data research at Statistics Canada:Electronic household transactions (credit and debit) Statistics Canada • Statistique Canada
Examples of big data research at Statistics Canada:Scanner data: vendor specific • Collection/Access Agreement: MOU in negotiation • Cost: Current costs are nil though the long-term approach being proposed would involve a quid pro quo agreement where CPD would provide the company their data back with value added (i.e., an implicit cost would be borne by the division). • Dimensions: Sales, quantities, and item descriptions of all goods sold for a given store over a given period • Uses: Consumer prices and household expenditure weights to feed the CPI • Timeliness: TBD, though potentially as little as a one day lag (e.g., weekly data for a given week could be delivered on the first day of the following week). • Frequency: Initial data has been provided on a weekly aggregated basis. Future work will look at daily and / or transactional level data. • Dataset size: For one week of sales data (aggregated on the week) for one store, • roughly 4,000 KB • roughly 30,000 rows (i.e., unique items sold) • implies roughly 200MB for one year of weekly aggregated data for one store. • Potential uses moving forward: Direct input into the calculation of the CPI (potential replacement for collected prices), studies on consumer behaviour, CPI weights, household final consumption expenditures, retail sales. Statistics Canada • Statistique Canada
Examples of big data research at Statistics CanadaSmart meter: household electricity consumption • Collection/access agreement: Two memoranda of understanding with two regional electricity distributors • Cost: Nil • Dimensions: Roughly 200 GB of raw hourly electricity consumption data have been obtained, providing detailed information on approximately 120,000 customers, between the years of 2008 to 2013 • Uses: Household electricity consumption • Timeliness: Earliest is 15 days following the reference period. • Frequency: Hourly • Potential uses: Household final consumption expenditure, monthly Gross Domestic Product’s utilities. Statistics Canada • Statistique Canada
Examples of big data research at Statistics CanadaSmart meter: household electricity consumption Total residential consumption Statistics Canada • Statistique Canada
Examples of big data research at Statistics CanadaSatellite Imaging: Land Account • Collection/Access Agreement: Public data • Cost: Nil • Dimensions: 20 GB. Although not apparent here, “dimension” of this type of big data (which is not really big data, strictly speaking) may well explode in the coming years. LIDAR datasets (high resolution radar), as well as higher resolution (space and time) satellite data will require terabytes of storage and “terahertz” of processing capacity. • Uses: Land accounts: Land cover / land use change 2000 and 2010 - 2013 • Timeliness: 3 years lag • Frequency: Annual • Potential Uses moving forward: Landscape and freshwater ecosystem accounts Statistics Canada • Statistique Canada
Examples of big data research at Statistics CanadaRemote sensing: land use Statistics Canada • Statistique Canada
Examples of big data research at Statistics CanadaWater Measurement Instruments: Water Account • Collection/Access Agreement: Informal agreement with Water Survey of Canada • Cost: Nil • Dimensions: Original WSC data is 5 GB; derived water yield data is 90 GB • Uses: Water accounts: Water Yield • Timeliness: From real-time to lag of several years • Frequency: Daily • Potential Uses moving forward: Freshwater ecosystem accounts Statistics Canada • Statistique Canada
Some lessons learned so far • Quid pro quo – is important when trying to obtain ‘big data’. Firms are more willing to part with their ‘big data’ if you show them how they will receive a ‘business intelligence’ benefit on their side. • Cost – ‘big data’ is not always the cheapest option. It is sometimes easier to have the firm complete the survey than to create an infrastructure to receive and process their data. For example, the data received from local electricity providers is equivalent to the completion of two questions on our current survey. • Classification systems – ‘big data’ does not follow any standard classification system. For example, electronic retail transactions are classified according to merchant groups rather than industries. • Big data aggregates – asking firms to aggregate their ‘big data’ is an option. • Data formats – Need to work with new data formats that we are often not familiar with. Statistics Canada • Statistique Canada
Discussion point for the AEG • In order to exploit the potential of big data, NSOs need to make significant investments. How can we leverage the work taking place across various NSOs to minimize the investment and maximize the return? • How do we promote the development of new data products using big data over using big data to re-construct existing data products? Do we adjust our frameworks to accommodate big data or do we adjust big data to accommodate our frameworks? Statistics Canada • Statistique Canada