Managing and Mining Smart Meter Data at Scale CSE Project Showcase

Managing and mining smart meter data – at scale CSE Project Showcase 9 July 2013 Twitter: @cse_bristol #SmartMeterData

Introduction • Contents • Introduction to the project, the data, and its applications • Managing SM data at scale • Getting valuable knowledge out of SM data • Demo: Smart Meter Analytics, Scaled by Hadoop (SMASH) • Where next? • Discussion

Introduction • Project Background • “Generating Value from Smart Electricity Meter Data” • 18 Month TSB-supported collaboration • CSE, University of Bristol, SSE and Western Power Distribution • Three themes: • Managing the data at scale • Extracting useful knowledge • Integrating the above in a user-facing application

Introduction The data A half-hourly timeseries for each smart meter / register Content: date, time, consumption in the half hour. For a single register: 17,520 records per year. This is what 18 months look like:

Introduction • The data • EDRP: • 18 months • 16,250 smart metered households • 16,250 smart electricity meters • 9,364 smart gas meters • 670m half-hourly records (E: 420m, G: 250m) • 40GB of raw csv file data • Post rollout, per year, domestic only: • 25m smart metered households • 25m smart electricity meters • 20m smart gas meters • 800 billion half-hourly records (E: 450Bn, G: 350Bn) • 50TB of raw csv file data • EDRP ~ 0.1% of a year’s domestic data

Introduction • What might we use it for? • Improve existing processes • Settlement • Billing, reconciliation, audit • Demand profiling • Customer profiling & segmentation • New processes not possible without HH data at scale • Localised prediction • Distribution network planning and modelling • Automated DSM – prediction and verification • System state detection • Individualised consumer energy services

Introduction • What are the essential processes? • Ingestion – getting the data into the system • Storage – keeping it there securely • Analysis and reporting • Ad-hoc queries • Transaction reports • Descriptives and summaries (e.g. OLAP) • Mining and modelling • Visualisation

Data management & processing More fundamentally Moving data between storage, memory and CPU Transforming it in the CPU into desired forms There are physical constraints on the speed of this. (These are relevant at the scale of smart meter datasets).

Data management & processing Single machine RDBMS Using SQL Server to sum half hourly consumption: 4 bn records: ~ 1 hour 40 bn records: ~ 10 hours 1 years’ worth: ~ 200 hours

Data management & processing Single machine RDBMS Problem: the throughput of a single machine has not kept up with the growth in the size of datasets.

Data management & processing Single machine RDBMS Problem: the throughput of a single machine has not kept up with the growth in the size of datasets. Solution: harness multiple individual machines (‘horizontal scaling’).

Data management & processing Single machine RDBMS Problem: the throughput of a single machine has not kept up with the growth in the size of datasets. Solution: harness multiple individual machines. Problem: this is difficult and expensive using traditional relational database applications

Data management & processing Solution Move away from traditional databases and use a purpose-designed (‘big data’) framework to get horizontal scaling: 1 machine ~£10k 2.5GHz 1 GB/s 100MB/s ~ a week

Data management & processing Solution Move away from traditional databases and use a purpose-designed (‘big data’) framework to get horizontal scaling: 10 node cluster ~£50k 25GHz 10 GB/s 1 GB/s ~ a day 1 machine ~£10k 2.5GHz 1 GB/s 100MB/s ~ a week

Data management & processing Solution Move away from traditional databases and use a purpose-designed (‘big data’) framework to get horizontal scaling: 100 node cluster ~£300k 250GHz 100 GB/s 10 GB/s ~ an hour 10 node cluster ~£50k 25GHz 10 GB/s 1 GB/s ~ a day 1 machine ~£10k 2.5GHz 1 GB/s 100MB/s ~ a week

Data management & processing Hadoop Designed to solve the problem of exponentially growing data volumes (originally, google’s searchable copy of the web) Harness a large number of commodity machines and low cost networking and storage. Software takes a job (query, calculation, whatever) and ‘maps’ it out across the cluster. In parallel each node locally processes a subset of the problem, before the results are ‘reduced’ back to a single dataset. (Hence ‘Map/Reduce’)

Data management & processing Experiments: SQL server Single high performance machine: bottlenecked by the speed of the hard drive ~ 400GB

Data management & processing Experiments: Hadoop 11 node physical cluster (~£50k hardware cost) ~2,500GB

Data management & processing Experiments compared Not straightforward to get SQL Server to run over ~ 10Bn records. ~2,500GB

Data management & processing Experiments: growing the cluster Fixed dataset size of 500m records

Data management & processing • Hadoop • Pros • Open source software – free and customisable • Adjustable data redundancy (data is replicated over the cluster) • Incrementally scalable – on both performance and cost measures: just add machines, system adapts automatically. • Responsive and cooperative developer community • Cons • Not the last word in user-friendliness (but this is changing) • Sledgehammer to crack a nut below a certain scale • Less mature (but rapidly developing) software ecosystem • Algorithms must fit the framework • Conclusion: low cost option for smart meter data processing

Data mining and visualisation • Finding value in the data • Improve existing processes • Settlement • Billing, reconciliation, audit • Demand profiling • Customer profiling & segmentation • New processes not possible without HH data at scale • Localised prediction • Distribution network planning and modelling • Automated DSM – prediction and verification • System state detection • Individualised consumer energy services

Data mining and visualisation Finding value in the data Collaborative approach with industry partners to identify business needs Focus on: (1) Datamining for subgroup discovery – classifying end users (2) Cluster analysis on demand data – finding profiles (3) Innovative visualisation of consumption data and datamining results

Data mining and visualisation • Subgroup discovery • “Pattern features”: 14 variables describing each household • Income, geography, access to gas, size of house, value of house etc. • “Target features”: describe the behaviour of interest • Profile error: how different is usage from the assigned profile? • Outputs: • groups of households with significantly different profile errors

Data mining and visualisation Subgroup discovery Looking at % annual profile error against sociodemographics

Data mining and visualisation Clustering Can we use demand data to create better profiles? Define target features: waveform’s properties of interest Two examples: using imposed and emergent properties. Each using 3 clusters.

Data mining and visualisation Clustering E.g. 1 the average weekday as 5 pairs of numbers: Consumption (not to scale) Time of day (half hours from midnight)

Data mining and visualisation Clustering E.g. 2: Frequency spectrum of the demand timeseries

Data mining and visualisation Cluster analysis Project competition results (the University won)

Data mining and visualisation Conclusions from datamining Subgroup discovery results suggest the approach is useful as long as you have metadata on the households Cluster analysis work suggests it is possible to improve on the standard profile classes using SM data Further work needs to be carried out on more representative datasets There are many other potential applications!

The SMASH application Web application Installation of Hadoop on UoB and CSE clusters 11 Node physical cluster at the university (£50k) 8 Node virtual cluster at CSE (£15k) Integration of a range of Hadoop-friendly data management components Development of a proof-of-concept web application for user interaction, job management, visualisation etc. Deployment on both clusters

The SMASH application Web application Currently running on the CSE virtual Hadoop cluster

Generating Value from SM Data • Where next? • We have a proof-of-concept system developed with TSB R&D funding support. • We have mastered the underlying technologies and established that this approach has the potential to be a low-cost solution to a number of industry data challenges. • On a technical level the next steps are to • Further develop the web application • Refine the datamining algorithms (with more data) • Implement selected DM algorithms directly on the cluster • On a policy/programme level we want ensure this knowledge is incorporated into SM rollout infrastructure decision making.

Questions and discussion @cse_bristol #SmartMeterData

Contacts: Simon Roberts simon.roberts@cse.org.uk Joshua Thumim joshua.thumim@cse.org.uk Web: www.cse.org.uk Sign up to our monthly e-news through our website Follow us on Twitter @cse_bristol

Managing and Mining Smart Meter Data at Scale CSE Project Showcase

Managing and Mining Smart Meter Data at Scale CSE Project Showcase

Presentation Transcript

CS 345A Data Mining Lecture 1

Data Mining

Data Mining

Data Mining

Data Mining: An Introduction

Data Mining

Applications and Trends in Data Mining

Data Mining

CHAPTER 17: DATA MINING BASICS

July 18, 2013

Ant Inspired Data Mining

CHAPTER 17: DATA MINING BASICS

CS548 Showcase Using SPSS for Data Mining

Smart Meter Initiative Customer Update Meeting July 18, 2011

Data Mining with DB

Spatial and Temporal Data Mining

Data Mining: Extracting Knowledge from Past Data

Smart Home Technologies

Data Mining in Spatial Data Sets

AMR Deployment Managing All That Data