170 likes | 429 Views
Big data now playing ..... a t the sandbox. John.Dunne@cso.ie 17 th October 2014 IAOS, Vietnam. Overview. Context How CSO got interested in b ig data The sandbox Learning from other industries Learning from the past The sandbox – looking to the future
E N D
Big data now playing ..... at the sandbox John.Dunne@cso.ie 17th October 2014 IAOS, Vietnam
Overview • Context • How CSO got interested in big data • The sandbox • Learning from other industries • Learning from the past • The sandbox – looking to the future • Concluding comments Keywords – big data, modernisation, sandbox
Big data – working definition Data that is difficult to collect, store or process within the conventional systems of statistical organizations. Either, their volume, velocity, structure or variety requires the adoption of new statistical software processing techniques and/or IT infrastructure to enable cost-effective insights to be made.
Do more with less Mindset - Opportunities exist with secondary data sources
Legal environment Data Protection Freedom of Information Official Statistics Key : 3 Legislative pillars
Modernisation and big data 2011Conference of European Statisticians endorse modernisation strategy 2012Big data on modernisation agenda 2013ESSC Scheveningen memorandum on Big data and official statistics 2013International Big data team gets going 2014Big data on UNSC agenda 2014The sandbox goes live at MSIS Dublin
2013 CSO Project - To determine household composition using smart metering data Origin of data : Consumer Behaviour Trials in 2009 and 2010 • Over 5000 households in pilot • 3 months baseline data (reading every 30 mins) • Pre-trial survey using CATI http://www.unece.org/stats/documents/2013.09.coll.html
Project with pilot data brought challenges Pilot 7 million data points per month ICHEC helped out Go live 2160 million data points per month Joe, we need a bigger computer https://www.ichec.ie/
The sandbox The hardware on which the sandbox system is based is a High Performance Computing cluster called Stoney. The cluster is hosted in the National University of Ireland, Galway since April 2009 and is composed of 60 compute nodes each of which has two 2.8GHz Intel (Nehalem EP) Xeon X5560 quad-core processors, 48GB of RAM and a 1TB local disk. Each node is connected to two networks – an InfiniBand network for accessing the shared Lustrefilesystem and for high performance communications as well as a Gigabit Ethernet network for management tasks. In addition, a 20TB shared filesystem is available to all nodes. ICHEC will dedicate 20 compute nodes to enable a Hadoop cluster with 160 cores almost 1TB of RAM and 20TB of HDFS distributed storage.
The sandbox provides an environment to • test feasibility of remote access and processing • test whether existing standards/models/methods can be applied to big data • evaluate the usefulness of big data software tools • learn by doing with respect to potential uses, advantages and disadvantages of big data • facilitate further collaboration in the international community
The toys (data sources) • twitter data • mobile phone data • satellite imagery / aerial photography • price data/ job vacancy data via scraping • scanner data/price data sourced via large vendors • data from road traffic sensors • smart meter data on electricity/gas consumption
Some of the players To play, contact Steven.Vale@unece.org
Learning from other industries- technical partners can have a role to play Exchange of data for billing purposes Irish Mobile Network Operators MNOs Data Clearing Houses ROW Mobile Network Operators
Learning from the past- think about the bigger picture Nordbotten, Thygesen and the statistical archive concept
Learning from the past- do not underestimate privacy concerns http://www.census.gov/history/pdf/kraus-natdatacenter.pdf http://blog.modernmechanix.com/the-national-data-center-and-personal-privacy/ The National Data Center and Personal Privacy By Arthur R Miller
The sandbox - looking to the future • Centres for Research and Development ? • Centres of Excellence ? • Partner organisations for collecting, processing or storing data of a less or non sensitive nature ??? • Significant partner organisations enabling the collection, processing or storing data of a sensitive nature ?????
Concluding remarks • Think about bigger picture / broader system • An open mind to the possibility of new partners • Be open and transparent • Don’t underestimate privacy concerns • Continue to collaborate and share