300 likes | 426 Views
Build the UK’s COINS in the Data Science Library Cloud. Brand Niemann US EPA June 9, 2010 http://semanticommunity.net. Disclaimer: These slides do not reflect the views of the U.S. Environmental Protection Agency
E N D
Build the UK’s COINS in the Data Science Library Cloud Brand Niemann US EPA June 9, 2010 http://semanticommunity.net Disclaimer: These slides do not reflect the views of the U.S. Environmental Protection Agency and does not constitute endorsement by the EPA of the standards or products mentioned.
Overview • The Challenge • The Data.gov.uk Program • The Expert and His Advice • The Cloud Tools • The Inspiration • The Data Sources • Other Sources of Data • The Process • The Results • Comments • Acknowledgements • References
The Challenge • Tim Berners-Lee "Bag of Chips" talk: • http://www.youtube.com/watch?v=ga1aSJXCFe0 • To get five stars: 1-Expose your data, 2-Provide in machine readable format (Excel), 3-Provide as CSV, 4-Provide at permanent URL, and 5-Provide metadata. • Nigel_Shadbolt: Lots of eyeballs pouring over COINS: • http://bit.ly/b8XQGB - opendata in the wild - more functionality all the time. • http://twitter.com/Nigel_Shadbolt/status/15419573652 • bniemannsr: @jahendler Hope data.gov evolves from quantity (500,000 datasets) to quality (data science applications): • http://twitter.com/bniemannsr/status/15334914269 • Note: Now data.gov says only 272,677. • jahendler: @bniemannsr sure, but check out the Sem Web and Apps sections - lots of stuff there that prototypes what we could do #websci: • http://twitter.com/jahendler/status/15335026437 • bniemannsr: @jahendler Did, but neat prototypes don't improve data quality-data science does: • http://radar.oreilly.com/2010/06/wha...a-science.html. • http://twitter.com/bniemannsr/status/15549816659 • eGovernment Interest Group Teleconference, 04 Jun 2010: • http://www.w3.org/2010/06/04-egov-minutes.html Excerpts: Cory Casanave: Can't see the Web of Data: • Cory to write up requirements/wishlist for generic Web of Data browser. See Supporting the Linked Data Consumer. http://gaininitiative.wik.is/United_Kingdom#The_Challenge
The Data.gov.uk Program • Advised by Sir Tim Berners-Lee and Professor Nigel Shadbolt and others, government is opening up data for reuse. This site seeks to give a way into the wealth of government data and is under constant development. We want to work with you to make it better. • We’re very aware that there are more people like you outside of government who have the skills and abilities to make wonderful things out of public data. These are our first steps in building a collaborative relationship with you. http://gaininitiative.wik.is/United_Kingdom#The_Data.gov_Program
The Data.gov.uk Program http://data.gov.uk/
The Data.gov.uk Program http://data.gov.uk/blog/finance-data-coins-goes-live
The Expert and His Advice • Edward Tufte Presidential appointment announced by White House, March 5, 2010. • Tufte Comment on iPhone interface design: Better to have users looking over material adjacent in space within our eyespan rather than stacked in time. This is especially the case for statistical data, where the fundamental analytical task is to make comparisons. Also see page 159 in the above book reference. http://gaininitiative.wik.is/United_Kingdom#The_Expert_and_His_Advice
The Cloud Tools http://cloud.mindtouch.com/
The Cloud Tools http://gaininitiative.wik.is/United_Kingdom
The Cloud Tools http://spotfire.tibco.com/
The Cloud Tools http://ondemand.spotfire.com/public/Help/index.htm
The Inspiration H1N1 Spread Courtesy of TIBCO Spotfire. See Web Player.
The Inspiration http://www.wheredoesmymoneygo.org/dashboard/
The Inspiration • What is data science? Analysis: The future belongs to the companies and people that turn data into products. Mike Loukides. • http://radar.oreilly.com/2010/06/wha...a-science.html. • My Response: Please see my Data Science Library in the Cloud: http://ondemand.spotfire.com/public/...VL-4372/public and my suggestion that The 2010 Health 2.0 Developer Challenge should build a community health data science library-see http://federaldata.wik.is/ June 3rd: http://twitter.com/bniemannsr/status/15482514867 and http://www.hhs.gov/open/discussion/chdi.html. http://gaininitiative.wik.is/United_Kingdom#The_Inspiration
The Data Sources Scroll down to Full Description (see next slide) http://data.gov.uk/dataset/coins
The Data Sources http://hm-treasury.gov.uk/coins
The Data Sources • Tried Zipped 2009/10 Adjustment table, 31MiB (405MiB uncompressed): Got 405 MB text file that when imported into Spotfire gave three columns with no headers and 317,346 rows (with the last row saying: (316,119 row(s) affected)! • See next slide. • Read Comments: Saw where others had had trouble using these datasets. • Is this CSV? • I unzipped the (non-torrent) version of the 09/10 adjustment table and it wasn't CSV but rather 2-sign delimited (think tab-delim with an @ instead of a tab). also the data wasn't clean for import to something like Excel as it had some lines of non-table data at the end - just the sort of thing to upset already hard-pushed spreadsheet importers on non-high end rigs. • Posted on: Fri, 04/06/2010 - 14:18 — Anonymous http://gaininitiative.wik.is/United_Kingdom#The_Data_Sources
The Data Sources COINS: Adjustment_table_extract_2009_10 in Spotfire-PC
The Data Sources • Should have first read: The structure of the data is similar to that in a .csv file with a string of characters being formed to represent each row, using the following delimiters: • Line: carriage return (so lines are presented separately); and • Fields: @ . • http://gaininitiative.wik.is/United_Kingdom/Understanding_the_COINS_data#The_Data_Files_and_Downloading • And read: COINS contains millions of rows of data; as a consequence the files are large and the data held within the files complex. Using these download files will require some degree of technical competence and expertise in handling and manipulating large volumes of data. As such it is likely that this data will be most easily used by organisations that have such expertise, rather than individuals. More directly useful and accessible datasets that draw on the contents of the COINS database will be made available by August 2010. • http://gaininitiative.wik.is/United_Kingdom/Understanding_the_COINS_data#Who_might_find_the_data_useful
Other Sources of Data http://coins.guardian.co.uk/coins-explorer/search For Output all as CSV could get only 5,000 of 72,644 rows. Sent question: Why? http://gaininitiative.wik.is/United_Kingdom#Other_Sources_of_Data
Other Sources of Data Hugh Expenditure for Financial Stability for Northern Rock Refinancing! COINS: Data Explorer in Spotfire-PC
Other Sources of Data Each has link To detailed Table – see next slide. Could only get 100 rows per page. Sent Question: How get all 3,897,330? http://coins.wheredoesmymoneygo.org/?items_per_page=100&page=1
Other Sources of Data http://coins.wheredoesmymoneygo.org/coins/fact_table_extract_2009_10.1361871
Other Sources of Data COINS: Where Does the Money Go? in Spotfire-PC
The Process • The Basic Steps: • Inventory Data Sources and Plan Application • Prepare and Import Data and Metadata • Implement Layout and Analytics • Add Bookmarks and Create Data Stories • Publish and Test in Web Player • Get Feedback and Improve • First create visualizations, faceted search (filters), and analytics for each individual data source and then look for relationships between the data sources. http://gaininitiative.wik.is/United_Kingdom#The_Process
The Results • Recall The Challenges in slide 3: • TBL – Get 5 stars. • NS – Get more eyeballs on COINS. • JH - Data.gov/semantic prototypes what we could do with Web Science. • BN - Evolve from quantity of datasets to quality data science applications. • CC - Can't see the Web of Data – Support the Linked Data Consumer. http://gaininitiative.wik.is/United_Kingdom#The_Results
The Results • Tried to accomplish all five challenges. • Waiting to hear back on requests for full data sets. • Want to emulate Dashboard for Where Does My Money go? • Want to work with other data sources in Data.gov.UK: • E.g. Climate Change. http://gaininitiative.wik.is/United_Kingdom#The_Results
Comments • The initial objective to see how fast one could create this basic application. I am waiting to hear back on requests for full data sets. I want to emulate the Dashboard for Where Does My Money go? I want to work with other data sources in Data.gov.uk: E.g. Climate Change. • Please use the Add Comment feature at the bottom of this wiki page to provide feedback and suggest additional analyses you would like to see. To use the Add Comment feature you first need to register by providing your email address. Your privacy will be respected and your email addressed will not be available to others or used for any other purpose. You can also download the Spotfire File from this Wiki and a 30-day free evaluation copy from http://spotfire.tibco.com/ and reuse these analyses, add your own data to this file or new Spotfire files that you create. Have fun and give us your feedback! http://gaininitiative.wik.is/United_Kingdom#Comments
Acknowledgements • The author acknowledges gratefully Dean Allemang, Cory Casanave, Sean Connors, Mills Davis, Li Ding, David Eng, Lee Feigenbaum, Aaron Fulkerson, Jim Hendler, Ralph Hodgson, Kevin Kirby, Kevin Jackson, Bob Marcus, John McMahon, Richard Murphy, Brand Niemann, Jr., Barry Nussbaum, Matthew Phoenix, Tony Shaw, Jeff Stein, George Strawn, George Thomas, Pete Tseronis, and Edward Tufte. http://gaininitiative.wik.is/United_Kingdom#Acknowledgements
References • Brand L. Niemann, Put Your Desktop in the Cloud to Support the Open Government Directive and Data.gov/semantic, April 19, 2010, Semantic Universe. • Brand L. Niemann, Build Your Own Data.gov (Spotfire) and EPA Microsite (Spotfire) with Semantics and Statistics in the Cloud, May 15, 2010. Slides. • Brand L. Niemann, Build Your Community Health Information "Design for America" Using Mindtouch and Spotfire Analytics, May 17, 2010. Slides. • Brand Niemann, Build Your Own Data.gov/semantic with Spotfire in the Cloud: The White House Visitor Database, May 22, 2010. Slides. See Data.gov takes the 'Mumsy' test, FCW, May 26, 2010. • Edward R. Tufte, Beautiful Evidence (2006), Graphics Press LLC. http://gaininitiative.wik.is/United_Kingdom#References