300 likes | 411 Views
Semantic Data Science for the US Census Bureau. Dr. Brand Niemann Director and Senior Data Scientist Semantic Community http://semanticommunity.info/ http://datacommunitydc.org/blog/2013/08/cloud-soa-semantics-and-data-science-conference/
E N D
Semantic Data Science for theUS Census Bureau Dr. Brand Niemann Director and Senior Data Scientist Semantic Community http://semanticommunity.info/ http://datacommunitydc.org/blog/2013/08/cloud-soa-semantics-and-data-science-conference/ https://silverspotfire.tibco.com/us/library#/users/bniemann/Public http://semanticommunity.info/Census_Semantic_Knowledge_Base November 14, 2013
Google Search Result: Census Bureau • Home Page • First source for current population data and the latest Economic Indicators • State and County QuickFacts • USA QuickFacts • American FactFinder • Your source for population, housing ... • 2010 Census • Redistricting Data - What is the Census? • Population Estimates • The Census Bureau's Population Estimates Program • Easy Stats • Easy Stats gives you quick and easy access • Data Access Tools • The Census Bureau data tools provide on-line access
Data Access Tools • Interactive Internet Data Tools: • Data Visualization Gallery - A weekly exploration of Census data used to promote visualization and make data accessible to a broader audience. • DataFerrett is a tool and data librarian that searches and retrieves data across federal, state, and local surveys, executes customized variable recoding, creates complex tabulations and business graphics. Current Population Survey, Survey of Income and Program Participation, American Community Survey, American Housing Survey, Small Area Income Poverty Estimates, Population Estimates, Economic Census Areawide Statistics, National Center for Health Statistics data, Centers for Disease Control data, and more. • DataFerrett’s newest tool, the Community Economic Development HotReport provides community and business leaders speedy access to information on counties and the Employment & Training Administration’s Workforce Innovation in Regional Economic Development (WIRED) areas across the U.S.
Data Visualization Gallery http://www.census.gov/dataviz/
Census Data Visualization Gallery As Data For the Digital Government Strategy My Note: The entire platform can be searched. The entire knowledge base page can be searched. My Note: Structured and unstructured information is all turned into a knowledge base of data for relational and graph database processing. http://semanticommunity.info/Census_Data_Visualization
Census Data Visualization Gallery: Spotfire My Note: This is federation of diverse data sources to find, facet filter, visualize, and discover new facts. Spotfire Web Player
The Data Web: Data Ferrett http://dataferrett.census.gov/
Data Ferrett Description • DataFerrett is a data analysis and extraction tool to customize federal, state, and local data to suit your requirements. Using DataFerrett, you can develop an unlimited array of customized spreadsheets that are as versatile and complex as your usage demands then turn those spreadsheets into graphs and maps without any additional software. • My Comment: This is what I use Spotfire for on Open Government Data for the Digital Government Strategy.
Community Economic Development HotReport Description • This site, the Community Economic Development HotReport, provides access for users seeking economic indicators for individual counties. • For areas that experience economic disruptions due to natural disasters, plant closings, base closings, and other economic changes, such as abrupt increases in employment, this HotReport shows pertinent economic indicators in unified on-line reports from many data sources.
Community Economic Development HotReport Web Site Click on graph to view table. Community Economic Development HotReport
White House Big Data Event:Data to Knowledge to Action “Just wanted to say how helpful it is that you take notes and share so broadly at these types of events. Thanks for your ongoing contributions to all the communities of which you are a part.” Making the Most of Big Data
Semantic Data Science Team Attends White House Big Data Event • Our work is an example of the bold new collaboration theme: “Harnessing the Potential of Data Scientists and Big Data for Scientific Discovery” that shows “Data Innovation Across Sectors” and includes the following Breakout session topics: • Education and Workforce Development (George Mason University and John Hopkins University - see below) • My Note: Census is one of 9 agencies involved in this NITRD effort. • Research and Development (NIH and YarcData) • Innovation (DC Data Science Community and Semantic Community)
NITRD Supplement to the FY14 President’s Budget • We have worked to support the NITRD Current and Planned Coordination Activities as follows: • Working with two of the six agencies: NSF, NIH, and trying to work with the other four: DoD, DARPA, DOE, and USGS; • Following the work in the NSF-NIH Solicitation, Core Techniques and Technologies for Advancing Big Data Science & Engineering for datasets and results that can be reused; • Helping ensure a trained workforce to capitalize on big data resources by working with GMU Data Science as part of our team and preparing a graduate course on data science using the applications and data sets mentioned above and below; • Providing examples of applications that use multiagency big datasets and core technology that is needed to turn heterogeneous data into more homogeneous, interoperable data; • Providing big data infrastructure development for domain science with Spotfire and the YarcData Graph Appliance; and • Attending the second National Big Data R&D Initiative event. • My Note: We would like to work with Census on any or all of these! Current and Planned Coordination Activities
Demos • Spotfire 6: • Web Link • Semantic Medline with YarcData Graph Appliance Pilot: • Wiki • YarcData Videos • Schizo-7 minutes • Cancer-21 minutes
Contact Information • Brand Niemann, Semantic Community • bniemann@cox.net • 703-268-9314 • http://semanticommunity.info • N. Fredrik Salvesen, SBK LLC Alliance Partner YarcData • fredrik@salvesen.me • 443 994-5193 • http://yarcdata.com/
Some Next Steps • So after about 10 years of development and the recent work of our Semantic Data Science Team, we think we have the best US Federal Government semantic knowledge base (NIH Semantic Medline) running on one of the best graph computers (YarcData) for the OSTP/NITRD Federal Big Data Senior Steering WG. • Our goal is to produce the “Killer Semantic Web Application for the US Federal Government” and we still have a ways to go. • Now we need to help other agencies do the same by applying semantic data science to their data and metadata to develop their semantic knowledge base for piloting on the best graph computers. • The following is a pilot example to begin to develop a semantic knowledge base for US Census showing the steps for preparing legacy US Census data sources and for collecting new US Census data sources so they are stored directly in a semantic knowledge base. • A historical note: This is like when I led the E-forms For E-government Pilot for OMB and the Federal CIO Council – I selected the US Census Economic Census E-forms solution by Rick Fenestra to be the best practice for getting about 15 E-forms solutions being used by the US Federal Government to adopt a common e-Grant XML Schema so all 15 could become semantically interoperable and agencies would not have to “rip and replace” solutions. This approach could make agency semantic knowledge bases interoperable so they can be federated and we would have a “killer semantic web application” on top of “individual killer semantic web applications”!
Data Access Tools • Quick Facts • American FactFinder • Easy Stats • My Congressional District • Population Finder • American Community Survey • 2010 Census • Economic Census • Interactive Maps • Data Visualizations • Training & Workshops • Data Tools • Catalogs • Publications http://www.census.gov/main/www/access.html
Census Semantic Knowledge Base • US Census data is available in the following ways: • Data Access Tools: Making It Easier to Use the Data Than Just Direct File Access Below (Start Here) • Research Data Centers: Access to Confidential Data (Defer This Until Later Stage) • Software to Download: More Tools to Use (This is More About Data Than Software) • Direct File Access: Public (Include This) and Private (Not Applicable Here) • Access Tools at Other Sites: Is There a Better Place to Build This Semantic Knowledge Base? (That University of Minnesota Web Site Looks Pretty Good!) My Note: This defines how to start and the scope of the semantic knowledge base.
Semantic Knowledge Base • Initially we need at least a taxonomy and a vocabulary. • Eventually, we would like an ontology and thesaurus. • We need to build a data and metadata ecosystem with relational and graph data sets. • The pilot will build a knowledge base in MindTouch, spreadsheets in Excel, a dashboard in Spotfire, and a business process for data collection in Be Informed. • The pilot will be scaled up to create a RDF triple store for the YARCData Graph Appliance. • In essence, I am going to build a “SemanticData.gov” type application for the US Census Data.
Data Access Tools • Data Visualization Gallery: Recall Slide 6 Knowledge Base and Slide 7 Spotfire • 2010 Census Interactive Population Map • The American FactFinder • QuickFacts • Easy Stats • County Business & Demographics Map • Economic Database Search and Trend Charts • Glossary: See Slide 26 Excel and Slide 29 Spotfire Knowledge Bases • Censtats • Online Mapping Tools • US Gazetteer • Business Dynamics Statistics • DataFerrett: Recall Slides 8-9 • Community Economic Development HotReport: Recall Slides 10-11 • QWI Online • OnTheMap • Industry Focus • Census 2000 EEO Data Tool My Note: This is another taxonomy!
Data Access Tools:Knowledge Base Spreadsheet My Note: This is a taxonomy in Semantic Web Linked Open Data Format. http://semanticommunity.info/@api/deki/files/27077/USCensusSemanticKnowledgeBase.xlsx
Direct File Access: Public My Note: This is a taxonomy of how Census organizes it data files that needs to be a searchable index in a spreadsheet. http://www2.census.gov/census_2000/datasets/
Direct File Access Public: Knowledge Base Spreadsheet My Note: This is both relational and graph (subject, object, & predicate database formats. http://semanticommunity.info/@api/deki/files/27077/USCensusSemanticKnowledgeBase.xlsx
Census Taxonomy and Vocabulary: MindTouch Matrix My Note: The entire page & platform can be searched. http://semanticommunity.info/Census_Semantic_Knowledge_Base#Story
Census Semantic Knowledge Base: Excel Glossary My Note: All of these spreadsheets can be searched. My Note: The Semantic Community approach is consistent with the EU ISA Recommended URI Design and Management Principles. http://semanticommunity.info/@api/deki/files/27084/CensusSemanticKnowledgeBase.xlsx
Census Semantic Knowledge Base: Spotfire Glossary https://silverspotfire.tibco.com/us/library#/users/bniemann/Public?CensusSemanticKnowledgeBase-Spotfire.dxp
Census Semantic Knowledge Base: Spotfire Taxonomy https://silverspotfire.tibco.com/us/library#/users/bniemann/Public?CensusSemanticKnowledgeBase-Spotfire.dxp
Conclusions and Recommendations • A taxonomy (Interactive Internet Data Tools)and vocabulary (Glossary) from Census were used to pilot a semantic knowledge base. • Agile development of the semantic knowledge base was possible when the data dictionary and data are readily available in a spreadsheet or at the download site so one can focus on doing the data science and analytics. • The Census "Building Deep Links into American FactFinder" can be Semantic Web Linked Open Data. • See 2012 Statistical Abstract as a Semantic Knowledge Base in the Next Slide. • The Semantic Community Platform can produce a Census data science ecosystem and products in an interoperability interface with semantic interoperability. • Next is piloting Be Informed for Census survey data collection and then YARCData on the triple stores that are created.
Statistical Abstract 2012: Spotfire Knowledge Base http://semanticommunity.info/FedStats.net#Spotfire_Dashboard