Brand Niemann Senior Enterprise Architect EPA Enterprise Architecture Team

Enterprise Data Architecture and Implementation:Federated, Faceted, Semantic Search of Both EPA Metadata and Data with Governance Brand Niemann Senior Enterprise Architect EPA Enterprise Architecture Team March 26, 2008, Updated April 4, 2008

Brief History • March 3 - 11, 2008, Enterprise Data Architecture Discussions and Activities. • Kevin Kirby, David Prompovitch, Michael Alford, and Brand Niemann. • March 12, 2008, Enterprise Data Architecture Program, Kevin Kirby, Overview Presentation for CIO Biweekly. • Strategy for Program Growth (see next slide). • March 13, 2008, Enterprise Data Architecture Briefing, Kevin Kirby, Enterprise Architecture Working Group Session. • Essentially repeat of March 12th with suggestions (see slide 4). • March 13, 2008, Data Architecture Subcommittee Meeting, Brand Niemann, Informal Presentation. • Vision & Implementation (see slides 5-8). • Web 2.0 (see slides 9-10). • March 16-20, 2008, The DAMA International Symposium & Wilshire Meta-Data Conference, Kevin Kirby Attending. • At least nine presentations on Web 2.0, Wikis, etc. for Metadata and Data Management, etc. • March 24, 2008, EPA Data Architecture: Overview of Metadata Strategy – Summary of Issues for Data Advisory Council, Kevin Kirby, Enterprise Architecture Team Call. • Metadata Framework for Discovery & Evaluation and Conceptual Federated Search Architecture.

Vision and Implementation Our initial objective is to see if this Web 2.0 Wiki can be useful in bringing about collaboration across the Metadata Management Functions Matrix, Teams-Tasks Matrix, and Data Architecture Documents. A longer range goal would be to see if this Web 2.0 Wiki could be used as an Enterprise Metadata Management and Application Development Tool (e.g. data and metadata mashups). Footnotes: See slide 4. Note: Web 2.0 does DRM 2.0 and Web 3.0 does DRM 2.0/3.0!

Footnotes • (1) FEA DRM 2.0 and Report to Congress (2005). • (2) February 6, 2007, and February 5, 2008. • (3) Combines Description and Context from DRM 2.0. See (2). • (4) The data and metadata are combined together (see Brand Niemann). • (5) Information Architecture (topics and subtopics) and Data Architecture (data tables and data elements) are integrated. See Web 2.0 Wiki Pilot: Information Classifications. • (6) This specification defines the syntax and semantics of the SPARQL query language for RDF. SPARQL can be used to express queries across diverse data sources. • (7) EPA Data Architecture Enterprise Metadata. • (8) Video on data reuse in mashups that will revolutionize EPA data architecture, data management, and data reuse applications! • * Note: This also works with relational databases.

Vision and Implementation http://epametadata.wik.is/ (password required to see)

Vision and Implementation The EPA Data Architecture Metadata Community of Interest (CoI) is working to integrate the following metadata sources for information sharing and integration across the enterprise and the world. (1) Web 2.0 Wiki pages are XML-based and have RSS Feeds!

Web 2.0 Source: Mills Davis, Four Stages of the Web at http://project10x.com/about.php

Web 2.0 • Some basic functionalities: • Author like Word • Edit/comment on every page • Some level of security for every page • Tagging • Versioning • Watchlist • RSS/XML between applications • Search • etc.

Overview of Metadata Strategy • Direction from July 2007 Meeting: • “Enable to share” means enabling EPA to share data within programs, across programs, with partners, and with the public. • Purpose and General Approach: Phase 1 (through April 14, 2008): • Objects include: DBMS Data Sets, Unstructured Data (e-mail, docs), and Multimedia, etc. • Proposed Metadata Framework for Data “Objects”: • Coverage is Incomplete. Slide 10. • Federated Registries with a Common Front End Search Tool: • Conceptual Architecture Using Faceted Search. Slide 11. • Governance Artifacts to Implement this Framework: • A National Data Policy Modeled after NGD.

Metadata Framework for Discovery & Evaluation Categories of metadata help the user assess the value of the data set. Levels of metadata exist within an RDBMS set, especially for evaluating quality and security issues. Standard taxonomies aid discovery. These might be specific to broad categories like “Admin./Financial”. EPA Data Classification is a start.

Conceptual Federated Search Architecture Major gap is for RDBMS Data Sets not managed by Informatica

Demonstrations • Federated • Faceted • Semantic Search • Data • Metadata • Governance • DRM 2.0 Compliance • Information Architecture and Data Architecture • DRM 3.0/Web 3.0 • Discovery (Centrifuge) (TRI data pilot slides coming)

Federated See Multiple Nodes on the Same or Different Web Servers.

Faceted See Hierarchy of Topics, Subtopics, etc. That Can be Searched.

Semantic Search See Query Within Context and With Various Semantic Operators.

Data Screen-scrape This Table and Copy It to Excel and the Structure is Preserved.

Metadata This is the Highest Quality-Peer Reviewed Metadata the Agency Has Produced.

Taxonomy This Taxonomy Was Produced by Subject Matter Experts and Peer Reviewed.

Governance The Words Governance and Provenance Have Both Been Used.

DRM 2.0 Compliance The Three Requirements for Information Sharing Have Been Satisfied!

Information Architecture and Data Architecture • Level 1 Top-level Topics • Level 2 Next-level Subtopics • Level 3 Data Tables • Level 4 Data Elements • See: Getting to Web Semantics for Government Spreadsheets Pilot (RDF/SPARQL) • http://semanticommunity.wik.is/People/Brand_Niemann/2008_Semantic_Technology_Conference

DRM 3.0/Web 3.0 Source: Mills Davis, Four Stages of the Web at http://project10x.com/about.php

Discovery (Centrifuge) Centrifuge Systems is a leading provider of next generation business intelligence software that helps organizations discover insights, patterns and relationships hidden in their data. The unique Centrifuge approach allows users to ask open ended questions of their data by interacting with visual representations of the data directly. Traditional business intelligence solutions require users to define what they want to see in advance and present the results in static dashboards. With Centrifuge, users determine what is of interest “on the fly”, then manipulate the displays directly in a highly interactive fashion. The experience is refreshingly easy-to-use and the resulting insights can be extraordinary. Centrifuge is used in some of the most demanding applications in the world, including law enforcement, counter- terrorism and homeland defense, to help analysts move from data to discovery.

Centrifuge Server Centrifuge provides an interactive visualization layer on data such as the Toxics Release Inventory-Made Easy for the Web (TRI-ME WEB). Data can be viewed through a desktop client or a web browser. Here is sample data through a web browser. Centrifuge Server, as a next generation Information visualization system, meets the following requirements: •Ground-Breaking Interactive Visualization in a Browser • A 100% browser-based thin client • Collaborative Analysis • Modern SOA Architecture • Geospatial Integration with Google™ Earth • Pluggable, Componentized and Extensible • Easy to Use

Table View This view represents a sample table from TRI-ME WEB dataset downloaded from the EPA website.

Relationship Graph This relational view represents a bundled graph of MD 2006 data showing Companies linked to Chemicals. This graph shows that two of the primary collections, PBT and TRI, have multiple companies between them. The chemicals have been bundled (grouped) by their chemical classifications.

Relationship Graph Spinoff A subset for specific chemicals of interest can be created (spunoff). In this case, PBT chemicals are shown bundled and connected to the companies associated with them.

Table Spinoff The spinoff concept applies to all views, for example the table view shown on this page.

Relationship Graph of Table Spinoff This relational view represents a graph of the previous tables spinoff.

Quantitative (Charts) View This quantitative view represents a simple distribution of the number of times a chemical is referenced across all companies and facilities in MD.

Timeline (Temporal) View This temporal view of sample data represents how time based data can be viewed. For example this could represent toxic release events if that data were available and time stamped.

Geospatial View This geospatial view represents a spatial distribution of facilities across Maryland

Detailed Geospatial View This geospatial view represents the locations of toxic chemicals in Baltimore, Maryland.

DRM 3.0/Web 3.0 http://richard.cyganiak.de/2007/10/lod/

Brand Niemann Senior Enterprise Architect EPA Enterprise Architecture Team