Opening EPA Databases Now Closed to Search Engine Crawlers

Opening EPA Databases Now Closed to Search Engine Crawlers Brand Niemann Senior Enterprise Architect Office of Environmental Information, U.S. EPA Federal CIO Council Semantic Interoperability Community of Practice (SICoP), Co-chair EPA Web Work Group Conference, March 13-15, 2007 March 15, 2007, 3-3:45 p.m. Google: Federal Sitemaps Initiative

Introduction • Since I had the advantage of listening to most of the conference, I thought I would provide some perspective as follows: • Maxamine: Should support the Sitemaps Protocol. • WashingtonPost Interactive: EPA can’t afford what their 700 journalists produce, but we can afford a Wiki. • Library of Congress: Cited Wikinomics* and supports the Sitemaps Protocol. • Portals Project: New databases for public should support the Sitemaps Protocol. * See http://www.wikinomics.com/, How Mass Collaboration Changes Everything, 2006.

Introduction • Perspectives (continued): • Infrastructure Priorities: • Search Enhancements (folders and databases) (YES) • Content Management System (YES) • Information Architecture (YES) • Media Inventory (graphics) (YES) • Content Priorities: • Drive Visitors to EPA.Gov (Sitemaps Protocol would like LoC reported 40% increase at this conference!) • Content Reorganization (YES- repurpose) • Develop Standards, etc (RSS, Web 2.0 and 3.0-Semantic Web) • Support Crisis Communication (YES) • Note: SICoP already does all of these things with Wikis and Semantic Wikis with EPA and non-EPA Content!

Introduction Link to Preview of DVD by EPA Region 4 • Google (for example): • Brand Niemann: • Slide 5 • NIH Wiki Fair • Slide 6 • Best Practices Committee: • Slide 7 • DRM 3.0 and Web 3.0 • Audio and Video • EPA Data Architecture for DRM 2.0 (and now 3.0): • Slide 32 • Federal Sitemaps: • Slide 16

Introduction http://www.himotion.us/2/2006/139.html

Introduction

Introduction • Scott Butner, Introduction to the Semantic Web: Basic Concepts, Tools, and Why It Matters: • My Comments: • Excellent Overview of Concepts and Tools. • EPA Needs More Semantic Web and Technology Pilots. • Google: SICoP (Semantic Interoperability Community of Practice – Federal Chief Information Officer Council). • We Have Been Discussing the Use of Wikis and Semantic Wikis! • See http://Knoodl.com and http://www.visualknowledge.com • These are open and free to all.

Introduction • Cy Kidd, OW Web Modernization Project Update: Creating An Integrated Site: • My Suggestions: • Use Most Trusted Reference Knowledge Source to Create An Ontology (Taxonomy): • Sustainable Water Resources Roundtable (Whitehouse Council on Environmental Quality). • Really Helps Prioritize, Organize, and Integrate Content. • Use Enterprise Architecture/Segment Architecture/Solutions Architecture for EPA Water Resources: • EPA Office of Water (Mark Hamilton) Leads Segment Architecture Work. • EPA Enterprise Architecture Team (Brand Niemann) Leads Pilot to Apply Water Ontology to Integrate (Mashup) EPA and Non-EPA Water Content, Make Databases Visible (Sitemaps Protocol), and Reduce Data Element Redundancy in Major Water Databases.

Abstract • Government information is estimated to be about 80% unstructured and about 90% of the structured information is estimated to be invisible to search engine crawlers and users. In addition, because (1) the UK government recently announced that hundreds of their websites are being consolidated or shut down to make access to information easier for people and (2) the recent SICoP Special Conference on Building DRM 3.0 and Web 3.0 in support of the Federal CIO Council Strategic Plan for FY 2007-2009 Goal 2 (Information securely, rapidly, and reliably delivered to our stakeholders) to provide implementation strategies, best practices, and success stories, it seems appropriate to pilot a process that deals with all of these issues at the same time. • The purpose of this presentation is to structure unstructured EPA information, make EPA databases visible to search engine crawlers and users, consolidate EPA information to make it easier to use, and to provide semantic metadata and linking in support of DRM 3.0 and Web 3.0 applications. The new EPA Strategic Plan, Report on the Environment, Enterprise Architecture, and Performance Results are used to illustrate the “long tail” of search (being successful with obscure queries).

Overview • 1. Common Barriers to Web Search Engine Crawling • 2. Sitemaps Protocol • 3. Federal Sitemaps Initiative • 4. EPA Experience • 5. Microformats • 6. Gleaning RDF from XML • 7. DRM 3.0/Web 3.0 • 8. EPA Pilots • 9. Federal Sitemaps Initiative at FOSE 2007 • 10. Questions & Answers

1. Common Barriers to Web Search Engine Crawling • What can make a site effectively invisible to search engine users: • Content “hidden” behind search forms • • Non-HTML links • • Outdated robots.txt crawling restrictions • • Server errors (crawler times out when fetching content) • • Orphaned URLs • • Rich media: audio, video • • Premium content

1. Common Barriers to Web Search Engine Crawling Total: 27 Sample list of EPA sites with uncrawlable elements: http://spreadsheets.google.com/pub?key=pUb62ZKHnzgqEoGF4LFf3Gw

2. Sitemaps Protocol • The Sitemap protocol is an open, XML-based standard for managing search engine crawling. The protocol provides website owners a means of communicating to search engines the location, priority, change frequency, and last modification date of all pages on a website or web-accessible database, which can ensure complete and efficient crawling of the site's contents. • The Sitemap protocol was introduced by Google in June 2005 under a Creative Commons License and was adopted in November 2006 as an industry standard by Google, Microsoft and Yahoo. • See SearchEngineWatch - Search Engines Unite On Unified Sitemaps System, November 16, 2006.

2. Sitemaps Protocol • <?xml version="1.0" encoding="UTF-8"?> • <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> • <url> • <loc>http://www.example.com/</loc> • <lastmod>2005-01-01</lastmod> • <changefreq>monthly</changefreq> • <priority>0.8</priority> • </url> • </urlset> http://www.sitemaps.org/

3. Federal Sitemaps Initiative • Federal Sitemaps is an initiative to help federal agencies make their websites more accessible to search engine users through sitemapping. • Representative federal agencies that have implemented the Sitemap protocol to open a previously uncrawlable database or other website element to search engine crawling: • Library of Congress • National Center for Education Statistics (NCES), Department of Education • Government Accountability Office (GAO) • Office of Science and Technology Information (OSTI), Department of Energy • PlainLanguage.gov, sponsored by Web Content Management Working Group of the Interagency Committee on Government Information (ICGI) & owned by PLAIN, Plain Language Action and Information Network, a federal, interagency working group.

3. Federal Sitemaps Initiative • The XML Community of Practice and the Semantic Interoperability Community of Practice (SICoP) encourage adoption and implementation of the Sitemap protocol by federal agencies because it: • Supports the E-Government Act of 2002 (Pub. L. No. 107-347) • Supports the Federal Enterprise Architecture's Data Reference Model 2.0 requirements. • Supports the SICoP DRM 2.0 Implementation - Knowledge Reference Model requirements for the use of increasing metadata to provide increasingly powerful search results. • Supports the new CIOC Strategic Plan FY 2007-2009. See pages 10-11 re Goal 2: Information securely, rapidly, and reliably delivered to our stakeholders. Provide updates to the FEA Data Reference Model (DRM) and establish DRM implementation strategies, best practices, and success stories.

3. Federal Sitemaps Initiative • Recent presentations: • February 15, 2007, Web Content Managers Forum. • February 6, 2007, SICoP Special Conference: Building DRM 3.0 and Web 3.0 for Managing Context Across Multiple Documents and Organizations. • January 29, 2007, EPA OEI (OTOP and OIAA). • January 21-25, 2007, The National Academies Transportation Research Board 86th Annual Meeting. • January 17, 2007, XML CoP Meeting. http://colab.cim3.net/cgi-bin/wiki.pl?FederalSitemaps

3. Federal Sitemaps Initiative • OSTI success story: • Department of Energy agency that “makes R&D findings available and useful, so that science and technological creativity can advance”. • Web manager submitted sitemaps for Energy Citations and Information Bridge services, opening 2.3M bibliographic records and full-text documents to crawling. • Sitemap standard assures web search engines have “a complete picture” of information in OSTI services. http://www.osti.gov/

4. EPA Experience • Sitemaps augments, but does not replace regular crawling. • Sitemaps is focused on exposing the contents of databases which estimates suggest may be as much as 90% of Web content. • The current Sitemaps protocol is the “lowest-common-denominator” approach (recall slide 7). • In EPA’s new template, we're including the Dublin Core fields that make us consistent with the eGov Act of 2002 and the OMB guidance pursuant to it (see next slide). • I will meet with the Searchmasters and discuss how we might alter our existing "jump pages" to conform to the Sitemap protocol, or to alter our jump-page creation process to also create Sitemaps. Source: John Shirey, Notes on Federal Sitemaps Discussion, January 10-11, 2007.

4. EPA Experience • <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"><head> <title>Page Title | Area Name | US EPA</title> <meta name="DC.title" content="" /> <meta name="DC.description" content="" /> <meta name="keywords" content="" /> <meta name="DC.Subject" content="" /> <meta name="DC.type" content="" />  <meta name="DC.date.modified" content="" /> <meta name="DC.date.created" content="" /> <meta name="DC.date.reviewed" content="" /> <meta name="DC.language" content="en" /> <meta name="DC.creator" content="" /> <link rel="schema.DC" href="http://purl.org/dc/elements/1.1/" /> <link rel="meta" href="http://www.epa.gov/labels.rdf" type="application/rdf+xml" title="ICRA labels" /> <meta http-equiv="Content-Style-Type" content="text/css" /> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" /> Source: John Shirey, New EPA Basic Template, January 8, 2007.

4. EPA Experience • “Sitemaps as a method for discovering database content is something that I heartily endorse. It makes sense, and it's good to have a data standard for doing it. Google, et. al. are to be commended for that. Too bad it's such a minimalist protocol! As we work to expose database contents to our internal search engine, we will keep in mind the need to express that content in a Sitemap protocol as well. EIMS is our first target database, hopefully tackling it this spring.” Source: John Shirey, Notes on Federal Sitemaps Discussion, January 10, 2007.

5. Microformats • Designed for humans first and machines second, microformats are a set of simple, open data formats built upon existing and widely adopted standards. • Instead of throwing away what works today, microformats intend to solve simpler problems first by adapting to current behaviors and usage patterns (e.g. XHTML, blogging). • See http://microformats.org

5. Microformats • Overview of microformats: • People and Organizations • hCard • Calendars and Events • hCalendar • Opinions, Ratings and Reviews • VoteLinks, hReview • Social Networks • XFN • Licenses: • rel-license • Tags, Keywords, Categories • rel-tag • Lists and Outlines • XOXO

6. Gleaning RDF from XML In this example the focus is on automating the construction of indexes. The idea is to crawl GRDDL source documents and extract embedded RDFa to feed an RDF store. SPARQL queries are then solved against this store and rendered as web pages to automatically generate up-to-date indexes. http://www-sop.inria.fr/acacia/personnel/Fabien.Gandon/tmp/grddl/rdfaprimer/PrimerRDFaSection.html

7. DRM 3.0 and Web 3.0 • CIOC Strategic Plan FY 2007-2009. Pages 10-11 Re Goal 2: Information securely, rapidly, and reliably delivered to our stakeholders: • Provide updates to the FEA Data Reference Model (DRM) and establish DRM implementation strategies, best practices, and success stories. The purpose of these activities is to contribute to the usability of the DRM by maintaining an effective process for modifying the DRM and sharing strategies for success. • SICoP Special Conference 1, February 6, 2007: • http://colab.cim3.net/cgi-bin/wiki.pl?SICoPSpecialConference_2007_02_06 • SICoP Special Conference 2, April 25, 2007: • http://colab.cim3.net/cgi-bin/wiki.pl?SICoPSpecialConference2_2007_04_25

SICoP Source: Pages 21-22, Federal Chief Information Officer Council Strategic Plan: FY 2007-2009, 28 pp. http://www.cio.gov/documents/CIOCouncilStrategicPlan2007-2009.pdf

7. DRM 3.0 and Web 3.0 • Building DRM 3.0 and Web 3.0 for Managing Context Across Multiple Documents and Organizations: • Therefore it is possible to unify the Data Description and Data Contents by creating an intelligentDirectory Interchange Format type structure which will be used to build a knowledge base. This would be the model in the DRM 3.0. • This is Web 3.0 Technology because it reasons about content and adds it. • Source: Lucian Russell, DRM 2.0 Author – see his slides 6-7.

DRM 2.0 Reference Model

Data & Information & Knowledge Repository Data Reference Model 3.0, Web 3.0 & SOAs dynamic static Data Resource Awareness Agent Language Logic Figure 3-1 DRM standardization Areas

7. DRM 3.0 and Web 3.0 • Highlights of SICoP Special Conference, February 6, 2007: • Tools for Semantic Data Modeling (WordNet). • Tools for Building WordNets of Documents (Semantic Wikis). • Work with knowledge in three forms: documents, models, and software behaviors. • Tools to Extract Semantic Relationships from Unstructured Text and Build Ontologies (Language Computer Corporation). • Tools to Reason Over Knowledgebases (CYCORP). • Conference Captured as a “best practice” by the CIOC Best Practices Committee.

8. EPA Pilots • Strategies for EPA Databases Now Closed to Search Engine Crawlers: • Get vendors to support automatic generation (Oracle, IBM, etc.) • Convert to HTML • Repurpose to XML • Repurpose to knowledgebases: • EPA Strategic Plan, Report on the Environment, Enterprise Architecture, and Performance Results. • Semantic Relationships between strategic goals, detailed business and information management requirements, and measurable performance improvements.

8. EPA Pilots • Metadata: • Full text of unstructured, semi-structured, and structured information (EPA example: READ). • Harmonization • Different ways in which the same words are used EPA example: EDR). • Enhanced Search: • Across all content nodes and showing context (e.g. words around the term or concepts) (EPA example: All the rest). • Mashups: • A website or application that combines content from more than one source into an integrated experience (repurposing) (EPA example: EIMS).

8. EPA Pilots http://web-services.gov

9. Federal Sitemaps Initiative at FOSE 2007 • Demonstration area of the Government Business Solutions Pavilion, Booth 1237, March 20-22, 2007, 10 a.m. – 4 p.m. each day. • The Sitemap protocol is an open, XML-based standard for managing search engine crawling. The protocol provides website owners a means of communicating to search engines the location, priority, change frequency, and last modification date of all pages on a website or web-accessible database, which can ensure complete and efficient crawling of the site's contents. Federal Sitemaps is an initiative to help federal agencies make their websites more accessible to search engine users through sitemapping. The exhibit will feature a brief video describing the agreed upon standards and resources, and agency staff involved in implementing these standards will be available to discuss their individual experiences and answer your questions.

10. Questions & Answers • John Lewis (JL) Needham • Strategic Partner Development Manager, Google, Inc. • jlneedham@google.com • Mills Davis • Project10x and SICoP Co-Chair • mdavis@project10x.com • Brand Niemann • Senior Enterprise Architect and SICoP Co-Chair • niemann.brand@epa.gov

Opening EPA Databases Now Closed to Search Engine Crawlers

Opening EPA Databases Now Closed to Search Engine Crawlers

Presentation Transcript

Search Engine

A search engine for phylogenetic tree databases

Search Engine

Crawlers

Search Engine

Search Engine

Search Engine

Creepy Crawlers

Search Engine Optimization and Search Engine Marketing

SEARCH ENGINE

Search Engine

Search Engine

Search Engine

Introduction to Search Engine Optimisation

Search engine

Search Engine

search engine

Search Engine

SEARCH ENGINE