340 likes | 431 Views
U.S Geological Survey National Biological Information Infrastructure. Technical Overview: NBII Metadata Clearinghouse May 2008. Mike Frame. Topics for discussion. Metadata CH Background New Metadata CH Design & Demo Underlying Architecture. www. NBII. gov. My. NBII. gov. PORTAL.
E N D
U.S Geological Survey National Biological Information Infrastructure Technical Overview: NBII Metadata Clearinghouse May 2008 Mike Frame
Topics for discussion • Metadata CH Background • New Metadata CH Design & Demo • Underlying Architecture
www . NBII . gov My . NBII . gov PORTAL Integrated View Content Management Collaboration Services Integrated / Federated Search Database and Web Geospatial Services Model Services Services Distributed Services Geo - ITIS Thesaurus DIGR Catalog Mapping Geoparsing Catalog Discovery Operations Catalog referencing Resource Catalog Resource and Geospatial text Model Geospatial Dublin Core ( plus ) Database and Service Catalogs Dataset Services Services Web Services Resource Catalog Catalog Catalog Clearinghouse OGC / ISO FGDC / ISO UDDI / WSDL ?? Describe and Discover Consume Distributed Resources Distributed Applications , Databases , Websites , Tools and Models Services Overview
NBII Metadata Resourceshttp://metadata.nbii.gov http://metadata.nbii.gov
Metadata Resources:FGDC Metadata Program NBII Clearinghouse Resources for using the Standard Tool reviews Training Opportunities
Some basic metadata facts…about the FGDC Standard 7 Sections make up the FGDC Standard: • Identification Information • Data Quality Information • Spatial Data Information • Spatial Reference Information • Entity and Attribute Information • Data Distribution Information • Metadata Reference Information
Rational for Metadata CH Redesign • User Feedback • Metadata creation • Metadata management • Metadata integration with data • Open architecture framework • Speed and Reliability • Data quality • Data visualization • License Costs
NBII Metadata CH provides: • Single portal to information contained in disparate data management systems • Free text, fielded, spatial, and temporal search capabilities • Allow individuals and database managers to distribute their data while maintaining complete control and ownership • Leverage investment in existing information systems and research • NBII is part of the Mercury Consortium @ ORNL
NBII CH: New Functionalities • Rich Client Interface • Combined search results (status page) • Filterring search results (Facet) • Dynamic sorting of search results • Bookmark brief and full metadata pages • Based on open source technologies: • Lucene • Solr
NBII CH New Functionalities Cont.. • SOA based design • Web services • RSS services for search results • Portlet support • Search Sharing support • Thesaurus Support • Seamless data ordering/data extraction with various data partners • Seamless data visualization integration with external visualization tools • Improved User Statistics Collection
The NBII Clearinghouse The Clearinghouse is operated for NBII by the Oak Ridge National Laboratory Over 38,000 records 41 partners contributing metadata records Ability to search in a variety of ways Redesigned in 2008
NBII CH Demo • NBII Clearinghouse interface: http://mercdev3.ornl.gov/nbii3/
Metadata CH RSSWorld Data Center http://wdc.nbii.gov
Metadata CH Architecture • CH Function of the NBII Metadata Program Operated by ORNL • NBII is 1 Organization in Mercury Consortium • Established relationship in 2001 • Formerly based on “Blue Angel Technologies” • Currently based on Lucene/Solr Open Source Technologies
Virtual Internet Database P.I. Name Product Number Product Title Site Subject Area Thematic Area Keywords etc. Index Distributed Data Discovery and Access System 1. Principal investigators create detailed metadata and data files using local applications or ORNL- OME 2. NBII Mercury collects metadata and key data from contributing agencies’ servers distributed around the country and builds a centralized index 5. Remoteusers select links to data of interest 6. Highly detailed data and documentation are downloaded directly from the contributing agency 3. Remote users query the index via a Web-based browser Users 4. Metadata summaries are returned to the remote users, including links back to detailed information and data at the PIs’ server or data repository P.I. Summary – John Smith Product A Container: 1; 10/12/2003 Container 2; 01/20/2002 Container 3; 07/05/2001 Product B Container 1; 03/05/1999 ….
A Virtual Aggregate Database Existing Database Existing Database Existing Database Metadata exists in remote legacy databases using any platform, OS or RDBMS Databases can be of different structures and content No re-programming of existing systems required Business as usual for contributing databases Custom Export Program Custom Export Program Z39.50 or WS Export programs are easily written and automated Metadata are extracted into XML files yielding standardized data objects Encrypted XML Encrypted XML These files can be remotely harvested via the Internet Harvested metadata are combined at the central site, transformed (if needed), and indexed Index Frequent, automated harvesting and complete re-building of the index keeps the aggregate database up to date Users work with a single, simple, web-like interface to access all data simultaneously
External Metadata FGDC-BIO MySQL Mercury3_harvests_nbii DB updater tool (custom Java) NBII CH Harvester Transformed Files Solr Schema for defining the fields Index metadata records Solr Indexer tool (custom java) Extended Lucene Index SOLR Search Server XML Beans to extract the contents Solr Searcher (custom Java Spring) Portlets Web Service RSS UI NBII CH Design Diagram http, ftp, web crawl
Future Development • Phase II (May 2008 to September 2008): • Harvester engine to use open source tools (Remove COTS) (Phase I & II) • Portal integration through JSR-168 Portlet standard • Search portlets, portlets for recent datasets, top most searched words etc.. • Web service implementation (Phase I & II): • Thesaurus support (semantic web integration support) • Gazetteer web service implementation • OGC Catalog Service (include Web Mapping/Coverage/Feature Servers in search) • Universal Description, Discovery, and Integration (UDDI) Directory Services • Dynamic RSS support, including Geo-RSS support • ISO 19115 support • OpenSearch support • Documentation and Help (Phase I & II) • User Statistics Application modifications • Phase III (October 2008 to January 2009): • Save, Retrieve and Email user queries • Possible integration to OPeNDAP • Web Service Harvesting (OAI) • Internationalization • ????
Search technology using Lucene/SOLR • Lucene • Overview • Who uses Lucene • Solr • Overview • Who uses Solr
Lucene Overview • High-performance, full-featured text search engine library written entirely in Java • Mature Apache Open Source Java Project • Index speed and integrity, search speed • uses file based full text and inverted indexing • is extremely fast with built-in caching • Can easily handle millions of documents • Very active mailing list for support
Who uses Lucene • Wikipedia • MediaWiki • European Bioinformatics Institute • Liferay • Bigsearch.ca • Monster • Academic Archive On-line • Complete list: • http://en.wikipedia.org/wiki/Lucene • http://wiki.apache.org/lucene-java/PoweredBy
SOLR Overview • Open source enterprise search server based on the Lucene Java search library • Apache project, sub-project of Lucene • Advanced Full-Text Search Capabilities • Optimized for High Volume Web Traffic • Standards Based Open Interfaces - XML and HTTP • Solr uses Lucene search library and extends it
SOLR Overview Contd.. • A Real Data Schema, with Numeric Types, Date fields, Dynamic Fields • Dynamic Faceted Browsing and Filtering • Advanced, Configurable Text Analysis • Highly Configurable and User Extensible Caching • External Configuration via XML • Scalability - Efficient Replication to other Solr Search Servers • Administration Interface is available
Who uses SOLR • CNET Reviews • shopper.com • AOL Music • netflix • search.com • The Digital Commonwealth • mindquarry • for complete list: http://wiki.apache.org/solr/PublicServers
Mercury Instances Demo • NBII Clearinghouse interface: http://mercdev3.ornl.gov/nbii3/ • ORNLDAAC interface: http://daac.ornl.gov/ • LBA Mercury interface: http://mercdev3.ornl.gov/lba3/ • DADDI Mercury interface: http://mercdev3.ornl.gov/daddi3/ • GFIS RSS Portal interface: http://www.gfis.net/gfis/home.faces
Questions, Comments, Mike Frame 865 576-3605 mike_frame@usgs.gov Thanks to: Giri PalanisamySystems Architect and Team LeaderMercury Consortium palanisamyg@ornl.gov Vivian Hutchison NBII Metadata Program Manager vhutchison@usgs.gov