470 likes | 584 Views
Building a Geospatial Data Dictionary: Enhanced Data Description NEARC, Fall 2011 Brian Hebert Solutions Architect ScribeKey, LLC www.scribekey.com. Workshop Outline. Review goals and requirements for producing enhanced data description materials Look at approaches to data description
E N D
Building a Geospatial Data Dictionary: Enhanced Data DescriptionNEARC, Fall 2011Brian HebertSolutions ArchitectScribeKey, LLCwww.scribekey.com
Workshop Outline • Review goals and requirements for producing enhanced data description materials • Look at approaches to data description • US Census data as sample • Review ScribeKey shareware tools • Discussion and Q&A www.scribekey.com
Goals • Make data as easy to understand and use as possible, reduce the learning curve. • Learning about data takes lots of time and effort and given dataset(s) are often part of larger data use and mission. • Make full use of the tools we have. • Apply these ideas to your own use cases. • Whether you are a user, provider, broker, creator of data, help people use it in the best way. www.scribekey.com
Lessons Learned • Global FGDC Metadata and data description materials for large volume commercial geospatial data sets, containing 1000s of data layers and tables. • Assess, describe, and standardize large collection of geospatial datasets and metadata. • Borrow from data warehousing, business intelligence, and library science approaches. 200+ Countries 72 Layers 100s of Attributes 100s of Domains Quarterly Updates 50+ States 400 Layers 1000s of Attributes 100s of Domains Annual Updates www.scribekey.com 4
Background: Industrial Strength Metadata Generation • Sample data is reviewed and profiled. Any metadata is imported into repository. • From profile, existing user documentation, technical support staff, and website, a metadata repository is populated and metadata document templates are developed. • FGDC/ISO Metadata generated, as XML/HTML reports, from metadata repository. Metadata Repository Metadata Templates Metadata Templates Metadata Export App PDF DOC FGDC XML HTML www.scribekey.com 5
Sample Data: US Census • US Census Data for Saratoga Country, NY • Good example • Lots of detail • Has CSDGM metadata • Has its own vocabulary Saratoga County, NY Personal GeoDb www.scribekey.com 6
How Do People Learn About Data? Website Metadata Documentation Email User Tech Support Data Itself Users learn how to use data through a variety of sources www.scribekey.com 7
Challenges • Documentation: Large volume, time consuming • FGDC Metadata: Sets of separate XML documents, redundancy, cumbersome, different format than data being described, etc. • Website: Lots of great info, somewhat unstructured • Tech Support: Availability, cost • Data Itself: Familiarity takes time • How can we consolidate all of this information in a single place in an easy-to-use format? www.scribekey.com
Solution: 2 Data Dictionary Formats 1) HTML Pages 2) GIS Metalayers Integrated Data/Metadata Flexible Familiar Simplification Lightweight Flexible Familiar Static or Dynamic www.scribekey.com
Essentials: It’s All Metadata Meaning Structure Contents Q: What does it mean to be familiar with data? A: Users know where to find something and how to make detailed maps and reports. www.scribekey.com
Creating FGDC CSDGM Metadata Identification_Information: Citation: Citation_Information: Originator: John Hancock Publication_Date: 2008 Title: Boston Streets Description: Abstract: The Boston Streets dataset provides a complete set of single line street segments for the town of Boston, Massachusetts. Purpose: The purpose of the Boston Streets dataset is to provide a basic street base map for general purpose use by the town and its people. Time_Period_of_Content: Time_Period_Information: Single_Date/Time: Calendar_Date: 2008 Currentness_Reference: Publication Date Status: Progress: Complete Maintenance_and_Update_Frequency: Quarterly Spatial_Domain: Bounding_Coordinates: West_Bounding_Coordinate: -70.00 East_Bounding_Coordinate: -69.00 North_Bounding_Coordinate: 45.00 South_Bounding_Coordinate: 44.00 Keywords: Theme: Theme_Keyword_Thesaurus: None Theme_Keyword: Streets Access_Constraints: This dataset may be freely accessed by the public. Use_Constraints: This dataset may be freely used by the public. Metadata_Reference_Information: Metadata_Date: 20080219 Metadata_Contact: Contact_Information: Contact_Person_Primary: Contact_Person: Sam Adams Contact_Address: Address_Type: Mailing Address: 100 Beacon Street City: Boston State_or_Province: MA Postal_Code: 02108 Contact_Voice_Telephone: 508-429-1234 Metadata_Standard_Name: FGDC Content Standards for Digital Geospatial Metadata Metadata_Standard_Version: FGDC-STD-001-1998 • Checklist: • CSDGM Core • Only 26 Values • Attribute Definitions • Domain Values and Definitions • Use USGS MP Tool www.scribekey.com
Geospatial Metadata Issues • There is no real support for non-geometric entities, e.g., tables. For example, the record count element is buried inside a geospatial element. So, there is no place to put a record count for a simple table. • There is incomplete representation for domains. Domains can’t be shared. Domains have no name of their own, but exist only as info added to an attribute. Domains can only have 2 values, so can’t support 3 related values, e.g., MA, Massachusetts, 25. • Attribute information is optional. Unlike the most basic RDBMS metadata available in any system, there are no elements for attribute data type and length. • There are no elements at the entity level for specifying relationships, through joins, etc. • Metadata at the individual feature record is not supported. • Describing data layers resulting from combinations of N source datasets is not supported. www.scribekey.com
Geospatial Metadata Issues (cont.) • Because they are managed using two different physical implementations, geospatial data and metadata get out of synch. • Metadata is available as separate, independent documents. It can not easily be queried as a set. For example, getting a simple list of features/tables requires a custom XML application. • The FGDC CSDGM XML based standard is complex and difficult to understand by end users and vendors building tools. Based on an XML using variable length records and nesting, it is basically the schema for an object oriented database, not a relational or object relational database. • The new ISO standards are even more confusing and difficult to understand. ISO Layer metadata and entity, attribute, domain metadata are also now separated into two different standards. Current recommendation by FGDC is to continue using CSDGM. • http://www.fgdc.gov/metadata/geospatial-metadata-standards www.scribekey.com
CSDGM Physical Implementation Guidelines • The FGDC/CSDGM standard clearly states that the standard describes content, and not physical implementation. From the CSDGM Workbook: The standard specifies information content, but not how to organize this information in a computer system or in a data transfer, or how to transmit, communicate, or present the information to a user. There are several reasons for this approach: There are many means by which metadata could be organized in a computer. These include incorporating data as part of a geographic information system, in a separate data base,and as a text file. Organizations can choose the approach which suits their data management strategy, budget, and other institutional and technical factors. In spite of these statements, geospatial metadata implementation has not been approached using industrial strength RDBMS data access technology, but rather relies on sets of separate XML files, using an entirely different data access and management paradigm than that used by the data it is describing. www.scribekey.com
Centralizing Meaning, Structure, and Content: The RDBMS Based Metadata Repository FGDC XML Metadata RDBMS: Structure & Contents Data Profiling Roads METADATA REPOSITORY Parcels FGDC XML Metadata Metadata Import Buildings XML: Meaning & Geospatial FGDC XML Metadata Data and Metadata Sources Data Description Tools www.scribekey.com
How Does Data Profiling Help? An essential tool for enhanced metadata: shows end user actual sample values, data types, lengths, formats, percent complete, etc. This valuable contents information is typically not found in geospatial metadata. www.scribekey.com
CSDGM Core into the RDB XML Metadata IMPORT XML Metadata XML Metadata When metadata is imported into an RDB, the full flexibility of SQL becomes available for very flexible query and management of large volume data description information. www.scribekey.com
Tools Demonstration Data Profiling Windows Based Batch Command Line .NET .mdb Files Logging Metadata Import www.scribekey.com
Inside the Repository: Tables and View • PROFILE: • DiTABLES • DiCOLUMNS • DiDOMAINS • DiDomainValues • METADATA INGEST: • CsdgmEnt • CsdgmAtt • CsdgmDomVal • VIEWS: • EntRpt • AttRpt • DomRpt Elements from Profile and Metadata Ingestion can be combined through SQL views. Data structure, contents, and meaning housed in a table-centric RDBMS repository. Easy to access, query, and share. If you didn’t have CSDGM attribute metadata before, the data profile really helps with providing a baseline. www.scribekey.com 19
Helping with the Data Provider/End User Communication Gap “Layer Table Attribute Map Symbol Centroid Join Report” “Impute FROMHN EDGES ADDRFN Internal Point MTFCC S1100” Provider Language User Language Data providers and users have different languages and understandings of data. Use of keywords, aliases, and definitions in data dictionary helps bridge this gap; provides a translation www.scribekey.com 20
Schemas and Semantics Layers Attributes Symbols, Towns … UML, XSD GML ISO 19XXX ? The Tower of Babel Data Modelers ISO/OGC Schemas GIS Users What does this mean? Ontologies Abracadabra www.scribekey.com 21
Next Steps: Clarification and Completion • We’ve integrated profile and metadata info • Now need to refine this information • Make sure everything is clear • Make sure everything is complete • Library Science to the rescue www.scribekey.com 22
Library Science Artifacts • Indexing and Abstracting • The Dictionary Hierarchy • Types and Taxonomies • The Thesaurus • The Glossary With the Metadata Repository loaded, a number of useful data description artifacts can be developed. www.scribekey.com 23
Indexing and Abstracting: The Overview Page • The most essential information • Clear concise writing • Links to details • Automated tools are no substitute for subject matter expertise • Limits of FGDC or ISO schemas as template • Data driven www.scribekey.com 24
The Data Dictionary Hierarchy: Categories, Entities, Attributes, Domains • Data typically falls into higher level categories • Entities include layers and tables and relationships among them • Attribute data types, lengths, domain contents provide the heart of data detail for query, reporting, and mapping • A streamlined and flexible view of metadata www.scribekey.com 25
Feature Types and Taxonomies • Users need to be able to search through metadata and data easily, using feature names they are familiar with. • Domain profiles and metadata are starting points for developing of feature description typology. • Isolated domain information doesn’t always present the entire picture. This HTML page allows users, to look up a feature name and find the corresponding layer and attribute SQL query that can be used to filter for it. www.scribekey.com 26
The Thesaurus: What’s in a name? US Census MTFCC SDTS Entities www.scribekey.com 27
Choosing the Best Names • If you’re developing a new set of names for data categories, entities, attributes, and domain values, use words that your data user audience is familiar with. • Don’t invent new words when an existing ones will do. Reuse taxonomies. • “Consistency is the last refuge of the unimaginative” Oscar Wilde • Natural language is often inconsistent, but can still be very clear for end users. www.scribekey.com 28
Choosing the Best Names (cont.) lon/lat: 201,000,000 lat/lon: 7,870,000 The Google Test www.scribekey.com 29
Tool Demonstration: Sql2Html www.scribekey.com 30
Glossaries http://textalyser.net/ • Which words and terms need to be described? • Text analysis tools are freely available for helping with this task. • This list was generated from entity definitions. • Can also be used as input to list of keywords for FGDC metadata. www.scribekey.com 31
Metalayers: Metadata as GIS Data Tables from the Metadata Repository can be easily accessed in ArcMap, and joined with polygon layers to provide access to fully integrated data/metadata www.scribekey.com 32
Metalayers: Metadata as GIS Data (cont.) Metadata Repository layer/table information, as populated from data profiling and FGDC metadata ingestion, for US Census data, Saratoga County area, against full backdrop of New York towns. www.scribekey.com 33
Table-Centric Metadata in ArcGIS • Metadata tables can be added to your ArcMap .mxd files. • If you have multiple sets of heterogeneous data, you can link metadata tables with polygons depicting data coverage areas. • Metadata can now be used like any other geospatial data, as the basis for color shading, symbology, reports, etc. • Metadata can be used to first find data, through lighter weight wrapper, then drill through to actual underlying data. www.scribekey.com 34
Are Data Aggregation Results Metadata? • Data aggregation provides a key component of decision support information systems, AKA, Business Intelligence (BI). • Provides a smaller, faster, high level summary and simplification of large volumes of data. • Helps decision makers focus in on what’s important. • Created using standard RDBMS SQL aggregation constructs, SUM, COUNT, and GROUP BY and OLAP technology. AGGREGATE BASE DATA www.scribekey.com
Metalayers: Aggregation www.scribekey.com 36
Metalayer Drilldown and Rollup Increasingly detailed views COUNTY TOWN Applying Pivot Table like view and Drilldown and Rollup with hierarchical geography units CENSUS TRACT www.scribekey.com 37
Meta-Layer Geometry Creation and Management Spatial_Domain: Bounding_Coordinates: West_Bounding_Coordinate: -167.946360 East_Bounding_Coordinate: 179.001991 North_Bounding_Coordinate: 71.298141 South_Bounding_Coordinate: 17.678360 Lon/Lat Bounding Boxes 1 2 3 Three basic approaches to generating layer coverage polygons as 1) bounding boxes 2) convex/concave hulls, tessellations and 3) existing administrative or other polygons. Choice based on presentation and data management requirements. www.scribekey.com 38
Convex Hull of Census Edges Layer Convex hulls are useful for describing arbitrary Metalayer coverage areas when no existing political or administrative boundary polygons are available. www.scribekey.com 39
Summary and Take-Aways: 5 Phases • Developed standardized geospatial metadata • Profiled data • Integrated profile results and metadata in an RDBMS repository • Refined information, using library science approach and artifacts • Exported metadata from repository in 2 convenient formats, HTML and geospatial data layers. www.scribekey.com 40
Take Away: Lightweight HTML Data Dictionary Full descriptions of data categories, entities, attributes, domain values. Information integrated from documentation, data profiles, metadata, and data provider website. Available as stand alone HTML or on web site. www.scribekey.com 41
Take Away: Metalayers Use data profiles and metadata to create GIS layers to allow variety of map presentations, reports, etc. to summarize and highlight datasets by metadata values. www.scribekey.com 42
Take Away: Data Description Checklist Meaning Structure • Is there a Data User Guide? A glossary and index? • Are primary data categories and entities fully described? • Are all acronyms, abbreviations, provider vocabulary terms explained? • Are short, cryptic database field names and values explained? • Are data types, lengths, keys, nulls allowed, formats, lists clear to help user form SQL queries? • Is FGDC/ISO Metadata available? • Are sample values and data profiles available? • Are data presentations, maps, symbols, reports prepared for quick start? • All this info in one place? Contents Complete metadata describes Meaning, Structure, and Contents. Maximize understanding of details by end user to help create queries/reports/maps. www.scribekey.com 43
Take Away: Use a Geospatial Metadata Repository Data Dictionary METADATA REPOSITORY Data Layers Enhanced User Views Metadata Pivot Tables Areas Entities Derivative Datasets Documents Metalayers Assessments Attributes Domains New Schemas The Metadata Repository, implemented as an RDMBS, is populated with automated tools then used to generate metadata outputs, data dictionary content, schemas, maps, etc. www.scribekey.com 44
The Future: Structured vs. Unstructured Query Query/Access Structured data queries require that a use know the exact entity.attribute=value construct to find data. Unstructured data queries can use underlying metadata tables like the FeatureFilter, to locate the correct entity.attribute=value construct to find data. Metadata is also generally much smaller volume than the data it is describing and can be queried very quickly. www.scribekey.com 45
ScribeKey Shareware Tools • Data Profiler: SkProfile.exe • Metadata Importer: SkMtd2Db.exe • SQL To HTML Generator: SkSql2Html.exe • MS Access Metadata Repository • Look at ReadMe.txt files • Work with Personal Geodatabases • Requires .NET runtime www.scribekey.com 46
Thank you Q&A Brian Hebert Solutions ArchitectScribeKey, LLCwww.scribekey.com