130 likes | 146 Views
Learn about the process of building a high-performance gazetteer database, including handling ambiguities, classification, and storage systems for geographic features. Explore database production, conversion scripts, tables used, and exporting techniques.
E N D
On building a high performance gazetteer databaseAmittai AxelrodMetaCarta Inc
Thanks to Keith Baker Kenneth Baker Michael Bukatin András Kornai
Plan of the talk • Database background • Relating geographic names and features • Handling ambiguities and inconsistencies in geographic names • Classification and storage system for geographic features
Databases • No DB (faking it with flat files) -- clumsy • Record-oriented -- still runs the world • Relational -- making headway • Object-oriented -- still very academic • For MetaCarta GazDB, relational approach made most sense: • Overlapping records (McKinley/Denali) • Need for frequent updates of subparts of records
Conversion scripts • Enforce uniform structure on the data • Normalize across sources (e.g. lat/lon to decimal degrees, spelling, …) • Configuration required once per source • Load data in GazDB • Combination perl/SQL
Other tables used in GazDB • Population • Elevation • Language • Feature type • Source/versioning info • Temporal extent • Hierarchical information • Confidence • Comments • Change logs (full auditing)
Geographic names • Internationalization • Full Unicode (UTF8) support • Maintain detail language information (SIL) • Name resolution • Canonical form (16 bits) • Display form (8 bit) • Search form (6 bit) • Authoritativeness • Explicitness
Geographic features • Spatial representations • Point, line, area, … • Functional classes • Building, field, campus, city, … • Administrative types • Nation, province, county, international org, …
Export scripts • Read GazDB • Select which fields to include in custom output • Creates .gbdm (MetaCarta format) binaries • Combination perl/SQL • Not yet general across binary output formats
Conclusions • Accept multiple sources (only configure once per source) • Fast loading of large datasets (1m entries per hour on linux desktop) • Simple update procedure • Outputting large binary custom gazetteers for different purposes at extreme speeds (1m entries per minute)