130 likes | 144 Views
On building a high performance gazetteer database Amittai Axelrod MetaCarta Inc. Thanks to. Keith Baker Kenneth Baker Michael Bukatin András Kornai. Plan of the talk. Database background Relating geographic names and features Handling ambiguities and inconsistencies in geographic names
E N D
On building a high performance gazetteer databaseAmittai AxelrodMetaCarta Inc
Thanks to Keith Baker Kenneth Baker Michael Bukatin András Kornai
Plan of the talk • Database background • Relating geographic names and features • Handling ambiguities and inconsistencies in geographic names • Classification and storage system for geographic features
Databases • No DB (faking it with flat files) -- clumsy • Record-oriented -- still runs the world • Relational -- making headway • Object-oriented -- still very academic • For MetaCarta GazDB, relational approach made most sense: • Overlapping records (McKinley/Denali) • Need for frequent updates of subparts of records
Conversion scripts • Enforce uniform structure on the data • Normalize across sources (e.g. lat/lon to decimal degrees, spelling, …) • Configuration required once per source • Load data in GazDB • Combination perl/SQL
Other tables used in GazDB • Population • Elevation • Language • Feature type • Source/versioning info • Temporal extent • Hierarchical information • Confidence • Comments • Change logs (full auditing)
Geographic names • Internationalization • Full Unicode (UTF8) support • Maintain detail language information (SIL) • Name resolution • Canonical form (16 bits) • Display form (8 bit) • Search form (6 bit) • Authoritativeness • Explicitness
Geographic features • Spatial representations • Point, line, area, … • Functional classes • Building, field, campus, city, … • Administrative types • Nation, province, county, international org, …
Export scripts • Read GazDB • Select which fields to include in custom output • Creates .gbdm (MetaCarta format) binaries • Combination perl/SQL • Not yet general across binary output formats
Conclusions • Accept multiple sources (only configure once per source) • Fast loading of large datasets (1m entries per hour on linux desktop) • Simple update procedure • Outputting large binary custom gazetteers for different purposes at extreme speeds (1m entries per minute)