Дигитализация топонимических данных в Институте эстонского языка

Дигитализация топонимических данных в Институте эстонского языка Peeter Päll 9th Baltic Division meeting Jūrmala, October 2005

Overview • Existing collections in Estonia • Two types of collections (systematic databases; archived materials) • Digitization options • ETOP, the digital archive • Data structure, encoding problems • Named features identification • Perspectives

History of names collections • Various initiatives in the 19th century (1888 M. J. Eisen, 1895 J. Jung, 1901 F. Kuhlbars) • Systematic collection since 1920’s by the Mother Tongue Society • Collections by scholarship, field expeditions by researchers and students (1930’s–1990’s)

Current collections • Place Names Archive at the Institute of Estonian Language – over 500,000 cards (integrates also collections of the Mother Tongue Society) • Endel Varep’s archive at EKI (ca. 95,000 A5 pages) • M. J. Eisen’s collections at the National Museum (dispersed; copy in Helsinki in 3 vol.) • Collections at various museums and institutes • Main Estonian-Swedish collections at Uppsala

Place Names Archive at EKI • Collections since 1920’s, arranged by parishes; general alphabetic index • Data on cards: • name in standard spelling and pronunciation • locative case forms (external or internal) • references to other connected names • short description of the named feature (maps usually not included) • background information, examples of name usage • village, parish, informant, collector

Place name card (1930’s)

Alphabetic index card

Place name card (1960’s)

Place name “maps”

Problems • Parish collections are uneven by the amount of cards and their quality • Usually no map references are given • Using the collection for research requires a lot of time

Digital resources • Place Names Database at EKI (KNAB): 90,000 records (244,000 names) • Toponymic Database for historic Võrumaa by the Institute of Võru: ca. 10,000 records • National Place Names Register: test version only • Databases of private publishers (e.g. Regio road map index contains ca. 11,000 names)

KNAB (www.eki.ee/knab/knab.htm)

Võrumaa (www.ekk.ee/avka/)

Two types of collections • 1. Archives • unsystematic • uneven • excessive data • 2. Systematic databases • processed data, based on various sources • standard quality criteria • structured data, even coverage

Digitization of names archives • Aims: • to protect against destruction of data • to give better access to data • to enable further processing • Models: • Finland (partial type-in of card information) • Sweden (scanning + digital headwords) • Norway (type-in and scanning)

ETOP, the digital archive of EKI • Data format: SGML-text • Character format: Unicode-enabled • Uniformity with KNAB, the systematic place names database • Recognition of international standards on digital gazetteers (Alexandria Digital Library, ADL Gazetteer Protocol) • Availability through the Internet (both for entering and accessing the data)

Data fields • <nimi>name [label] {comment}</nimi> • <var>name variants [] {}; …</var> (includes pronunciation in standardized transcription + the original notation) • <koh>use of locative forms</koh> • <vrd>references to other names</vrd> • <sel>description of the feature</sel> • <lkood>feature type</lkood> (new information) • <txt>background, examples, text</txt> • <all>name source (informant)</all> • <khk>parish</khk> • <lähik>village, collection point</lähik> • <aut>collecting person, owner of collection, typist</aut>

Principles of digitization • inclusion of all the content of the original • maximum preservation of original structure • all additions and corrections are clearly marked • no interpretations, but comments are allowed

Phases of digitization • All new cards will be processed first, these are not copied for alphabetic index file any more • Other materials (not in a card format) that can be integrated, obtained from other sources • New contributions may be provided through web digitally • Entering the data of main collections starts from sample parishes • After the main collection is entered, comparison with the alphabetic index (and subsequently this may be discarded)

Perspectives • Estimated time-length of digitization: 10–20 years • Possible additions to data: geocoding • Possible further processing: linking records of identical features • Research continues..

Дигитализация топонимических данных в Институте эстонского языка