230 likes | 358 Views
The cadastral register of the town Wismar (1677 – 1838) Structure Mining in historical documents. Manja Nelius Meike Klettke Department of Computer Science Database Research Group University of Rostock. Overview. Available documents Cadastral register Additional information
E N D
The cadastral register of the town Wismar (1677 – 1838)Structure Mining in historical documents Manja Nelius Meike Klettke Department of Computer Science Database Research Group University of Rostock
Overview • Available documents • Cadastral register • Additional information • Index of persons • List of abbreviations • Algorithms • Overview of the approach • Analysis of (implicit) structure, layout and of the regular parts • Association of markup in the historical texts • Rule-based semantic analysis • Structuring of information • Results • Conclusion & Literature Manja Nelius, Meike Klettke University of Rostock
Motivation • In the Department of History at the University of Rostock (Prof. Papay) • Historical geographical information systems are developed, underlying databases • Cadastral registers contain lots of information for these geographical information systems (in texts) • Characteristics of historical texts: • No unique orthography • Varying spelling (temporal and regional), also the spelling of proper names varies • Influences from different languages • Usage of Latin words (for instances saint feasts (Heiligenfeiertagen) for dates) • Varying use of capitalisation • Inserting data in the databases is till now a manual process • With a diploma thesis: trying to automate this process Manja Nelius, Meike Klettke University of Rostock
Available documents • Cadastral register of the town Wismar (1677 – 1838) • Contains information about the history of the town • Reflects changes of the building development • Shows ownership structure • Similar Cadastral register are available for the towns Rostock and Stralsund • Different categories and different structures in the registers • Then-mayor (burgomaster) fixed the structure that‘s why it is different in each town • 2002 all Cadastral registers had been (manually) digitalised - aim: preserving of sources • That means no changes on the original texts • but addition of an index for persons and a list of usual abbreviations Manja Nelius, Meike Klettke University of Rostock
Categories Street Alt Wismarsche Strasse vom thor her. Norderseite 1 Grundbuch Nr. 16 ( 16 ), fol. 15v , neues Stadtbuch Nr. 196 2 siehe Grundbuch Nr. 15. 3 Haus 4 jus protimiseos an dem dahinten belegenen Garten. Medard. 1673. 5 ut No. praeced. siehe Grundbuch Nr. 15. Peter Koppe. adjud. et aedif. Tr. Regum 1657. Georg Gammelkern. empt. Medard. 1673. […] 6 400 Rthlr. Joh. Christoff Müller. Clem. 1677. delirt Mattaei 1688. 600 m. l. die Stadt Cämmerey t. Joh. mit 5 procent zu vertzinsen. […] Contig. Oldeböterstr. Ost Grundbuch Nr. 310. Neighbour objects
Information about a person Different alternative spellings of a family rname References onto another spelling of a family name Family name with additional information about a person (first name) and the cadastral number (where the person is mentioned) List of persons Gagzow, David 1207 Gahde siehe Gade Gahrtz, Gehrtz -, Agneta Cristiane Hedwig (Witwe, geborene Haase) 22, 29, 263 Manja Nelius, Meike Klettke University of Rostock
Abbreviations of phrases Alternative abbreviations Explanations List of abbreviations infr.: infra (unten) Inn., Innoc., Innocent. Puer. : (dies) Innocentum Puerorum (28. Dezember) intab., intabul.: intabulatus (intabuliert, in das Grundbuch eingetragen) inter., interim. : interimistisch Manja Nelius, Meike Klettke University of Rostock
Overview • Combining analysis of layout, structure and texts • Stepwise enrichment of the original documents with markup (markup contains the meaning of the text fractions) generation of XML documents (bottom up) • Storage in databases Manja Nelius, Meike Klettke University of Rostock
DOC RTF RTF HTML TXT TXT XML XML XML List of abbreviations List of persons Cadastral register Transformation of the input format Mapping rules Analysis of layout and structure Rules for replacement Normalisation Exploitation of regular expressions Grammars Analysis of full texts Dictionaries Semantic Analysis Semantic rules Structuring of information Rules for structuring Storage in database tables
Analysis of layout and structure • Usage of layout characteristics • Bold and italic fonts (HTML-Markup) • Usage of X-Fetch Wrappers • Analysis of structural characteristics • Numbers at the beginning of rows determine the categories • cadastral item is divided into the different categories • Implementation uses the parser generator ANTLR adding of XML-Markup into the cadastral items Manja Nelius, Meike Klettke University of Rostock
Exploitation of regular expressions • Some parts of the input documents are regular (list of abbreviations, list of persons, category 1 of cadastral items) • Example: Grundbuch Nr. 16 ( 16 ), fol. 15v , neues Stadtbuch Nr. 196 • A Grammar describes these expressions adding of further XML-Markup into the cadastral items Manja Nelius, Meike Klettke University of Rostock
DOC RTF RTF HTML TXT TXT XML XML XML List of abbreviations List of persons Cadastral register Transformation of the input format Mapping rules Analysis of layout and structure Rules for replacement Normalisation Exploitation of regular expressions Grammars Analysis of full texts Dictionaries Semantic Analysis Semantic rules Structuring of information Rules for structuring Storage in database tables Manja Nelius, Meike Klettke University of Rostock
Analysis of the historical full texts /1 • Usage of 23 different dictionaries • Some of them generated from the list of persons (first names, family names) • Some created by hand (streets, saint feasts, professions, stop words) • Same method that was used in the GETESS project, developed from the DFKI Saarbrücken, AG Prof. Uszkoreit • Determining the similarity between terms in the cadastral register and in the dictionaries • with phonetic encoding and • phonetic similarity search Manja Nelius, Meike Klettke University of Rostock
Analysis of the historical full texts /2 • Phonetic encoding norms all spellings that sound similar • Example: Friedrich VRYEDRYCH • Substitution rules for that had been defined (because that were not available for historical texts) Cadastral item dictionaries Term 2 Term 1 phonetic encoding phonetic encoding Term 2‘ Term 1‘ Word distance with Levenshtein distance Manja Nelius, Meike Klettke University of Rostock
DOC RTF RTF HTML TXT TXT XML XML XML List of abbreviations List of persons Cadastral register Transformation of the input format Mapping rules Analysis of layout and structure Rules for replacement Normalisation Exploitation of regular expressions Grammars Analysis of full texts Dictionaries Semantic Analysis Semantic rules Structuring of information Rules for structuring Storage in database tables Manja Nelius, Meike Klettke University of Rostock
semantic analysis with rules • Association of terms to dictionaries: • unique • ambiguous (two or more meanings) semantic • no association found rules • Different semantic rules (with priorities) are applied • In the rules information about the context are used (predecessor –successor) • Example: • (a rule in informal description) • token: number, has no associated meaning • predecessor token: date meaning year is associated to the token <Feiertag value="Jacobi" pos="6“/> <Jahr value="1693" pos="7"/> Manja Nelius, Meike Klettke University of Rostock
DOC RTF RTF HTML TXT TXT XML XML XML List of abbreviations List of persons Cadastral register Transformation of the input format Mapping rules Analysis of layout and structure Rules for replacement Normalisation Exploitation of regular expressions Grammars Analysis of full texts Dictionaries Semantic Analysis Semantic rules Structuring of information Rules for structuring Storage in database tables Manja Nelius, Meike Klettke University of Rostock
Structuring of information • Up to now: meanings are associated to terms of the cadastral register • Now: bottom up structuring of information • Example: <Eigentuemer> <Person> <Vorname value=„Peter" pos="1"/> <Nachname value=„Koppe" pos="2"/> </Person> <Erwerbsart> …</Erwerbsart> … </Eigentuemer> • Process base on rules Manja Nelius, Meike Klettke University of Rostock
DOC RTF RTF HTML TXT TXT XML XML XML List of abbreviations List of persons Cadastral register Transformation of the input format Mapping rules Analysis of layout and structure Rules for replacement Normalisation Exploitation of regular expressions Grammars Analysis of full texts Dictionaries Semantic Analysis Semantic rules Structuring of information Rules for structuring Storage in database tables Manja Nelius, Meike Klettke University of Rostock
Result of this process ... <Eigentuemer> <Person> <Vorname value="Georg" pos="1"/> <Nachname value="Gammelkern" pos="2"/> </Person> <Erwerbsart value="emptio" pos="3"/> <unbekannt value="Medard" pos="4"/> <Zahl value="1673" pos="5"/> </Eigentuemer> <Eigentuemer> <Person> <Vorname value="Johan" pos="1"/> <Nachname value="Faber" pos="2"/> </Person> <Erwerbsart value="emptio" pos="3"/> <Zeitangabe> <Wochentag value="Veneris" pos="4"/> <Feiertag value="Jacobi" pos="6"> <Zeitbezug value="ante" pos="5"/> </Feiertag> <Jahr value="1693" pos="7"/> </Zeitangabe> </Eigentuemer> ... <Eintrag> <Grundbuchnummer> <GbNr value="16"/> <SchNr value="16"/> <FolNr value="15v"/> <SbNr value="196"/> ... </Grundbuchnummer> ... <GegQualitaet> <Bauwerk value="Haus" pos="1"/> </GegQualitaet> ... <Eigentuemer> <Person> <Vorname value="Peter" pos="1"/> <Nachname value="Koppe" pos="2"/> </Person> <Erwerbsart value="adjudicatio" pos="3"/> <Stoppwort value="et" pos="4"/> <Erwerbsart value="aedificatio" pos="5"/> <unbekannt value="Tr" pos="6"/> <unbekannt value="Regum" pos="7"/> <Zahl value="1657" pos="8"/> </Eigentuemer> Manja Nelius, Meike Klettke University of Rostock
Instead of a benchmark 80 70 60 50 after full text 40 analysis 30 20 after 10 application of 0 semantic rules tokens tokens token with without with two unique markup or more markup meanings • Semantic rules assign numbers a meaning (most number are four-digit numbers: year or cadastral numbers) • Solves ambiguous meanings (first name vs. date, family name vs. profession) Manja Nelius, Meike Klettke University of Rostock
Conclusion • Approach bases on • extensible rules and • dictionaries flexible tool • Extraction of information is realised in a stepwise process • Check mechanism supports error detection • An evaluation of the results is difficult because we cannot compare the results of the process with correct results • All methods that are represented here had been developed in the diploma thesis from Manja Nelius • Future work: • Semantic analysis and information structuring in one step: • Matching between XML documents with incomplete markup and a schema Manja Nelius, Meike Klettke University of Rostock
Literatur • Ernst Münch: Das Wismarer Grundbuch ( 1677/80 - 1838 ), Verlag Schmidt-Römhild, Rostock, 2002 • Hans-Jürgen Martin, Geschichtlicher Abriß der Rechtschreibung, http://www.schriftdeutsch.de/orth-his.htm, 2004 • Justin Zobel and Philip W. Dart: Phonetic string matching: lessons from information retrieval, Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval table of contents, 1996 • Justin Zobel, Philip W. Dart, Finding Approximate Matches in Large Lexicons, Software --- Practice and Experience, Volume 25, Number 3, 1995 • Norbert Fuhr: Regelbasierte Suche in Textdatenbanken mit nichtstandardisierter Rechtschreibung, http://www.is.informatik.uni-duisburg.de/projects/rsnsr/, 2005 • M. Abolhassani and N. Fuhr and N. Gövert, Information Extraction and Automatic Markup for XML documents, in Intelligent Search on XML Data. Applications, Languages, Models, Implementations, and Benchmarks, ed. Henk M. Blanken and Torsten Grabs and Hans-Jörg Schek and Ralf Schenkel and Gerhard Weikum, Lecture Notes in Computer Science, Vol 2818, 2003 • Arnaud Sahuguet, Fabien Azavant: Building Light-Weight Wrappers for Legacy Web Data-Sources using W4F, 25th Conference on Very Large Database Systems, Edingurgh, UK, 1999 • Terence Parr: ANTLR -- ANother Tool for Language Recognition, http://www.antlr.org, 2004 • X-Fetch Suite, Republica Corporation, www.x-fetch.com • Kai-Uwe Sattler, Stefan Conrad, Gunter Saake, Datenintegration und Mediatoren, In Web & Datenbanken, d.punkt Verlag, Heidelberg, 2003 Manja Nelius, Meike Klettke University of Rostock