460 likes | 475 Views
Discover practical insights and guidelines for interpreting relationship data, including the use of knowledge tables, segmentation, storage, attributes, statistics, and rules. Learn how to apply country-specific rules, manage databases effectively, and handle diverse types of relationship data with accuracy.
E N D
Interpretation and fault-tolerant identification of relationship data Holger Wandt Colloquium Taal en Spraak KU Nijmegen Wednesday 3 March 2004
Overview • The use of knowledge tables • Relationship data: segmentation, storage • Attributes • Statistics • Rules • A closer look • How do we use the knowledge and the rules in interpretation? • The Rolodex-demo
Fysiotherapeutisch CentrumArie en Jolanda KruizengaIntake Unit 1
Siemens ElectroCom GmbH & Co.Postdienstautomatisierung und Technologieentwicklung
DE POSTc/o mevrouw A. Vanderwalle-Van DammeIndustrieel Ingenieur Logistiek
Eerste Roelofarendveense Papierfabriek Anno 1931 NVh.o.d.n. “Papier Hier”
NATIONALE SOCIALE VERZEKINGSKAS VOOR MIDDENSTAND EN BEROEPEN SUKKURSALE BRUGGE V.Z.W. / A.S.B.L.
Let’s summarize…. • Surnames • Given names • Forms of address • Titles • Prefixes/infixes and prepositions/articles • Additions • Professions • Geographical items • Legal forms • Company words • Divisions • Company names • Ordinals
Relationship data • LCR manages and maintains 3 knowledge databases for each country: • 1stbase • Fambase • DicMan • LCR manages and maintains country specific synonym tables
Storage of relationship data • Segmentation (define groups of data) • Attributes of groups • Attributes of particular items • Link between items (abbreviation, plural, etc.)
STATISTICS BE DE NL Surnames 337410 1006097 277312 Given names 20618 22425 25569 FoA 269 131 136 Titles 284 1739 279 Prefix/Infix & articles/prepositions 654 664 498 Additions 324 192 143 Professions 968 2792 355 Geogr. items 12416 32248 18611 Legal forms 236 1835 138 Company words 20467 8121 5920 Divisions 172 160 90 Company names 1967 1504 684 Ordinals 421 293 71
General and country specific rules • Capitalization • Punctuation • Word break • Abbreviation
Capitalization • Belgium: • Flemish: Karin Van der Ploeg • Walloon: Henri de La Censerie • Germany: • E.v. Buskirk KG • Verband der Chemischen Industrie e.V. • Netherlands: • Puffelen r.a., Victor van • Puffelen RA, de heer Van
Punctuation • Mr Theodor St.John • mr. Olaf Oudendijk • Martin Klaus Lehmann • Martin, Klaus & Lehmann • HA.DI.WE. Inh: Hans-Dieter Weber • Don Quichotte N.V./S.A. • Don Quichotte NV/SA
Epitaph Here lies my beloved wife Christine In heaven she is not in hell I know It’s written for everyone to be seen
Word break J.P.L. den He- yer Groepsex- cursies General and country specific rules: • In NL: ma-chi-nes • In GB: ma-chines NEVER: mac-hines
Abbreviation General rule for BE, DE and NL: Every word must not be abbreviated further than its first Vowel-Consonant (VC) group or its first Consonant-Vowel-Consonant (CVC) group. Abbreviation – abbrev. – abbr. Consonant – conson. – cons. There are country specific abbreviations: Ges.m. beschränkt. Haft. / Handelsmij./ Stnrs. / R.P. and RR.PP. But beware of the Hotel Association Française
A closer look: Family names • Prefixes • Names consisting of several parts • Names with a foreign language attribute • Diacritic symbols
Prefixes • In NL separation of prefix and family name is necessary for sorting purposes • In the Human Inference databases: • 22.000 family names with prefix in BE • 15.000 family names with prefix in DE • 30.000 family names with prefix in NL • Validation of names: Le Galloudec, but not Galloudec
Names consisting of several parts • Double-barrelled names with and without hyphen: Adelheid de Boer-van Buiten Dirk Segaert vanden Bussche • Double-barrelled name with infix: Arie Gansneb genaamd Tengnagel tot den Bonckenhave • Double-barrelled name without infix: Martina Galloux Wittevrouw
Names with a foreign language attribute • Three categories: Arabic: el Bahlaoui Husseini al Fharid Chinese/Vietnamese: Cuong Buo Chan Spanish/Portuguese: Fonseca Aranda de Pereira Rodriguez
Diacritic symbols • All diacritics have to be recorded in the database. • Preferences in Capital Conversion • Validation of names • Examples: • Büch • Hällström • Özgüleç • Güçlütürk
Interpretation of relationship data • Different kinds of relationship data • Different attributes • General and country specific rules (capitalization, abbreviation, etc.) • Signification differs due to context • Due to the ambiguity of relationship data, correct interpretation is no picnic
Different kinds of relationship data with different attributes • Betonmortelfabriek BEMOTI Tilburg bv • Tilburgse Betonmortelfabriek BEMOTI bv • RegTP, Regulierungsbehörde für Telekommunikation und Post • CQCS International Consulting • Servicebureau Jansen/ Jansen Elektroservice • De Boer Landbouwmachines/ De Boer Machinebouw
Signification can differ as consequence of context, rules for abbreviation, capitalization and punctuation • Art Gallery Wandt & Wandt • Wandt Fachhandel für Kunstart. • Art. Wandt Kunsthandel • van Walbeek, M.B.A. • Van Walbeek, MBA
Significations: How can they be determined? • Does the item exist in the particular knowledge universe? • Can the significations be resolved or deducted (acronyms and compounds)? • If the item does not exist in the knowledge universe, what is the most probable signification, considering the context?
Can the item be deducted or resolved? • NeVoBo Nederlandse Volleybalbond • KLM Koninklijke Nederlandse Luchtvaartmaatschappij • AAAA • Maschinenfabrik Mertens • Carburateurbinnenverlichtingsfabriek Mertens
The item is not found in the knowledge universe • Harry Edward Johnson • Harry Edward Ireallygotaweirdsurname • IBM Computing • HAL Computing • Hermans Groente & Fruit, A’dam • Johnson Sarvice & Cnosult, Chelsee
Context Metzgerei Theo Frankfurt given name/surname? Metzgerei Theo Frankfurt given name/ geographical item? Karin Jansen – Bloemen given name/surname/company word? Karin Jansen – Bloemen given name/surname – surname (maiden name)?
Patterns Restaurant Die Vier Jahreszeiten Café Het Nerveuze Schaap Jasmijn Bloemen en Planten Helena Catering & Imbiß Consultingservice QCS Amsterdam Aardappelhandel ABC Paterswolde
Patterns? chr. bond v. ambtenaren chr. bond van zomers KARL OTTO GRAF LAMBSDORFF EVA MARIA BARON POTOCKI Hi-Fi Johanson & Gruber GmbH Em-Lo Emmerich und Lohmeier GmbH
Multiple occurrences An item must be stored in all its significations • Beh. Behandlung, Behälter, Behörde, Behinderte • Ond. Onderzoek, Onderhoud, Onderneming, Onderwijs, Onderling
Interpretation step by step • Read appellation • Divide appellation in relevant sections and ascribe all possible significations to the sections • Apply context and grouping rules and chose the most probable combination of significations • Score the found items, the small context, the large context and the corrections for special cases.
Knowledge Universe Appearance Context <WORD> Interpretation Signification
For more information: h.wandt@humaninference.com