390 likes | 561 Views
innoQ Deutschland GmbH D-40880 Ratingen www.innoq.com thomas.bandholtz@innoQ.com. Expressing Lexical Complexity in SKOS(XL). Thomas Bandholtz 5th ECOTERM MEETING at FAO, Rome, Italy 05-06 October 2009. Content. Expressing Lexical Complexity in SKOS(XL) Motivation
E N D
innoQ Deutschland GmbH D-40880 Ratingen www.innoq.com thomas.bandholtz@innoQ.com Expressing Lexical Complexity in SKOS(XL) Thomas Bandholtz 5th ECOTERM MEETING at FAO, Rome, Italy 05-06 October 2009
Content Expressing Lexical Complexity in SKOS(XL) • Motivation • Thesaurus Models with regard to lexical complexity • UMTHES extensions of SKOSXL • Examples using RDF Turtle syntax Ecoterm 2009: Lexical Complexity SKOS(XL)
Motivation What is „lexical complexity“? Why should we care? The case: UMTHES in SKOS Umweltbundesamt (DE) & innoQ develop iQvoc
What is „lexical complexity“? Each Concept may be represented by multiple terms • Preferred / non-preferred term, multilingualism, etc. Each term may have many lexical representations • inflection • abbreviation • “legal” variants in orthography • historical versions of “legal” orthography (in German: 1880 - 2006) • common misspellings • regional variants in the same language Each term may be a compound term • a compound term may contain term delimiters (spaces or hyphens) • the components may appear dispersed within a sentence • the components may designate different concepts by themselves. Ecoterm 2009: Lexical Complexity SKOS(XL)
(a side note about orthography) “Before compulsory education has been established, it was something to be able to write.” tb: just like Cervantes, Dante, Goethe, Shakespeare, Whitman, etc. “Since then, you have to be a proper speller.” (Peter Bichsel, Der Leser. Das Erzählen. Frankfurter Poetik-Vorlesungen. 1982) Ecoterm 2009: Lexical Complexity SKOS(XL)
Why should we care? Traditional: (nice-to-have): • Alphabetic lists of subject indices show some lexical variants. Contemporary (prerequisite): • automatic (machine-made) detection of Concepts covered by a natural language document (“Named Entity Recognition”) • must capture a covered Concept as concise as possible • considering all possible lexical appearances, including term composition Language dependant: • English is comparatively simple in this regard. • German is awful! • (add your language here) Ecoterm 2009: Lexical Complexity SKOS(XL)
The case: UMTHES in SKOS The German Environmental Thesaurus UMTHES ~ 12,000 preferred + 25,000 non-preferred terms + 11 000 'multiple-composition' (spelling) forms • needs to be serialized in SKOS for migration into the iQvoc vocabulary management tool • includes sophisticated knowledge about lexical complexity • we don‘t want to loose this moving to SKOS(XL) Ecoterm 2009: Lexical Complexity SKOS(XL)
UBA(de) & innoQ develop … iQvoc - Open Source Vocabulary Management Tool • Totally Web-based, supports distributed editorial teams • Safe and comfortable, schema driven editing features • Simple but powerful workflow implementation Conformance • W3C “Cool URI” design and deployment • W3C SKOS Recommendation Availability • GNU public license (GPL) • iQvoc version 1 demo (GEMET) at:http://apps.innoq.com/iqvoc/about.html • iQvoc 2 availability planned for Q1 2010 Ecoterm 2009: Lexical Complexity SKOS(XL)
Thesaurus models with regard to lexical complexity Traditional - ISO 2788:1986 ISO Model revised (Draft 2008-11-18) SKOS W3C Recommendation 2009-08-18
Traditional - ISO 2788:1986 “Guidelines for the establishment and development of monolingual thesauri” • indexing language: “A controlled set of terms selected from natural language and used to represent, in summary form, the subjects of documents.” • thesaurus: “The vocabulary of a controlled indexing language, formally organized …” • preferred term: “A term used consistently when indexing to represent a given concept … sometimes known as descriptor.“ • non-preferred term: “The synonym or quasi-synonym of a preferred term. A non-preferred term is not assigned to documents but is provided as an entry point … sometimes known as a non-descriptor" Ecoterm 2009: Lexical Complexity SKOS(XL)
ISO 2788:1986 Model (1) see next slide (hierarchical and associative relations between preferred terms here not in focus) term equivalence Ecoterm 2009: Lexical Complexity SKOS(XL)
ISO 2788:1986 Model (2) • compound term: “An indexing term which can be factored morphologically into separate components, each of which could be expressed, or re-expressed, as a noun that is capable of serving independently as an indexing term. • a) the focus or head, i.e. the noun component which identifies the general class of concepts to which the term as a whole refers. Examples: ‘printed indexes’, ‘hospitals for children’. • b) The difference or modifier, i.e. one or more further components which serve to narrow the extension of the focus and so specify one of its subclasses. Examples: ‘printed indexes’, ‘hospitals for children’. • The focus and its difference(s) may be written as separate words, as in ‘dining rooms’ and ‘soup spoons’, or they may be concatenated into single words, as in ‘bedrooms’ and ‘teaspoons’”. Ecoterm 2009: Lexical Complexity SKOS(XL)
ISO Model revised (Draft 2008-11-18) Leonard Will 2009-02-13 in the public SKOS mailing list: “I write as Chair of the ‘Data Modeling, Exchange Formats and Protocols’ subgroup of the ISO working group SC9WG8/Project 25964, currently revising the ISO standard for thesauri for information retrieval, but as these standards are still in draft form anything I say here is my own interpretation of the way we are going, and is not authoritative”. … “The ISO model is firmly based on relationships between concepts, not terms. Terms are used as labels for concepts, as in SKOS”. http://lists.w3.org/Archives/Public/public-esw-thes/2009Feb/0033.html (see diagram on next slide) Ecoterm 2009: Lexical Complexity SKOS(XL)
ISO Model revised (Draft 2008-11-18) Ecoterm 2009: Lexical Complexity SKOS(XL)
W3C SKOS Recommendation Simple Knowledge Organization System • “SKOS is an area of work developing specifications and standards to support the use of knowledge organization systems (KOS) such as thesauri, classification schemes, subject heading systems and taxonomies within the framework of the Semantic Web …” • Started in 2004: http://www.w3.org/2004/02/skos/ • 2009-08-18: W3C Recommendation status • SKOS Reference: http://www.w3.org/TR/2009/REC-skos-reference-20090818/ • SKOS Primer: http://www.w3.org/TR/2009/NOTE-skos-primer-20090818/ • SKOS Use Cases and Requirements: http://www.w3.org/TR/2009/NOTE-skos-ucr-20090818/ Ecoterm 2009: Lexical Complexity SKOS(XL)
SKOS Model “anything“ can have these labels (~terms) and notes about Concepts not terms ~ ISO node label includes relations known from ISO “preferred term”: hierarchical , associative, but notequivalence Ecoterm 2009: Lexical Complexity SKOS(XL)
ISO 2788:1986 mapped to SKOS Ecoterm 2009: Lexical Complexity SKOS(XL)
What is added by SKOSXL? • skosxl:Labelis a Class not a literal • skosxl:Labelhas (exactly one) literalForm • skosxl:Labelcan have labelRelation to another Label What you don’t see in the diagram: • skos:prefLabeletc. are extended by a „property chain“(seen from a rdfs:Resource) :the value of an assigned skos:prefLabel is equivalent to the value of the skosxl:literalForm of an assigned skosxl:Label. Ecoterm 2009: Lexical Complexity SKOS(XL)
Extensions of SKOSXL by UMTHES properties of skosxl:Label complementingskosxl:literalForm • baseForminflectional “root” of the term (add suffixes to this) • inflectionalCodeencoding of a regular inflectional pattern • lexicalVariantany lexical variant that may appear in a written document • inflectional- derived by inflection • acronym - any kind of abbreviation • cultural- any (sub) cultural variation • misspelled - common spelling errors subProperties of skosxl:labelRelation • homographhomograph part of a qualified name • hasQualifierqualifier part of a qualified name • lexicalExtensionmay point to historical orthography, or verb form, etc. • compoundFromcomposition (value is a rdf:List) Ecoterm 2009: Lexical Complexity SKOS(XL)
Examples using SKOS(XL) (mostly stripped down to a topic)
Switching to Turtle Syntax Terse RDF Triple Language • W3C Team Submission 14 January 2008 • http://www.w3.org/TeamSubmission/turtle/ by TBL • Used in W3C SKOS Recommendation as well as in OWL 2 Draft Everything can be expressed in XML as well. • Turtle syntax makes more sense for human reading. • see yourself … Ecoterm 2009: Lexical Complexity SKOS(XL)
UMTHES in SKOS(XL) examples Namespace prefixes used in the following: @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>. @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>. @prefix owl: <http://www.w3.org/2002/07/owl#>. @prefix skos: <http://www.w3.org/2004/02/skos/core#>. @prefix skosxl: <http://www.w3.org/2008/05/skos-xl#>. @prefix ext: <http://www.uba.de/2009/08/UmThesScheme#>. # no prefix means: defined in the local namespace Ecoterm 2009: Lexical Complexity SKOS(XL)
waste & garbage # SKOS only :4711 rdf:type skos:Concept; skos:prefLabel “waste”; skos:altLabel “garbage”. # exactly the same in SKOSXL :4711 rdf:type skos:Concept; skosxl:prefLabel :waste; skosxl:altLabel :garbage. :waste rdf:type skosxl:Label; skosxl:literalForm “waste”. :garbage rdf:type skosxl:Label; skosxl:literalForm “garbage”. • NOTE: Local instance identifiers (:4711, :waste, :garbage, etc.) in these examples follow a local naming convention which addresses human reading only. • “4711” used to be the brand name of a Cologne based perfume manufacturer (“Eau de Cologne”). This has emerged to a generic ID symbol in informatics in the 80/90s. So, :4711 stands for “any kind of unique, but by itself meaningless ID”. • The only functional requirements for IDs in this place are: • being unique within the assigned namespace; • being part of a working http URI. Ecoterm 2009: Lexical Complexity SKOS(XL)
waste & garbage # SKOS only :4711 rdf:type skos:Concept; skos:prefLabel “waste”; skos:altLabel “garbage”. # exactly the same in SKOSXL :4711 rdf:type skos:Concept; skosxl:prefLabel :waste; skosxl:altLabel :garbage. :waste rdf:type skosxl:Label; skosxl:literalForm “waste”. :garbage rdf:type skosxl:Label; skosxl:literalForm “garbage”. # this looks like saying the same stuff in a more complicated way # but wait ... Ecoterm 2009: Lexical Complexity SKOS(XL)
“waste water” composition :4711 rdf:type skos:Concept; skosxl:prefLabel :wasteWater. :wasteWater rdf:type skosxl:Label; skosxl:literalForm “waste water”; ext:lexicalVariant “wastewater”; ext:compoundFrom (:waste :water). # already defined in the previous slide, could skip it here: :waste rdf:type skosxl:Label; skosxl:literalForm “waste”. # only the noun, “wasted water” is NOT “waste water”! :water rdf:type skosxl:Label; skosxl:literalForm “water”; ext:inflectional “waters”. Ecoterm 2009: Lexical Complexity SKOS(XL)
Multiple Composition in German # @en: technique of facilities for the recycling of waste water :4711 rdf:typeskos:Concept; skosxl:prefLabel :abwasserAufbereitungsAnlagenTechnik. :abwasserAufbereitungsAnlagenTechnikrdf:typeskosxl:Label; skosxl:literalForm “Abwasseraufbereitungsanlagentechnik”; ext:compoundFrom (:abwasser :aufbereitung :anlage :technik); ext:compoundFrom (:abwasserAufbereitung :anlage :technik); ext:compoundFrom (:abwasserAufbereitungsAnlage :technik); ext:compoundFrom (:abwasser :Aufbereitungsanlage :technik); ext:compoundFrom (:abwasserAufbereitung :anlagenTechnik); ext:compoundFrom (:abwasser :aufbereitung: :anlagenTechnik); ext:compoundFrom (:abwasser :aufbereitungsAnlagenTechnik). # maybe I missed some composition variant? Not joking! Ecoterm 2009: Lexical Complexity SKOS(XL)
Lexical extension example in German # in English: “cleaning” :reinigung rdf:type skosxl:Label; skosxl:literalForm “Reinigung”@de; ext:lexicalExtension :reinigen . # extended by the verb form, English “to clean” Caution: see “wasted water” :reinigen rdf:type skosxl:Label; skosxl:literalForm “reinigen“@de; ext:baseForm “reinig”; ext:inflectionalCode “007” ext:inflectional “reinige”; ext:inflectional “reinigen”; ext:inflectional “reinigte”; ext:inflectional “gereinigt”; ext:inflectional “gereinigte”; ext:inflectional “gereinigter”; ext:inflectional “gereinigtes”; ext:inflectional “reinigend”; ext:inflectional “reinigende”; ext:inflectional “reinigender”; ext:inflectional “reinigendes”; #to be continued … Ecoterm 2009: Lexical Complexity SKOS(XL)
Homograph & qualifier :4711 rdf:type skos:Concept; skosxl:prefLabel :bass--fish. # [ˈbas] :4712 rdf:type skos:Concept; skosxl:prefLabel :bass--music . # [ˈbās] :bass rdf:type skosxl:Label; skosxl:literalForm “bass”. :fish rdf:type skosxl:Label; skosxl:literalForm “fish”. :bass--fish rdf:type skosxl:Label; skosxl:literalForm “bass (fish)”; ext:homograph :bass; ext:hasQualifier :fish. # add Labels :music and :bass--music using the same pattern Ecoterm 2009: Lexical Complexity SKOS(XL)
Multilingualism (symmetric) # symmetric (in SKOS, can be expressed in SKOSXL likewise) :4711 rdf:type skos:Concept; skos:prefLabel “organisation”@en; skos:prefLabel “organization”@en-US; # add your language here ... (GEMET has more than 20) skos:prefLabel “Organisation”@de. SKOS integrity condition S14: • “A resource has no more than one value of skos:prefLabel per language tag.” NOTE: this does not mean it must have prefLabel values in multiple languages Ecoterm 2009: Lexical Complexity SKOS(XL)
Multilingualism (language-centric) # UMTHES is German-centric with altLabel values also in English :4711 rdf:type skos:Concept; skos:prefLabel “Organisation”@de; skos:altLabel “organisation”@en; skos:altLabel “organization”@en-US. # or use skosxl: in the above to refer to: :Organisation rdf:type skosxl:Label; skosxl:literalForm “Organisation”@de; ext:inflectional “Organisationen”; ext:inflectional “Organisations-”. :organisation rdf:type skosxl:Label; skosxl:literalForm “organisation”@en; ext:inflectional “organisations”. :organization rdf:type skosxl:Label; skosxl:literalForm “organization”@en-US; ext:inflectional “organizations”. Ecoterm 2009: Lexical Complexity SKOS(XL)
Multilingualism (asymmetric) # full asymmetric pattern (currently not used by UMTHES) :4711 rdf:type skos:Concept; skosxl:prefLabel :Organisation; ext:hasTranslation :4712. :4712 rdf:type skos:Concept; skosxl:prefLabel :organisation. ext:hasTranslation :4711. # :Organisation & :organisation already known from previous slide Ecoterm 2009: Lexical Complexity SKOS(XL)
About Federation • UMTHES has been one of the 8 sources of GEMET • UMTHES extends GEMET with more detailed German Concepts and their lexical complexity. @prefix gemet: <http://www.eionet.europa.eu/gemet/concept/>. # GEMET URIs do resolve in SKOS since 2009-09 !!! :14452 rdf:typeskos:Concept; skosxl:prefLabel :klimaAenderung; skosxl:altLabel :klimaWandel; skosxl:altLabel :climateChange; # referencing GEMET “climatic change” from here skos:closeMatch gemet:1471. :klimaAenderungrdf:typeskosxl:Label; ext:compoundFrom (:klima :aenderung); # ... etc, as exemplified before Ecoterm 2009: Lexical Complexity SKOS(XL)
preferred, non-preferred term again # you may define such classes in SKOS (OWL) at any time # but they will never be exactly equivalent to ISO 2788 (why?) :isPrefLabelOf owl:inverseOf skosxl:prefLabel. :isAltLabelOf owl:inverseOf skosxl:altLabel. :PreferredTerm owl:equivalentClass [ rdf:type owl:Restriction ; owl:onProperty :isPrefLabelOf ; owl:someValuesFrom skos:Concept ]. :NonPreferredTerm owl:equivalentClass [ owl:intersectionOf ( [owl:complementOf :PreferredTerm ] [owl:equivalentClass [ rdf:type owl:Restriction ; owl:onProperty :isAltLabelOf ; owl:someValuesFrom skos:Concept ] ])]. Ecoterm 2009: Lexical Complexity SKOS(XL)
Finally … # you may express anything in RDF / Turtle … @prefix foaf: <http://xmlns.com/foaf/spec#>. :ecoTerm2009 rdf:type :meeting; :hasOnAgenda :theseSlides. :theseSlides rdf:type :presentation; skos:preflabel “Expressing Lexical Complexity in SKOS(XL)”; :hasPresenter :tb. :tb rdf:type foaf:person; foaf:mbox <mailto:thomas.bandholtz@innoq.com>; foaf:isPrimaryTopicOf <http://www.bandholtz.eu/foaf.rdf>; foaf:workplaceHomepage <http://www.innoq.com>; foaf:currentProject <http://apps.innoq.com/iqvoc/about.html>; # add your assertions here ... :says “Good Buy!”. Ecoterm 2009: Lexical Complexity SKOS(XL)