130 likes | 142 Views
This article discusses the problems encountered in mapping DUELME-LMF to standardized Data Categories (DCs) in ISOCAT, including overlap with other projects, ill-defined DCs, and language sections. It also highlights the need for support from CLARIN for existing tag sets.
E N D
ISOCATISOCAT Problems encountered in DUELME-LMF Jan Odijk Nijmegen 21 Sep 2010
Standardized DCs? Multiple relevant DCs in ISOCAT Overlap with other projects Container Data Catgegories Almost Identical DCs Language Sections Existing Tagsets Overview
Almost none of the current ISOCAT DCs are part of an official standard There are often multiple candidate DCs in ISOCAT for a DUELME-LMF DC Which one should we map it to? If mapped to one that will later not become a standard, the mapping should be redone Standardized DCs?
There are often multiple candidate DCs in ISOCAT for a DUELME-LMF DC Caused inter alia because each project is entering its own subset (in some cases multiple are appropriate, in many cases none is appropriate) How to deal with this? Multiple ISOCAT DCs
DUELME-LMF uses a tag set that overlaps with the D-COI tagset TTNWW and Adelheid also use (a set overlapping with) the D-COI tagset Mutual consultation is required, and what strived for However, difficult to realize because of different lead times of projects DUELME-LMF finished, Adelheid still to start, TTNWW so far worked only on a partially different subset And maybe other projects also use these tags, but how do we know? Overlap with other projects
Container data categories not possible (yet?) in ISOCAT many DUELME-LMF XML elements have no entry in ISOCAT (yet) Has to be added later Container data categories
Many DCs in ISOCAT are Ill-defined (is it the same DC as I need?) Sufficiently or Well defined but slightly differently than what I need How to deal with this? Almost identical DCs
Some DCs in ISOCAT are highly-language-specific http://www.isocat.org/datcat/DC-2704 (noun) Highly Polish-specific Noun [subst] contains lexemes infecting for number and case, with a lexically determined grammatical gender, which do not have the category of person, e.g., woda `water', profesor `professor', pięciokrotność 'fivefoldness'; this class also contains defective plurale tantum and singulare tantum lexemes, but not depreciative lexemes. Grammatical categories of noun [subst]: number (http://www.isocat.org/datcat/DC-2709), case (http://www.isocat.org/datcat/DC-2720), gender (http://www.isocat.org/datcat/DC-2728). But in the English language section Language Sections?
They should fall under a more language-independent DC, with specializations for the relevant language in the language section (?) E.g. http://www.isocat.org/datcat/DC-3347 (Noun) Reasons: Projects enter their own DCs as separate DCs in ISOCAT Language Sections?
Reasons (cont.): Most language-independent DCs have lousy definitions http://www.isocat.org/datcat/DC-1333 (noun): “Part of speech used to express the name of a person, place, action or thing “ Why is it a lousy definition? Definition of a morpho-syntactic DC is in terms of semantics only (while definition of POS http://www.isocat.org/datcat/DC-396 states A category assigned to a word based on its grammatical and semantic properties. Die Klasse von Wörtern einer Sprache auf Grund der Zuordnung nach gemeinsamen grammatischen Merkmalen. Though taken from a credible source (ISO 12620) ( don’t rely on authority!) It does not correspond to any concept of noun used elsewhere if "name"= proper name, then John, London ok but words which are usually considered nouns not many real nouns express properties: man, city, work, book here expresses a place, but it surely is no noun Example given is not convincing: Spiderman (a person?) Language Sections?
There are many existing tag sets E.g. CGN tagset, D-COI tagset, STTS tagset, IPI PAN tagset, etc. Usually language-specific Usually de facto standards for the language Used by multiple resources Used / assumed by multiple existing tools Often claimed to be EAGLES-compatible (but almost never actually proven) Existing Tag sets
There are many existing tag sets (cont.) With very precise definitions for its member DCs Much more specific than individual language-independent tags With clear delimitation from other tags in the tagset With clear assignment guidelines Covering the whole space of tags nicely divided up – so it is essential that all tags of a tagset are in ISOCAT and Each tags is identifiable as member of the tagset They should be supported by CLARIN (or CLARIN will be a failure) Existing Tag sets
Thanks for your attention! Listen to my solutions later! http://www.clarin.nl/ CLARIN-NL