280 likes | 292 Views
This workshop discusses the elements to be included in ISOcat, including names, identifiers, and the use of capitals and special characters. It also explores hierarchical structures and the possibility of adding new profiles. Other topics include dealing with larger amounts of data and linking closed and simple data categories.
E N D
CLARIN-NL ISOcat workshop 2013part 2 (02-10-2013) Ineke Schuurman Menzo Windhouwer
Issues call 4 Issues brought up by participants call 4 Which elements are to be included in ISOcat? What can be expressed in ISOcat (and other cats) Names, identifiers, and the use of capitals, special characters, etc Easy linking of closed and simple DCs Hierarchical structure between DCs Can new profile be added?
Other issues • Issues brought up in previous calls • Type of DC • When to create a new DC/adopt an existing one • When to create several DCSs • Name of DC, several DCs with same name • How to deal with larger amounts of data
What to include? • ALL concepts dealing with linguistics/ metadata • Van Dale EN-NE include (overgankelijk werkwoord) 1) omvatten 2) (mede) opnemen • 'overgankelijk werkwoord' / 'transitive verb' is to be included, same for 'overg.ww', 'trns.v.' • One and the same DC! (but separate parts)
What to include? Have a look at ‘transitive verb’ • Several entries in ISOcat • DC-1405 A verb which takes a direct object; that is, a verb that expresses an action which directly affects another person or thing. • DC-3532 A transitive verb is a verb that takes a direct object, and describes a relation between two participants [Crystal 1997: 397; Payne 1997: 171] • And several more, so... which one to select?
Names - identifiers • Identifier • No spaces (properNoun) • Camelcase (properNoun) • Start with small character (properNoun), not with number, punctuation character • Such characters may appear elsewhere in the identifier
Names - identifiers • Name • Multi-word units allowed (proper noun) • Several names allowed (in same name section), one per entry • Use the most common name first, alternative names in further entries (same languages) • Use common spelling • Abbreviations etc in ‘Data Element Name’
Same name • Not really a problem, not even when coming with the same profile • PositivePolarity • In general, positive polarity refers to an assertion that contains no marker of negation [Crystal 1980: 299]. (DC-3405) • the property of a word or concept to express positive sentiment (myDC-xx) • Whether you can reuse DC-3405 depends on your use of the concept!
Same name • Do not avoid reuse of a name when it is the name commonly used! • Another type of duplicate names where one concept entails the other one: • meewerkend voorwerp (indirect object) • meewerkend en belanghebbend voorwerp • event (also called 'eventuality', and including 'state') • event (sister of 'state')
Identical identifiers • Identical identifiers will be accepted by the system! • There are at least 4 identifiers ‘noun’ • Rule: start with small character • In that respect • X-qatalClause should become x-qatalClause, even when the latter exist as well • Difference is mainly to be made in • Name • Definition • Identifier =/= PID (persistent identifier), the latter are unique (http://www.isocat.org/datcat/DC-1345)
Adoption • When (not) to adopt an existing DC • It should ‘match’ with the way you use a specific notion in your annotation scheme, application, … • It should come with the sameprofile and type • That being said • Reuse a CLARIN NL/VL DC whenever possible (contact Ineke when such a definition is incorrect)
What defines a good DC? • Correct definition • NOT (unless all concepts defined in ISOcat) • Actor (DC-4146) • a participant in an action or process • Question: is an addressee to be considered an actor? (used in DC-4158, no proper definition yet)
What defines a good DC? • Reusable and correct definition • NOT • conversation (DC-2661) • Communication event with more than two participants • mother tongue (DC-2955) • […] a speaker’s mother tongue • neuter (myDC-XX) • In CGN the gender … / In Dutch …
What defines a good DC? • Meaningful definition • NOT • annotation format (DC-2562) • Specifies the annotation format that is used … • source language (DC-2494) • Indicates if a language is a source language • mother tongue (DC-2955) • […] a speaker’s mother tongue
Not that good examples • Mother tongue (DC-2955) • Specifies whether the language is a speaker’s mother tongue • Mother’s language (DC-4516) • […] NOT necessarily the mother tongue […] • - There is no definition of concept ‘mother tongue’ • (Relation with /home language/ , /primary language/, • /heritage language/? And what about ‘father tongue’?) • - And why ‘speaker’?
Rule Make your definition • as general as possible • as specific as necessary
Linking closed - simple • Not always that simple when there are many entries within a profile • Selected profile determines the number of choices • You can order them: name, pid, owner, … • When you don’t find the ‘simple’ you need, have a look with other profiles ! (esp. ‘undecided’)
Standards • Within ISOcat currently there are little or no standards, Therefore • CLARIN NL and VL will set up their own set of ‘standardized DCs’, Ineke will be in charge, selecting new flag “recommended by CLARIN NL/VL” (issue: often no correct profile selected, still showing the ‘undecided’ one)
Standards Another issue wrt standards 'included' in ISOcat - Athens Core DC's (recommended by metadata/CMDI): we are to adapt them in order to avoid tautologies and/or correct smaller ‘errors’ Target language: indicates if the language is the target language Conversation: […] three or more participants Same may be necessary for TEI Headers etc. Contact Ineke in case you are not sure whether you can reuse such DCs
DC/DCS and profile • Profiles are not added automatically, a DCS may contain elements with various profiles (although you may decide to create several DCSs) (do select proper names!) • In case the profile you need is not yet available, contact Menzo and Ineke
Part B: do’s & don’ts Do’s: • Create a DCS for your scheme (name project, ann.scheme, …), it is to contain all your DCs (also adopted ones) • Provide clear definition (short, to the point) for your scheme, application, …. • Take care not to leave concepts used in your definition undefined or vague • Use appropriate vocabulary (per profile) • Check ‘adopted’ DC’s regularly till standardization/recommendation (history-button !!!)
Do’s (continued) When creating a DC, fill out • Justification: used in XYZ, part of tagset N • Language section • Always English language section (advice: do not create sections in other languages (NL!) before Ineke has seen your input) • Strong recommendation: sections for object language(s), for working language manual • Sections in the various languages should match (+/- be translations of each other)
Do’s (continued) When creating a DC, fill out • Example section • Note that *negative* examples may be very helpful! (for nouns (CGN): jongens, mannen, niet: gelovigen (is form of ADJ))
Example sections Suppose you want to illustrate a German phenomenon: • Ex.sec. in EN language section • German ex with transl in English • Ex.sec. in NL language section • German ex with transl in Dutch • Ex.sec. in EN linguistic section • EN example • Ex.sec. in NL linguistic section • NL example with translation in English
Don’ts • Confuse Language and Linguistic section • Latter contains language specific values for closed domains • Be (too) language specific in definition • Mention scheme in definition • Use several definitions in one DC • Circular definitions • Rely on authority • Rely on standardized status • Definition should fit YOUR scheme, etc
. -- End --