340 likes | 355 Views
Language Identification and IT. Peter Constable and Gary Simons SIL International peter_constable@sil.org gary_simons@sil.org www.sil.org. Language identification.
E N D
Language Identification and IT Peter Constable and Gary Simons SIL International peter_constable@sil.org gary_simons@sil.org www.sil.org
Language identification • The use of identificational codes for tagging information objects to indicate the language in which the information is expressed <body xml:lang=“en”> 17th International Unicode Conference San Jose, CA September 2000
Language identification • Not considering automated language detection Considering only language identifiers, not identifiers for paralinguistic notions, such as writing system or locale 17th International Unicode Conference San Jose, CA September 2000
About the Ethnologue • SIL Ethnologue • catalogue of all modern languages in the world • lists over 6,800 living languages • result of decades of research • system of three-letter codes • http://www.sil.org/ethnologue 17th International Unicode Conference San Jose, CA September 2000
About the Ethnologue 17th International Unicode Conference San Jose, CA September 2000
About the Ethnologue 17th International Unicode Conference San Jose, CA September 2000
About the Ethnologue • Existing user base for Ethnologue codes: • SIL • UNESCO • Linguistic Data Consortium (850+ agencies) • The Linguist List (12,500 individual linguists) • The Endangered Language Fund • others 17th International Unicode Conference San Jose, CA September 2000
Linguistic diversity • # of languages: Europe: 237 Asia: 2202 Africa: 2062 Americas: 1020 Pacific: 1312 17th International Unicode Conference San Jose, CA September 2000
Motivation for this paper • Languages covered by standards • ISO 639-x covers approx. 400languages; • existing needs to go much further—over 6,800 languages • immediate need among linguists and other researchers for use in XML 17th International Unicode Conference San Jose, CA September 2000
Five issues • Change • Categorization • Inadequate definition • Scale • Documentation 17th International Unicode Conference San Jose, CA September 2000
The need for language identifiers • Language-specific processing • spell-checking • sorting • morphological parsing • speech recognition/synthesis • language-specific typographic behaviour • etc. 17th International Unicode Conference San Jose, CA September 2000
The need for language identifiers • Language-specific processing • choosing appropriate resources Los eventos deportivos pra la juventud Los eventos deportivos pra la juventud ህ ጩቦአፈ ዸድ ማዽ ጸመቂትወይቴ። ህ ጩቦአፈ ዸድ ማዽ ጸመቂትወይቴ። 17th International Unicode Conference San Jose, CA September 2000
The need for language identifiers • Two distinct issues: • identify the language • apply the specific processing for that language 17th International Unicode Conference San Jose, CA September 2000
The need for language identifiers • Language detection • identify language by inspection of data itself • available only for a few languages • not practical for searching large corpora (e.g. the Internet) • doesn’t work on short text segments She said, “chat”. 17th International Unicode Conference San Jose, CA September 2000
The need for language identifiers • Language-specific processing • in general, must tag information objects to indicate language • identifiers are needed to distinguish every language 17th International Unicode Conference San Jose, CA September 2000
Issue #1: change • Languages are constantly changing • Implications: • systems of language tags cannot be static • the speech variety (varieties) denoted by a tag is time-bound “English” c. 1700 A.D. ≠ “English” c. 2000 A.D. 17th International Unicode Conference San Jose, CA September 2000
Issue #2: categorization • Typical question: Are Serbian and Croatian the same language, or different languages? Operational definitions of language • many different ways to formulate a definition • different definitions create different categorizations • different categorizations serve different purposes 17th International Unicode Conference San Jose, CA September 2000
Issue #3: inadequate definition • Existing systems do not consistently employ a single operational definition • ISO 639-2: codes for “languages” and for groups of languages nav = Navajo ath = Athapascan languages • ISO 639-2: some “languages” are groups of languages que = “Quechua” (47 distinct languages) 17th International Unicode Conference San Jose, CA September 2000
Issue #3: inadequate definition • Consistent use of a single definition in a given namespace is beneficial • “Requiring a single definition imposes too much constraint on users” • users may legitimately have different requirements • but no control results in confusion, especially when thousands of identifiers are added 17th International Unicode Conference San Jose, CA September 2000
Issue #4: Scale • Number of languages exceed existing systems by an order of magnitude (400 vs. 6,800) • Existing systems do not scale well 17th International Unicode Conference San Jose, CA September 2000
Issue #4: Scale • ISO 639-x • slow process unable to cope with large volume of requests • minimal attestation (50 documents) not appropriate for lesser-known languages • mnemonic codes (impossible for thousands of languages) • confusion due to inconsistent definition 17th International Unicode Conference San Jose, CA September 2000
Issue #4: Scale • RFC 1766 • process unable to cope with large volume of requests • confusion due to inconsistent definition • unclear how to create tags 17th International Unicode Conference San Jose, CA September 2000
Issue #5: documentation • Existing systems: can’t tell what codes denote • ISO 639-x: language, or group of languages? ara, “Arabic”: Standard only? all variants? • ISO 639-x: which of several alternate possibilities? bin, “Bini” = dial. of Yoruba (Nigeria; 20,000,000) = dial. of Anyin (Côte d'Ivoire; 810,000) = alt. name for Edo (Nigeria; 1,000,000) = alt. name for Pini (Australia; dying) 17th International Unicode Conference San Jose, CA September 2000
Issue #5: documentation • ISO 639-x: 2- vs. 3-letter codes st, “Sesotho” = nso, “Sotho, Northern”? = sot, “Sotho, Southern”? = both? to, “Tonga” = tog, “Tonga (Nyasa)”? = ton, “Tonga (Tonga Islands)”? 17th International Unicode Conference San Jose, CA September 2000
Solving these problems • Requirements of an adequate system: • able to scale • able to deal with change, track history of change • use a single operational definition for a given namespace • apply definition consistently within a namespace • complete, maintained, online documentation 17th International Unicode Conference San Jose, CA September 2000
What the Ethnologue offers • Scale: already there • enumeration of languages • set of three-letter codes • Change: careful management • no re-use of codes • have begun recording revision history 17th International Unicode Conference San Jose, CA September 2000
What the Ethnologue offers • Definition: single definition, applied quite consistently • definition: primary criterion of mutual non-intelligibility as a basis for identifying candidates for separate literacy, literature • all categories are of the same type; no language families, groups, writing systems 17th International Unicode Conference San Jose, CA September 2000
What the Ethnologue offers • Documentation • extensive information maintained for every language • new site will provide various reports • alternate names, location, population, etc. • related ISO codes, relationship • return Ethnologue data given an ISO code • evaluating possibilities for returning results as XML 17th International Unicode Conference San Jose, CA September 2000
Integration with RFC 1766, XML • Ethnologue codes immediately available using “x-” “Hopi”: <body xml:lang=“x-hop”> <body xml:lang=“x-sil-hop”> • private-use tags not ultimately satisfactory 17th International Unicode Conference San Jose, CA September 2000
Integration with RFC 1766, XML • Register thousands of new tags with IANA • process would not be able to cope • problems devising that many tags • create considerable confusion in the single namespace 17th International Unicode Conference San Jose, CA September 2000
Integration with RFC 1766, XML • Register “i-sil-” to specify a namespace maintained by a particular agency • <body xml:lang=“i-sil-hop”> • deals with scale • creates a namespace with a particular definition that is consistently applied • avoids confusion of having a single namespace for all needs • allow alternate namespaces 17th International Unicode Conference San Jose, CA September 2000
Integration with RFC 1766, XML • Possible refinement: define primary tag “n-” <body xml:lang=“n-sil-hop”> • first sub-tag identifies a registered namespace of identifiers • each namespace provides its own operational definition(s) • “i-” usage more consistent (languages only) • “i-” specifies a privileged namespace (doesn’t require “n-”) 17th International Unicode Conference San Jose, CA September 2000
Conclusions • Language identifiers required for language-specific processing • Immediate need for thousands of new language identifiers; in particular, for use in XML • Five problem areas—need to be considered in any system • SIL Ethnologue codes address all five problems • Revising RFC 1766 to add a namespace mechanism can support this and would offer many benefits 17th International Unicode Conference San Jose, CA September 2000