330 likes | 522 Views
Language and the Internet Assessing Linguistic Bias. Measuring the Information Society WSIS, Tunis, November 15, 2005 John C. Paolillo, Indiana University. Overview. Sources of Linguistic Bias Linguistic Bias: examples Text Communication Internet Host Names Web Programming
E N D
Language and the InternetAssessing Linguistic Bias Measuring the Information Society WSIS, Tunis, November 15, 2005 John C. Paolillo, Indiana University
Overview • Sources of Linguistic Bias • Linguistic Bias: examples • Text Communication • Internet Host Names • Web Programming • Global Linguistic Diversity • Who bears the costs? • Conclusions
Sources of Linguistic Bias (Friedman and Nissenbaum 1997) • Pre-existing • originate from outside the technical system • National, trans-national and institutional policies • Technology companies • Technical • are built into the technical system itself • Developers’ language backgrounds, national origins • Legacy standards, “backward” compatibility • Emergent • arise in specific contexts of use of a technical system • Economics of technology industry (marketing, monopoly power, unstable markets, etc.) • Rapid technologization
Text Communication • Requires an encoding and its support • Assign code numbers to script characters • ASCII (American English) • ISO-8859-1 (European Languages) • Unicode (most languages, but support is uneven) • Support means many things • Fonts, rendering, sorting, spell-checking etc. • Computer-Mediated Communication • Web pages, Email, chat, etc. • Language use is not uniform in these modes • Multilinguals tend to favor different languages for specific purposes • Represents both technical and emergent biases
Unicode Status: Examples Language Chinese English French German Spanish Finnish Russian Arabic Hindi Sinhala S. Azerbaijani Unicode yes yes yes yes yes yes yes yes yes yes no Browser good good good good good good good (late) good (late) poor none none Script Chinese Roman Roman Roman Roman Roman Cyrillic Arabic Indic Indic Arabic Pop. 1,240M 400M 81M 82M 358M 5M 132M 247M 213M 15M 26M Good support Poor support No support
Internet Host Names • The Domain Name System • Uses a 30-year old 7-bit ASCII standard • Now supports Punycode (a variant of Unicode) • Imposes a maximum name length • Run by ICANN under US Dept of Commerce contract • More concerned with trademark protection • Host/domain naming is widely abused (e.g. tv domain) • Names provided by the DNS are not that useful • An example of emergent bias • Technical origin • Economic and political forces amplify and sustain it
Web Programming and Unicode • Markup & web scripting languages • Unicode is standard • Browser support, fonts, etc. lag behind • Databases and development environments tend to lack proper Unicode support • End-user oriented, not programmer oriented • All of the most important technologies are Open-Source software (FLOSS) • User extensible/modifiable • Language localization of these is possible but rare
Linguistic Bias in Web Programming • English is the source language for most programming & markup languages • Keywords • Operator-argument order • Programming constructs, etc. • Programming as a linguistic act • Complex concepts are rendered into text • Different languages have different ways of doing this • Emergent language biases
Linguistic Properties of Programming • LISP • Predicates precede their arguments • Like Arabic, Celtic, Hebrew, etc. (defun fact (x)(if (<= x 0) 1 (* x (fact (- x 1))))) • Postscript • Predicates follow their arguments • Like Farsi, Hindi, Japanese, Tamil, Turkish, etc. /factorial { dup 1 gt { dup 1 sub factorial mul } if } def
The Linguistic Digital Divide • Language issues go beyond content • WSIS repeatedly re-affirms principles of • Transparency • Self-determination • Open access to participation for all parties These principles cannot be guaranteed unless speakers of different languages can manipulate all aspects of IT use in a way that is native-like • The linguistic divide has broader consequences • Costs are borne in • Education — great for non-English speaking people • Technical development — small, in comparison (there is a trade-off)
Language Diversity Who bears the costs?
(source data: www.ethnologue.com) A typical language group has around 10-50 thousand people 80% of language groups have fewer than 100 thousand members
(source data: www.ethnologue.com) 90% of the world’s population belongs to a language group with at least 1 million people (416 groups) Many languages with hundreds of milloins of speakers lack adequate support
Conclusions • Linguistic Bias is manifest in many ways • Technical biases are sometimes overt • Emergent biases can be subtle • All potential sources of bias need to be examined and questioned if we are to uphold principles affirmed by WSIS • Without this effort, the linguistic digital divide will simply amplify existing disparities in wealth and power
Language Diversity On The Internet
Linguistic Diversity Based on Entropy: Diversity = –2 ∑pi ln pi Diversity is the long-run per-individual average variance in language category (similar to log-likelihood)