1 / 33

Language and the Internet Assessing Linguistic Bias

Language and the Internet Assessing Linguistic Bias. Measuring the Information Society WSIS, Tunis, November 15, 2005 John C. Paolillo, Indiana University. Overview. Sources of Linguistic Bias Linguistic Bias: examples Text Communication Internet Host Names Web Programming

comfort
Download Presentation

Language and the Internet Assessing Linguistic Bias

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Language and the InternetAssessing Linguistic Bias Measuring the Information Society WSIS, Tunis, November 15, 2005 John C. Paolillo, Indiana University

  2. Overview • Sources of Linguistic Bias • Linguistic Bias: examples • Text Communication • Internet Host Names • Web Programming • Global Linguistic Diversity • Who bears the costs? • Conclusions

  3. Sources of Linguistic Bias (Friedman and Nissenbaum 1997) • Pre-existing • originate from outside the technical system • National, trans-national and institutional policies • Technology companies • Technical • are built into the technical system itself • Developers’ language backgrounds, national origins • Legacy standards, “backward” compatibility • Emergent • arise in specific contexts of use of a technical system • Economics of technology industry (marketing, monopoly power, unstable markets, etc.) • Rapid technologization

  4. Text Communication • Requires an encoding and its support • Assign code numbers to script characters • ASCII (American English) • ISO-8859-1 (European Languages) • Unicode (most languages, but support is uneven) • Support means many things • Fonts, rendering, sorting, spell-checking etc. • Computer-Mediated Communication • Web pages, Email, chat, etc. • Language use is not uniform in these modes • Multilinguals tend to favor different languages for specific purposes • Represents both technical and emergent biases

  5. Unicode Status: Examples Language Chinese English French German Spanish Finnish Russian Arabic Hindi Sinhala S. Azerbaijani Unicode yes yes yes yes yes yes yes yes yes yes no Browser good good good good good good good (late) good (late) poor none none Script Chinese Roman Roman Roman Roman Roman Cyrillic Arabic Indic Indic Arabic Pop. 1,240M 400M 81M 82M 358M 5M 132M 247M 213M 15M 26M Good support Poor support No support

  6. Internet Host Names • The Domain Name System • Uses a 30-year old 7-bit ASCII standard • Now supports Punycode (a variant of Unicode) • Imposes a maximum name length • Run by ICANN under US Dept of Commerce contract • More concerned with trademark protection • Host/domain naming is widely abused (e.g. tv domain) • Names provided by the DNS are not that useful • An example of emergent bias • Technical origin • Economic and political forces amplify and sustain it

  7. Web Programming and Unicode • Markup & web scripting languages • Unicode is standard • Browser support, fonts, etc. lag behind • Databases and development environments tend to lack proper Unicode support • End-user oriented, not programmer oriented • All of the most important technologies are Open-Source software (FLOSS) • User extensible/modifiable • Language localization of these is possible but rare

  8. Linguistic Bias in Web Programming • English is the source language for most programming & markup languages • Keywords • Operator-argument order • Programming constructs, etc. • Programming as a linguistic act • Complex concepts are rendered into text • Different languages have different ways of doing this • Emergent language biases

  9. Linguistic Properties of Programming • LISP • Predicates precede their arguments • Like Arabic, Celtic, Hebrew, etc. (defun fact (x)(if (<= x 0) 1 (* x (fact (- x 1))))) • Postscript • Predicates follow their arguments • Like Farsi, Hindi, Japanese, Tamil, Turkish, etc. /factorial { dup 1 gt { dup 1 sub factorial mul } if } def

  10. The Linguistic Digital Divide • Language issues go beyond content • WSIS repeatedly re-affirms principles of • Transparency • Self-determination • Open access to participation for all parties These principles cannot be guaranteed unless speakers of different languages can manipulate all aspects of IT use in a way that is native-like • The linguistic divide has broader consequences • Costs are borne in • Education — great for non-English speaking people • Technical development — small, in comparison (there is a trade-off)

  11. Language Diversity Who bears the costs?

  12. (source data: www.ethnologue.com) A typical language group has around 10-50 thousand people 80% of language groups have fewer than 100 thousand members

  13. (source data: www.ethnologue.com) 90% of the world’s population belongs to a language group with at least 1 million people (416 groups) Many languages with hundreds of milloins of speakers lack adequate support

  14. (source data: www.ethnologue.com)

  15. Conclusions • Linguistic Bias is manifest in many ways • Technical biases are sometimes overt • Emergent biases can be subtle • All potential sources of bias need to be examined and questioned if we are to uphold principles affirmed by WSIS • Without this effort, the linguistic digital divide will simply amplify existing disparities in wealth and power

  16. Language Diversity On The Internet

  17. Global Reach

  18. Linguistic Diversity Based on Entropy: Diversity = –2 ∑pi ln pi Diversity is the long-run per-individual average variance in language category (similar to log-likelihood)

  19. O’Neill, Lavoie and Bennett, 2003

  20. www.isc.org/ds

  21. www.isc.org/ds

  22. www.isc.org/ds, ITU

  23. ITU

  24. www.isc.org/ds

  25. www.isc.org/ds

  26. www.isc.org/ds, UNPD

  27. ITU, UNPD

  28. ITU

  29. www.isc.org/ds, ITU

More Related