290 likes | 520 Views
Asian Languages on the Web. S. T. Nandasara Lecturer USCS, University of Colombo, Sri Lanka Ashu Marasinghe Associate Professor LOP, Nagaoka University of Technology, Japan Yoshiki Mikami Professor, Leader LOP, Nagaoka University of Technology, Japan. Asian Languages on the Web.
E N D
Asian Languages on the Web S. T. NandasaraLecturer USCS, University of Colombo, Sri Lanka Ashu Marasinghe Associate ProfessorLOP, Nagaoka University of Technology, Japan Yoshiki Mikami Professor, LeaderLOP, Nagaoka University of Technology, Japan
Asian Languages on the Web • Introduction of Asian Languages • Survey Objectives and Methodology • Asian Language Presence on the Web • Multilingualism in the Asian Web • Script and Encoding Issues • Asian Language Resource Network (ALRN) Project
Survey Objectives • Give an overview for Asian Languages on the web • To describe the state of multilingualism in Asian country domains • Defined at various levels, from a personal or document level to a societal level • Multiple language presence in each country domain • Give an overview of cross-border languages • To shed light on script and encoding issues of Asian languages • What extent is UCS/Unicode employed for Asian languages? • What scripts are actually used to represent a specific language? • What extent are locally developed encodings used? • Define a future agenda, which can guide us in realizing the vision of creating an observation-collection instrument for Asian languages.
Survey Methodology • Used a web crawler (Ubi crawler) • It traces links within pages and recursively crawls to gather those newly discovered pages • The collection of downloaded web pages passed to the language identification engine • The language properties of the pages were identified
Web Pages Collected • Focused on web pages in 42 country domains in Asia. • The crawl was begun from a seed file containing 13,286 URLs • The list of ccTLDs contains ae, af, az, bd, bh, bn, bt, cy, id, il, in, iq, ir, jo, kg, kh, kw, kz, la, lb, lk, mm, mn, mv, my, np, om, ph, pk, ps, qa, sa, sg, sy, th, tj, tm, tp, tr, uz, vn and ye. • The Asia crawl started from 5th July 2006 at 11:00hrs and ended on 19th July 2006 at 19:03hrs • Downloaded 107,141,679 web pages in total, 652,710,237,381 bytes in size
Language Identification Process • The language identification engine LIM (Language Identification Module) used • LIM consists of two components • Training component • Training data is translations of the Universal Declaration of Human Rights (UDHR) provided by the United Nation’s Office of Higher Commissioner for Human Rights • The second component is identification component • LIM can simultaneously detect the triplet of language, script and encoding scheme
Discovered 55 Asian languages Chinese, Japanese and Korean are excluded from the analysis Hebrew, Thai, Turkish, Vietnamese, Arabic, Tatar, Farsi, Javanese, Indonesian, Malay, Sundanese, Hindi, Dari, Uzbek, Mongolian, Kazakh, Madurese, Uighur, Kashmiri Pushtu, Balochi, Turkmen, Minangkabau, Bikol, Kyrgyz, Balinese, Punjabi, Sindhi, Achehnese, Sinhala, Kapampangan, Iloko, Bengali & Assamese, Filipino, Waray, Bugisnese, Burmese, Kurdish, Tajiki, Azeri, Tamil, Hiligaynon, Dhivehi, Bhojpuri, Tibetan, Cebuano, Telugu, Saraiki, Lao, Gujarati, Pashto, Kannada, Urdu, Khmer, Hani
Multilingualism by Country Domain • The most recent version of Ethnologue lists close to seven thousand languages around the world. • More than 2600 of them are spoken in the Asian region. • Large scale linguistic diversity is observable in Asia. Among the 2600, only around 51 languages are recognized by Asian governments as official or national language(s) • Richest diversity of languages in the region, i.e. Indonesia • Interesting to note that there is a significantly larger number of pages in Javanese compared to either Indonesian or Malay • The major language found in Indonesia, Malaysia, Brunei, Singapore, Southern Thailand and Phillipines can be categorized into a single root Malay language spoken in different dialects. • Javanese has a dominating web presence in Indonesia. • The lesser Sundanese, Madurese, Achehnese and Buginese languages are found to be of great importance to Indonesia’s local language diversity on the Internet
Cross-Border Languages • Another aspect of the multilingualism in the region is the overwhelming presence of cross-border languages on the web • Defined two categories of languages • First category is “local languages”, which are officially recognized language(s) and home speakers’ languages of the state • The second category is “cross-border languages”, such as English, French, Russian and Arabic, which are used as a language of communication among the peoples of different nations
Cross-Border Language Presence West Asia
Same Script Shared by Various Languages Devanagari Script used by • More than 480 million speakers • Hindi • More than 10 million speakers • Marathi • Nepali • More than 1 million speakers • Awadhi • Bhojpuri • Braj-Dhasha • Chahattsigarhi • Konkani • Kachchi • Marwani • Maithali • Magahi • Scholars’ language • Sanskrit Less than 1 millionspeakers Kului Kumaoni Khadiya Khortha Kului Kumaoni Kurku Kurukh Kurmali Palpa Panchpargania Santali Nagpuri Kankan Limbu Sherpa Garhwali Mundari Newari Begheli Bhatneri Bathi Bateri Bhili Gondi Jaipuri Harauti Ho Kachchhi Kanauji Khadiya Khorthi
UDHR Document by Major Script Grouping Representation of the UDHR Document by Major Script Grouping [1]Cumulated speaker population based on Ethnologue, “Language of the World”, 15th ed. (2005)
To create a network of qualified Asian partners to specify and support the development of high priority Language Resources (LRs) for Asian Languages in a systematic, standards-driven, collaborative and learning context. The project will focus on identifying the state of the art of LRs in the region, assessing priority requirements through consultations with language research, industry and communication players, and establishing a protocol and standards for developing a LR Network for the languages spoken in the region. Asian Language Resources – Agenda ALRN Mission
ALRN Action Plan • The project will be focusing on South, South East, Central & West Asian Languages • Act as an umbrella with Asian Language Resources (ALR) • To accommodate Secure and Sustainable UTF base encoding • Take advantage of existing Organization such as Language Observatory Project (LOP,TCL) • Corpus collection from the web using LO’s crawler/language identifier • Language resources originated from Japan and with their paralleled language corpus available in other languages (UDHR, Oshin, One Straw Revolution, etc) • Multilingual Terminology Dictionary • Information Standards of language corpus building • Liaison with international organization such as UNESCO, UDHR, etc. • Information resource shearing web site (www.language-resource.net) Asian Academy of Languages …?
Thank you Danke schön Merci Gracias Obrigado Grazie Danke Spaciba Ευχάριστο
Language Presence in Asian Countries (The exact number of languages may never be determined exactly)
Language Diversity (Half of the world’s languages are spoken in only eight countries)
Will Cover 4 Asian Regions (West, Central, South & South East Asia) 42 Countries 9 Language Families 62 Languages 18 Major Scripts Asian Language Resources Network - Agenda