160 likes | 315 Views
GSK: Development and Distribution of Resources. Licensing and Distribution of Resources and Applications. Hitoshi ISAHARA GSK : Gengo Shigen Kyokai (Language Resource Association) National Institute of Information and Communications Technology (NICT).
E N D
GSK: Development and Distribution of Resources Licensing and Distribution of Resources and Applications Hitoshi ISAHARA GSK: Gengo Shigen Kyokai (Language Resource Association) National Institute of Information and Communications Technology (NICT)
Organizing Creation & Utilization of Language Corpora Creation of language corpora needs some cost. Utilization needs a system to distribute corpora. Some activities started early in 1990s. 1992 LDC in U.S.A. 1995 ELRA in Europe Regional Conference on Localized ICT Development and Dissemination across Asia Jan. 15, Vientiane, Laos
Japanese Activities GSK: Gengo Shigen Kyokai (Language Resource Association) Launched in 1999, Reformed as an NPO in 2003, Project accepted in 2005 for 3 years, Text corpora are its main concern at present. NII-SRC distributes speech corpora. Regional Conference on Localized ICT Development and Dissemination across Asia Jan. 15, Vientiane, Laos
GSK and NII-SRC Language Resource Association (GSK) A nonprofit organization collecting and distributing text and speech corpora. http://www.gsk.or.jp/ NII-Speech Resources Consortium (NII-SRC) Collects and distributes most major speech corpora. http://research.nii.ac.jp/src/eng/ These two organizations try to play central roles for collecting and distributing speech and language corpora in Japan. Regional Conference on Localized ICT Development and Dissemination across Asia Jan. 15, Vientiane, Laos
JEITA (Japan Electronics and Information Technology Industries Association) GSK NII-SRC Knowledge Information Processing Technologies Committee NII: National Institute of Informatics NICT: National Institute of Information and Communications Technology Language Resource Sub-committee TCL Natural Language Processing Portal Site SHACHI: Language Resource Metadata DB Regional Conference on Localized ICT Development and Dissemination across Asia Jan. 15, Vientiane, Laos
Purpose of GSK Collection, distribution, investigation, research, and standardization of electronic data and software tools necessary for the promotion of science, technology, education and industry concerning natural language. Regional Conference on Localized ICT Development and Dissemination across Asia Jan. 15, Vientiane, Laos
GSK Organization President Two vice presidents 11 board members 25 steering committee members All are voluntary workers. Regional Conference on Localized ICT Development and Dissemination across Asia Jan. 15, Vientiane, Laos
No-fee Distribution Corpus Provider Distribution permission User GSK Payment Agreement As a rule, the cost of handling corpora falls on the user, though the corpus itself is free of charge. Regional Conference on Localized ICT Development and Dissemination across Asia Jan. 15, Vientiane, Laos
Agency Agency Request GSK Provider User Form Commission Payment Agreement The providers of the corpora entrust GSK with requests received from users. GSK mediates between users and providers. Regional Conference on Localized ICT Development and Dissemination across Asia Jan. 15, Vientiane, Laos
Advertizing Provider User Ad request GSK Publicity Ad rate Payment Agreement Corpora providers entrust GSK with advertizing useful information on their data or corpora. Regional Conference on Localized ICT Development and Dissemination across Asia Jan. 15, Vientiane, Laos
Some Examples of GSK Corpora JEITA Multimodal Corpus Japanese Web N-ram Version 1 CICC Multilingual Dictionary IPAL Lexicon of Basic Japanese Regional Conference on Localized ICT Development and Dissemination across Asia Jan. 15, Vientiane, Laos
JEITA Multimodal Corpus A corpus of collected person-to-person task-oriented dialogues. 80 min. of video for 9 conversations concerning topics of “faces” and “travel” included. Speech data transcribed and provided with annotations indicating morphemes, dialogue structure and prosody. Contained in 1 DVD-R (800 MB). Regional Conference on Localized ICT Development and Dissemination across Asia Jan. 15, Vientiane, Laos
Japanese Web N-gram Version 1 N-grams that have been extracted from Google crawling publicly available Japanese webpages. Pages requiring special permission to brows or indicated with nonarchaive/noindex are not included. N-grams (1-7) with frequency greater than 20 were extracted from approximately 20 billion sentences. Contained in 6 DVD-Rs (26 GB after gzip compression). Regional Conference on Localized ICT Development and Dissemination across Asia Jan. 15, Vientiane, Laos
CICC Multilingual Dictionary A collection of Malay, Indonesian, Chinese, and Thai Dictionaries containing 50,000 basic words, POS tags; some contains English translations. Technical Term Dictionary for each language is also available. Contained in 1 CD-ROM for each language. CICC: Center for the International Cooperation for Computation Regional Conference on Localized ICT Development and Dissemination across Asia Jan. 15, Vientiane, Laos
IPAL Lexicon of Basic Japanese Containing 861 verbs, 136 adjectives, and 1,081 Nouns and glossary. English translations also provided for nouns contained in glossary. Contained in 1 CD-ROM. Regional Conference on Localized ICT Development and Dissemination across Asia Jan. 15, Vientiane, Laos
Summary 1. There are several distributers of language resources in Japan. 2. GSK is the only consortium of language resources qualified as NPO in Japan. 3. GSK plans to collaborate with Language Grid Project. Regional Conference on Localized ICT Development and Dissemination across Asia Jan. 15, Vientiane, Laos