180 likes | 433 Views
Indian Language Initiatives at LDC . Denise DiPersio dipersio@ldc.upenn.edu. Overview . Introduction to LDC Tamil Projects/Resources Indian Language Projects/Resources. LDC: Origin and Model . Linguistic Data Consortium established in 1992
E N D
Indian Language Initiatives at LDC Denise DiPersiodipersio@ldc.upenn.edu
Overview • Introduction to LDC • Tamil Projects/Resources • Indian Language Projects/Resources Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011
LDC: Origin and Model • Linguistic Data Consortium established in 1992 • Via open, competitive government solicitation, won by U. Penn • Initial 5-year funding followed by self-sufficiency through membership fees, data licenses • Power of the collective • Language resource distributor/archive • Centralized distribution, archiving, licensing • Resources from donations, funded projects, community initiatives, LDC initiatives • Membership • Members support the consortium through fees, data, services • Ongoing rights to data published in membership years • Reduced fees on older corpora, extra copies Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011
LDC: Roles • Data collection • Language resource (LR) production, including quality control • LR distribution and archiving • Intellectual property rights management and license management • Human subjects protocol management • Annotation, lexicon building • Creation of tools, specifications, best practices • Knowledge transfer: documentation, metadata, consulting, training • Corpus creation research and academic publication • Resource coordination in large multisite programs • Serving multiple research communities • Funding panelists, workshop participants, oversight committee members Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011
LDC: Data Collection • News text • Web text (newsgroups, blogs, chatrooms, twitter) • Biomedical texts and abstracts • Printed, handwritten and hybrid documents • Broadcast programming (news, conversation) • Conversational telephone speech • Lectures, meetings, interviews • Read and prompted speech • Role play • Video (broadcast, web) • Animal vocalizations Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011
LDC: Annotation • Data scouting, selection, triage • Audio-audio alignment: bandwidth, signal quality, language, dialect, program, speaker • Quick and careful transcription, aligned at turn, sentence, word level • Phonetic, dialect, sociolinguistic feature, supralexical • Tokenization, tagging of morphology, part-of-speech, gloss • Syntactic, semantic, discourse functions, disfluency, sense disambiguation • Identification/classification of entities, relations, events and coreference • Translation, alignment of translated text • Identification/classification of entities/events in video • Document zoning Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011
LDC: Distribution • Since 1992, LDC has distributed • Nearly 75,000 copies of 1300 titles to more than 3000 organizations in over 65 countries • Approximately 8000 scholars and research groups receive LDC’s monthly newsletter • Non-exclusive distribution of donated data • LDC research communities span human language technologies, computer science, social sciences • Uniform licensing within and across research communities • Stable infrastructure • LRs permanently accessible, ongoing access to data • Standardized, simple terms of use and distribution methods Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011
LDC: Data Scholarships • Formalizes LDC’s long practice of $0 distribution of data to students without the means to otherwise license it • Competitive process • Student submits application that contains: • Data set requested, proposed need and use of data • Description of research agenda • Demonstration of high probability of success for work • Letter of support from department chair/advisor including statement of financial need • Two cycles completed; next will be Fall 2011 • 16 recipients • Argentina, China, India, Indonesia, Mexico, UK, USA • ~USD40,000 in data awarded Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011
Tamil Projects: REFLEX/LCTL 1/3 • REFLEX-LCTL (Less Commonly Taught Languages) • Goal: to create human language technologies for the target languages, especially machine translation, information extraction • Language selection criteria • Large population of native speakers • Relatively few language resources (electronic text, intentional difficulty variation in LR creation) • Linguistic and geographic diversity • Include some related languages • Make use of existing collaborations • Thirteen languages: Amazigh (Berber), Bengali, Hungarian, Kurdish, Pashto, Panjabi, Tamil, Tagalog, Thai, Tigrinya, Urdu, Uzbek, Yoruba • Bengali, Panjabi, Urdu – related languages Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011
Tamil Projects: REFLEX/LCTL 2/3 • LDC created language packs for each language consisting of • a monolingual news text corpus (500k words) • a parallel text corpus (250k words) • a lexicon (10k entries) • a grammatical sketch • an encoding converter • a sentence segmenter • a tokenizer • a name transliterator • a part of speech tagger and tagged text • a named entity tagger and tagged text • a morphological analyzer and tagged text Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011
Tamil Projects: REFLEX/LCTL 3/3 • Resources identified through individual scouting, “Harvest Festivals”, native speakers • Tamil Language Pack • Text sources included websites (for monolingual and parallel text) • Collaboration with Harold Schiffman, VasuRenganathan • Tamil lexicon – An English Dictionary of the Tamil Verb • Consulted on encoding conversion • Project sponsor has not yet released pack for publication; potential use in ongoing technology evaluations • Will be published in LDC catalog when cleared for distribution Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011
Tamil Projects: Language Resource Wiki • Language Resource (LR) Wiki designed to be • Publicly accessible, world-readable • Portal of found resources “harvested” in REFLEX-LCTL project • Editable by authenticated others outside LDC • Pages for seven languages, including Tamil • http://lrwiki.ldc.upenn.edu/mediawiki/index.php/Tamil/Tamil • Bengali, Berber, Panjabi, Pashto, Tagalog, Tamil, Urdu • Breton, Ewe pages in progress • Language summary, linguistic resources, encoding and fonts, data sources, portals, tools and other natural language processing resources Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011
Tamil Projects: Language Resource Wiki Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011
Tamil Projects: CALLFRIEND • CALLFRIEND project supported the development of language identification technology • LDC recruited native speakers in the target languages to make telephone calls to other native speakers • Calls were unscripted and lasted between 5-30 minutes • Target languages: American English, Canadian French, Egyptian Arabic, Farsi, German, Hindi, Japanese, Korean, dialectal Mandarin Chinese, Spanish (Caribbean, non-Caribbean), Tamil, Vietnamese • CALLFRIEND Tamil LDC96S59 • 60 telephone conversations • Demographic data: sex, age, education • Call information: channel quality, number of speakers • Calls originated inside the continental United States and Canada Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011
Tamil Resources • An English Dictionary of the Tamil Verb Second Edition LDC2009L01 • Harold Schiffman, VasuRenganathan (U Penn, Department of South Asia Studies) • Translations for 6597 English verbs and definitions for 9716 Tamil verbs • Associated sound files for pronunciation; example sentences • Windows search and browse application • Complementary copy in conference packet Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011
Indian Language Projects/Resources: Hindi • Hindi Surprise Language Exercise (2003) • Goal: to assemble found resources under timed conditions • LDC collected newswire, web data, some parallel text • Not all resources can be released due to intellectual property, license restraints • Further work needed for public release • Hindi WordNet LDC2008L02 • Joint distribution with IIT Bombay • First WordNet for an Indian language • CALLFRIEND Hindi LDC96S52 Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011
Indian Language Resources: POS Tagsets • Indian Language Part of Speech Tagsets (IL-POST) • Developed by Microsoft Research India; Anna University, Chennai; Delhi University; IIT Bombay; Jawaharlal Nehru University, Delhi; Tamil University, Tamilnadu • Goal: to provide a common tagset framework for Indian languages that offers flexibility, cross-linguistic compatibility and reusability across languages • LDC currently distributes three IL-POST sets at no cost: Bengali, Hindi, Sanskrit • IL-POST Bengali LDC2010T16 – 103k words from web text, EMILLE corpus (parallel newswire) • IL-POST Hindi LDC2010T24 – 98k words from web text • IL-POST Sanskrit LDC2011T04 – 57k words from Panchatrantra stories • More languages planned, Tamil among them Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011
LDC: Need to Know • LDC website, http://www.ldc.upenn.edu/ • The LDC Corpus Catalog, http://www.ldc.upenn.edu/Catalog/ • Submitting Corpora and Other Resources to LDC, http://www.ldc.upenn.edu/Providing/ • LDC Online, https://online.ldc.upenn.edu/login.html • Member Resources, http://www.ldc.upenn.edu/Membership/ • Questions? • Thank you! Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011