1 / 18

Indian Language Initiatives at LDC

Indian Language Initiatives at LDC . Denise DiPersio dipersio@ldc.upenn.edu. Overview . Introduction to LDC Tamil Projects/Resources Indian Language Projects/Resources. LDC: Origin and Model . Linguistic Data Consortium established in 1992

bella
Download Presentation

Indian Language Initiatives at LDC

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Indian Language Initiatives at LDC Denise DiPersiodipersio@ldc.upenn.edu

  2. Overview • Introduction to LDC • Tamil Projects/Resources • Indian Language Projects/Resources Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011

  3. LDC: Origin and Model • Linguistic Data Consortium established in 1992 • Via open, competitive government solicitation, won by U. Penn • Initial 5-year funding followed by self-sufficiency through membership fees, data licenses • Power of the collective • Language resource distributor/archive • Centralized distribution, archiving, licensing • Resources from donations, funded projects, community initiatives, LDC initiatives • Membership • Members support the consortium through fees, data, services • Ongoing rights to data published in membership years • Reduced fees on older corpora, extra copies Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011

  4. LDC: Roles • Data collection • Language resource (LR) production, including quality control • LR distribution and archiving • Intellectual property rights management and license management • Human subjects protocol management • Annotation, lexicon building • Creation of tools, specifications, best practices • Knowledge transfer: documentation, metadata, consulting, training • Corpus creation research and academic publication • Resource coordination in large multisite programs • Serving multiple research communities • Funding panelists, workshop participants, oversight committee members Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011

  5. LDC: Data Collection • News text • Web text (newsgroups, blogs, chatrooms, twitter) • Biomedical texts and abstracts • Printed, handwritten and hybrid documents • Broadcast programming (news, conversation) • Conversational telephone speech • Lectures, meetings, interviews • Read and prompted speech • Role play • Video (broadcast, web) • Animal vocalizations Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011

  6. LDC: Annotation • Data scouting, selection, triage • Audio-audio alignment: bandwidth, signal quality, language, dialect, program, speaker • Quick and careful transcription, aligned at turn, sentence, word level • Phonetic, dialect, sociolinguistic feature, supralexical • Tokenization, tagging of morphology, part-of-speech, gloss • Syntactic, semantic, discourse functions, disfluency, sense disambiguation • Identification/classification of entities, relations, events and coreference • Translation, alignment of translated text • Identification/classification of entities/events in video • Document zoning Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011

  7. LDC: Distribution • Since 1992, LDC has distributed • Nearly 75,000 copies of 1300 titles to more than 3000 organizations in over 65 countries • Approximately 8000 scholars and research groups receive LDC’s monthly newsletter • Non-exclusive distribution of donated data • LDC research communities span human language technologies, computer science, social sciences • Uniform licensing within and across research communities • Stable infrastructure • LRs permanently accessible, ongoing access to data • Standardized, simple terms of use and distribution methods Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011

  8. LDC: Data Scholarships • Formalizes LDC’s long practice of $0 distribution of data to students without the means to otherwise license it • Competitive process • Student submits application that contains: • Data set requested, proposed need and use of data • Description of research agenda • Demonstration of high probability of success for work • Letter of support from department chair/advisor including statement of financial need • Two cycles completed; next will be Fall 2011 • 16 recipients • Argentina, China, India, Indonesia, Mexico, UK, USA • ~USD40,000 in data awarded Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011

  9. Tamil Projects: REFLEX/LCTL 1/3 • REFLEX-LCTL (Less Commonly Taught Languages) • Goal: to create human language technologies for the target languages, especially machine translation, information extraction • Language selection criteria • Large population of native speakers • Relatively few language resources (electronic text, intentional difficulty variation in LR creation) • Linguistic and geographic diversity • Include some related languages • Make use of existing collaborations • Thirteen languages: Amazigh (Berber), Bengali, Hungarian, Kurdish, Pashto, Panjabi, Tamil, Tagalog, Thai, Tigrinya, Urdu, Uzbek, Yoruba • Bengali, Panjabi, Urdu – related languages Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011

  10. Tamil Projects: REFLEX/LCTL 2/3 • LDC created language packs for each language consisting of • a monolingual news text corpus (500k words) • a parallel text corpus (250k words) • a lexicon (10k entries) • a grammatical sketch • an encoding converter • a sentence segmenter • a tokenizer • a name transliterator • a part of speech tagger and tagged text • a named entity tagger and tagged text • a morphological analyzer and tagged text Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011

  11. Tamil Projects: REFLEX/LCTL 3/3 • Resources identified through individual scouting, “Harvest Festivals”, native speakers • Tamil Language Pack • Text sources included websites (for monolingual and parallel text) • Collaboration with Harold Schiffman, VasuRenganathan • Tamil lexicon – An English Dictionary of the Tamil Verb • Consulted on encoding conversion • Project sponsor has not yet released pack for publication; potential use in ongoing technology evaluations • Will be published in LDC catalog when cleared for distribution Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011

  12. Tamil Projects: Language Resource Wiki • Language Resource (LR) Wiki designed to be • Publicly accessible, world-readable • Portal of found resources “harvested” in REFLEX-LCTL project • Editable by authenticated others outside LDC • Pages for seven languages, including Tamil • http://lrwiki.ldc.upenn.edu/mediawiki/index.php/Tamil/Tamil • Bengali, Berber, Panjabi, Pashto, Tagalog, Tamil, Urdu • Breton, Ewe pages in progress • Language summary, linguistic resources, encoding and fonts, data sources, portals, tools and other natural language processing resources Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011

  13. Tamil Projects: Language Resource Wiki Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011

  14. Tamil Projects: CALLFRIEND • CALLFRIEND project supported the development of language identification technology • LDC recruited native speakers in the target languages to make telephone calls to other native speakers • Calls were unscripted and lasted between 5-30 minutes • Target languages: American English, Canadian French, Egyptian Arabic, Farsi, German, Hindi, Japanese, Korean, dialectal Mandarin Chinese, Spanish (Caribbean, non-Caribbean), Tamil, Vietnamese • CALLFRIEND Tamil LDC96S59 • 60 telephone conversations • Demographic data: sex, age, education • Call information: channel quality, number of speakers • Calls originated inside the continental United States and Canada Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011

  15. Tamil Resources • An English Dictionary of the Tamil Verb Second Edition LDC2009L01 • Harold Schiffman, VasuRenganathan (U Penn, Department of South Asia Studies) • Translations for 6597 English verbs and definitions for 9716 Tamil verbs • Associated sound files for pronunciation; example sentences • Windows search and browse application • Complementary copy in conference packet Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011

  16. Indian Language Projects/Resources: Hindi • Hindi Surprise Language Exercise (2003) • Goal: to assemble found resources under timed conditions • LDC collected newswire, web data, some parallel text • Not all resources can be released due to intellectual property, license restraints • Further work needed for public release • Hindi WordNet LDC2008L02 • Joint distribution with IIT Bombay • First WordNet for an Indian language • CALLFRIEND Hindi LDC96S52 Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011

  17. Indian Language Resources: POS Tagsets • Indian Language Part of Speech Tagsets (IL-POST) • Developed by Microsoft Research India; Anna University, Chennai; Delhi University; IIT Bombay; Jawaharlal Nehru University, Delhi; Tamil University, Tamilnadu • Goal: to provide a common tagset framework for Indian languages that offers flexibility, cross-linguistic compatibility and reusability across languages • LDC currently distributes three IL-POST sets at no cost: Bengali, Hindi, Sanskrit • IL-POST Bengali LDC2010T16 – 103k words from web text, EMILLE corpus (parallel newswire) • IL-POST Hindi LDC2010T24 – 98k words from web text • IL-POST Sanskrit LDC2011T04 – 57k words from Panchatrantra stories • More languages planned, Tamil among them Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011

  18. LDC: Need to Know • LDC website, http://www.ldc.upenn.edu/ • The LDC Corpus Catalog, http://www.ldc.upenn.edu/Catalog/ • Submitting Corpora and Other Resources to LDC, http://www.ldc.upenn.edu/Providing/ • LDC Online, https://online.ldc.upenn.edu/login.html • Member Resources, http://www.ldc.upenn.edu/Membership/ • Questions? • Thank you! Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011

More Related