1 / 11

15 Years of Language Resource Creation and Sharing: A Progress Report on LDC Activities

15 Years of Language Resource Creation and Sharing: A Progress Report on LDC Activities. Christopher Cieri, Mark Liberman {ccieri,myl}@ldc.upenn.edu University of Pennsylvania Linguistic Data Consortium 3600 Market Street, Suite 810 Philadelphia PA. 19104, USA. Language Resource Landscape.

dalit
Download Presentation

15 Years of Language Resource Creation and Sharing: A Progress Report on LDC Activities

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 15 Years of Language Resource Creation and Sharing: A Progress Report on LDC Activities Christopher Cieri, Mark Liberman {ccieri,myl}@ldc.upenn.edu University of Pennsylvania Linguistic Data Consortium 3600 Market Street, Suite 810 Philadelphia PA. 19104, USA

  2. Language Resource Landscape • Change in Language Resource landscape continues; some new trends emerging since last report • Continuing growth in need for language resources • number of languages, sophistication of annotation, variety of user communities • Continuing advances in computing enable ever greater resource creation by individual researchers • However, demand for data centers has never been greater • as measured by: memberships, resource donations, projects • Some technologies approaching human performance • quality becomes more important even at the cost of volume • Understanding natural limits of human performance becomes very important • DARPA TIDES & EARS, 2004/5, groups working in MT & STT did not use all data provided • DARPA GALE emphasizes source variation, richness, quality of annotation, coordination of resource types • REFLEX LCTL (Less Commonly Taught Languages), NIST LRE (Language Recognition Evaluation) focus on diversity of languages and resource types not volume in any specific language or type • Move toward digital linguistic resources by new research communities increases resource sharing • new communities need simple, adaptive access to existing data and flexible standards. • communities extending data sharing require mapping among alternate representations • Growth of computing around the world • Increases the diversity of languages represented on the Internet • Raises the demand for technologies in these languages that in turn requires language resource kits.

  3. LDC • Linguistic Data Consortium established 1992 • centralized location to distribute and archive language data • normalize and manage intellectual property rights and distribution practice • Organized as a consortium, group of organizations, hosted by U. Penn. • Management staff in Philadelphia: 45 FT & <= 65 PT employees • Funding • DARPA seed funding covered operations + corpus creation • early support from NSF, NIST • required to be self-sufficient within 5 years (operation costs<= fees) • annual membership fees, data licenses • grant funding for specific resource creation, not maintenance • Data comes from donations, funded projects at LDC or elsewhere, community initiatives and LDC initiatives. • Expansion • 1995: collection, transcription activities, 1998: annotation,1999: tools and standards, 2002: coordinating multi-site efforts, sharing experience through publications, training • LDC’s mission as currently defined is to support language-related education, research and technology development by creating and sharing linguistic resources: data, tools and standards. • Activities • resource distribution, intellectual property rights management, resource production • data collection, annotation, lexicon building • tool creation, infrastructure building • creation of best practices, consulting and training • corpus creation research, resource coordination

  4. Benefits • Broad distribution of data with • uniform licensing within and across research communities which • relieves funding agencies of distribution costs and • provides vast amounts of data to members • Sustains stable infrastructure so that • Research communities know where to find data with • greatly standardized terms of use, distribution methods, • Members’ access to data is ongoing • Any patches are available via the same methods • Tools and specifications are distributed without fee. • The cost to create any one of the corpora in the LDC catalog is at least as much as the membership fee; in many cases it is one, two or even three orders of magnitude greater.

  5. LDC • data collection • news text • web text: blogs, zines, newsgroups • broadcast news and talk • telephone conversation • meetings • interviews • read and prompted speech • printed, handwritten and hybrid documents • annotation • quick and careful transcription • time-alignment and segmentation at the turn, sentence and word level • tagging of morphology, part-of-speech, gloss • syntactic annotation • semantic annotation • discourse function and disfluency • categorization according to topic relevance • identification and classification of entities, relations, events and their co-reference • summarization of various lengths from 200 words down to titles • translation, multiple translation, translation quality control • alignment of translated text at the document, sentence and word levels • lexicon building • pronunciation, morphological, translation

  6. Publications/Membership • New Membership types: • Online: access to subset of data included in LDC Online • Standard: LDC Online plus may request licenses <= 16 corpora, discounted licenses of data from previous years, discounted extra copies of licensed data • Subscription: Standard Members but automatically receive 2 copies of all corpora on media as they are released • Subscription memberships, added in 2005, now account for 23% of all members. • Cost increases • Due to rising costs of facilities, materials and labor costs • Licensing fees increased in 2007 • Membership fees increased as of January 2008 • Increases modest • 10% for subscription members • 20% for standard members • compared to average 3% annual increase in time value of money * 15 years • scaled according the demand of member type • “Frequent flyer” and early bird discounts • 5% for any returning member • 5% for any organization joining in first 2 months of membership year • Overall effect subscription members who maintain their membership from year and renew early to year doing so early in the year will actually see a 1% decrease is costs. • LDC currently adds 2-3 corpora to Catalog/month. • Membership and licensing fees support this activity completely • LDC has distributed 53,580 copies of nearly 800 corpora and otherwise shared data with 2540 organizations in 67 countries.

  7. Publications • Since last report, LDC added • 68 titles to Catalog + dozens of corpora for evaluation programs • A sampling of those corpora includes: • email from the Enron scandal annotated for topic • Gigaword (billion word) News Text corpora in Arabic, Chinese, English, French, Spanish • broadcast news in Arabic, Korean • many contributions from Center for Spoken Language Understanding (CSLU) • Foreign Accented English, Apple Words and Phrases, Yes/No, Spelled and Spoken Words, Stories, Multilanguage Telephone Speech, Portland and National Cellular Telephone Speech, Names Release, Speaker Recognition, Spoltech Brazilian Portuguese and Voices • parallel text including Arabic Blogs (DARPA GALE) • Hungarian-English parallel text (Varga, Németh, Halácsy, Kornai) • STC-TIMIT: TIMIT data process through telephone network contributed by (Morales) • Urdu speech from the Army Research Labs • Speech in Korean and Spanish contributed by West Point • Treebanks in Arabic, Chinese, Czech, English, Korean with translations of Arabic, Chinese • Penn Discourse Treebank (Joshi, et. al.) • Propbank in Korean • OntoNotes Release 2.0 • Conversational Telephone Speech in Levantine, Iraqi and Gulf Arabic • Parallel Text in Arabic and Chinese (including 2 from ISI) • Broadcast News Parallel Text (LDC, MITRE) • Video key frames and transcripts created by the TRECVID program • Broadband Prompted Speech in English and Turkish (Middle East Technical University) • Telephone Band Speech in Russian • Evaluation data from the NIST 2003 and 2004 Rich Transcription campaigns • TimeBank corpus contributed by (Pustejovsky et. al.) • SpatialML annotation of ACE 2005 Multilingual (Mani, Hitzeman, Richer, Harris)

  8. Sample Projects • DARPA GALE (Global Autonomous Language Exploitation) • supports multilingual transcription, translation into English and distillation of text into structured information • text (news, newsgroup, blog), transcribed speech (broadcst news and conversation) translated and aligned at sentence and sub-sentence level, annotations for syntactic structure & propositional content, distillation into structured information. • English, Mandarin and Arabic • MADCAT • supports systems that perform OCR (,LR) and MT of handwritten, printed and hybrid text • varying scribe, text type, writing instrument, time, speed of writing, paper quality • first language Arabic • Mixer Phases 1-5 • support robust speaker recognition technologies • multigenre: conversational telephone speech, transcript reading, face-to-face interviews • multilingual: Arabic, English, Mandarin, Russian, Spanish • multichannel: lavalier on the subject and interviewer, Etymotic Link-It micro-array, podium, PZM, studio, hanging conference room, camcorder, 4 studio mics at varying distances from subject, microphone array, head mounted mic used only for brief telephone calls • LVDID (Language Variation and Dialect Identification) • >100 conversations in each of a dozen linguistic varieties • ongoing collection in another 20 varieties with all calls audited for sound quality and language • REFLEX-LCTL (Less Commonly Taught Languages) [Simpson, et. al.] • supports multiple technologies for LCTLS especially extraction and translation • monolingual & parallel news text, bilingual lexicons, encoding converters, word & sentence segmenters, POS tagsets and taggers, morphological analyzers and tagged text, named-entity tagger and tagged text, personal name transliterator and grammatical sketch • Amazigh (Berber), Bengali, Hungarian, Pashto, Punjabi, Kurdish, Tagalog, Tamil, Thai, Tigrigna, Urdu, Uzbek and Yoruba

  9. Projects • ACE - English, Chinese and Arabic corpora annotated for entities, the relations among them and the events in which they participate and their co-reference. • HAVIC - web video collected, classified and annotated • TREC Video - broadcast video, key frames, transcripts • Mixer Greybeard - multiple telephone conversations from subjects in previous studies • ITRE - scientific text in the biomedical domain treebanked and tagged for entities • OLAC - ongoing development of the Open Language Archives Community • QLDB – methods, tools for querying complex linguistic data bases including treebanks

  10. Sample Collaborations • ELRA • joint programs: NetDC • collaboration: ENABLER, NEMLAR, FlareNet • joint data releases • subcontracts LDC->ELRA,MedLTC Arabic BN collection/transcription to ELRA/MedLTC; ELRA considering subcontracting Spanish collection transcriptin to LDC • ANC • Appen – TRANSTAC • BUTE – REFLEX LCTL • CASL • CMU – REFLEX LCTL Elicitation Corpus • DGA • ELSNET • IRCAM • Melbourne University • OLAC • TalkBank

  11. Future Plans • maintain a leadership role in language resource creation and distribution • continue to support distribution operations and to provide increasing support for local initiatives via memberships and data licenses • extend outreach to new communities • including commercial ventures that require specialized corpora • make better use of technologies that are based upon LDC data • generally increase activities devoted to research • simplify production through efficiency and outsourcing • expand provision of tools, specifications and training to members

More Related