110 likes | 248 Views
15 Years of Language Resource Creation and Sharing: A Progress Report on LDC Activities. Christopher Cieri, Mark Liberman {ccieri,myl}@ldc.upenn.edu University of Pennsylvania Linguistic Data Consortium 3600 Market Street, Suite 810 Philadelphia PA. 19104, USA. Language Resource Landscape.
E N D
15 Years of Language Resource Creation and Sharing: A Progress Report on LDC Activities Christopher Cieri, Mark Liberman {ccieri,myl}@ldc.upenn.edu University of Pennsylvania Linguistic Data Consortium 3600 Market Street, Suite 810 Philadelphia PA. 19104, USA
Language Resource Landscape • Change in Language Resource landscape continues; some new trends emerging since last report • Continuing growth in need for language resources • number of languages, sophistication of annotation, variety of user communities • Continuing advances in computing enable ever greater resource creation by individual researchers • However, demand for data centers has never been greater • as measured by: memberships, resource donations, projects • Some technologies approaching human performance • quality becomes more important even at the cost of volume • Understanding natural limits of human performance becomes very important • DARPA TIDES & EARS, 2004/5, groups working in MT & STT did not use all data provided • DARPA GALE emphasizes source variation, richness, quality of annotation, coordination of resource types • REFLEX LCTL (Less Commonly Taught Languages), NIST LRE (Language Recognition Evaluation) focus on diversity of languages and resource types not volume in any specific language or type • Move toward digital linguistic resources by new research communities increases resource sharing • new communities need simple, adaptive access to existing data and flexible standards. • communities extending data sharing require mapping among alternate representations • Growth of computing around the world • Increases the diversity of languages represented on the Internet • Raises the demand for technologies in these languages that in turn requires language resource kits.
LDC • Linguistic Data Consortium established 1992 • centralized location to distribute and archive language data • normalize and manage intellectual property rights and distribution practice • Organized as a consortium, group of organizations, hosted by U. Penn. • Management staff in Philadelphia: 45 FT & <= 65 PT employees • Funding • DARPA seed funding covered operations + corpus creation • early support from NSF, NIST • required to be self-sufficient within 5 years (operation costs<= fees) • annual membership fees, data licenses • grant funding for specific resource creation, not maintenance • Data comes from donations, funded projects at LDC or elsewhere, community initiatives and LDC initiatives. • Expansion • 1995: collection, transcription activities, 1998: annotation,1999: tools and standards, 2002: coordinating multi-site efforts, sharing experience through publications, training • LDC’s mission as currently defined is to support language-related education, research and technology development by creating and sharing linguistic resources: data, tools and standards. • Activities • resource distribution, intellectual property rights management, resource production • data collection, annotation, lexicon building • tool creation, infrastructure building • creation of best practices, consulting and training • corpus creation research, resource coordination
Benefits • Broad distribution of data with • uniform licensing within and across research communities which • relieves funding agencies of distribution costs and • provides vast amounts of data to members • Sustains stable infrastructure so that • Research communities know where to find data with • greatly standardized terms of use, distribution methods, • Members’ access to data is ongoing • Any patches are available via the same methods • Tools and specifications are distributed without fee. • The cost to create any one of the corpora in the LDC catalog is at least as much as the membership fee; in many cases it is one, two or even three orders of magnitude greater.
LDC • data collection • news text • web text: blogs, zines, newsgroups • broadcast news and talk • telephone conversation • meetings • interviews • read and prompted speech • printed, handwritten and hybrid documents • annotation • quick and careful transcription • time-alignment and segmentation at the turn, sentence and word level • tagging of morphology, part-of-speech, gloss • syntactic annotation • semantic annotation • discourse function and disfluency • categorization according to topic relevance • identification and classification of entities, relations, events and their co-reference • summarization of various lengths from 200 words down to titles • translation, multiple translation, translation quality control • alignment of translated text at the document, sentence and word levels • lexicon building • pronunciation, morphological, translation
Publications/Membership • New Membership types: • Online: access to subset of data included in LDC Online • Standard: LDC Online plus may request licenses <= 16 corpora, discounted licenses of data from previous years, discounted extra copies of licensed data • Subscription: Standard Members but automatically receive 2 copies of all corpora on media as they are released • Subscription memberships, added in 2005, now account for 23% of all members. • Cost increases • Due to rising costs of facilities, materials and labor costs • Licensing fees increased in 2007 • Membership fees increased as of January 2008 • Increases modest • 10% for subscription members • 20% for standard members • compared to average 3% annual increase in time value of money * 15 years • scaled according the demand of member type • “Frequent flyer” and early bird discounts • 5% for any returning member • 5% for any organization joining in first 2 months of membership year • Overall effect subscription members who maintain their membership from year and renew early to year doing so early in the year will actually see a 1% decrease is costs. • LDC currently adds 2-3 corpora to Catalog/month. • Membership and licensing fees support this activity completely • LDC has distributed 53,580 copies of nearly 800 corpora and otherwise shared data with 2540 organizations in 67 countries.
Publications • Since last report, LDC added • 68 titles to Catalog + dozens of corpora for evaluation programs • A sampling of those corpora includes: • email from the Enron scandal annotated for topic • Gigaword (billion word) News Text corpora in Arabic, Chinese, English, French, Spanish • broadcast news in Arabic, Korean • many contributions from Center for Spoken Language Understanding (CSLU) • Foreign Accented English, Apple Words and Phrases, Yes/No, Spelled and Spoken Words, Stories, Multilanguage Telephone Speech, Portland and National Cellular Telephone Speech, Names Release, Speaker Recognition, Spoltech Brazilian Portuguese and Voices • parallel text including Arabic Blogs (DARPA GALE) • Hungarian-English parallel text (Varga, Németh, Halácsy, Kornai) • STC-TIMIT: TIMIT data process through telephone network contributed by (Morales) • Urdu speech from the Army Research Labs • Speech in Korean and Spanish contributed by West Point • Treebanks in Arabic, Chinese, Czech, English, Korean with translations of Arabic, Chinese • Penn Discourse Treebank (Joshi, et. al.) • Propbank in Korean • OntoNotes Release 2.0 • Conversational Telephone Speech in Levantine, Iraqi and Gulf Arabic • Parallel Text in Arabic and Chinese (including 2 from ISI) • Broadcast News Parallel Text (LDC, MITRE) • Video key frames and transcripts created by the TRECVID program • Broadband Prompted Speech in English and Turkish (Middle East Technical University) • Telephone Band Speech in Russian • Evaluation data from the NIST 2003 and 2004 Rich Transcription campaigns • TimeBank corpus contributed by (Pustejovsky et. al.) • SpatialML annotation of ACE 2005 Multilingual (Mani, Hitzeman, Richer, Harris)
Sample Projects • DARPA GALE (Global Autonomous Language Exploitation) • supports multilingual transcription, translation into English and distillation of text into structured information • text (news, newsgroup, blog), transcribed speech (broadcst news and conversation) translated and aligned at sentence and sub-sentence level, annotations for syntactic structure & propositional content, distillation into structured information. • English, Mandarin and Arabic • MADCAT • supports systems that perform OCR (,LR) and MT of handwritten, printed and hybrid text • varying scribe, text type, writing instrument, time, speed of writing, paper quality • first language Arabic • Mixer Phases 1-5 • support robust speaker recognition technologies • multigenre: conversational telephone speech, transcript reading, face-to-face interviews • multilingual: Arabic, English, Mandarin, Russian, Spanish • multichannel: lavalier on the subject and interviewer, Etymotic Link-It micro-array, podium, PZM, studio, hanging conference room, camcorder, 4 studio mics at varying distances from subject, microphone array, head mounted mic used only for brief telephone calls • LVDID (Language Variation and Dialect Identification) • >100 conversations in each of a dozen linguistic varieties • ongoing collection in another 20 varieties with all calls audited for sound quality and language • REFLEX-LCTL (Less Commonly Taught Languages) [Simpson, et. al.] • supports multiple technologies for LCTLS especially extraction and translation • monolingual & parallel news text, bilingual lexicons, encoding converters, word & sentence segmenters, POS tagsets and taggers, morphological analyzers and tagged text, named-entity tagger and tagged text, personal name transliterator and grammatical sketch • Amazigh (Berber), Bengali, Hungarian, Pashto, Punjabi, Kurdish, Tagalog, Tamil, Thai, Tigrigna, Urdu, Uzbek and Yoruba
Projects • ACE - English, Chinese and Arabic corpora annotated for entities, the relations among them and the events in which they participate and their co-reference. • HAVIC - web video collected, classified and annotated • TREC Video - broadcast video, key frames, transcripts • Mixer Greybeard - multiple telephone conversations from subjects in previous studies • ITRE - scientific text in the biomedical domain treebanked and tagged for entities • OLAC - ongoing development of the Open Language Archives Community • QLDB – methods, tools for querying complex linguistic data bases including treebanks
Sample Collaborations • ELRA • joint programs: NetDC • collaboration: ENABLER, NEMLAR, FlareNet • joint data releases • subcontracts LDC->ELRA,MedLTC Arabic BN collection/transcription to ELRA/MedLTC; ELRA considering subcontracting Spanish collection transcriptin to LDC • ANC • Appen – TRANSTAC • BUTE – REFLEX LCTL • CASL • CMU – REFLEX LCTL Elicitation Corpus • DGA • ELSNET • IRCAM • Melbourne University • OLAC • TalkBank
Future Plans • maintain a leadership role in language resource creation and distribution • continue to support distribution operations and to provide increasing support for local initiatives via memberships and data licenses • extend outreach to new communities • including commercial ventures that require specialized corpora • make better use of technologies that are based upon LDC data • generally increase activities devoted to research • simplify production through efficiency and outsourcing • expand provision of tools, specifications and training to members