1 / 12

A Road Map for Interoperable Language Resource Metadata

A Road Map for Interoperable Language Resource Metadata. Christopher Cieri 1 , Khalid Choukri 2 , Nicoletta Calzolari 3 , D. Terence Langendoen 4 , Johannes Leveling 5 , Martha Palmer 6 , Nancy Ide 7 , James Pustejovsky 8 1. Linguistic Data Consortium (LDC), (ccieri @ ldc.upenn.edu,)

joella
Download Presentation

A Road Map for Interoperable Language Resource Metadata

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Road Map for Interoperable Language Resource Metadata Christopher Cieri1, Khalid Choukri2, Nicoletta Calzolari3, D. Terence Langendoen4, Johannes Leveling5, Martha Palmer6, Nancy Ide7, James Pustejovsky8 1. Linguistic Data Consortium (LDC), (ccieri @ ldc.upenn.edu,) 2. European Language resources Association (ELRA), (choukri @ elda.org) 3. Instituto di Linguistica Computazionale, Consiglio Nazionale delle Ricerche (glottolo @ ilc.cnr.it) 4. Department of Linguistics, University of Arizona (langendt @ email.arizona.edu) 5. Centre for Next Generation Localisation (CNGL), Dublin City University, (johannes.leveling @computing.dcu.ie) 6. Center for Computational Language and Education Research, Department of Computer Science, University of Colorado, Boulder (Martha.Palmer @ Colorado.edu) 7. Department of Computer Science, Vassar College, USA,(ide @ cs.vassar.edu) 8. Department of Computer Science, Brandeis University, (jamesp @ cs.brandeis.edu)

  2. Background • LRs remain expensive to create, thus rare relative to demand • accidental re-creation of LR a nearly unforgiveable waste of scarce resources • Despite • existence of a few large data centers focused on HLT • ELRA, LDC • prior harmonization project • Networking Data Centers • union catalog initiative • Open Language Archives Community (OLAC) • HLT researchers must still • master multiple metadata sets • to search multiple locations • in order to find needed resources • or else risk failing to note the existence of critical LRs • recreate them • do without them.

  3. Recent Pre-History • OLAC (Open Language Archives Community) • LDC, ELRA early adopters • LREC Universal Catalog, LDC participating • FlareNet (Fostering Language Resources Network) • A major condition for the take-off of the field of Language Resources and Language Technologies is the creation of a shared policy for the next years. • FlareNet Meeting, Vienna • SILT (Sustainable Interoperability for Language Technology) • turn existing, fragmented technology and resources developed to support language processing technology into accessible, stable, and interoperable resources that can be readily reused across several fields • SILT-FLaReNet Meeting, "Towards An Operationalized Definition of Interoperability for Language Technology", Brandeis University, Waltham, Massachusetts, 1-2 November, 2009.

  4. Current Landscape • Major Data Centers maintain own separate catalogs • different metadata languages (categories, terminologies) • export subsets of their metadata categories to the OLAC • OLAC provides • specifications for OAI (Open Archives Initiative) compliant metadata • routines for harvesting, interchanging, searching • ELRA UC • focusing on resources intended for HLT R&D • includes a greater percentage of ELRA metadata fields • exploits data mining to discover resources not produced or distributed by ELRA • LREC Map • uses LREC submission process to increase the contribution of LR metadata • NICT Shachi catalog • union catalog of resources • records are scraped • uses data mining technologies to discover LRs features missing from home catalog entries

  5. Current Landscape • LDC LR Wiki • indentifies LRs (for less commonly taught languages) • organized by language and LR type • area experts edit individual sections • some resources: plain/parallel text & lexicons identified & even harvested automatically • free text description => normalization • LDC LR Papers Catalog • research papers • introduce, describe, discuss, extend or rely upon another LR • currently focusing on papers dealing with LDC data • full bibliographic information on the paper • link to the unique identifier of the LR referenced

  6. Current Landscape

  7. Short Term Recommendations • harmonize LR catalogs of largest international data centers • non-reductionist approach to harmonization • not identify minimal subset that apply to all LR types • focus on LRs targeted toward HLT R&D • identify the superset of metadata types contained in them • distinguish • those than can be normalized internally across data centers • from those that encode irreconcilable differences • agree to normalize, harmonize practice wherever possible • governance body specifically for this project • project partners, sponsors, individual and small group LR providers and LR users

  8. Outcomes • harmonized catalogs • definition of metadata categories • database structure • search engine customized to HLT LR search • controlled vocabulary fields • relevance-based search of entire catalog records • specification of best metadata practices • centralized metadata repository with a harvesting protocol • searcher assistance based upon • relations among metadata categories (dictionary ≅lexicon) • prior search behavior • those who searched for “Gigaword” also searched for “news text corpora” • metadata creator assistance based on • searcher behavior • “93% of searchers include a language name in their search” but “87% of all providers include ISO 639-3 language codes” • behavior of other metadata providers • “the metadata you have provided so far also characterize 32 other resources

  9. Middle Term Recommendations • expand UC scope to include raw data & research papers • some work already begun, • not coordinated across data centers and LR creators • LCTL LR wiki at LDC • Rosetta Project • harvest of papers describing LRs • at LDC using human effort • within Rexa project using data mining technologies • integrate effective workflows: social networking, web sourcing, data mining • enhance UC with links to raw resources including • web sites rich in monolingual and parallel text • lexicons built for interactive use • new harmonization challenges • adjust governance and broaden the scope of its normalization activities • implement sustainable business models

  10. Requirements • Representatives of • relevant data and metadata centers; ELRA/ELDA, LDC, NICT, and OLAC. • interested professional organizations: ACL, LSA, LinguistList, SIL, ISCA • journals willing to implement version of LREC map: LRE, LILT • organizers of conferences who agree to implement LREC Map: LREC, AFLR • related cataloging projects: Rexa Project • leading industrial partners • related LR development projects and centers: LanguageGrid, CDAC • Resources • support for partners (ELRA, LDC, NIST) some of which is already in place • database schema (existing) • search engine • technology for data mining • taxonomy/controlled vocabulary of applications, data types, etc. • Activities • outreach • sample output early in the project • institutional endorsement • evaluation of metadata (user feedback) • evaluation of performance of the Catalog in terms of LRs required

  11. Use Case • connect discussions of interoperability • metadata, tools, documentation, standards • design use case which is inclusive, advanced • automatically training and HLT • harmonized metadata: necessary not sufficient • corpus descriptions • machine readable to identify locations, syntax, semantics • data catalog registry for non-identical specifications • human readable to assure consistent methodology

  12. Progress to Date • planning refined at 2010 FlareNet Forum • progress on funding advances in Europe • FlareNet, T4ME, MetaNet, ISO-CAT • work on the UC continues • work on • UC • LR Wiki • Amazigh, Bengali, Panjabi, Pashto, Tagalog, Tamil, Urdu • ~100 resources per language • LDC Papers Catalog • 2500 papers

More Related