120 likes | 235 Views
A Road Map for Interoperable Language Resource Metadata. Christopher Cieri 1 , Khalid Choukri 2 , Nicoletta Calzolari 3 , D. Terence Langendoen 4 , Johannes Leveling 5 , Martha Palmer 6 , Nancy Ide 7 , James Pustejovsky 8 1. Linguistic Data Consortium (LDC), (ccieri @ ldc.upenn.edu,)
E N D
A Road Map for Interoperable Language Resource Metadata Christopher Cieri1, Khalid Choukri2, Nicoletta Calzolari3, D. Terence Langendoen4, Johannes Leveling5, Martha Palmer6, Nancy Ide7, James Pustejovsky8 1. Linguistic Data Consortium (LDC), (ccieri @ ldc.upenn.edu,) 2. European Language resources Association (ELRA), (choukri @ elda.org) 3. Instituto di Linguistica Computazionale, Consiglio Nazionale delle Ricerche (glottolo @ ilc.cnr.it) 4. Department of Linguistics, University of Arizona (langendt @ email.arizona.edu) 5. Centre for Next Generation Localisation (CNGL), Dublin City University, (johannes.leveling @computing.dcu.ie) 6. Center for Computational Language and Education Research, Department of Computer Science, University of Colorado, Boulder (Martha.Palmer @ Colorado.edu) 7. Department of Computer Science, Vassar College, USA,(ide @ cs.vassar.edu) 8. Department of Computer Science, Brandeis University, (jamesp @ cs.brandeis.edu)
Background • LRs remain expensive to create, thus rare relative to demand • accidental re-creation of LR a nearly unforgiveable waste of scarce resources • Despite • existence of a few large data centers focused on HLT • ELRA, LDC • prior harmonization project • Networking Data Centers • union catalog initiative • Open Language Archives Community (OLAC) • HLT researchers must still • master multiple metadata sets • to search multiple locations • in order to find needed resources • or else risk failing to note the existence of critical LRs • recreate them • do without them.
Recent Pre-History • OLAC (Open Language Archives Community) • LDC, ELRA early adopters • LREC Universal Catalog, LDC participating • FlareNet (Fostering Language Resources Network) • A major condition for the take-off of the field of Language Resources and Language Technologies is the creation of a shared policy for the next years. • FlareNet Meeting, Vienna • SILT (Sustainable Interoperability for Language Technology) • turn existing, fragmented technology and resources developed to support language processing technology into accessible, stable, and interoperable resources that can be readily reused across several fields • SILT-FLaReNet Meeting, "Towards An Operationalized Definition of Interoperability for Language Technology", Brandeis University, Waltham, Massachusetts, 1-2 November, 2009.
Current Landscape • Major Data Centers maintain own separate catalogs • different metadata languages (categories, terminologies) • export subsets of their metadata categories to the OLAC • OLAC provides • specifications for OAI (Open Archives Initiative) compliant metadata • routines for harvesting, interchanging, searching • ELRA UC • focusing on resources intended for HLT R&D • includes a greater percentage of ELRA metadata fields • exploits data mining to discover resources not produced or distributed by ELRA • LREC Map • uses LREC submission process to increase the contribution of LR metadata • NICT Shachi catalog • union catalog of resources • records are scraped • uses data mining technologies to discover LRs features missing from home catalog entries
Current Landscape • LDC LR Wiki • indentifies LRs (for less commonly taught languages) • organized by language and LR type • area experts edit individual sections • some resources: plain/parallel text & lexicons identified & even harvested automatically • free text description => normalization • LDC LR Papers Catalog • research papers • introduce, describe, discuss, extend or rely upon another LR • currently focusing on papers dealing with LDC data • full bibliographic information on the paper • link to the unique identifier of the LR referenced
Short Term Recommendations • harmonize LR catalogs of largest international data centers • non-reductionist approach to harmonization • not identify minimal subset that apply to all LR types • focus on LRs targeted toward HLT R&D • identify the superset of metadata types contained in them • distinguish • those than can be normalized internally across data centers • from those that encode irreconcilable differences • agree to normalize, harmonize practice wherever possible • governance body specifically for this project • project partners, sponsors, individual and small group LR providers and LR users
Outcomes • harmonized catalogs • definition of metadata categories • database structure • search engine customized to HLT LR search • controlled vocabulary fields • relevance-based search of entire catalog records • specification of best metadata practices • centralized metadata repository with a harvesting protocol • searcher assistance based upon • relations among metadata categories (dictionary ≅lexicon) • prior search behavior • those who searched for “Gigaword” also searched for “news text corpora” • metadata creator assistance based on • searcher behavior • “93% of searchers include a language name in their search” but “87% of all providers include ISO 639-3 language codes” • behavior of other metadata providers • “the metadata you have provided so far also characterize 32 other resources
Middle Term Recommendations • expand UC scope to include raw data & research papers • some work already begun, • not coordinated across data centers and LR creators • LCTL LR wiki at LDC • Rosetta Project • harvest of papers describing LRs • at LDC using human effort • within Rexa project using data mining technologies • integrate effective workflows: social networking, web sourcing, data mining • enhance UC with links to raw resources including • web sites rich in monolingual and parallel text • lexicons built for interactive use • new harmonization challenges • adjust governance and broaden the scope of its normalization activities • implement sustainable business models
Requirements • Representatives of • relevant data and metadata centers; ELRA/ELDA, LDC, NICT, and OLAC. • interested professional organizations: ACL, LSA, LinguistList, SIL, ISCA • journals willing to implement version of LREC map: LRE, LILT • organizers of conferences who agree to implement LREC Map: LREC, AFLR • related cataloging projects: Rexa Project • leading industrial partners • related LR development projects and centers: LanguageGrid, CDAC • Resources • support for partners (ELRA, LDC, NIST) some of which is already in place • database schema (existing) • search engine • technology for data mining • taxonomy/controlled vocabulary of applications, data types, etc. • Activities • outreach • sample output early in the project • institutional endorsement • evaluation of metadata (user feedback) • evaluation of performance of the Catalog in terms of LRs required
Use Case • connect discussions of interoperability • metadata, tools, documentation, standards • design use case which is inclusive, advanced • automatically training and HLT • harmonized metadata: necessary not sufficient • corpus descriptions • machine readable to identify locations, syntax, semantics • data catalog registry for non-identical specifications • human readable to assure consistent methodology
Progress to Date • planning refined at 2010 FlareNet Forum • progress on funding advances in Europe • FlareNet, T4ME, MetaNet, ISO-CAT • work on the UC continues • work on • UC • LR Wiki • Amazigh, Bengali, Panjabi, Pashto, Tagalog, Tamil, Urdu • ~100 resources per language • LDC Papers Catalog • 2500 papers