140 likes | 293 Views
The current state of Metadata - as far as we understand it -. Peter Wittenburg The Language Archive - Max Planck Institute CLARIN Research Infrastructure Nijmegen, The Netherlands . Old Concept. of course "metadata" is an old concept library cards were introduced to cope with
E N D
The current state of Metadata - as far as we understand it - Peter Wittenburg The Language Archive - Max Planck Institute CLARIN Research Infrastructure Nijmegen, The Netherlands
Old Concept • of course "metadata" is an old concept • library cards were introduced to cope with • mass and anonymity • not surprising that library people started thinking • about this to describe all kind web-accessible resources • DC and qualified DC wee the results • however, research world is different - not just search • therefore in many domains solutions were developed • 2 years ago CLARIN revised its 15 year old set&framework
Big Ideas • of course managing increasing amounts of data • of course finding valuable data in the growing haystacks • but also • machine usage of metadata • automatic profile matching • research statistics - virtual sub-collection building • etc. • multilinguality in a multilingual European society • interdisciplinary research • biodiversity people should find information in linguistic archives • etc. • linking with contextual information • document lifecycle management (provenance)
Big Change • until now researchers informed each other • culture of personal exchange • claim: this will only work partially in the future • have distributed centers storing lots of data • national and discipline dimensions • depositors upload their data into these centers • will have an anonymous landscape of data & tools • all offered as services • what do we have to find things: • proper metadata descriptions • social tagging by virtual organizations • content to operate on by "smart" data mining
Big Question • are we ready to meet these wishes and changes? • probably not • some major issues • quality • interoperability • registry and reference stability • functional • multilingual • scalability • IT principles
Quality Issue • lack quality in descriptions • not all elements filled in • (researchers are lazy, lack of tool support) • often not schema based (XLS) thus inconsistent • lack agreed and standardized vocabularies • ISO 639-3 - about 6000 language codes • what about subject classification schemes • what about institution names • thus many errors and inconsistencies • ontologies are expensive to maintain • misinterpretations/misuse of element semantics • etc
Interoperability Issue • hampered by different approaches • (closed DB, no modularity, embedded ontologies) • structural difficulties up to context dependency • difficult semantic mapping • different description dimensions • bad element definitions • bad vocabulary definitions • only little support of OAI-PMH • reliance on DC semantics - but useless for research etc • often "hardwired" mappings • lack of a flexible framework to create/share/use relations • little is standardized - what about lifetime then
Registry and Reference Stability Issue • flexibility only when we separate things • define & register all concepts in open registries • (we are using ISO 12620 - ISOcat) • define & register all components/profiles • (we are using CLARIN registry) • register all mappings (nothing yet) • but if we do this we need to refer • are our references stable?? • some are using Cool URIs - are they just URLs? • some using explicit Handles - are they maintained? • who takes care? • (we are using EPIC - European PID Consortium)
Functional Issue • do we address new functional requirements • what about provenance information • is it automatically generated • what about versions - are they visible • what about ltp information • what about formal access information • do we know what is needed for the web services scenario • (profile matching, deployment information, etc)
Multilingual Issue • what does it really include? • localizing all software • multilingual definitions of all concepts • elements and vocabulary terms • (no translations of proper names of course or?) • or do we simply rely on some lingua franca • answer probably discipline dependent • how much is (should be) public involved • whatever we do it is a lot of work • CLARIN: ISOcat covers almost all major EU languages
Scalability Issue • are our solutions scalable? • in EUROPEANA millions of metadata records • in CLARIN about 270.000 • how to structure the offer • how to present this to naive users • do we share same granularity • (md at collection and/or resource level) • can we deal with aggregations in same way • can we apply semantic web technology • automatic mapping • automatic quality improvement
IT Principles • we need to disseminate the message of some • basic IT principles • define and register your semantics • specify and register your syntax • use a stable reference scheme • in some areas separate definitions and relations • get things standardized or use standards such as • XML, some schema language • ISO 12620, etc • URI, Handles
What can we do? • listen to each other first • increase awareness about metadata and basic principles • see how we can create an interoperable landscape • harmonizing approaches • harmonizing along major issues • making things explicit and scalable • look for proper interdisciplinary solutions
moving towards an ideal e-Science domain Ümnichtto end in Babylonish scenario nous avons still algo time omsistemas teimprove. Thanks for your attention.