370 likes | 386 Views
This workshop presented by Norman Paskin discusses the principles of identification and description in managing digital content and licenses. Topics covered include internet identifiers, metadata, and key issues in managing digital entities. The workshop focuses on structured management, persistent identification, and interoperability.
E N D
Structured Management of Digital Content and Licenses Electronic Publishing, Digital Archiving and Licensing workshop Frankfurt October 20 2005 Norman Paskin, International DOI Foundation n.paskin@doi.org
Structured Management of Digital Content and Licenses Outline: • Define terms in the title • Two principles: identification and description. • Identification: resolution, persistence, interoperability • Internet identifiers; URI, URN, is DNS enough? • What do we need to identify? • Description: what is it we are identifying? • Metadata: taxonomies, ontologies, folksonomies • Summary of key issues
Structured Management of Digital Content and Licenses Management: • know what it is you are managing – label it • Require a unique label for an entity involved in a DRM transaction • An identifier string, which can do something Digital Content and Licenses: • Enties in transactions: stuff, people, deals (= content, users, licences) • indecs: “people make stuff, people do deals about stuff; stuff is used by people” • Same system for all these entities, using internet standards Structured: • Objective: capable of being used in distributed systems • someone else can come along at another time/place, and may need to link to another system, etc • So must be persistent and interoperable (which means: description)
ID Two principles for persistent identification resource 1. Obvious:IDENTIFICATION Assign ID to resource Once assigned the number must identify the same resource • Beyond the lifetime of the resource, or the assigner • Less obvious: DESCRIPTION Assign Resource to ID • The resource must be described • If the Resource is not alwayssecurely and exclusively bound to the ID – then: • Describe the resource “content” [with precision] • Failure to do this will ultimately break interoperability • How far do we go in each? Depends on what is “good enough” • Technologists have focussed on (1) [and “bags of bits/data structures”] • The content/rights world on (2) [and focus on “intellectual content”]: ISBN etc • Both viewpoints valid • (2) is now becoming more relevant – because more open/distributed systems
Structured Management of Digital Content and Licenses Outline: • Explaining the terms in the title • Two principles: identification and description • Identification: resolution, persistence, interoperability • Internet identifiers; URI, URN, is DNS enough? • What do we need to identify? • Description: what is it we are identifying? • Metadata: taxonomies, ontologies, folksonomies • Summary of key issues
Identifiers do something • Identifier: A unique label for an entity involved in a transaction • Note the ambiguity of the word “identifier”: • Label (e.g. ISBN) • Specification (e.g. URN) scheme for making actionable + = Implemented system (e.g. DOI, Bar code) “actionable identifier” • But pure versus actionable identifier is not a clear distinction – any pure identifier may become actionable in the future through new specifications being applied • Resolution: The process in which an identifier is the input (a request) to a network service to receive in return a specific output. • Both concepts are in principle neutral as to technology implementation • Abstract concepts, but implementations typically at least “internet” TCP/IP (the more general the better, e.g. not just “Web”)
Technical and social infrastructure issues Persistence • "It is intended that the lifetime of a [persistent identifier] be permanent. That is, the [persistent identifier] will be globally unique forever, and may well be used as a reference to a resource well beyond the lifetime of the resource it identifies or of any naming authority involved in the assignment of its name.“ • [Persistent Identifier] = URN in IETF RFC 1737: Functional Requirements for Uniform Resource Names. (http://www.ietf.org/rfc/rfc1737.txt)
Interoperability • Persistence can be seen as just one aspect of this wider concept • “persistence is interoperability with the future” • We know what we mean, but others may not. • Identifiers assigned in one context may be encountered, and may be re-used, in another place or time [= persistence] - without consulting the assigner. You can’t assume that your assumptions made on assignment will be known to someone else. Interoperability = the possibility of use in services outside the direct control of the issuing assigner • This will be key for publishing, archiving and licensing – all assume distributed access
Persistent identifiers on the Internet: DNS • Domain Name System: DNS • designed primarily as a level of indirection for IP addresses: 132.157.24.3 is a machine. Move server.acme.com to another machine, you don't have to tell everyone but just change your DNS records so it now points to 132.157.24.6 instead. • A number of assumptions that were valid at that time now pose problems : • All the data is public: difficult for use in applications like voice over IP. • The data can be implicitly trusted: you need some way to trust that you are talking to who you think you are talking to. • The names can all be in ASCII – but Chinese etc is important after all. • Administration will be done by sys admins sitting at consoles: no need for an administrative protocol. Ownership is then naturally at the level of whoever owns the servers and pays the sys admins. • Control of the naming authority will not be a problem: ICANN, Root zone file is a very active UN row now going on (WSIS) • DNS designed for servers: • When Tim B- L came out with a plan for linking documents it seemed natural to build on DNS: tack file paths on the end of the server names in order to identify the business ends of the links: URLs (now URIs). • But now the documents are identified starting with the names of the organizations that own the servers they sit on. A problem.
Persistent identifiers on the Internet: Handle • DNS is not essential to the underlying TCP/IP network, but just to the current use of that network. One proposed solution to DNS problems; Handle system (1995+) • identify objects, not servers. • objects can be anything identified: accounts, names, ids, phone #s, content… • explicit improvements for identifying verylarge number of digital objects. • not all the data is public: individual values within a handle can be private. • all transactions can be certified. • any Unicode character set can be used. • separation between who owns and controls the handle versus who happens to run the servers (distributed administration, ownership at the handle level) • gets rid of semantics in the identifier: makes it easy to move ownership across organizations without your objects having someone else's name. • Freely available to be used as engine underneath other named identifiers. Does not need DNS, but can work with DNS. • Basis of DOI system – advantages as above, proven for publishers. Used in Grid computing, US govt applications, DOI, etc though most DOIs are used in translated http proxy form • “The governance of the DNS will not completely encompass future Internet addressing and navigation…The system…is not static but a technology capable of evolving into a better form. As such, the current system should not be treated as sacrosanct, but amenable to innovation”. Kenneth Neil Cukier (Technology Correspondent, The Economist) • However, most identifier methodologies still use the DNS basis: URI, URN
URI : observations • Web based (W3C led). Still much wider uptake than DOI etc. Takes DNS as basis. Problems: • URLs, as currently understood, are demonstrably not persistent: calling them URIs doesn’t fix that • Inherits DNS problems (last slide) especially the name/place confusion • Many important recent developments are not based on URIs in any way e.g. VoIP (Skype), Peer-to-peer • Some are URI based but with different registration requirements (MPEG-21) • The Web is not the end point of evolution: grid computing, mobile computing • The IETF RFC consensus process, and the separate existence of W3C, leads to ongoing debate and standards with a vague existence (Cf. ISO standards: W3C web site on naming and addressing is “incomplete”) • Persistence = organisation is now becoming recognised, and technical solution should follow • e.g. “commitment statement” in archiving is seen as important (ARK) • e.g. IDF has established rules for social network support of DOIs • Importance ofsocial infrastructure • URN mechanism (>10 years old) meant to be solution: • But still not implemented – recent renewed interest may help
URN: observations • URN (Uniform Resource Name): using DNS to add names to locations • Part of mid90s IETF design concept: URL/URN/URC • Still inherits problems of DNS, but better than URL • But not widely used • A single point re-direction to URLs using an http: proxy server • Any existing identifier can add the URN spec: • isbn:12345678 as a URN = urn:isbn:123456789. • Assumes a DNS-based Resolution Discovery Service (RDS) • No such widely deployed RDS schemes currently exist: Browsers cannot action URN strings without some additional programming “plug-in”. • Some have been built for individual communities • Example: Life Science identifier LSID • fine but also needs a social infrastructure • functionally gives nothing beyond the functionality achieved by coherent management of the corresponding URLs – • but they work for that community, by adding that coherent management . • URN code or plug-in promised for CENDI (US government users). Some movement to “re-define URN”. If that happens and is taken up, it could be significant.
Identifier systems • Each community tends to arrive at its own “good enough for us” solution • less focus now on “what is a persistent identifier?” More on “how do we build a system… ” • Whatever mechanism, resolvable identifiers must provide: • Agreed numbering syntax • Resolution mechanism • Data model to define “what it is we are identifying” • Technical and social infrastructure to implement • (compare physical world bar codes, etc) • could be assembled ad hoc, or offered as a packaged system (e.g.DOI)
Identifying entities of all types • Resources: most commonly content (Stuff) • Licences (some music industry applications now looking at this (Deals) • Parties (see earlier InterParty project) including Institutions(people): • e.g. exploratory stakeholders' meeting took place Washington DC October 7 to examine the feasibility of an Institution Registry • Problem: libraries deliver contact names and numbers, IP address ranges, etc to publishers, • Publishers manage this in their access and subscription systems in order to be able to authenticate library users • This exchange of information is usually done individually between publishers and libraries; much duplication of effort, no possibility of synergy • Institution Registry could at minimum provide a central space to hold this information once only .
Structured Management of Digital Content and Licenses Outline: • Explaining the terms in the title • Two principles: identification and description • Identification: resolution, persistence, interoperability • Internet identifiers; URI, URN, is DNS enough? • What do we need to identify? • Description: what is it we are identifying? • Metadata: taxonomies, ontologies, folksonomies • Summary of key issues
Resolution and “What are we identifying?” • Resolution: The process in which an identifier is the input (a request) to a network service to receive in return a specific output • Identifier identifies an entity. • “what I point to” (resolve to and get) is not always “what is identified”, • Can identify but not “get” directly things that are intangible (works), or fugitive (performances) or that change: (“Todays NY Times”) or people and concepts…. • Pointing and clicking can return different things in different contexts, or give multiple options • Entities can be physical, abstract, tangible, intangible, things, people, concepts, colours… • Resolution provides a mechanism to describe the resource “content” through a service which delivers a description
What are we identifying? “what I point to” (resolve to and get) is not always obvious Document on screen Abstract work? Manifestation of abstract work? Version? This HTML file? All/some of these?
Describing what we are managing Whatpreciselyare we identifying by this identifier? How are these things related to other things? Common approaches: • Taxonomies • Ontologies • Folksonomies
Taxonomy • (Greek) taxis, arrangement; + -nomie, method • Division into ordered groups or categories • Hierarchical, parent/child relationships • Defined area of interest • Gives a good way of being unambiguous within a controlled, defined area • Best example is Linnean taxonomy of life: the classification of organisms in an ordered system that indicates natural relationships • And that illustrates a key point…
Taxonomy • “It’s a Robin” • Id = Robin • ..and we all know what a Robin looks like… • “we know what we mean but others may not”
Chordata | Aves | Passeriform | Turdidae | Erithacus | Rubecula European Robin
Chordata | Aves | Passeriform | Turdidae | Turdus | Migratorius American Robin (different genus)
Chordata | Aves | Passeriform | Eopsaltridae | Petroica | Multicolor Scarlet Robin (Australasia) (different family)
? | ? | ? | ? | ? | ? Robin (red) (and Batman)
? | ? | ? | ? | ? | ? Robin Reliant (red)
Ontologies • differ from taxonomic approach: • Not just “stamp collecting” but extensible • do not follow a rigid/parent child hierarchical structure: terms may inherit meaning from more than one parent • a more complex relationship is maintained. • Can build on / are more complex than taxonomies • Show how taxonomies map to each other • May add inference engines etc • the proposed third (missing) component of the semantic web: • XML allows users to add arbitrary structure to their documents but says nothing about what the structures mean. • RDF enables expression of meaning (sets of triples, each triple being rather like the subject, verb and object) • Ontologies “will enable machines to comprehend semantic documents and data"
Ontologies • Use underlying data model – a “context model” - to express an events-based structure • the accepted ontology approach [context based= events and states] • We often think of metadata as “about” things, people, etc • static views e.g. about “person A” ; “creation B” • Events link things (e.g. to describe rights activities) by relating things and people in the context which generated/used them • dynamic views e.g. “A created B” • Events description is the key to “rights metadata” • all such transactions are contextual (events) • describing the event in context, using formal dictionary terms, enables semantic interoperability • The common methodology with most uptake and promise is the <indecs> one • developed in more detail by CONTECS and by RightsCom • MPEG21 RDD the first result of the extended methodology
2005 Int DOI Foundation indecs (2000) CONTECS (2001+) ISO MPEG21 RDD IDF + ONIX indecsDD EU project -> indecs Framework Ltd IFPI/RIAA, MPA, IDF, DentsuMMG, Rightscom OntologyX Mi3p etc 1998-2005: Defining what is identified through metadata Development of indecs 1998-2005 Black = what Red = who
Folksonomies • Current hot web topic: individuals assign their own keywords to content • Examples: • www.flickr.com (photo-sharing); • http://del.icio.us/ (social bookmarking)
Folksonomies • Rough and ready alternative to traditional information organisation • Most people use tags first and foremost to organise their own information in a way that makes sense to them • Sharing this creates a side-effect of “vast democratically structured frameworks of organisation” • Not much good for managed structured searching/management: • e.g. “recipe” “cooking” “barbecue” • the Robin problem • But don’t write them off: • cf Wikipedia (people said it would never work…) • imagine some automated organisation/rules/dictionary being added in certain communities • imagine links to Autonomy type searching
Structured Management of Digital Content and Licenses Outline: • Explaining the terms in the title • Two principles: identification and description • Identification: resolution, persistence, interoperability • Internet identifiers; URI, URN, is DNS enough? • What do we need to identify? • Description: what is it we are identifying? • Metadata: taxonomies, ontologies, folksonomies • Summary of key issues
Summary: key issues • What are we identifying? [content not just bits] • What are we resolving to from this identifier? • What, if any, explicit metadata are we making available? • How will the social infrastructure be provided? The mechanisms must allow: • Identification of entities of all forms • To be used in variety of contexts • Appropriate use of metadata at appropriate level • Development of ontology tools to describe entity relationships The logic chain: Identification Persistent Interoperable Automation Precision Logic
Structured Management of Digital Content and Licenses Electronic Publishing, Digital Archiving and Licensing workshop Frankfurt October 20 2005 Norman Paskin, International DOI Foundation n.paskin@doi.org