1 / 35

Metadata in NIR

Metadata in NIR. Fabio Vitali University of Bologna Maria Guercio University of Urbino. Introduction. Metadata support has always been present in NIR Recently (June/July 2004) deep (and hot) discussions have happened within the WG about identifying a full set of metadata information

yon
Download Presentation

Metadata in NIR

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Metadata in NIR Fabio Vitali University of Bologna Maria Guercio University of Urbino

  2. Introduction • Metadata support has always been present in NIR • Recently (June/July 2004) deep (and hot) discussions have happened within the WG about identifying a full set of metadata information • This is the result so far of the status of discussion.

  3. Some terminology • Automatic: any task that can be completely left to the machine to be performed • All kinds of data format conversion • E.g. XML->HTML or NIR XML -> NIR RDF. • Semi-automatic: any task that can, with a certain degree of precision, be performed by the machine, but that still requires a human for final verification and approval. • Identification of structures • E.g. partitioning of documents, identification and interpretation of citations • Manual: any task that needs to be decided upon and performed by a thinking human, even though the machine can provide the support to help him/her and ease the task itself.

  4. Some terminology (2) • Objective • an objective datum is something for which no reasonable discussion can exist as to its value. • E.g. the title of article 15, the publication date • Subjective • A subjective datum is something that requires an active interpretation from a human that may be wrong, or for which different opinions exist • E.g., resolution of implicit citations, classification of provisions • Explicit • A datum that is actually written somewhere in the text • Implicit • A datum that needs to be deduced from the external, or through the application of specific reasoning

  5. Some terminology (3) • Low competence • the kind of competence one may expect from a non-specialized employee, such as a secretary, armed with just common sense and some topical experience • E.g.: where does article 1 end and article 2 start • High competence • The kind of competence one may expect from overspecialized jurists that come to some results after careful and painful reasoning • e.g.: dates and times in norms. • Editorial intervention • by the publisher of a document • Authorial intervention • by the author of a document

  6. Design issues for NIR (1) • Data structure rather than application • Norme In Rete knows about applications, but is not dependent on any use of the data and is not specifically targeted towards any specific application (except presentation) • The same text should be marked in the same way by different editors (at least in the most fundamental structures)

  7. Design issues for NIR (2) • Rigorous distinction of roles • The author of a norm is the legislator, the provider of the actual XML document is the editor. • The legislator is GOD (his decisions cannot be discussed), but He only speaks through the text of the norms. • The editor can add a large quantity of information, but it has no official status • The very act of adding tag is an editorial operation, subjective and open to discussions. • In fact, any addition coming from editors (structure identification, notes, comments, interpretation) happens outside of the document content (in markup structures or in special metadata sections)

  8. Design issues for NIR (3) • Complexity of the access to texts • Many editors, many publishing systems, many copies in different stages of evolution • There is no authoritative source of XML documents (only of printed documents). • One web site could forget about updating a law to the latest version • Use of URN allows to refer to the text of a law without identifying a single existing authoritative source.

  9. Design issues for NIR (4) • Support for description and prescription • Tagging of existing texts can only be descriptive (supporting any possible mess that the legislator may have put in) • Support for legal drafting can be provided, suggesting or enforcing legal drafting rules in the writing.

  10. Design issues for NIR (5) • Everything has a reliable name • Every legal structure needs to be referenced and accessible. • References need to be unambiguous, universal, definitive. • URN for whole documents, • id attributes for substructures and spans • XPointers for even smaller entities.

  11. Design issues for NIR (6) • Clean separation between objective properties and interpretation • Objective properties can be marked by low-level editors, while interpretation requires experts and high-level editors. • Objective (manifest) properties include identification of boundaries (articles, slauses, etc.) and official facts about texts (publication dates, etc.) • Interpretation includes identification of troublesome dates (dies coactu, dies valens), identification of normative content of the texts provisions, application of modifications.

  12. Design issues for NIR (7) • Specific support for multiple interpretations • “Disposizioni” (law provisions) can be identified and specified on the text. • Multiple different interpretations of the same text must be allowed • So they cab be placed outside of the main document.

  13. Basic structures (1) • Containers • Documents, parts, subparts, articles, etc. • All numbered and titled • Text containers • Clauses (comma), list elements, etc. • Inline elements • Presentation oriented (bold, italics, etc.): discouraged, we rely on HTML elements and CSS styles • Legal oriented (references, modifications, specification of dates, organizations, roles, places, etc.): we rely on specific NIR elements.

  14. Basic structures (2) • Metadata • Publication information and other data supplied by editors (publication notes, document evolution, etc.) • Law provisions for the interpretation of the semantics of the content • Support for irregular texts (those that do not comply with standard legal drafting rules) is available through relaxed syntax in some cases (documentoNIR)

  15. The Schemas for NIR documents • 3 different DTDs • Strict rules (prescriptive) • Loose rules (descriptive) • Light rules (support for most common cases) • They are intercompatible • The vocabulary is exactly the same • All light documents are also loose • All strict document are also loose

  16. The needs for metadata • Metadata represent the only chance for putting information that was not explicitly written by the legislator. • All possible types of additional information beyond those provided in the text need to find a place here. • Uses: archival, analysis, annotations, automatic processing (consolidation), etc.

  17. Official classification of metadata • A starting point is provided by NISO (US National Information Standards Organization) in the guide “Understanding metadata” (2004): • descriptive metadata to describe a resource “for purposes such as discovery and identification” • structural metadata to indicate “how compounds objects are put together” • administrative metadata to provide information “to help manage a resource”, articulated (only) as rights management metadata and preservation metadata (“information needed to archive and preserve a resource”)

  18. But… • The distinction between descriptive, structural and administrative metadata cannot find any concrete basis on the real practice: • All the communities involved in the preservation of documents have developed and used relevant information related to the structure identification as a sub-set of information of their descriptive systems. They never consider the structural data as independent component. • The ambiguity of the administrative metadata is even more evident, specifically in the digital systems where the technological components are less and less relevant for the long-term preservation and play a function for physical retrieval of a resource in a digital repository, but are considered part of the descriptive system in the case of web resources.

  19. Metadata in the NIR DTD text • Any kind of information that is provided by the editor rather than by the author. • In a way even tagging text is metadata • Deriving new versions out of an original and a few modification documents is also adding metadata. • But adding proper metadata means providing additional information to a version of a document that can be used to better search, contextualize and understand a document. <xml> Text </xml> <xml> Changes </xml> <xml> Changes </xml> <xml> Changes </xml> meta <xml> Changes </xml>

  20. Proper metadata in the NIR DTD • Can be specified • In an external document (in RDF - still underspecified) • In an internal section at the beginning of the document (meta) in a NIR vocabulary • In many internal sections near the parts of the text they refer to, in a NIR vocabulary • Conversion back and forth is always possible and automatic. • Deals with description, structure, administration, as well as: • Interpretation of content • Relationships with other documents • Comments and notes

  21. Seven types of proper metadata • Reflective information • Things the document knows about itself • Positioning information • Things the document knows about the norms it expresses and the legal system it belongs to • Lifecycle information • Special moments in the history of the document and of its norms, and the list of other documents that justify them • Editorial notes • Things the editor wants to attach to specific parts of the document but cannot, since the DTD does not allow editorial intervention on content • Iter-connected texts • The history of the document before its approval • Proprietary extensions • Provisions (disposizioni)

  22. Reflection info (descrittori) • Refers to the document, not its content • Publication date. Re-publications. Errata. Official clarifications. • URN(s), aliases • Objective data, easy to find even with low competences • Storing freshness information? • A document does not usually know whether it is up-to-date. We may deal with stale documents, dead web sites, CD-ROMs • The best we can do is to provide them with a last-updated date • The normative system will confirm whether this is the last interesting date, or there exist more recent versions of the same document

  23. Positioning info (inquadramento) • Refers to the norms contained in the doc • Missing parts • Rank, function, nature and proposers of the law • Keywords and taxonomies they belong to • Objective data (mostly), but requiring high competence to write down.

  24. Lifecycle (altriatti) - 1 • Over time, documents undergo changes (in content, efficacy, power and so on) • These change happen at specific points in time and depend on specific documents (modification documents). • Usually modification documents specify several changes on the same modified document, and may specify multiple modification dates. • Therefore it makes sense to create a secondary structure where all relevant moments and documents can be matched

  25. t05 t01 t02 t04 t03 suspended repealed resumed modified original v02 v01 v02 1/3/1997 24/9/1999 1/1/2001 1/1/1996 12/6/1998 Lifecycle (altriatti) - 2

  26. Lifecycle (altriatti) - 3 • The lifecycle section only provides information about the relation to the document that causes the modifications • This information is objective and can be provided with low competence • Information about each actual modification is optional and placed in the provision section. • That information is sometimes subjective and can be provided only with significant competence

  27. Other sections • Editorial notes (redazionale) • Footnotes, comments, and any text the editor feels like adding. It can point to specific places in the text through <ndr> elements • Iter-connected data (lavoripreparatori) • An official blurb detailing the iter for the approval of the act, with presentation dates, discussion dates, etc. Plain text. • Proprietary • An open-ended section where editors can add their own metadata with freedom.

  28. Provisions • Provisions describe the meaning of each meaningful fragment of the text according to a predefined (and hopefully complete) taxonomy (ontology???) • Divided in three main sections plus a residual category: • Justifications • Analytical provisions • Modifications • Other

  29. Justifications • Some norms (e.g., decrees) introduce before the actual text a foreword providing a number of justifications: • Considered… • Consulted… • Based on a proposal by • Considering… • Etc.

  30. Analytical provisions • Describe properties and meaning of fragments of the actual text. • A full taxonomy exists, including concepts like definition, obligation, right, etc. • Carlo will be speaking about them

  31. Modifications • In a modifying law, each modification can be described in detail with a provision. • The provision describes in details what kind of modification, the document it is applied to, where inside it, and when. • Possible modifications are: abrogation, substitution,insertion, renumbering, change of terms, prorogation, repetition, suspension, retro-activity, ultra-activity, etc (a total of 24 different types). • Currently no way to express normal case (dies coactu = dies valens = 15 days after publication for the whole act), but a way will be found soon.

  32. Arguments for provisions • All provisions have some specific arguments, plus some shared arguments • E.g.: <motivazioni> <regole> <obbligo> <pos href=“#art12com5”/> <destinatario>sindaco</destinatario> <controparte>ufficio tributi</controparte> <termine da=“r01” a=“r02”/> </obbligo> … </regole> • Important shared arguments are positions and terms

  33. Positions • All provisions point to a position inside the document where the text of the provision is placed. <articolo id="art1"> <num>1.</num> <comma id="art1-com1"> <num>1</num> <corpo>blah blah</corpo>… <obbligo> <pos href=“#art1com”/> <destinatario>xxx</destinatario> <controparte>y1</controparte> </obbligo> • The pos element points to the id, or XPointer, or the text content, of the part of the document that contains the provision.

  34. Terms • Specify conditions, and specific efficacy (dies coactu) and validity (dies valens) intervals. • No formal language exists yet for specifying conditions • E.g.: “after the approval of the corresponding regulation” • Dates are specified by referring to the id of the relevant date as placed in the lifecycle section.

  35. Conclusions • Metadata are still under heavy evolution within the NIR WG. • In the last 4 month a major work has been started, in order to perform a systematic analysis of the desired metadata information for NIR documents. • I haven’t even mentioned namespaces • Some details are still shaky (required elements, repeatable elements, conditions, default values), but the structure should be reasonable stable. • These are not in the published version: it is still way too early.

More Related