440 likes | 680 Views
What is XML?. And Why Do I Care?. In the age of Google, why have fielded data?. More efficient for both data entry and for systems to search, retrieve and ingest Parsed, discretely fielded data can be recombined mechanically for a variety of outputs and uses, including XML.
E N D
What is XML? And Why Do I Care?
In the age of Google, why have fielded data? • More efficient for both data entry and for systems to search, retrieve and ingest • Parsed, discretely fielded data can be recombined mechanically for a variety of outputs and uses, including XML
A popular YouTube to illustrate the power of XML:“The Machine is Using Us”http://youtube.com/watch?v=NLlGopyXT_g By Michael Wesch, an Assistant Professor of Cultural Anthropolgy at Kansas State University, this clip illustrates how he can supply the same data content to many Web 2.0 sites. The same principles can be applied to the model of supplying data to various software interfaces and tools in an automated fashion—stop and watch it now—it will get you in the XML mood!
So…..? This changes the landscape of digital tools for users and support staff It is no longer a matter of “one-size fits all” tools, but a new scenario of multiple tools to fit the users and the use. Supporting multiple tools is less of a burden because the data can be generated once and be automatically transformed by XML stylesheets for each tool or interface or digital collection
What is XML? • Extensible Markup Language(XML) is a universal language for sharing data between applications. XML is most appropriate for situations where the volume of data is generally small, as the data is transmitted as text, and controlling the structure of the data is important. • TRANSLATION: It shuffles data between applications, and users can grab it and send it to a new application too
What XML does • Tags information • Facilitates transfer of that information between applications and also out to the Web (Web 2.0) • Allows information to be provided by schemas, which organize information and can represent standards (like MARC or VRA Core 4 or Dublin Core)
How does XML work? • It “tags” data—identifies what that data is (what meaning it holds). MARC tags by using numeric designators: for instance a “245” field is always a title, a “700” or “7xx” field is a personal name (creator)
XML tags • XML tags with natural language—easy to see what the information (the data value) is within the “chicken lips” ><
XML example (in VRA Core 4) <!-- AGENT --> <set> <display>Jasper Francis Cropsey (American painter, 1823-1900)</display> <index> <agent> <name type="personal" vocab="ULAN" refid="500012491">Cropsey, Jasper Francis</name> <dates type="life"> <earliestDate>1823</earliestDate> <latestDate>1900</latestDate> </dates> <culture>American</culture> <role vocab="AAT" refid="300025136">painter</role> </agent> </index> </set>
Schema: Where the data standard and XML meet Once a data standard like VRA Core 4.0 is devised, with all the elements and qualifiers laid out, the standard can then be expressed in one XML document called the schema—a road map to then apply to a specific XSLT style sheet that tells a database (or another type of application) how to export data into (Core 4) XML. A schema is a set of rules to which the xml document must conform to be “valid”
VRA Core 4.0 XML schema (a small sample) <!-- Agent --> <xsd:complexType name="agentType"> <xsd:annotation><xsd:documentation>VRA Agent element. Subelements are used for different types of data (names, roles, dates, etc.). At least one subelement must be provided.</xsd:documentation> </xsd:annotation> <xsd:sequence minOccurs="1" maxOccurs="unbounded"> <xsd:element name="attribution" type="basicString" minOccurs="0" /> <xsd:element name="culture" type="basicString" minOccurs="0" /> <xsd:element name="dates" type="agentDateType" minOccurs="0" /> <xsd:element name="name" type="agentNameType" minOccurs="0" /> <xsd:element name="role" type="basicString" minOccurs="0" /> </xsd:sequence> <xsd:attributeGroup ref="vraAttributes" />
XML example (compare this output to the previous slide--schema outline for the agent data element) <!-- AGENT --> <set> <display>Jasper Francis Cropsey (American painter, 1823-1900)</display> <index> <agent> <name type="personal" vocab="ULAN" refid="500012491">Cropsey, Jasper Francis</name> <dates type="life"> <earliestDate>1823</earliestDate> <latestDate>1900</latestDate> </dates> <culture>American</culture> <role vocab="AAT" refid="300025136">painter</role> </agent> </index> </set>
What is XSLT? • You can export XML data from FileMaker or Access (and many other programs) to use in an assortment of applications simply by applying the appropriate Extensible Stylesheet Language Transformation(XSLT) stylesheet. XSLT is also XML-based. You can use a stylesheet to take an XML document and turn it into plain text, PDF documents, web pages, or to import fielded data into other applications.
XLST Sample—how the XML is actually exported from a database (in this case FMP) <!-- Agent --> <set> <display> <xsl:value-of select="fm:AgentDisplay" /> </display> <index> <xsl:for-each select="fm:AgentSortName/fm:DATA"> <xsl:variable name="i"> <xsl:value-of select="position()" /> </xsl:variable> <agent>
File Extensions for the 3 parts of XML So when you see these file extensions, you will know what you are looking at: The XML document is .xml The XML schema is .xsd The XSLT stylesheet is .xsl
Ummm, yeah, OK Will you do coding/tagging for schemas? (No, you will use schemas provided/published for standards—MARC (MODS), VRA 4.0, CDWA lite, etc.) Will you do coding/tagging for XSLT? (Maybe, if you take a class and are interested. More likely you will get tech support or support from user groups) Will you be able to look at an XML document and basically understand it and edit it? (Yes, this is similar to learning HTML and HTML editors)
So how does this fit into my cataloging? VRA Core 4 and CCO were both formed with an eye to output and expression in XML They can be used in “flat” systems, but there is a clear benefit to using relational databases, and XML is also good at capturing/transmitting relational structure
Relational Databases • Relate information stored in multiple tables • Ideally, there is no redundancy of data entry—each value that might be reused in data entry is only entered once and stored in one table that is related for use everywhere else in the database (made available anywhere needed in the data entry workflow) • Numeric keys are normally used in this process
Excel sample (“flat file” output) Notice that each row represents an image file and conflates the work and image records (repeats the information about the work for each image). Each repeating value (like Artist) must have a column reserved for possible use.
A pithy answer to “why relational?” (for cataloging) Message from Jan Eklund to VRA-L, Feb 20, 2008, subject: Re: CONTENTdm and metadata (search list archive for full message) Complexity: “complexity cannot be captured efficiently in a flat data model because basically you have to leave space in every record to accommodate the most complex object you will ever encounter. This adds up to a lot of wasted space, and wasted space means more money…” Consistency: “all the descriptive data about the work is entered once, and every image that shows this work inherits the same information”
Repeating values are supported for each element “indexed” value (in this case the sort name) Numeric key A note field is possible for every Core 4 element “display” value done to CCO recommended formatting. Note that the Agent Nationality is supplied automatically here by the Link (numeric key) to the Agent Authority
Authority record All the information about the agent is supplied from this file on the basis of the numeric key Numeric key
The same information expressed in Core 4 XML—this is automatically output from the database <agentSet> <display>ACT Architecture (French architectural firm, ca. 1982-present); Gaetana Aulenti (Italian interior designer, born 1927); Victor Alexandre Frédéric Laloux (French architect, 1850-1937)</display> <notes>ACT Architecture (Renaud Bardon, Pierre Colboc and Jean-Paul Philippon)</notes> <agent> <name vocab="ULAN" refid="500023967" type="personal">Laloux, Victor Alexandre Frédéric</name> <dates type="life"> <earliestDate>1850</earliestDate> <latestDate>1937</latestDate> </dates> <culture>French</culture> </agent> <agent> <name vocab="LCNAF" refid="nr 95039966" type="corporate">ACT Architecture</name> <dates type="activity"> <earliestDate>1982</earliestDate> <latestDate>2082</latestDate> </dates> <culture>French</culture> </agent> <agent> <name vocab="ULAN" refid="500031019" type="personal">Aulenti, Gaetana</name> <dates type="life"> <earliestDate>1927</earliestDate> <latestDate>9999</latestDate> </dates> <culture>Italian</culture> </agent> </agentSet>
Reciprocity in Relationships Easy to show relationships between works in a relational database and via XML. In this case the XSLT stylesheet (in conjunction with programming within the database) can be written to supply the reciprocity (the other related work) based on the numeric key.
Stylesheets can do a lot! They literally do “transformations”—they can change the XML into other formats, they can recombine parsed information—and they can even take that more efficient and consistent relational data and “flatten” it, and output it in csv (Excel) for import into delivery systems or other uses that are not yet XML-compatible!
Other Data Standards (field structures) and XML • MARC; MODS • CDWA • Dublin Core • VRA Core 4.0 • EAD • METS
MARC—Machine Readable Cataloging • Emerged from a Library of Congress-led initiative that began in the 1970sfor bibliographic (reprographic) materials • Uses numeric tags to designate the fields (“245” means title, “700” fields are makers/creators etc) • This enabled computer protocols to share data worldwide • “The future of the MARC formats is a matter of some debate in the worldwide library science community. On the one hand, the formats are quite complex and are based on outdated technology. On the other, there is no alternative bibliographic format with an equivalent degree of granularity. The huge user base, billions of records in tens of thousands of individual libraries, also creates inertia” (Wikipedia entry)
MODS—Metadata Object Description Schema • A schema that allows the traditional numerically tagged MARC to be turned into XML • Can carry data from existing MARC plus allows creation of new XML-based records—a way to integrate and move forward? http://www.loc.gov/standards/mods/
CDWA—Core Description of Works of Art • Developed by the Getty specifically to describe art, architecture and cultural artifacts • A very granular standard—the fields are very narrowly defined and there are many specific fields (as opposed to a few fields that use “qualifiers”) Example: Creation - Commissioner - Commissioner Role • See the CDWA lite xml schema: http://www.getty.edu/research/conducting_research/standards/cdwa/cdwalite.html
Dublin (Ohio) Core • Developed by OCLC (headquartered in Dublin OH) (serving 53,500 libraries in 96 countries) • Created to describe “born digital” items in particular • Simple “bins” of data that can be further “qualified” (difference in Simple DC and Qualified DC) • A qualifier is an element refinement—example Date. Creation
The SimpleDublin Core Metadata Element Set (DCMES) consists of 15: • Title • Creator • Subject • Description • Publisher • Contributor • Date • Type • Format • Identifier • Source • Language • Relation • Coverage • Rights
VRA Core 4.0 • Published in April 2007: http://www.vraweb.org/datastandards/VRA_Core4_Welcome.html • A data standard guiding data structure • Formed with an eye to expressing content in XML—with both index and display values • Formed like library records with a “bib” (work) record and an item (image) record • Formed as is Dublin Core with a 1:1 relationship—one record describes one object
EAD (Encoded Archival Description) Started 1993 at Berkeley—now maintained by Library of Congress with SAA (Society of American Archivists) Began using SGML, now uses XML So, tagged and machine-readable, but not necessarily 1:1 records—simple way to make groups/boxes of material retrievable
Sample EAD Finding Aid • http://webtext.library.yale.edu/art/art.VRC1.htm • 152 boxes; 64 linear feet of mounted photographs of American painting now in storage • Simply used the outline of the original filing/drawers and tagged them—this translates now to boxes of material with barcodes
METS (Metadata Encoding and Transmission Standard) http://www.loc.gov/standards/mets/ Think of it as an XML “wrapper”—it can describe a group of objects, a collection of different objects, can “wrap” around a set of XML items that are different formats and therefore may be a way to integrate and present these
METS Profiles UCSD Simple Object Profile • abstract:The UCSD Libraries uses the UCSD Simple Object profile for composing METS instances for digital objects consisting of a single digital content file and associated descriptive, administrative, and structural metadata. The single digital content file may be of any format type, e.g., audio, image, text, or video, and it may be represented in the METS instance with content equivalent file versions. For example, a digital image may be represented in the METS instance by a TIFF file, a JPEG file, and a GIF file, with each containing the same content image.
What do [book] librarians have that VR professionals don’t? Tools and networked utilities for COPY CATALOGING: MARC (Machine Readable Cataloging) for field structure (data standard) AACR2 (Anglo-American Cataloging Rules) for data formatting (data content) XML and Z39.50 (and other protocols) for transmitting data OCLC as a shared records repository (sustainable business model)
How do we get to shared VR image cataloging? • Have to develop the same general mechanisms as the library world • VRA Core 4.0 = MARC • CCO = AACR2 • XML will be one transmission vehicle/protocol • OAI (Open Archives Initiative) may become a harvesting and retrieval mechanism for record sharing
OAI (Open Archives Initiative)—XML Based http://www.openarchives.org/ Started by 2 computer scientists at Cornell to quickly share information via mechanical “harvesting”—databases are opened to allow harvesting and results are then put in a central repository for searching. It is a “low-barrier” interoperability framework using Dublin Core (in XML) as its minimum standard, but one can also use other standards (expressed in XML) on top of that. Google is using OAI to harvest data from the National Library of Australia. (See also U Michigan’s OAIster project).
See—XML matters! Susan Jane Williams Independent Cataloging and Consulting williams.susanjane@gmail.com