Interoperability

Interoperability Kevin Hegg and Andreas Knab 2008 ARLIS/NA-VRA Summer Educational Institute July 11, 2008

The Challenge • How do we connect disparate systems and applications so that users can discover, access and exchange digital content and cataloging data from a coherent interface using their preferred set of desktop tools?

iMovie 8

Photoshop Express

Broad Categories: Systems & Tools (1 of 2) • Digital Assess Management (DAM) with Discovery/Access/Presentation (DAP)Examples: Almagest, ARTstor, CONTENTdm, Luna Insight, MDID • Institutional RepositoriesExamples: Digitools, DSpace, Fedora, VITAL • Online image collections and digital libraries — freely accessible contentExamples: American Memory (Library of Congress), Bristol Biomedical Image Archive, Earth Science World Image Bank, Metropolitan Museum of Art, Museum of Modern Art (MOMA), NASA Image eXchange, New York Public Library Digital Gallery • Online image collections and digital libraries — subscriptionExamples: AccuNet/AP Multimedia Archive, Art Museum Image Gallery (Wilson), ARTstor, CAMIO (OCLC), images MD (Current Medicine, Inc.) • Content aggregators/gatewaysExamples: HarvestRoad Hives, IMLS-DCC, MERLOT, OAIster

Broad Categories: Systems & Tools (2 of 2) • Media-sharing communitiesExamples: Flickr, Picasa, Wikimedia Commons, YouTube • Course Management SystemsExamples: ANGEL, Blackboard, Moodle, Sakai, WebCT • Federated search enginesExamples: Central Search (Serials Solution), LibraryFind (Open Source), MetaLib, Muse, WebFeat • Internet search enginesExamples: Google Image Search, Live Search (Microsoft) • Stand-alone applications and browser-based toolsExamples: Amaznode, ARTstor Offline Image Viewer (OIV), Collex, Cross Media Annotation System, Digital Library eXtension Service (DLXS), Image Innovations Image Manager, iTunes University, PowerPoint (Microsoft), Pachyderm, Scholar's Box, VireoCat, VUE • Social networkingExamples: Facebook, MySpace

Data Exchange Mechanisms

Data Exchange Mechanisms • Protocols, standards, specifications, interfaces, guidelines, etc. used to facilitate the orderly discovery and exchange of data • Three methods for exchanging data: • Linking/redirecting: Digital content is stored on remote system. User is directed to remote system to access content (e.g. federated searches) • On request with optional caching: Digital content is downloaded from remote system as needed and presented to user (e.g. MDID remote collections) • Harvesting (bulk import): Entire collection of digital content is copied from remote system in advance and served to user locally (e.g. Allan Kohl’s AICT collection)

Z39.50 • “ISO 23950: Information Retrieval: Application Service Definition and Protocol Specification” • Maintained by Library of Congress • Defines procedures and formats that a client may use to search a remote database, to learn about the results of the search, and to manipulate and retrieve search results • Complicated and difficult to implement • Used commonly by libraries to facilitate federated searches

SRU/SRW • “Search and Retrieval via URL/Web Service” • Maintained by Library of Congress • Companion protocols used to formulate and execute Internet search queries and to retrieve the query results as a record set • Query results are formatted in MARCXML or Dublin Core • Relatively simple and easy to implement • ARTstor’s XML Gateway is built on SRU/SRW

OAI-PMH • “Open Archives Initiative (OAI) Protocol for Metadata Harvesting” • Developed and maintained by The Open Archives Initiative • Allows data providers (repositories) to expose metadata to client applications (harvesters) and facilitates the aggregation of metadata from more than one repository • Fairly easy to implement • LOC’s American Memory repository implements OAI-PMH

OAI-ORE (Beta) • “Open Archives Initiative (OAI) – Object Re-use and Exchange” • Developed and maintained by The Open Archives Initiative • A companion standard to OAI-PMH that will allow repositories to exchange digital objects and applications to consume digital objects residing in repositories • Fairly easy to implement • OAI released public beta June 2008

OKI OSID • “Open Knowledge Initiative (OKI) Open Service Interface Definition” • Developed and maintained by the Open Knowledge Initiative (housed at MIT) • Describes a set of programmatic interfaces used to achieve interoperability among different repositories built on a variety of evolving technologies • Implemented by a variety of systems and tools – including the Museum of Fine Arts, Boston; the National Library of Australia; ARTstor; Sakai; Fedora; Dspace; Pachyderm; and VUE

Proprietary APIs • An API is “the interface that a computer system or application provides in order to allow requests for service to be made of it by other computer programs, and/or to allow data to be exchanged between them”* • APIs are frequently used to facilitate data sharing and digital object reuse • Examples of systems and tools that use APIs to exchange data and/or digital objects: ARTstor, Blackboard, CONTENTdm, Facebook, Flickr, MDID, Photobucket, Photoshop Express, Picasa, PowerPoint, and YouTube * http://en.wikipedia.org/wiki/Application_programming_interface

Data Exchange Overview Data Exchange Overview

Metadata

Metadata • Metadata is “data about data” and is used to structure and describe any kind of content, including of course digital objects • Metadata helps us discover, evaluate, retrieve and use digital objects • Metadata standards are an important component in data exchange • Result sets must be structured and understandable • Metadata standards provide structure and meaning to content • Many data exchange mechanisms support multiple metadata standards and most support Dublin Core

The Challenge • It is very unlikely that two unrelated repositories will catalog data in exactly the same way using the same metadata standards • How does one system or tool process information from another system? How does the user search a remote collection in a targeted • The solution: A simple, generic metadata standard that supports a wide variety of cross-disciplinary resources

Dublin Core Overview • “The Dublin Core metadata element set is a standard for cross-domain information resource description. It provides a simple and standardised set of conventions for describing things online in ways that make them easier to find”* • The Simple DC consists of 15 optional and repeatable DC elements: Title Creator Subject Description Publisher Contributor Date Type Format Identifier Source Language Relation Coverage Rights • The Qualified DC extends or refines DC elements in order to narrow the meaning of the DC elements * http://en.wikipedia.org/wiki/Dublin_Core

Dublin Core Principles • The One-to-One Principle: “Create one metadata description for one and only one resource”* • For example: Do not describe a JPEG of the Mona Lisa as if it were the original painting. Do not confuse the creator of the JPEG with the painter of the original • The Dumb-down Principle: Translate qualified DC to simple DC so that a user can ignore qualifiers and treat a description as if it were unqualified • Appropriate values: Construct your metadata in such a way that it makes sense to a user outside of your context (e.g. to someone who is not a curator or art historian) * Marty Kurth. Basic DC Semantics. http://dublincore.org/resources/training/dc-2006/Tutorial1.pdf (p. 20)

Sample Dublin Core Record* * http://www.pictureaustralia.org/schemas/pa/pa-slvic-example.xml

Crosswalks • A crosswalk “maps the elements in one metadata scheme to the equivalent elements in another scheme”* • The challenge: Many if not most systems and repositories do not use Dublin Core to catalog data • Digital image repositories often use the VRA Core or proprietary schemas to catalog image records • MDID allows the curator to define a custom catalog structure for each collection • ARTstor aggregates data from many sources and thus is not able to use standardized metadata • The solution: Because many data exchange mechanisms support Dublin Core (e.g. SRU/SRW, OAI-PMH, OKI OSID, MDID API), use a crosswalk to map your data to Dublin Core * http://en.wikipedia.org/wiki/Crosswalk_(metadata)

Curators Map MDID Fields to Dublin Core

Dublin Core/VRA Crosswalk* *Claremont Colleges Digital Library (http://ccdl.libraries.claremont.edu/inside/CCDLmetadata.pdf)

Unicode, Encodings, Character Sets

Where do the funny characters come from?

History • ASCII represents characters with numbers between 32 and 127 (7 bits) • Computers use 8 bits, so there is room for an additional 128 characters

More History • Problem: everybody used the additional 128 characters for something different • Solution: ANSI standard for code pages • Examples • Israel: code page 862 • Greece: code page 737 • All code pages still use the same characters below 128, but different ones above

More History • Problem: Only one code page possible at any time • Problem: Some languages use more different characters than fit in 8 bit • Solution: Unicode

Unicode • Each letter or symbol is represented by a code point (a number) • No limits on numbers • Examples • A is U+0041 • ڬ is U+06AC • ♫ is U+266B

Encodings • There are many different ways to store these code points (numbers) in a file • UCS-2/UTF-16 • Stores every character in two bytes • Problem: byte order • UTF-8 • Stores every character in one to six bytes • Compatible with ASCII/ANSI for first 128 characters

So, where do the funny characters come from?

Encodings • If the expected encoding does not match the actual encoding, characters will be interpreted incorrectly or not at all • Make sure to save your content in an encoding that is understood by the program you want to read it with • UTF-8 generally is the best option

Example: Microsoft Excel

Unicode • Reference: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)by Joel Spolskyhttp://www.joelonsoftware.com/articles/Unicode.html

Cataloging Data File Formats

Many options • Databases • Microsoft Access • FileMaker Pro • Spreadsheets • Microsoft Excel • Text based • CSV (comma separated values) • TSV (tab separated values) • XML (structured text)

Options for exchanging data • Usually text based • CSV and TSV are functionally equivalent • Simple spreadsheet format • Multi-valued fields or hierarchies are difficult to represent • XML • Easily handles multi-valued fields or hierarchies • Structure must be understood

Image Formats

Criteria for picking an image format • Lossy vs. Lossless • Compressed vs. Uncompressed • Archival vs. Presentation • Offline vs. Online • Exchangeable vs. Proprietary • Common vs. Uncommon

Lossy vs. Lossless • Tradeoff between file size and image quality • Archival images should always be lossless • Image quality with lossy formats will get progressively worse with editing or processing • Most formats are lossless • JPEG is lossy

Compressed vs. Uncompressed • Tradeoff between file size and processing time • Processing time is usually no longer an issue • Compressed does not mean lossy • Examples: • TIFF can be compressed or uncompressed • JPEG is compressed and lossy • PNG is compressed and lossless

Archival vs. Presentation • Archival images should always be lossless and highest possible quality and resolution • Presentation images need to be small for quick network transfer • Presentation image quality and size can be lower depending on presentation equipment • As equipment gets better, better presentation images can be derived from archival images

Offline vs. Online • Online images need to be smaller and in a format that is supported by a web browser • Offline images can be larger and in formats that may require specific software to view or process • Example • Adobe Photoshop files are great for post-processing, but cannot be delivered in the browser

Exchangeable vs. Proprietary • Proprietary image formats may be harder to exchange with others • Examples • PSD files require Adobe Photoshop • MrSID files require proprietary plug-ins • TIFF or JPEG are universally understood

Common vs. Uncommon • Some file formats or format options are more popular and widely supported than others • Examples • Transparent GIFs work in any browser • Transparent PNGs only work in newer browsers • 8-bit RGB TIFFs work in almost every program • 16-bit RGB TIFFs or grayscale TIFFs do not

Common Tasks for Curators

Reformatting Data in Excel • Cells • A spreadsheet is made up of cells organized in columns and rows, respectively identified by letters (A, B, C, …) and numbers (1, 2, 3, …) • A cell is referred to by its column letter and row number, for example “A1” or “D25” • Formulas in Excel • Always start with an equal sign “=“ • Adjust their cell references when being moved or copied to other cells

Excel User Interface Current cell Formula in current cell Cell value Dragging little square copies cell formula into adjoining cells

Useful Excel Formulas • Combine cell values • CONCATENATE(cell, cell, …) • Or use “&” operator: cell & cell & … • Combines cell values • Extract parts of a cell value • LEFT(cell, length) • RIGHT(cell, length) • MID(cell, start, length)

Interoperability