840 likes | 976 Views
CM [A] R’s “MarLIN” Metadata System - or, how do we discover what data we’ve got??. Tony Rees Manager, Divisional Data Centre 3 June 2005 CSIRO Marine Research. Talk overview. Data in our overall activities Data directories and metadata CMR’s “MarLIN” system
E N D
CM[A]R’s “MarLIN” Metadata System- or, how do we discover what data we’ve got?? Tony Rees Manager, Divisional Data Centre 3 June 2005 CSIRO Marine Research
Talk overview • Data in our overall activities • Data directories and metadata • CMR’s “MarLIN” system • Creating metadata content – roles and issues • Concluding remarks • Questions / discussion items MarLIN ... Marine Laboratories Information Network (1998) ... Marine & Atmospheric Research Laboratories Information Network (2005 onwards) CMR’s “MarLIN” Metadata System 1
Problem Formulation • Operationalisation • Data Acquistion • Data Appraisal • Data Interpretation • Solution / New Knowledge The research process (heavily simplified!) CMR’s “MarLIN” Metadata System 2
outputs: • Problem Formulation • Scientific publications, reports, products • Operationalisation • Happy customers • New problems • Data Acquistion • Data Appraisal • Data Interpretation • Solution / New Knowledge The research process (heavily simplified!) CMR’s “MarLIN” Metadata System 2
outputs: • Problem Formulation • Scientific publications, reports, products • Operationalisation • Happy customers • New problems • Data Acquistion • Data Appraisal • Data Sources • Data Sinks • Data Interpretation • Solution / New Knowledge The research process (heavily simplified!) CMR’s “MarLIN” Metadata System 2
Data sources and sinks • Data sources • pre-existing data in CMR / CSIRO holdings • pre-existing data – external (third party) sources • new data collection / generation CMR’s “MarLIN” Metadata System 3
Data sources and sinks • Data sources • pre-existing data in CMR / CSIRO holdings • pre-existing data – external (third party) sources • new data collection / generation • Data sinks • project / researcher archives (formal / informal) • CMR centralised data holdings (e.g. in Data Centre) • external repositories / customers’ holdings CMR’s “MarLIN” Metadata System 3
feed back to... Data sources and sinks • Data sources • pre-existing data in CMR / CSIRO holdings • pre-existing data – external (third party) sources • new data collection / generation • Data sinks • project / researcher archives (formal / informal) • CMR centralised data holdings (e.g. in Data Centre) • external repositories / customers’ holdings CMR’s “MarLIN” Metadata System 3
Need for a data catalogue • Requirements • (1) Need to know what we already have... • across our Division • across whole agency • (2) Others may also need to know what we have (or subset of the same, that we are in a position to share) • (3) Also need to know what exists elsewhere (potential for acquisition of third party data, where it exists) CMR’s “MarLIN” Metadata System 4
Need for a data catalogue • Requirements • (1) Need to know what we already have... • across our Division • across whole agency • (2) Others may also need to know what we have (or subset of the same, that we are in a position to share) • (3) Also need to know what exists elsewhere (potential for acquisition of third party data, where it exists) • Solutions • (1 and 2) Catalogue of our own data assets, using “metadata” • (3) Other agencies’ catalogues, metadata gateways • plus other routes (literature searches, peer-to-peer networking, etc.) CMR’s “MarLIN” Metadata System 4
What is this “metadata”? “Metadata is information about data or other information.” (USGS web site) “Metadata is data about data. In other words, it is a structured summary of information that describes the data. Metadata includes, but is not restricted to, characteristics such as the content, quality, currency, access and availability of the data.” (ANZLIC Metadata Guidelines v2, 2001) CMR’s “MarLIN” Metadata System 5
What is this “metadata”? “Metadata is information about data or other information.” (USGS web site) “Metadata is data about data. In other words, it is a structured summary of information that describes the data. Metadata includes, but is not restricted to, characteristics such as the content, quality, currency, access and availability of the data.” (ANZLIC Metadata Guidelines v2, 2001) In practice ... Metadata is structured summary information about [......] E.g.: • a film guide holds metadata about films e.g. title, director, genre, cast, running length, story synopsis, language, rating... • a scientific publications database holds metadata about scientific publications e.g. author, title, location in journal, publication date, abstract, keywords... • a data directory (metadata system, in the present context) holds metadata about datasets of value to scientists and other potential users. CMR’s “MarLIN” Metadata System 5
Why metadata? • Structured information collection supports powerful informationretrieval • Efficient – easier to search / browse the metadata than obtaining and interrogating all the actual resources, in the first instance (i.e., metadata is a surrogate for the resource) • Metadata is human-readable and text searchable, resource may not be (e.g. rock specimens, images, music, digital data files...) • Collection of metadata into metadata systems supports resource discovery (entry point/s for information) • Captures “corporate memory” – essential information required to understand or re-use the resource (information does not solely reside in people’s heads) • Assists in resource management – knowing what one has is a precursor to managing it well • Can be used for resource distribution – enquirer locates the metadata, then is provided with an access point to the data (e.g. with a web-based system, can then hyperlink to any web-accessible data source). CMR’s “MarLIN” Metadata System 6
Who uses metadata? • Science agencies and jurisdictions use it to describe their data holdings – e.g. (in our sphere of interest): • Australian Antarctic Division • Australian Hydrographic Service • BRS • Bureau of Meteorology • Dept. of Environment and Heritage [EA] • Geoscience Australia... • Jurisdictional directories: ACT, NSW, NT, SA, TAS, VIC, WA • Overseas examples – some agency-based, some jurisdictional, some national/thematic (e.g. European marine data, international “global change” data, space/satellite data, etc.) • Frequently, metadata “push” is coming predominantly from the “spatial data” community, i.e. data with a geographic component – but similar principles can be applied to (virtually) any data. CMR’s “MarLIN” Metadata System 7
Example: the Australian Spatial Data Directory (ASDD) • Single gateway (portal) to search 20+ metadata systems around Australia concurrently • CMR is represented (but no other CSIRO Divisions currently have a metadata system) • Searching is fairly basic, individual system entry points often have more functionality (e.g. MarLIN), however both have their place. CMR’s “MarLIN” Metadata System 8
www user (internal / external) • metadata gateways, search engines • (e.g. ASDD, Google, etc.) • CMR metadata system How it works in practice... • describes / points to ... • specimen collections • images (graphics, photos, video) • online data files / databases • offline data archives • documents (digital and non-digital) • (etc.) CMR’s “MarLIN” Metadata System 9
An overseas metadata example... • The UK’s “National Biodiversity Network” (www.nbn.org.uk) holds datasets on species distribution surveys for UK birds, animals, invertebrates, plants, etc... CMR’s “MarLIN” Metadata System 10
An overseas metadata example... CMR’s “MarLIN” Metadata System 11
An overseas metadata example... CMR’s “MarLIN” Metadata System 11
An overseas metadata example... CMR’s “MarLIN” Metadata System 12
An overseas metadata example... CMR’s “MarLIN” Metadata System 12
An overseas metadata example... CMR’s “MarLIN” Metadata System 13
An overseas metadata example... CMR’s “MarLIN” Metadata System 13
An overseas metadata example... CMR’s “MarLIN” Metadata System
An overseas metadata example... CMR’s “MarLIN” Metadata System 13
In other words, metadata supports... • Dataset discovery – via lists and / or structured searches • Dataset appraisal (via descriptive information) – including • what (dataset content) • where (dataset spatial footprint) – if applicable • when (dataset temporal footprint) • who by, why, etc. • Dataset access constraints – who can access, under what conditions • Dataset location and access point • Supplementary information e.g. documentation, images, references, etc. CMR’s “MarLIN” Metadata System 14
Metadata standards • Metadata example just shown – own format (no externally recognised standard) • Standards assist interoperability – e.g. • USA – 2 standards currently (1 small one “DIF”, one large one “FGDC”) • UK / Europe – historically, little standardization – but new ISO standard (2003) now exists (based on US “FGDC” model) • Australian uses “ANZLIC” standard (v.2) – currently pre-ISO, next version will be ISO compatible. CMR’s “MarLIN” Metadata System 15
What’s in the ANZLIC standard? CMR’s “MarLIN” Metadata System 16
What’s in the ANZLIC standard? • Dataset title, ANZLIC identifier • Custodian organisation and contact • Abstract, search words, bounding box, and Geographic Extent Name or polygon • Start / end dates, progress and maintenance status • Access constraints, stored and available data formats • Data quality (lineage, positional & attribute accuracy, completeness, and logical consistency) • Metadata entry (or last update) date • “Additional metadata” (for anything else) CMR’s “MarLIN” Metadata System 16
What’s in the ANZLIC standard? • Dataset title, ANZLIC identifier • Custodian organisation and contact • Abstract, search words, bounding box, and Geographic Extent Name or polygon • Start / end dates, progress and maintenance status • Access constraints, stored and available data formats • Data quality (lineage, positional & attribute accuracy, completeness, and logical consistency) • Metadata entry (or last update) date • “Additional metadata” (for anything else) Note: - Bounding box, and start / end dates support spatial and temporal searching - “Search words” support structured searches and information retrieval - Remaining fields searchable as free text. CMR’s “MarLIN” Metadata System 16
What’s missing from the ANZLIC standard (but would be useful)? CMR’s “MarLIN” Metadata System 17
What’s missing from the ANZLIC standard (but would be useful)? • Originator organisation (for data obtained from elsewhere) • Contributors, Acknowledgements, References • “Global Project” affiliation (e.g. WOCE, JGOFS, etc.) • Better keywords (ANZLIC ones are very high level) • Better geographic footprint (e.g. by grid squares or similar) – especially for patchy / irregular sampling patterns • CMR Project affiliation • Voyage or survey name (and relevant details) • Species names (if relevant) • Data volume and attributes in the dataset, plus information about its local storage environment • Links to documentation, graphics, and the data itself (where available) (probably some other stuff too, but that is a good start). CMR’s “MarLIN” Metadata System 17
CMR metadata standard = “extended ANZLIC”... • ANZLIC + useful extras • = “CMR metadata standard” • (1998 onwards) • - informal set of elements of value to our operations • - also, prototype for draft CSIRO metadata standard (2002). CMR’s “MarLIN” Metadata System 18
The external context – Australian Government CMR’s “MarLIN” Metadata System 19
The external context – Australian Government • Commonwealth Statement – via Office of Spatial Data Management (OSDM) extract from: AUSTRALIAN GOVERNMENT CUSTODIANSHIP GUIDELINES [for spatial data in this instance]: The Rights and Responsibilities of Spatial Data Custodians • “1. Various Australian Government agencies hold large amounts of spatial data, and will continue to collect more in the future. To achieve efficient and effective acquisition, management and use of spatial data, custodian agencies will be given policy guidelines setting out custodianship rights and responsibilities. CMR’s “MarLIN” Metadata System 19
The external context – Australian Government • Commonwealth Statement – via Office of Spatial Data Management (OSDM) extract from: AUSTRALIAN GOVERNMENT CUSTODIANSHIP GUIDELINES [for spatial data in this instance]: The Rights and Responsibilities of Spatial Data Custodians • “1. Various Australian Government agencies hold large amounts of spatial data, and will continue to collect more in the future. To achieve efficient and effective acquisition, management and use of spatial data, custodian agencies will be given policy guidelines setting out custodianship rights and responsibilities. • ... 32. A key part of any set of spatial data is the accompanying metadata [...] The custodian of the data is normally the best placed to supply this information. CMR’s “MarLIN” Metadata System 19
The external context – Australian Government • Commonwealth Statement – via Office of Spatial Data Management (OSDM) extract from: AUSTRALIAN GOVERNMENT CUSTODIANSHIP GUIDELINES [for spatial data in this instance]: The Rights and Responsibilities of Spatial Data Custodians • “1. Various Australian Government agencies hold large amounts of spatial data, and will continue to collect more in the future. To achieve efficient and effective acquisition, management and use of spatial data, custodian agencies will be given policy guidelines setting out custodianship rights and responsibilities. • ... 32. A key part of any set of spatial data is the accompanying metadata [...] The custodian of the data is normally the best placed to supply this information. • ... 34. The custodian is expected to facilitate efficient and effective use of the government's data, so as to derive maximum benefit from the investment. Thus the metadata must always be readily available, not just for existing users, but for potential users. The custodian should maintain publicised points of contact for enquiries and be in a position to provide the appropriate metadata promptly.” CMR’s “MarLIN” Metadata System 19
The external context – CSIRO CMR’s “MarLIN” Metadata System 20
The external context – CSIRO • extract from: [draft] CSIRO Scientific Data Management Policy– as submitted to Executive, March 2002 [technically still a “draft awaiting approval”] under Scientific Data Management Roles, Responsibilities and Actions: • Corporate • “Senior management (CEO, Deputy CEOs, Business Unit Chief Executives) are to foster and encourage the development of a culture within CSIRO where:- • the value of scientific data and associated data management is recognised and rewarded; • scientific data assets are shared by staff within the Organisation, and where appropriate, with others outside the Organisation.” CMR’s “MarLIN” Metadata System 20
The external context – CSIRO • extract from: [draft] CSIRO Scientific Data Management Policy– as submitted to Executive, March 2002 [technically still a “draft awaiting approval”] under Scientific Data Management Roles, Responsibilities and Actions: • Corporate • “Senior management (CEO, Deputy CEOs, Business Unit Chief Executives) are to foster and encourage the development of a culture within CSIRO where:- • the value of scientific data and associated data management is recognised and rewarded; • scientific data assets are shared by staff within the Organisation, and where appropriate, with others outside the Organisation.” • Research projects • “The incorporation of scientific data management objectives into routine R&D planning and development procedures. This should include procedures that will ensure the recording and updating of metadata, ensure the current and future security of the data asset (backup and archiving), ensure the protection of CSIRO’s intellectual property, and resolve issues of data ownership and future access.” CMR’s “MarLIN” Metadata System 20
The external context – CSIRO • extract from: CSIRO Scientific Data Management Policy– as submitted to Executive, March 2002 [technically still a “draft awaiting approval”] under Scientific Data Management Roles, Responsibilities and Actions: • Individual Officers • “Adopt a ‘one-CSIRO’ view of the data they collect, analyse, back-up and archive; • Adopt and utilise the CSIRO Metadata Standard and make the recording of metadata a routine part of their work practices; and • Ensure the security of CSIRO scientific data assets.” • (Comment:) • ... the above are mainly “sticks”, “carrots” will be discussed later in the presentation. CMR’s “MarLIN” Metadata System 21
What is a “dataset”, in this context? CMR’s “MarLIN” Metadata System 22
What is a “dataset”, in this context? According to the ISO metadata standard (ISO 19115, 2003): • “Dataset: an identifiable collection of data. (NOTE - A dataset may be a smaller grouping of data which, though limited by some constraint such as spatial extent or feature type, is located physically within a larger dataset [...] A hardcopy map or chart may be considered a dataset.)” CMR’s “MarLIN” Metadata System 22
What is a “dataset”, in this context? According to the ISO metadata standard (ISO 19115, 2003): • “Dataset: an identifiable collection of data. (NOTE - A dataset may be a smaller grouping of data which, though limited by some constraint such as spatial extent or feature type, is located physically within a larger dataset [...] A hardcopy map or chart may be considered a dataset.)” “In practice” definition ... a collection of data sharing common features such as data type, data collection activity or data assembly purpose, management / availability as a discrete unit, etc. Size of data “chunks” to be described (aka dataset granularity) is a subjective choice – whatever best suits the data custodian, or is most valuable to prospective data users Basically it comes down to a “lumping” or “splitting” decision (or set of guidelines) – however splitting down to the atomic level is probably undesirable in this context, for practical considerations. CMR’s “MarLIN” Metadata System 22
The CMR metadata story so far... • CMR has the “MarLIN” metadata system – implemented 1997-8 (plus ongoing enhancements) – in-house software, based on EA original (we can modify further as needed) • Records can be “internal”, = CSIRO only (for confidential or third party data), or “public” (open access) • Currently holds 2,100+ dataset descriptions – c.1,000 of these describe centrally-held datasets (Data Centre holdings) • Coverage of other Divisional holdings is patchy at present (a few groups have made an effort, many have not) • To address this in part, it is proposed to construct “skeleton” (template) records for all Divisional science projects (CMR plus CAR, i.e. future CMAR) – to act as a starting point for projects to describe their data holdings • Also, management / project “buy in” to the concept needs further developing. CMR’s “MarLIN” Metadata System 23
Searchability • Structured searches – including browse by subject categories, keywords, projects, custodian (site), voyages, species names, and more • Free text searches • Lists of titles – including recent additions / updates • Search by space and time criteria • Search for “own” records via Edit interface • Search via ASDD (Australia-wide metadata gateway) – public records only • Also – text searches via “Google” etc. will find relevant public records. CMR’s “MarLIN” Metadata System 24
Example: search (browse) by taxonomic group CMR’s “MarLIN” Metadata System 25
Example: search (browse) by taxonomic group CMR’s “MarLIN” Metadata System
Free text search... (e.g. for specialist terms, person names, etc.) CMR’s “MarLIN” Metadata System 26
Free text search... (e.g. for specialist terms, person names, etc.) CMR’s “MarLIN” Metadata System 26