260 likes | 276 Views
Atle Alvheim. Norwegian Social Science Data Archive. CESSDA Expert Seminar 2009. A common future ?. The last 15 years has been focused on building up a common data infrastructure for the social sciences, based on modern web-technology.
E N D
Atle Alvheim Norwegian Social Science Data Archive CESSDA Expert Seminar 2009
A common future ? The last 15 years has been focused on building up a common data infrastructure for the social sciences, based on modern web-technology
The web: The idea that the archives could create an integrated catalog, Grenoble 1994 • DDI: A richer and better data documentation format, R.Rockwell / ICPSR IASSIST 1995 • Integrate 3-4 components: • Internet / web / Common catalog • DDI • Access explore analyse download data: The social science dream machine NESSTAR J.Ryssevik / S.Musgrave • ILSES Integrated Library and Survey Data Extraction Service • Richer services, FASTER (Data types) LIMBER (Attack the language barrier) • One single common entry point, • Madiera, • Metadater
Search Browse LUCENE ELSST Topical List Nesstar SERVER 1 Nesstar SERVER 2 CESSDA METADATA HARVESTER Nesstar SERVER 3 OAI-PMH SERVER 4 OAI-PMH SERVER 5
Resources: CESSDA Template Controled vocabularies Multilingual thesaurus CESSDA classification Browsing tool Search tool Harvester Indexing tool Portal Publishing Server 1 2 3 4 Client Square files
Bridging the gap Looking for data... Cultivate knowledge... THE RESEARCHER THE ARCHIVES THE PORTAL
Greenland • Iceland • Feroe Islands • Norway • Sweden • Finland • Aaland Islands • Estonia • Latvia • Lithuania • Belorussia • Ukraine • Moldova • Poland • Germany • Denmark • England • Scotland • Wales • Northern Ireland • Ireland • Netherland • Belgium • Luxembourg • France • Portugal • Spain • Andorra • Monaco • Switzerland • Italy • San Marino • Vatican State • Slovenia • Lichtenstein • Austria • Czech republic • Slovakia • Hungary • Romania • Bulgaria • Serbia • Croatia • Bosnia & Herzegovina • Montenegro • Kosovo • Albania • Macedonia • Greece • Cyprus South • Cyprus North • Malta • Turkey • Russia ? • Georgia ?? • Armenia ??? • Israel ???? 30 Languages, 45 legal systems We are supposed to support research, break down technical-, linguistic-, judicial-, economic barriers Several processes – timelines in a layered system
Access and download Control access Share formats and routines Instrument development
Make a more powerful interface to data holdings • - more sophisticated search / browse possibilities, more focused, even across languages • - better possibilities to handle results • Handle more complex datastructures, over time, across space, languages, link micro – macro • These we may see as ”analytic dimensions” • 3. Persistent identity, connect knowledge products back into the data used, turn traditional picture upside down • These are more ”practical management • 4. Handle problems of double storage. Data dynamics, more than one value in a table cell • Versioning, updating, comments, links, references • Adding to the data item • Single Sign On, need to pass information and access more than one server, logging
Data have a life-cycle The archive: A Greenhouse or a Graveyard ? Much data generated by the public statistical system or other producers Contact with user community Metadata – standard Tool for instrument development Tool for data collection Tool for documentation Question DB, translations A overarching plan Integration of components The researcher formulate a problem and need data to analyse the problem When data are collected, with necessary metadata, they represent a SIP To make data ready for archiving they have to be documented (and processed), lifted from a SIP to a AIP If data have to be collected, we need an instrument, a questionnaire Conseptualisation Instrument Data production (SIP) Data documentation (AIP) Data documentation: Should be based on standardised procedures / best practices and common tools for all CESSDA (+) archives DDI 2/3 expressed as a Template/DDI-profile, which is a) selection of elements, with status b) element repositories c)controled vocabularies d) multi-lingual thesaurus e) gazetteer, geographic classification f) CESSDA study classification This requires software or a manual / clear guidelines. DDI becomes the glue that hold this whole system together. Question DB A questions- and concepts DB is a very useful tool to develop instruments A questions DB potentially problematic for data documentation processes. Better to import directly via questionnaire Will make it possible to find questions from concepts (Need an interface) Learn from others Encourage comp research Look up translations
Or do updates happen as a harvesting process ? Question DB A question database will be related to a basic storage. Do updates happen as a guarded / explicit process ? What are the criteria ? AIP Ingest Data repositories UKDA DDA FSD Our AIPs When an AIP a inserted into an archive or storage it can trigger an update of a question database. MetadataMetadata Metadata Metadata DataDataData-data-data Data Data-data-data Data Data-data-data To what degree are packages pre-defined or built for purposes ?
Archive Storage Metadata-standard Language UKDA Fedora Other DDI 2.0 DDI 3.0 English DDA Fedora Nesstar DD2.x Danish and English FSD Nesstar Other DDI 3.1 Finnish and English Combinations Combinations Because of storage complexity harvesting also becomes quite complex
Data repositories are guarded by access policies. Policies are usually formulated at institution or repository level Policies are activated by the crossing of the line between metadata and data, which is at data package level Should policies be linked to packages instead of repositories ? Should it be an obligatory part of metadata ? Then we need to have policies formalised. SSO / AAA Data repositories UKDA DDA FSD LOG-DB MetadataMetadata Metadata Metadata DataDataData-data-data Data Data-data-data Data Data-data-data Data repositories should be documented in national + common language Different documentation templates for national and international language
CV: LifeCycleEvent Study Proposal Study Design Instrument Design Funding Interviewer training Ethics Review Sampling Instrument pre-testing Pilot study Questionnaire translation Documentation translation DATA COLLECTION Data collection reports Post-collection processing Data production Initial data quality checks Metadata production Original release DEPOSIT Post-production processing Data quality checks Data editing Data integration Processing for Disclosure Metadata editing Preservation package production Dissemination package New version production New version release / publication From producer to consumer, the data archival work Locate, explore and download Cover the whole data (or project ?) life-cycle
CESSDA complications: We need services that cover many servers and many conditions for use The CESSDA data archives will in due time be both data providers, aggregators and single service providers. This is an illustration of what would presently be the NSD situation.
Functionalities we need, with a scale from producer to consumer
The user authentication problem Almost always at institutional level
The user authorisation problem Very often at resource level Dataset 1 User Server 1 DDA Server 2 Server 3 ZA Server 4 Server 5 UKDA Server n Dataset 2 Portal Dataset 3 Users, affiliated with national institutions, based on a common justification (research) and work within specific projects (Have roles within projects ?) want to access data resources in different institutions and countries
Conceptualisation Tool Web browser Instrument Data production (SIP) Portal Search Browse ELSST x time, space, methodology ELSST Query service Harmonisation (and concepts) DB Question DB 12 8 9 Intermediate storage 4 5 Data documentation Data loader: May handle multiple and complex data packages Explore and compare functionality DDI 2/3 expressed as Template/DDI-profile, as a) selection of elements, with status b) Controled vocabularies c) Multilingual thesaurus d) Gazetteer e) CESSDA classification Registry 1 10 CESSDA Toolkit 6 Download 2 3 SSO/AAA Ingest (AIP) Data repositories UKDA DDA FSD 7 Log database 11 Politics (Repository or package level)
Internal web services stack Could interact with WS for metadata preparation DDI centric back-end CESSDA-DB stores all low level objects Web services exposed for public consumption 3CDB/QBD applications call relevant WS CESSDA WS Concept Bank Nesstar Publisher Universe Bank 3CDB C3DB WS 3CDB Applications DDI 1/2.x Classification Bank DDI 3.0 Converter 3CDB/QDB Applications Ingester performs quality assurance, split metadata and maintains referential integrity for storage in CESSDA Bank Geo Bank local objects Question Bank DDI 3.0+ Publication Tool Ingest WS Metadata Ingester QDB QDB WS QDB Applications Questionnaire Bank Instruction Bank Custom Exporter local objects Study Bank Legacy Database Back-end maintenance and reporting tools Variable Bank Future Services Future WS Future Applications Could interact with WS for metadata preparation Reporting Tools … Banks Admin Tools Security Tools non-DDI Objects
Ingestion/Registration Process Repository Many metadata repositories can exist around the network. These can be deployed at the provider level, or as shared metadata storage. Example Submission of a Nesstar DDI will typically result in creation of objects in the following banks: study, classifications, variables, instance (files) and possibly concepts, universes, questions, instructions if such variable level metadata have been compiled. Concept Bank Nesstar Publisher Universe Bank DDI 1/2.x Classification Bank Submission Object registration could be automated upon release of the metadata by the provider. Workflow can be implemented as necessary. Metadata optimization / harmonization Optimization of the metadata (merging duplicates, aligning on harmonized objects, etc.) can be done using various automated, semi-automated or manual methods during the various stages of submission (this can also be performed later on) DDI 3.0 Converter Geo Bank Question Bank Publication WS Metadata Registry Metadata Repositories (Banks) DDI 3.0+ Publication Tool Ingest WS Metadata Ingester Questionnaire Bank Repository WS Submission Submission packages are prepared by providers in compliance with the CESSDA DDI3+ specification. Publications tools are used to manage packages and control ingestion process. Packages are broken down and stored in various banks (as needed) Instruction Bank Interfaces Note that metadata repositories also expose a set of general and specialized web services along with administrative / security interfaces Custom Exporter Study Bank Legacy Database Example A legacy system used for the production of questionnaire could create objects in the question, questionnaire, instruction, concepts, universes and classification banks. This may happen outside the context of a survey (question bank) and no variable would be associated with these objects. Variable Bank … Banks
Conceptualisation Tool Web browser Instrument Data production (SIP) Portal Search Browse ELSST x time, space, methodology ELSST Query service Harmonisation (and concepts) DB Question DB 12 8 9 Intermediate storage 4 5 Data documentation Data loader: May handle multiple and complex data packages Explore and compare functionality DDI 2/3 expressed as Template/DDI-profile, as a) selection of elements, with status b) Controled vocabularies c) Multilingual thesaurus d) Gazetteer e) CESSDA classification Registry 1 10 CESSDA Toolkit 6 Download 2 3 SSO/AAA Log database Ingest (AIP) Data repositories UKDA DDA FSD 7 11 Politics (Repository or package level)