490 likes | 499 Views
Provision of access to data for secondary analysis. Louise Corti, Jo Wathan and Keith Cole Economic and Social Data Service E-society Programme March 07. Overview of chapter. Why access secondary quantitative data? brief overview of the potential of secondary data
E N D
Provision of access to data for secondary analysis Louise Corti, Jo Wathan and Keith Cole Economic and Social Data ServiceE-society Programme March 07
Overview of chapter • Why access secondary quantitative data? • brief overview of the potential of secondary data • Finding, accessing and obtaining secondary data • describes the ESDS distributed national on-line data service designed • Case studies – the UK Economic and Social Data Service • practical exemplars of how data can be re-used
Why access secondary quantitative data? Quantitative methods have an important longstanding place in social research. Can identify: • typical characteristics and background description • the amount of variation within a population of interest • differences between groups • how possible explanatory factors can account for differences • predictions and forecasts Kinds of data: • Micro data resemble the sort of data obtained from a survey • Longitudinal data follow the same individuals (or other study unit) over time • Macro or aggregate data contain records for much larger units e.g countries or regions
Secondary analysis • reduces respondent burden • enables data linkage and the creation of new datasets • informs policy disputes about the interpretation of analyses • provides transparency within research • enables methodologists to learn from each other • allows students to engage with ‘real’ data, to obtain results which relate to the real world and to tackle real problems of data management (substantive social science and research methods teaching)
Data expensive • Collecting good quality, reliable, representative data is expensive and technically demanding • In 2001/2 the British General Household Survey (GHS) sample included all individuals in 8,989 households and cost £1.43 million • In 2001, the American Community Survey collected data from nearly 400,000 interviews in the year at an estimated to cost $131 million
Data historical - enabling trend analysis • In the UK the General Household Surveys (GHS) and Labour Force Surveys (LFS) date back to 1971 and 1973 • In the United States, the General Social Survey series dates back to 1972 and Current Population Survey data dating back to 1964 (ICPSR) • Longitudinal studies • US Panel Study of Income Dynamics, started in 1968 • German Socioeconomic Panel in 1984 • British Household Panel Study in 1991
Finding, accessing and obtaining secondary data The development of secondary analysis has depended on the development and growth of social science data archives: • Inter-University Consortium in Political and Social Research • (ICSPR) • the UK Data Archive (UKDA) • Zentralarchiv für Empirische Sozialforschung (ZA) • Norwegian Social Science Data Services (NSD) Now networked: • Council of European Social Service Data Archives (CESSDA) • International Federation of Data Organisations (IFDO)
Changing provision • early data archives predated e-social science, and the internet as we know it….by decades • the gradual development of online data archives and dissemination services has varied across the world • the more mature archives have reached the point at which most users will interact with the data service wholly through the internet • Internet delivery has broadened the potential role of data services
Functions of the modern archive’s role • acquire - nurture, cajol, plead, evaluate • prepare, document and enhance data – check and add context • store data safely for ever – back up, store and migrate • distribute data - download, explore online • provide support for their use - promote, write, teach • improve resource discovery and data access - R&D
Acquisition and checking • data archives typically select and evaluate potential data collections against criteria designed to ensure that they are appropriate for re-use • assessed for their: • research value, quality, degree of fit to meet existing collection • data are checked and validated by the receiving archive by: • examining the data values or text – validation and consistency checking • ensuring that, where requested, the data are anonymous (where required) • checking for Intellectual property and commericial ownership rights in the data
Documentation and metadata Documentation which enables users to understand the origins of the data and to correctly interpret outputs • user guides created - how the data were collected • questionnairess, questionnaires, code books, interviewer instructions, technical reports, original and subsequent publications and outputs • catalogue record, and full variable and value labels (standard used - DDI) • a few archives work closely with data creators in the early stages to ensure that good data management practices are adhered to
Online dissemination • first steps towards online data archiving and dissemination came with the development of archive websites • increasingly sophisticated data catalogues • nowdays, searchable online data catalogues enables users to search and browse collections • and view documentation freely online • online registration – account management, data download • access data via a web browser
New generation data services • online data exploration with tools • Survey Documentation and Analysis (SDA), Nesstar, Beyond 20:20, interactive (GIS) mapping tools • increasingly necessary to link to data sites, offsite support and related datasets as the complexity of the data infrastructure increases • data services may be distributed services • data need not be co-located • social science increasingly looking to the potential of grid technologies
Economic and Social Data Service (ESDS) • new generation distributed data service that provides a seamless integrated service • offers enhanced support for the secondary use of key economic and social data across the research, learning and teaching communities • value-added service goes far beyond the original role of traditional data archives as data storage and dissemination houses • brings together centres of expertise in data creation, dissemination, preservation and use
UK Data archiving history • Data Archive established in 1968 (as ‘Data Bank’) • funded by (then) SSRC to provide a service to UK HE sector • initial focus on academic surveys then government survey data • new distributed service established 1 January 2003 as the ESDS • core arching service plus four value added specialist services
Types of data • ESDS acquires mixed data types and formats • social surveys • aggregate data • administrative data • textual data • images • audio visual data • UKDA hosts specialist Qualidata unit, Census unit, and History Data Service • since 2005 designated as ‘Place of Deposit’ by The National Archives (TNA) • New data types: • Online surveys, interviews and focus groups • social transaction data • Linked admin data • blogs and so on
Who produces the social science data held by ESDS? • government agencies • increasing tendency for government agencies to contract out survey work to private sector (NatCen) • academic sector • private sector • local Government • Research Council funded • ESRC, MRC, NERC, AHRB, Wellcome, Leverhulme • increasing number of large digitisation projects • JISC, NOF • access to international datavia links with other data archives worldwide • IGOs
Core Service • run by UKDA • acquiring, processing, preserving and disseminating data • data creation and deposit support • central registration service operating across the ESDS • central 'first stop' help desk service • front line user support • cataloguing and describing data • maintaining and developing web presence • publicity and training
Specialist data services • ESDS Government • ESDS International • ESDS Longitudinal • ESDS Qualidata Greater emphasis on: • value-added data and documentation • enhanced resource discovery • improved delivery services • support and training for the secondary use of data for research, learning and teaching • outreach and promotion
Facts and figures: UKDA • 4,000+datasets in the collection • 350+new datasets and editions added each year • 30,000+registered users • 15,000+datasets distributed worldwide p.a. • 100,000+ online sessions p.a. • 15,000,000+ web hits p.a.
Data In • Data acquisition • offers and proactive scoping of data • formal data evaluation via committee • Data ingest • checking, verifying • converting, formatting, processing • documenting and contextualising • Data preservation • long-term data management • Preservation Policy
Online exploration • Online data browsing, including • simple data analysis, visualisation, downloading and subsetting via Nesstar • ESDS Government Vital Statistics online • International macro data via Beyond 20/20 and visualisation interface • ESDS Qualidata Online – interview transcripts • Census data services
1: Using Government microdata to explore health • UK is fortunate in its wealth of available major cross-sectional surveys • government surveys rich resources: • large micro data files with a large number of detailed variables • series of repeated cross sections which enable comparisons over time • nationally representative United Kingdom or constituent countries • sample survey data, which may involve a degree of complexity - structure ((hierarchical) and sampling strategy • data holdings and documentation are extensive
1: Government data • General Household Survey/Continuous Household Survey (NI) • Labour Force Survey/NI LFS • Health Survey for England/Wales/Scotland • Family Expenditure Survey/NI FES • British/Scottish Crime Survey • Family Resources Survey • National Food Survey/Expenditure and Food Survey • ONS Omnibus Survey • Survey of English Housing • British Social Attitudes/Scottish Social Attitudes/Young People’s Social Attitudes/NI Life & Times • National Travel Survey • Time Use Survey • Vital Statistics for England and Wales
1: Investigating smoking • ESDS high web presence • Google search ESDS pages • ESDS catalogue – advanced searching on key words – study and variable level information • browse by subject • major studies lists • Government series pages • theme guides • publications database • software and analysis guides
1: Accessing Data • register with ESDS, using the online authentication system ATHENS (currently moving towards a new system Shibboleth which provides a greater degree of differentiation in user types) • ESDS Users must specify the purpose for which they will use each data set • registered users can choose to download the whole file (typically SPSS, Stata and tab delimited) or undertake further analyses, including graphing, within Nesstar • more stringent conditions apply to more sensitive data such as detailed microdata with detailed geography (Special Licence)
1: Online exploration • Nesstar system - allows unregistered users to view metadata and univariate distributions online • based on the DDI standard to describe data • permits users to specify subsets and download in a wide range of formats • ability to quickly browse data useful where particular subsets of cases in the data are of interest • GHS to undertake an analysis of people who would like to give up smoking - need to know whether there were a sufficiently large number of people in the dataset who smoke but would like to give up
1: What can a user do with the data? • multivariate analysis that look within households and analyses that look at change over time • look at relationships between multiple individual characteristics • depth of many questionnaires, allows users to explore the validity of existing means of operationalising concepts, or to use new ones
2: Analysing longitudinal health data • true cohort analysis requires information about the same individuals over time • explore the chronological ordering of behaviours or characteristics • ESDS Longitudinal specializes in supporting five major UK-based longitudinal data sets: • British Household Panel Survey (BHPS) • 1970 British Cohort Study (BCS70) • National Child Development Study (NCDS) • Millennium Cohort Study (MCS) • English Longitudinal Study of Ageing (ELSA) • BHPS is a household hierarchical dataset - interviews all members of the households of panel members. Can explore household factors
3: Providing a common user interface to international macro data to support comparative research • researchers now require access to the key international evidence bases in order to contribute and comment on trans-national policy responses to global issues • ESDS International was established to address these needs through the provision of free web-based access to a portfolio of authoritative, high quality international databanks • high quality, regularly updated time series databanks - contain huge range of macro-economic and social indicators aggregated to national or regional level worldwide
datasets supported produced by a number of key International Governmental Organisations (IGOs) such as the International Monetary Fund, the United Nations, the World Bank, the Organisation for Economic Cooperation and Development and the International Energy Agency • access via a common user interface to all the international aggregate datasets which makes it easy for users to obtain access to data • beyond 20/20 Web Data Server (WDS) to display, subset, visualize, chart and download data • Iraqi exports to the rest of the world 1980-2005 (Source International Monetary Fund (IMF), Direction of Trade Statistics (DOTS) July 2006)
CommonGIS used to build a web-based data exploration interface to geographically referenced international data • CommonGIS provides standard GIS functionality and can be used as a tool for visualisation and exploratory analysis based on geographically referenced statistical data • CommonGIS visualization shows the relationship between birth and death rates in European countries in 2005 to CIA Word Factbook • the cross classification map shows those countries, such as Moldova, which have high birth and death rates
4: Grid-enabling quantitative datasets to support more complex forms of analysis • Data Grids facilitate unimpeded and integrated use of distributed, heterogeneous, autonomous data resources • grid enabling a dataset creates new opportunities for its use: • enables users to integrate it with other datasets • makes it possible to analyse the dataset using techniques that require the kind of computational power that it is only feasible using the Grid (e.g. more complex models, more data points). • standardisation of procedures and mechanisms used to access and update the dataset, increase its shareability • automated analyses (i.e. analyses can be re-run automatically when databases are updated)
4: ConvertGrid – Key Objectives • a practical demonstration of how the Grid can be used to facilitate data integration and overcome a major barrier to research use of multiple datasets • demonstrates how to build a social science Data Grid by grid enabling a number of key geo-referenced socio-economic data sources • uses Grid technologies to extend the functionality of an existing web based data service (i.e. Convert) to exploit the existence of a Data Grid • demonstrates how Grid technologies can automate complex workflows and enhance the capacity to address substantive social science research questions; • builds a user interface to a Grid based service which is suitable for student/teaching use
4: ConvertGrid – The Research Context • many research questions require the combination of a data from multiple geo-referenced datasets • E.g. Linking post coded data to census geography • conversion of data relating to different geographies to a common target geography is • complex time consuming task • requires a range of data handling/processing skills • the data conversion process will require users to perform the following generic tasks: • extract and download data in different formats from a number of databases using different interfaces • convert each dataset to the desired target geography using geographical conversion tables • combine the converted sets into a single dataset for analysis • these generic tasks can be automated!
4: ConvertGrid – A Worked Example • what factors explain spatial variations in participation rates in higher education • study target geography –1991 Census Ward • data required: • 1991 Census • total persons aged 16-17 & 18-19 (1991 Census Ward) • Neighbourhood Statistics • number of applicants aged under 20 entering university (1998 Electoral Ward) • Experian • average house price sales Quarter 2 2000 to Quarter 1 2001 (1999 Postcode Sectors)
4: ConvertGrid – Data Visualisation Interface High average house price sales but low participation rates Low average house price sales but high participation rates Ten minutes from start to finish • relationship between average house price sales (Experian) and percentage of 16-19 year olds entering university (Neighbourhood Statistics & Census aggregate statistics)
5: Mixed Methods Data • there is an increasing interest in and recognition of the value of re-using qualitative data • in the past few years there has been a significant move to utilise mixed methods strategies in research • ESDS has seen the deposit of multiple methods datasets combining quantitative and qualitative datasets • processed and supported by dedicated unit - ESDS Qualidata
5: ESDS Qualidata • range of qualitative datasets, hosted by the UK Data Archive • data from National Research Council (ESRC) individual and programme research grant awards (Data Policy) • data from ‘classic’ social science studies • other funders/sources • focus on DIGITAL Collections, but also facilitate paper-based archiving
5: Types of qualitative data • diverse data types:in-depth interviews ; semi-structured interviews; focus groups; oral histories; mixed methods data; open-ended survey questions; case notes/records of meetings; diaries/ research diaries • multimedia: audio, video, photos and text(most common is interview transcriptions) • formats: digital, paper, analogue audio-visual • data structures - differ across different ‘document types
5: Classic study datasets • Townsend – Poverty, old age and Katherine Buildings • Thompson – oral history and Edwardians • Goldthorpe et al - The Affluent Worker • Jackson and Marsden – Education and the Working class • National Social Policy and Social Change Archive
5: schoolchildren’s attitudes towards risk-taking and health • typical example of a mixed methods study might be undertaking a sample survey and conducting ethnographic fieldwork (eg observation and in-depth interviews) based on the survey sample or on other cases • Incidents and the Health-related Behaviour of Schoolchildren, 1997, M. Denscombe • Studying critical incidents’ in the life of young people which act as crucial flashpoints in the generation of attitudes towards health-related behaviour
5: schoolchildren’s attitudes towards risk-taking and health • the project used a mixture of quantitative and qualitative methodology • survey of 1648 children • eleven transcripts of focus group interviews • eight transcripts of interviews - two students together • Denscombe in-depth interviews also cover a lot of detail about the role and pressure of exams at the age of 15/16, and future life ambitions
Secondary use? • qualitative aspect can offer a more detailed explanation of a quantitative analysis and possibly enable a more complex model to be built • sequencing of data collection methods or the selection of cases needs to be carefully considered in re-use • in larger data collections, the data types may have been collected by different teams with differing methodological agendas - researchers tend to prioritise one method because of familiarity with the data type and analytic methods • possibility that each method could show conflicting findings - re-users should be aware how they report findings and be reflexive about how the secondary data were selected, confronted and analysed
Collaboration - UK • Government agencies – work closely • Research Councils on formal data sharing policies • Research Centres and Programmes collecting data • Other funding agencies e.g JISC on technical issues • authentication, digitisation, T&L resources • TNA on records management and preservation practise • E-science on grid enabled data issues, ontologies • Research Methods centres on data quality and secondary analysis
Conclusion • secondary analysis permits a range of valuable analyses to be undertaken quickly, effectively, transparently and with minimal respondent burden • digital formats have enable users to easily consult full documentation, explore and analyse data online • and to make linkages between appropriate resources in a context of an increasingly complex data infrastructure • data access services themselves may be virtual centres, distributed across multiple sites • anticipate that grid developments will provide increased scope for harmonising access to different data types
Contact www.esds.ac.uk help@esds.ac.uk corti@essex.ac.uk 01206 872145