140 likes | 250 Views
Researching e-Science Analysis of Census Holdings www.ucl.ac.uk/reach/ Dr Melissa Terras School of Library, Archive and Information Studies University College London m.terras@ucl.ac.uk. e-Science and the Humanities. Little use has been made of the computational grid in humanities research
E N D
Researching e-Science Analysis of Census Holdingswww.ucl.ac.uk/reach/Dr Melissa TerrasSchool of Library, Archive and Information Studies University College Londonm.terras@ucl.ac.uk
e-Science and the Humanities • Little use has been made of the computational grid in humanities research • The aims of the ReACH project were • To establish the potential of applying grid technologies to analyse a complex and rich humanities dataset • Pre-digitised • Historical census data • Of interest to academic researchers and general public • To investigate how e-Science technologies may be appropriated in the arts and humanities • Academic, Technical, Legal, Managerial, aspects of analysing large scale pre-digitized datasets using e-Science technologies • Understand the characteristics and features of large scale humanities datasets which differentiate them from scientific datasets • How does this affect the application of e-Science for research in the arts and humanities?
Partners • UCL SLAIS • Digital humanities, informatics, archives and digital preservation • UCL Research Computing • World leading expertise in High Performance, Grid and e-Science computing • “Research Computing” • High Levels of SRIF funding • The National Archives • who select, preserve and provide access to, and advice on, historical records, • e.g. the censuses of England and Wales 1841-1901 (and also the Isle of Man, Channel Islands and Royal Navy censuses) • Ancestry.co.uk • who own a massive dataset of census holdings worldwide, and who have digitized the censuses of England and Wales under license from The National Archives
Historical Census Data • England and Wales Census Data • 1841-1901 • 7 different censuses taken at 10 year intervals • 20 GB, 200 million records • Complex data set • Fields vary between each census year • Errors • from those supplying the data • from those writing down those answers • from those transcribing those answers into the enumerator returns • from those entering the data into the digital version of the records
Overview of aims • Ascertain whether it would be technically possible • Ascertain whether access to the data would be feasible • Ascertain whether is would be useful to historians • Ascertain whether the results from the project would by worthy of the intellectual and financial investment • And what financial investment would be required to undertake the project
Data • How do humanities datasets differ from scientific datasets? • Does this preclude them from utilising e-Science technologies in research? • Understand issues pertaining to the historical census • Quality of data • Importance of data to historians and researchers • What can be done to process the data to improve and facilitate research • How feasible, or useful, will that processing be • Understanding legal and managerial aspects of licensing pre-digitized datasets for analysis using grid technologies • Security • Who owns the research outcomes?
Methodology - ReACH Workshop Series • Series of 3 AHRC funded Workshops • at UCL from June – August 2006 • All Hands Workshop -June 2006 • Featuring input from Historians, Archivists, Digital Librarians, Computing Scientists, Physicists, and Humanities Computing Experts • What is the research question? • It may be technically feasible – but will outcomes be useful? • Technical Workshop -June 2006 • Computing scientists, physicists, archivists • Determining input, output, processing techniques, workflow, and costings of potential project • Managerial Workshop – July 2006 • Legal, security, and managerial aspects to using pre-digitized commercially sensitive data for research purposes
Historical issues – will it be useful? • If data quality/ computational complexity is not an issue: • Longitudinal dataset • Dictionaries of variants • Probability modelling of variants • Log analysis of how people are using census material • Checking and cleansing of census data • Generation of simple statistics • Calculating and identifying individuals who have been missed out in various censuses. • Reconstitution of missing data in the records through contextual information • Develop OCR techniques which can be used on copperplate • Techniques for social computing and family histories • Geographically normalised dataset • Mapping of geography to names • Assign grid references to historical data • Adding current geographical data to the census • Visualisation techniques
Is it technically possible? • Implement a project would be relatively straightforward • Mount it on UCL Research Computing facilities • SGI Altix Facility: 135GFlops • Access to data relatively straightforward • Outputted to XML database • 20 GB of data, warrants use of grid computing for searching and analysis • Computational Grid techniques (and CS algorithms) • No real understanding of tools to benchmark cross dataset record matching • Of great interest to physicists, astronomers, astrophysicists, computing scientists…. • Further research could investigate how automated record linking could be initiated, using probability modelling of variants
Is it feasible? Managerial Issues • Send in the lawyers… • Major legal issues in gaining access to commercially sensitive digitized data sets • Need for consortium agreements • Need to safeguard intellectual property rights • Need to ascertain who owns research outcomes • Datasets created in the process of analysing other datasets • Arts and Humanities need institutional backing in this area • Access to small subset of data in first instance to prove proof of concept • Need to set up secure systems and data management to ensure limited access to commercial datasets • Following lead of medical sciences
But is this possible with the information available? • Historical census material • Complex, and flawed dataset • For historical reasons • The very fact it is complex provides interesting opportunities to investigate record matching techniques • Also, access to other datasets needed • “triangulation” • Births, marriages and Deaths • Burials • Parish registers • In England and Wales, this data is not in the public domain (yet), and not available in digital form • In order to undertake this project successfully, a massive digitisation project would have to be undertaken first • Or wait a few years until others undertake the digitisation project.
Findings: e-Science and the Census • There has been much financial, industrial and academic investment in the creation of digital records from the English and Welsh historical census data • BUT there is not the quantity nor quality of information currently available to allow useful and usable results to be generated, checked, and assessed • will change as more data is digitised and becomes public • The potential for high performance processing of large scale census data is large • may result in useful techniques and datasets (for historian, genealogist and beyond) • Only when adequate historical data becomes available. • This should be revisited in the future
Findings – e-Science and the A + H • High performance computing and e-Science community were very welcoming to researchers in the Arts and Humanities • Often the problems facing e-Science research in the arts and humanities are not technical • Nature of humanities data means that novel computational techniques need to be developed to analyse and process them • fuzzy, small scale, heterogeneous, of varying quality, and transcribed by human researchers • as opposed to scientific datasets • large scale, homogenous, numeric, and generated (or collected/sampled) automatically • Arts and Humanities projects need to engage with the legal issues in using and creating commercially sensitive datasets • Sensitive data sets and security: Arts and Humanities researcher should look towards Medical Sciences for their methodologies in data security and management • in particular utilising ISO 17799 to maintain data integrity and security
Conclusion • Aimed to deliver a full project proposal for future funding rounds • Had to decide not to take this forward • Undertaking this pilot project prevented long term funding being wasted on a project which would have failed • Highlighted issues, problems, solutions, and barriers to any humanities project who may wish to use the computational grid to do complex record analysis • Report available from www.ucl.ac.uk/reach/