440 likes | 532 Views
http://www.dans.knaw.nl Dirk Roorda, coordinator infrastructure. Overview. Part 1: The rising role of data Part 2: The free use of data Part 3: The care for data Part 4: The re-use of data. Part 1: The rising role of data. http://en.wikipedia.org/wiki/Exabyte
E N D
http://www.dans.knaw.nl Dirk Roorda, coordinator infrastructure
Overview Part 1: The rising role of data Part 2: The free use of data Part 3: The care for data Part 4: The re-use of data
Part 1: The rising role of data http://en.wikipedia.org/wiki/Exabyte Internet size (May 2009): 500 EB 500.000 PB 500 million TB 500 million fat USB disks 500 billion memory cards of 1 GB 70 memory cards per person
Data deluge http://www.datadeluge.com/http://en.wikipedia.org/wiki/File:Tree_of_life_SVG.svg http://tolweb.org/tree/
Where does it come from? • Instruments • satellites, sensors, dna-sequencing • Records • administrations, censuses, surveys • Digitisation • the analog legacy • Hobby • pictures, movies, genealogy • Integration • better interoperability of existing data
The driving force Information and Communication Technology Babbage Analytical Engine 1870
A datacenter Genealogy 2,5 PB 5328 servers 1,12 MW http://www.ancestry.com/ http://blog.familytreemagazine.com/insider/Inside+Ancestrycoms+TopSecret+Data+Center.aspx
A closer look • Linguistics • text corpora, automatic translation • Philology • how to read a million books? • History • historical census data • Archeology • archive law, commercial research
Linguistics and Philology A chronometric approach to Indian alchemical literature Assessing frequency changes in multistage diachronic corpora Evaluating methods for computer-assisted stemmatology using artificial benchmark data sets A Corpus Study of the Rigveda Dictionary generation for less-frequent language pairs using WordNet An exercise in non-ideal authorship attribution: the mysterious Maria Ward http://llc.oxfordjournals.org/
History http://www.volkstellingen.nl/nl/
Archaeology http://edna.itor.org/nl/intern/upload_directory/a00002/downloads/IMG0013.tif
Archaeology (2) http://edna.itor.org/nl/oai/oai_addi/oai_addi/OAI:EVALMA:a00002.xml/
Open Access Data is information Information is knowledge Knowledge is power Why share it?
Open Access Shared knowledge is double knowledge Without free sharing of knowledge, scientific progress will halt Tensions between sharing and not sharing remain, though
A good Example http://www.ploscompbiol.org/home.action
Work to do • organise your data • let your data work together with those of others • (colleagues, future scientists, the public) • ask new questions to the data • because there is so much of it • create new (virtual) data collections
Research Data Recycling • existing data • collecting by experiments, surveys • primary research data • verifying results by others • preserving unique data from experiments • compilation, aggregation, annotation • databanks • data mining, analysis, visualisation • new data as research input
Challenge: Software Operating system (DOS, Windows 95, ...) Programming Languages (Basic, Pascal) File formats (Word Perfect, dBase) Applications (Addressbook, Websites) Old data may be locked up in old software.
Meeting the challenge To prevent the problem in the future Backward compatibility Open Standards Open Source Applications Modular software engineering keep data separated from interface and business logic To remedy the problems of the past Emulation Migration
Challenge: Human organisation Forgotten jargon Forgotten knowledge No metadata Websites with broken links
Jargon • II.17. Posterior berry aneurysm with subarachnoid bleed. • II.18. Subarachnoid bleed with extension into the ventricles. • II.19. Ruptured berry aneurysm at the end of the internal carotid artery, with obstructive hydrocephalus. Morgagni found the rupture. • II.22. Subarachnoid hemorrhage. http://www.pathguy.com/morgagni.htm
Meeting the challenge Persistent Identifiers Enough Metadata Codification of knowledge and practices Wikipedia Datamanagement early on
Data management Use common infrastructure rather than private means Use open formats rather than proprietary formats Use open source software rather than closed software Use standard ways of documenting data taxonomies, ontologies, metadata schemes
Common Infrastructure Local file shares University repository DANS European Infrastructures
DANS http://easy.dans.knaw.nl/dms
linguists make their technology accessible - resources algorithms techniques humanities and social sciences - they are the target users
Geleerdenbrieven=Circulation of Knowledge Archiving = circulation of information