330 likes | 496 Views
Sustaining Database Semantics. Keith W. Kintigh School of Human Evolution and Social Change Arizona State University kintigh@asu.edu In the Session Organized by Stuart Jeffrey Taking the Long View: Putting Sustainability at the Heart of Data Creation CAA Granada 7 April 2010.
E N D
Sustaining Database Semantics Keith W. Kintigh School of Human Evolution and Social Change Arizona State University kintigh@asu.edu In the Session Organized by Stuart Jeffrey Taking the Long View: Putting Sustainability at the Heart of Data Creation CAA Granada 7 April 2010
Background • Today, digital databases (spreadsheets) are often the only loci of irreplaceable records of systematically collected archaeological observations • In the US, databases are often not curated at all and are rapidly being lost. • Digital repositories e.g., ADS & tDAR can provide preservation and access
What Semantic Metadata are Necessary to Adequately Sustain/Document a Database? • Sufficient information for an archaeologist not familiar with the specifics of a project to make sensible analytical use of the data • Necessary for comparative and synthetic research • Necessary to reevaluate conclusions based on systematic evidence • Our ethical (legal) obligation is to preserve our data make data useable
Adequate Preservation is Rarely Achieved in Museum Contexts • Too frequently the media are curated so there is no long term preservation of data • Semantic metadata is often on paper • e.g., existing coding manual, coding keys • But adequate semantic documentation is more comprehensive than analysts would typically think to write down
Documenting Databases • Internally encoded: Structure, Table Names, Column Names & Data Types • Usually not internally encoded: • Each Column • Nature of the column values (not just string, etc.) • Arbitrary (lot number, provenience label) • Measurement (units of measure and methods) • Coded or abbreviated value (nominal variables) • Coded Nominal Values within Columns • Label & description of every value and how it is distinguished from others (101=rabbit)
More Subtle Points • Are all values in a coding key used? • Fish vs species of fish; birds, reptiles etc. • Can lead to conclusion that a species, of bird, for example, is absent when in fact species was not recorded to this level (i.e., missing data) • Academic traditions influence what is needed in more subtle ways. • What constitutes an adequate description varies. • What works for an Americanist might not work for a European Medievalist • Probably no absolute adequacy • We can do better and we must move forward
Digital Antiquity Digital Antiquity is a newly established multi-institutional organization based in the US devoted to enhancing preservation and access to the digital records of archaeological investigations: • to permit scholars to more effectively create and communicate knowledge of the long-term human past; • to enhance the management of archaeological resources; and • to provide for the long-term preservation of irreplaceable records of archaeological investigations. Business model targets technical, financial and sociological sustainability in 4-5 years
Digital Antiquity’s Software • Aspiring to be an on-line, open source, trusted digital repository for archaeological data and documents • Provides preservation and free, on-line discoveryand access for archaeological data and documents • Web-based ingest interface: the contributor uploads data and is prompted for detailed metadata • Advanced tools for data integration across inconsistently recorded databases
Database Ingest • Elicit Project & Information Resource metadata • Location, Time, Keywords, Credit, etc
Database Documentation • For each column in the database • Indicate data type (measurement or coded integer) • Indicate the material class and nature of variable • For each measurement, elicit units (e.g., m, kg) • For each coded value (string or number) • Provide a digital “Coding Sheet” specific to that analyst and dataset that associates codes with labels and descriptions • Associate each coded value labels with an ontology node with a standard definition • The original values do not change
Ontologies • Ontology is a map of the semantic relationships among a set of concepts. • In tDAR, ontologies are ordinarily hierarchical (tree-like) and represent an arbitrary number of levels of class-subclass relationships • For a given variable, a user community develops an ontology to enable integration –not centrally controlled
Integration: Standard Approach • Standardization at or before the time of data ingest (least common denominator) • This will fundamentally not work in archaeology • For legacy data sets, the lcd is very low • Different regional traditions in terminology, materials (lithics ceramics), and their analyses • Enforced standardization is a non-starter for the profession in the US
tDAR Data Integration • Because the digital encoding of the semantics known to the repository • We have the ability to combine datasets • Created by different investigators • Using incommensurate coding schemes • into a dataset in which the observations are analytically comparable
tDAR Process • Query to Identify Relevant Databases • User selects databases move into user workspace • Select Columns to Integrate • Specify Filtering & Aggregation of Ontology Values • Perform Aggregation • Obtain integrated database with commensurate observations • Download Result & Analyze It • In Place (beta, needs documentation) http://tdar.org
Initial Datasets Durrington Walls Knowth
Output • Output Database • 3 columns, area, FUSD FUSP • observations from both datasets (with any filtering eliminating cases) • provenience and stratum values are the same as in the original databases • Taxon values are values in the ontology with aggregation performed • Database is downloaded and analysed by user.
To Come in tDAR Integration • User dictated integration is in place • Query-oriented, ad hoc data integration • Based on a query, tDAR identifies databases that satisfy data requirement of the query: i.e., that are relevant and record needed variables • Interact, as necessary with the user • Perform integration on-the-fly, i.e. using ontologies, align key portions of the metadata for the selected columns • Output is an integrated dataset with maximum resolution and minimal changes
Acknowledgments • Andrew W. Mellon Foundation • National Science Foundation • Collaborators at ASU • K. Selcuk Candan, Tiffany Clark, Hasan Davulcu, John Howard, Shelby Manney, Ben Nelson, Margaret Nelson, Yan Qi, Katherine Spielmann • Digital Antiquity Board of Directors Keith Kintigh, ASU Tim Kohler, Washington State University Fred Limp, University of Arkansas Harry Papp, L. Roy Papp & Associates Julian Richards, University of York Dean Snow, The Pennsylvania State University Sander van der Leeuw, Arizona State University (ASU) [chair] Carol Ackerson, Girl Scouts Arizona Cactus-Pine Council Jeffrey Altschul, SRI Foundation Kim Bullerdick, Owner, BI, L.L.C. Jaime Casap, Google, Inc. John Howard, University College, Dublin