520 likes | 711 Views
Databrary . David Millman, NYU • Rick Gilmore, PSU • Dylan Simon, NYU Coalition for Networked Information • CNI Fall 13 December 10, 2013. databrary.org. Key Aims of Databrary project. Build a repository for sharing video Provide tools for scoring video Provide data management tools
E N D
Databrary David Millman, NYU • Rick Gilmore, PSU • Dylan Simon, NYU Coalition for Networked Information • CNI Fall 13 December 10, 2013 databrary.org
Key Aims of Databrary project • Build a repository for sharing video • Provide tools for scoring video • Provide data management tools • Create policies that enable sharing • Transform the culture of developmental science!
Key Aims of Databrary project • Build a repository for sharing video • Provide tools for scoring video • Provide data management tools • Create policies that enable sharing • Transform the culture of developmental science!
Current Funding • NIH • National Institute of Child Health and Human Development • NSF • Development & Learning Sciences Program • Research and Evaluation on Education in Science and Engineering (REESE)
Use cases: Education, teaching • I need video clips for teaching • I want to illustrate an idea • Show the range of behaviors and exceptions • Show an excerpt in a talk
Use cases: Pre-research • I want to browse the work in my field • I want to know whether a study is worth doing • I need preliminary data for grant proposal • I need ideas and inspiration • I want to replicate, expand on, or review previous work
Use cases: Research • I want to repurpose videos for new uses • Replicate existing work by recoding videos • I want to grow my sample size • I want to include participants from other contexts and populations • I want to conduct integrative analyses
Opportunities / Challenges Raw data re-use • The data is video of people participating in experiments. • Can be immediately re-used in different domains without mapping or data dictionaries
Opportunities / Challenges Video contains identifiable data • Faces, voices, possibly names & locations • De-identified data linked to video becomes identifiable • Enabling sharing while protecting privacy
Opportunities / Challenges Structural consistency • No two labs organize material in the same way • What data structure works for both contributors and “consumers”?
Opportunities / Challenges How “open” is it ? • Identifiable data • Inter-institutional permission clearance • Permissions structure / delegation • New IRB, sponsored programs standards?
Opportunities / Challenges Using significant univinfrastructure • IT • Library • IRB • OSP • Counsel
Data-sharing model How it works today
Data-sharing model Enter Databrary
Data-sharing model Sharing with Databrary
Data-sharing model New Investigator wants access to Databrary
Data-sharing model Browsing, non-research
Data-sharing model Conduct Research
Innovations / Insights • Seek permission to share from people depicted in recordings • Extends informed consent • Restrict access to • Recordings “permissioned” for sharing • Authorized researchers with ethics training • Researchers who agree to maintain privacy
Databrary Release Template • Sharing ≠ research participation • Data privacy • Who has access? • How long? • No compensation • Minor assent • Levels of sharing
Levels of sharing • Private: No sharing • Shared: Sharing only with authorized researchers • Excerptable: Sharing + excerpts may be created and shown by authorized researchers to the public • Open: Sharing with the public
Recording sharing permission • All depicted individuals • Explicit yes/no boxes • Adults and minors
Getting permissions right • Electronically recorded permissions • Linked to session- and participant-level metadata • Avoid data entry errors • Honor participants’ desired release level • Spreadsheet template • Web-based permission system
A better way... • Why is the Databrary model better? • Clear and unambiguous • Consent to participate ≠ permission to share data • Easier for participants • More realistic conceptualization of risk • Standardization across contributors via templates
Building a user community • Users must become Authorized Investigators • Designing registration process • Investigator Agreement • Covers data contributions, non-research, research use/re-use • 1.0 will be a web form • Institutional sign-off by Authorizing Official
Data-sharing model Conduct Research
Policy documents • Databrary Release Template • Investigator Agreement • Definitions of terms • Data Sharing Manifesto • Bill of Rights • Best Practices in Data Security • http://github.com/databrary/policies/
A data model for Databrary • Started by organizing around study • Different meanings for study: paper, analysis, etc. • Tremendous range in size of studies • Meaning can change over time • Raw data themselves are fixed, constant • Begin by collecting raw, session data into datasets • Layer analyses, research products on datasets
Organizational unit: Session • Data collected at the same time, often single visit • Defined by: • Date of test • Participant release level • Contains raw data files (videos, etc) • Associated with participant(s), other metadata
What’s in aSession? • Like a folder • A set of files • Collected at a specific time • Often a single visit or participant • Datafiles, coding spreadsheets layered on later
Each file within a session • Name/description • Home visit, interview, eye-tracking video, motion-tracking, EEG, ... • File format • .pdf, .doc, .csv, .mp4, .opf, .mat, ... • For video or other time series data • Start point in time and length • Identifiable (video) or de-identified?
What’s in a dataset? • Top-level, binding information (optional) • Title and short description • Data owners and other users with access • Excerpts • Procedures, stimuli, blank forms, IRB approvals, and other files • Funding information • Set of sessions and metadata
How is a dataset organized? • Many ways to organize a dataset • User-defined groups (labels, tags, annotations) • By participants, conditions, visits, tasks, etc. • Associated with metadata “measures” • Session assigned to arbitrarily many groups • Groups specific to a single dataset
Main grouping: Participants • Each group represents a participant • Includes any number of user-defined “measures” • Participant ID • Birthdate, gender, race/ethnicity • Geographic location, language, school grade, motor experience, disability, IQ, ... • Any other text, dates, numbers, ...
Representing datasets as files • People organize their own datasets in different ways • By using groupings for this organization, can dynamically export/import in many forms
From datasets to studies • Datasets provide organization for labs • Session storage for researchers, labs, and collaborators • Like a lab server, only better • Studiespresent research data to others • Pull from datasets, organize sessions • Full control over how research is represented • Add additional analyses, coding manuals, spreadsheets, scripts, figures, research products, ...
Data ingest: contributor role • Identify data to contribute • Determine organizational structure • Verify participant sharing permissions • Provide additional top-level metadata and files • description/abstract • resulting publications, funding sources • images/figures, procedure documents, stimuli • Set and maintain access restrictions
Data ingest • Organization, upload, and import • Enumerate sessions, groupings (participants, etc.), files (in CSV) • Collect original videos, best quality available • Transcode to standard video formats • MPEG-4, H.264, AAC, ffmpeg • Gradual transition from hand-curation to self-curation
Looking to Databrary 1.0 • Features • Study views and data re-use • Search • Policy-driven form for user registration • Self curation features • Automatic upload and transcoding • Timeline • Private beta early 2014, public release mid 2014
Building a Community Creating a community of researchers who share and self-curate More interesting data More contributors More users