1 / 51

Databrary

Databrary . David Millman, NYU • Rick Gilmore, PSU • Dylan Simon, NYU Coalition for Networked Information • CNI Fall 13 December 10, 2013. databrary.org. Key Aims of Databrary project. Build a repository for sharing video Provide tools for scoring video Provide data management tools

media
Download Presentation

Databrary

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Databrary David Millman, NYU • Rick Gilmore, PSU • Dylan Simon, NYU Coalition for Networked Information • CNI Fall 13 December 10, 2013 databrary.org

  2. Key Aims of Databrary project • Build a repository for sharing video • Provide tools for scoring video • Provide data management tools • Create policies that enable sharing • Transform the culture of developmental science!

  3. Key Aims of Databrary project • Build a repository for sharing video • Provide tools for scoring video • Provide data management tools • Create policies that enable sharing • Transform the culture of developmental science!

  4. Current Funding • NIH • National Institute of Child Health and Human Development • NSF • Development & Learning Sciences Program • Research and Evaluation on Education in Science and Engineering (REESE)

  5. What Users Can Do with Databrary

  6. Use cases: Education, teaching • I need video clips for teaching • I want to illustrate an idea • Show the range of behaviors and exceptions • Show an excerpt in a talk

  7. Use cases: Pre-research • I want to browse the work in my field • I want to know whether a study is worth doing • I need preliminary data for grant proposal • I need ideas and inspiration • I want to replicate, expand on, or review previous work

  8. Use cases: Research • I want to repurpose videos for new uses • Replicate existing work by recoding videos • I want to grow my sample size • I want to include participants from other contexts and populations • I want to conduct integrative analyses

  9. Opportunities / Challenges Raw data re-use • The data is video of people participating in experiments. • Can be immediately re-used in different domains without mapping or data dictionaries

  10. Opportunities / Challenges Video contains identifiable data • Faces, voices, possibly names & locations • De-identified data linked to video becomes identifiable • Enabling sharing while protecting privacy

  11. Opportunities / Challenges Structural consistency • No two labs organize material in the same way • What data structure works for both contributors and “consumers”?

  12. Opportunities / Challenges How “open” is it ? • Identifiable data • Inter-institutional permission clearance • Permissions structure / delegation • New IRB, sponsored programs standards?

  13. Opportunities / Challenges Using significant univinfrastructure • IT • Library • IRB • OSP • Counsel

  14. Enabling sharing of identifiable Data

  15. Data-sharing model How it works today

  16. Data-sharing model Enter Databrary

  17. Data-sharing model Sharing with Databrary

  18. Data-sharing model New Investigator wants access to Databrary

  19. Data-sharing model Browsing, non-research

  20. Data-sharing model Conduct Research

  21. Innovations / Insights • Seek permission to share from people depicted in recordings • Extends informed consent • Restrict access to • Recordings “permissioned” for sharing • Authorized researchers with ethics training • Researchers who agree to maintain privacy

  22. Databrary Release Template • Sharing ≠ research participation • Data privacy • Who has access? • How long? • No compensation • Minor assent • Levels of sharing

  23. Levels of sharing • Private: No sharing • Shared: Sharing only with authorized researchers • Excerptable: Sharing + excerpts may be created and shown by authorized researchers to the public • Open: Sharing with the public

  24. Recording sharing permission • All depicted individuals • Explicit yes/no boxes • Adults and minors

  25. Getting permissions right • Electronically recorded permissions • Linked to session- and participant-level metadata • Avoid data entry errors • Honor participants’ desired release level • Spreadsheet template • Web-based permission system

  26. A better way... • Why is the Databrary model better? • Clear and unambiguous • Consent to participate ≠ permission to share data • Easier for participants • More realistic conceptualization of risk • Standardization across contributors via templates

  27. Building a user community • Users must become Authorized Investigators • Designing registration process • Investigator Agreement • Covers data contributions, non-research, research use/re-use • 1.0 will be a web form • Institutional sign-off by Authorizing Official

  28. Data-sharing model Conduct Research

  29. Who promises what

  30. Who promises what

  31. Policy documents • Databrary Release Template • Investigator Agreement • Definitions of terms • Data Sharing Manifesto • Bill of Rights • Best Practices in Data Security • http://github.com/databrary/policies/

  32. A data model for diverse data sets

  33. A data model for Databrary • Started by organizing around study • Different meanings for study: paper, analysis, etc. • Tremendous range in size of studies • Meaning can change over time • Raw data themselves are fixed, constant • Begin by collecting raw, session data into datasets • Layer analyses, research products on datasets

  34. Organizational unit: Session • Data collected at the same time, often single visit • Defined by: • Date of test • Participant release level • Contains raw data files (videos, etc) • Associated with participant(s), other metadata

  35. What’s in aSession? • Like a folder • A set of files • Collected at a specific time • Often a single visit or participant • Datafiles, coding spreadsheets layered on later

  36. Each file within a session • Name/description • Home visit, interview, eye-tracking video, motion-tracking, EEG, ... • File format • .pdf, .doc, .csv, .mp4, .opf, .mat, ... • For video or other time series data • Start point in time and length • Identifiable (video) or de-identified?

  37. What’s in a dataset?

  38. What’s in a dataset? • Top-level, binding information (optional) • Title and short description • Data owners and other users with access • Excerpts • Procedures, stimuli, blank forms, IRB approvals, and other files • Funding information • Set of sessions and metadata

  39. How is a dataset organized? • Many ways to organize a dataset • User-defined groups (labels, tags, annotations) • By participants, conditions, visits, tasks, etc. • Associated with metadata “measures” • Session assigned to arbitrarily many groups • Groups specific to a single dataset

  40. Main grouping: Participants • Each group represents a participant • Includes any number of user-defined “measures” • Participant ID • Birthdate, gender, race/ethnicity • Geographic location, language, school grade, motor experience, disability, IQ, ... • Any other text, dates, numbers, ...

  41. Grouping sessions

  42. Grouping sessions

  43. Representing datasets as files • People organize their own datasets in different ways • By using groupings for this organization, can dynamically export/import in many forms

  44. From datasets to studies • Datasets provide organization for labs • Session storage for researchers, labs, and collaborators • Like a lab server, only better • Studiespresent research data to others • Pull from datasets, organize sessions • Full control over how research is represented • Add additional analyses, coding manuals, spreadsheets, scripts, figures, research products, ...

  45. From datasets to studies

  46. Data ingest: contributor role • Identify data to contribute • Determine organizational structure • Verify participant sharing permissions • Provide additional top-level metadata and files • description/abstract • resulting publications, funding sources • images/figures, procedure documents, stimuli • Set and maintain access restrictions

  47. Data ingest • Organization, upload, and import • Enumerate sessions, groupings (participants, etc.), files (in CSV) • Collect original videos, best quality available • Transcode to standard video formats • MPEG-4, H.264, AAC, ffmpeg • Gradual transition from hand-curation to self-curation

  48. System Architecture

  49. Looking to Databrary 1.0 • Features • Study views and data re-use • Search • Policy-driven form for user registration • Self curation features • Automatic upload and transcoding • Timeline • Private beta early 2014, public release mid 2014

  50. Building a Community Creating a community of researchers who share and self-curate More interesting data More contributors More users

More Related