Data standards in pathology informatics and experimental pathology Experimental Biology 2004

Data standards in pathology informatics and experimental pathology Experimental Biology 2004 April 17, 2004 Association for Pathology Informatics Data mining session Jules J. Berman, Ph.D., M.D. Program Director, Pathology Informatics Cancer Diagnosis Program, DCTD, NCI, NIH bermanj@mail.nih.gov

Standards issues Standard ways of obtaining medical research data (confidentiality/security methods) Standard ways of organizing data (nomenclatures, data structures) Standards in professional behavior (submitting primary data with research)

UFO Abductees Lots of them They often say about the same thing (independent confirmations) All walks of life Mostly honest and rational people Minority are a little crazy One problem: no evidence

Researchers who don’t publish their primary data Lots of them They often say about the same thing (independent confirmations) All walks of life Mostly honest and rational people Minority are a little crazy One problem: no evidence This is why we need to share data

After your research data reaches a certain size, the data becomes the publication, and the journal articles become tiny editorials that describe or interpret the data Think of the relationship between the earth and the sun. Terra-centrics did not want to think that their planet was not the center of the universe. But actually, earth is a tiny fraction of the size of the sun, and people eventually switched to a heliocentric vision of reality. Research papers are mere editorials that revolve around a central large BLOB of data. The database is the publication. Everything else is peripheral.

Examples where the data is the central research object: Human Genome Project Gene Expression Arrays Tissue Microarrays (a thousand cores of tissue) Proteomics

Data standards are all about data Sharing: NIH Statement on Data Sharing http://grants.nih.gov/grants/guide/notice-files/NOT-OD-03-032.html National Research Council UPSIDE Universal Principle of Sharing Integral Data Expeditious http://books.nap.edu/books/0309088593/html/R1.html

NIH Funding for data sharing!!! TOOLS FOR COLLABORATIONS THAT INVOLVE DATA SHARING http://grants1.nih.gov/grants/guide/pa-files/PAR-03-134.html INFRASTRUCTURE FOR DATA SHARING AND ARCHIVING http://grants.nih.gov/grants/guide/rfa-files/RFA-HD-03-032.html

What is a standard? Approval through a Standards Organization (ISO, IEC) Standards Organization route is less and less appealing for scientists: Takes too long (obsolete when complete) Enormous bureaucratic overhead Doesn’t guarantee acceptance (DICOM visible light) Doesn’t guarantee conformance (most ANSI computer languages) Doesn’t guarantee availability (MUMPS) Don’t despair: alternates are available

OR: Find a field where progress is impeded because of lack of standards Gather stakeholders Have a few open meetings Create a product using standard methods Open the drafts to public scrutiny Publish the product as an open specification (not a standard) Publish implementations of the specification

W3C (WorldWideWeb Consortium) is a good example of an open specification that did not go through any approval process of a Standards organization Consortium of interested companies They promote activities related to the WWW and the internet. They publish reports IT SEEMS TO WORK

Real world example: The Tissue Microarray Data Exchange Specification The greatest value of TMAs is the ability to link TMA data with data from other TMAs and from other databases that inform on the data contained in the TMA database. That value is essentially untapped because there has been no way to publish, exchange, merge and link TMA datasets in a manner that everyone can use and understand.

The basic properties of the TMA specification: • Self-describing • Made from commonly understood data structures • Extremely simple (most of our stakeholders are not sophisticated bioinformaticians, computer scientists, or metadata experts) • Infinitely scalable (can be endlessly combined with other data sources)

Four API meetings to discuss the TMA specification May 30, 2001. Ann Arbor, Michigan. Chair of speaker session: Mark A Rubin. Speakers: David Rimm, Steve Bova, Matt Van de Rijn, Jules Berman Oct. 6, 2001. Pittsburgh, PA and co-sponsored by The National Cancer Institute. Chair, Mary Edgerton. Speakers: Olli Kallioniemi, Chris Chute, Richard Lieberman, Paul Spellman. Chair of Data Exchange Workshop: Mary Edgerton. May 22, 2002. Ann Arbor, Michigan and co-sponsored by the National Cancer Institute. Chair of Speaker session: Mark A. Rubin. Speakers: James Bacus, Angelo de Marzo, Peggy Porter, David Rimm and Guido Sauter. Chair of Data Exchange Workshop: Dr. Mary Edgerton. October 4, 2002. Held in conjunction with Advancing Pathology Informatics, Imaging and the Internet, Pittsburgh, PA. Chair of speaker session: Mary Edgerton. Speakers: Steve Hewitt, Ulysses Balis. Chair of Data Exchange Workshop: Mary Edgerton.

In brief: The TMA Specification is an open access document that can be used without any restriction. Its development was sponsored by the NCI and by the Association for Pathology Informatics All the documents and software that you might need to obtain, understand and implement the specification are available in two recently published open access manuscripts.

Basics of the specification: Jules J Berman, Mary Edgerton and Bruce Friedman.The tissue microarray data exchange specification: a community-based, open source tool for sharing tissue microarray data. BMC Med Inform Decis Mak. 2003 May 23;3:5 Real-world implementation example: Jules J Berman, Milton Datta, Andre Kajdacsy-Balla, Jonathan Melamed, Jan Orenstein, Kevin Dobbin, Ashok Patel, Rajiv Dhir, Michael J Becich. The tissue microarray data exchange specification: implementation by the Cooperative Prostate Cancer Tissue Resource. BMC Bioinformatics 2004 Feb 27, 5:19

What’s next for the Association for Pathology Informatics?

Pathology image specification (part of API Laboratory Digital Imaging Project) • DICOM provides a visible light image standard • Nobody likes it (zero implemenatations of the visible light standard since its release, about 5 years ago • Isn’t designed to encapsulate multiplexed images, non-pictorial image data (spectral data, tiled images, histopathological and clinical annotation, data security protocols) • Lets make a specification in XML that pathologists can actually implement • Lets make tools that can port back and forth to DICOM • Lets get community participation along the way

end

Data standards in pathology informatics and experimental pathology Experimental Biology 2004