290 likes | 300 Views
Join Dr. Jules J. Berman and Dr. Bill Moore in this workshop to explore the domain of the pathology data miner, including confidentiality and privacy issues, data sharing and standardization, and data analysis techniques.
E N D
Anatomic Pathology Data Mining • Jules J. Berman, Ph.D., M.D.Program Director, Pathology InformaticsCancer Diagnosis ProgramNational Cancer InstituteDr. Bill Moore, Workshop DirectorFriday, October 27, 20008:00 A.M.*All opinions herein are Dr. Berman’s and do not represent those of any federal agency.
Expertise Domain of the Pathology Data Miner • Confidentiality/Privacy Issues • Data Sharing issues, which includes data standardization • Data Analysis
Data Domain of Pathology Data Miner • Pathology Data linked to tissue samples • Any medical record data that can be linked to pathology data (including cancer registry data) • Any other relevant data in existence that can be sensibly linked to pathology records (this usually means the internet)
Confidentiality/privacy • Anyone interested in using confidential information (essentially any data generated in a hospital that is attached to a patient) needs to understand confidentiality and privacy issues. • The fact that you might be using only your department’s data and that you treat the data confidentially will almost never exempt you from existing regulations. • The consequences to you and your institution of ignoring regulations can be profound.
Confidentiality/Privacy Lecture • I am giving a lecture today in the afternoon UAREP focus session on this subject. • The lecture is entitled Bioinformatics Data: Confidentiality Issues and is scheduled for 3:30.
Issues related to data sharing • Nomenclatures and free-text mapping • Common Data Elements • Standard Report Formats • Internet Protocols
CDE for Date of Birth • |birthdate| September 15, 1970 • |birthday| September 15, 1970 • |D.O.B.| September 15, 1970 • |d.o.b.| September 15, 1970 • |date of birth| September 15, 1970 • |date of birth| September 15, 1970 • |date-of-birth| September 15, 1970 • |date_of_birth| September 15, 1970 • |dob| September 15, 1970 • |DOB| September 15, 1970
Representation of CDE • |date_of_birth| September 15, 1970 • |date_of_birth| 15, September, 1970 • |date_of_birth| 9/15/70 • |date_of_birth| 15/9/70 • |date_of_birth| 15/09/70 • |date_of_birth| 9/15/1970 • |date_of_birth| 9.15.70 • |date_of_birth| 9,15,70 • |date_of_birth| some delta time
Annotation/Curation of the CDE • Unique identifier • Creator name • Date of creation • Date of modifications • Exact definition • Hierarchy (if applicable) • List of users or CDE-specific browsers
Shared Pathology Informatics Network • 5-year project beginning April 2001 • Will develop the tools that will allow about 6 large laboratories to share their data with researchers, using the internet • Basically, it will allow a researcher to interrogate the pathology records at multiple institutions simultaneously and receive a summary report almost instantaneously.
VIRTUAL MODEL (CBCTR) Required steps Resource 1 Extract patient data from clinical record Evaluate Specimens Quality control of data and specimens Audit Data and specimen quality Update central database regularly Re-evaluate data quality and currency before specimens are shipped U S E R S REQUEST Central Database Resource 2 DATA Resource 3 Research Evaluation Panel Resource 4 Specimens
What is so special about anatomic pathology data? • Every anatomic pathology record is linked to the patient identifier and to the tissue blocks for that record • One of the important rate-limiting factors in cancer research today is access to tissues • Access to even a small fraction of the tissues routinely collected by pathology departments (about 40 million each year) would be of enormous research benefit.
Example project: Virtual Precancer Archive • Johns Hopkins Surgical Pathology has cases accrued in electronic form since 1984 • 372, 536 is the current (circa Sept., 2000) number of accrued cases • Wouldn’t it be nice to be able to survey the archived precancer cases in a large archive such as the Hopkins Archive?
Step 1. (Drs Bill Moore and Robert Miller)Build a phrase from all cases • The text of the reports can be represented as a collection of phrases that contain all of the concepts included in the reports. • The 372,536 records were parsed to find the diagnostic field free-text. • Diagnostic field free-text was parsed into sentences. • Diagnostic field sentences were parsed into phrases and words.
418,159 phrases represent all the textual concepts in the JHH surg path records - lie outside the realm of Common Rule • minimal mononuclear cell infiltrate • minimal mononuclear cell infiltration • minimal mononuclear cell interstitial • minimal mononuclear infiltrate • minimal mononuclear inflammation • minimal mononuclear interstitial infitrates • minimal mononuclear meningeal • minimal morphologic abnormalities
Step 2. Create a precancer terminology • Started with the National Library of Medicine’s UMLS (Unified Medical Language System) • We use the concept list file, which is 113,699,627 bytes and contains 1,598,176 terms. • As example, rcc has about 80 synonymous terms in UMLS
UMLS CUI C0007134: Renal cell carcinoma • carcinoma, renal cell • carcinomas, renal cell • renal cell carcinoma • hypernephroid carcinoma • grawitz tumor • hypernephroma • renal cell adenocarcinoma • rcc
The UMLS precancer terms • 2,984 terms • Contains 221 terms added by myself and given private J-codes
Step 3. Map the Hopkins phrases to the precancer terms • Start with 418,159 phrases • One-by-one try to find a matching phrase from the list of 2,984 precancer terms list • Prepare a file of all the matching terms • This step takes 33 second to complete with a PERL script running on a 450 MHz desktop computer - i.e., it’s scalable
The result: 10,310 term matches,from 418,159 phrases:a scalable work in progress • early actinic keratosis|actinic keratosis|0022602 • early adenomatous polyp|adenomatous polyp|0206677 • early borderline rejection|borderline|0205189 • early dysplasia|dysplasia|0334044 • early dysplastic change|dysplastic|0334045 • early dysplastic process|dysplastic|0334045 • early gastric mucin cell metaplasia|metaplasia|0025568 • early gastric mucous cell metaplasia|metaplasia|0025568
Step 4. Give precancer match list to Drs. Bill Moore and Robert Miller to create a concordance • 10,310 precancer terms occurred in 54,909 accessioned surgical pathology cases between 1984 and 2000. That is, each of the precancer terms were found in a little more than 5 cases. • 54,909 cases containing a precancer term represents 54,909/ 372,536 =~ 15%
The concordance looks like this: • C0001815^367220497667008419098^^ • C0002893^394120765570701149177^^ • C0002893^435120960421908784068^^ • C0002893^436410698795906686356^^ • C0002893^445510623875200588234^^
1984 1175 7% 1985 1573 8% 1986 2024 10% 1987 2195 11% 1988 2239 11% 1989 2328 11% 1990 2721 12% 1991 3077 14% 1992 3185 14% 1993 2878 13% 1994 3060 14% 1995 2968 13% 1996 3475 14% 1997 4726 17% 1998 4989 18% 1999 5996 20% 2000 6298 25% Precancer-related cases by year
C0007124 1984 14 C0007124 1985 18 C0007124 1986 30 C0007124 1987 31 C0007124 1988 33 C0007124 1989 50 C0007124 1990 42 C0007124 1991 51 C0007124 1992 50 C0007124 1993 80 C0007124 1994 106 C0007124 1995 85 C0007124 1996 100 C0007124 1997 180 C0007124 1998 217 C0007124 1999 228 Cases by year of intraductal ca
C0004763 1984 30 C0004763 1985 35 C0004763 1986 82 C0004763 1987 97 C0004763 1988 106 C0004763 1989 84 C0004763 1990 97 C0004763 1991 100 C0004763 1992 132 C0004763 1993 126 C0004763 1994 144 C0004763 1995 162 C0004763 1996 221 C0004763 1997 307 C0004763 1998 341 C0004763 1999 401 Cases per year of Barrett’s esophagus
What does this really mean? • With this approach, we can identify all the cases of interest for any diagnosis or diagnoses, stratifying data by year of diagnosis, age, gender, or any other record element • We can determine the encrypted case identifiers for all those cases • We can give those encrypted case numbers back to the laboratory archivist, who can supply me with the (encrypted) tissue blocks belonging to that case.
Conclusion: • With these techniques, laboratories with good informatics infrastructure can create a virtual omni-archive (at very low cost) that operates within current human subject protection guidelines for minimal-risk de-identified retrospective studies.