1 / 29

Anatomic Pathology Data Mining

Join Dr. Jules J. Berman and Dr. Bill Moore in this workshop to explore the domain of the pathology data miner, including confidentiality and privacy issues, data sharing and standardization, and data analysis techniques.

palma
Download Presentation

Anatomic Pathology Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Anatomic Pathology Data Mining • Jules J. Berman, Ph.D., M.D.Program Director, Pathology InformaticsCancer Diagnosis ProgramNational Cancer InstituteDr. Bill Moore, Workshop DirectorFriday, October 27, 20008:00 A.M.*All opinions herein are Dr. Berman’s and do not represent those of any federal agency.

  2. Expertise Domain of the Pathology Data Miner • Confidentiality/Privacy Issues • Data Sharing issues, which includes data standardization • Data Analysis

  3. Data Domain of Pathology Data Miner • Pathology Data linked to tissue samples • Any medical record data that can be linked to pathology data (including cancer registry data) • Any other relevant data in existence that can be sensibly linked to pathology records (this usually means the internet)

  4. Confidentiality/privacy • Anyone interested in using confidential information (essentially any data generated in a hospital that is attached to a patient) needs to understand confidentiality and privacy issues. • The fact that you might be using only your department’s data and that you treat the data confidentially will almost never exempt you from existing regulations. • The consequences to you and your institution of ignoring regulations can be profound.

  5. Confidentiality/Privacy Lecture • I am giving a lecture today in the afternoon UAREP focus session on this subject. • The lecture is entitled Bioinformatics Data: Confidentiality Issues and is scheduled for 3:30.

  6. Issues related to data sharing • Nomenclatures and free-text mapping • Common Data Elements • Standard Report Formats • Internet Protocols

  7. CDE for Date of Birth • |birthdate| September 15, 1970 • |birthday| September 15, 1970 • |D.O.B.| September 15, 1970 • |d.o.b.| September 15, 1970 • |date of birth| September 15, 1970 • |date of birth| September 15, 1970 • |date-of-birth| September 15, 1970 • |date_of_birth| September 15, 1970 • |dob| September 15, 1970 • |DOB| September 15, 1970

  8. Representation of CDE • |date_of_birth| September 15, 1970 • |date_of_birth| 15, September, 1970 • |date_of_birth| 9/15/70 • |date_of_birth| 15/9/70 • |date_of_birth| 15/09/70 • |date_of_birth| 9/15/1970 • |date_of_birth| 9.15.70 • |date_of_birth| 9,15,70 • |date_of_birth| some delta time

  9. Annotation/Curation of the CDE • Unique identifier • Creator name • Date of creation • Date of modifications • Exact definition • Hierarchy (if applicable) • List of users or CDE-specific browsers

  10. Shared Pathology Informatics Network • 5-year project beginning April 2001 • Will develop the tools that will allow about 6 large laboratories to share their data with researchers, using the internet • Basically, it will allow a researcher to interrogate the pathology records at multiple institutions simultaneously and receive a summary report almost instantaneously.

  11. VIRTUAL MODEL (CBCTR) Required steps Resource 1 Extract patient data from clinical record Evaluate Specimens Quality control of data and specimens Audit Data and specimen quality Update central database regularly Re-evaluate data quality and currency before specimens are shipped U S E R S REQUEST Central Database Resource 2 DATA Resource 3 Research Evaluation Panel Resource 4 Specimens

  12. What is so special about anatomic pathology data? • Every anatomic pathology record is linked to the patient identifier and to the tissue blocks for that record • One of the important rate-limiting factors in cancer research today is access to tissues • Access to even a small fraction of the tissues routinely collected by pathology departments (about 40 million each year) would be of enormous research benefit.

  13. Example project: Virtual Precancer Archive • Johns Hopkins Surgical Pathology has cases accrued in electronic form since 1984 • 372, 536 is the current (circa Sept., 2000) number of accrued cases • Wouldn’t it be nice to be able to survey the archived precancer cases in a large archive such as the Hopkins Archive?

  14. Step 1. (Drs Bill Moore and Robert Miller)Build a phrase from all cases • The text of the reports can be represented as a collection of phrases that contain all of the concepts included in the reports. • The 372,536 records were parsed to find the diagnostic field free-text. • Diagnostic field free-text was parsed into sentences. • Diagnostic field sentences were parsed into phrases and words.

  15. 418,159 phrases represent all the textual concepts in the JHH surg path records - lie outside the realm of Common Rule • minimal mononuclear cell infiltrate • minimal mononuclear cell infiltration • minimal mononuclear cell interstitial • minimal mononuclear infiltrate • minimal mononuclear inflammation • minimal mononuclear interstitial infitrates • minimal mononuclear meningeal • minimal morphologic abnormalities

  16. Step 2. Create a precancer terminology • Started with the National Library of Medicine’s UMLS (Unified Medical Language System) • We use the concept list file, which is 113,699,627 bytes and contains 1,598,176 terms. • As example, rcc has about 80 synonymous terms in UMLS

  17. UMLS CUI C0007134: Renal cell carcinoma • carcinoma, renal cell • carcinomas, renal cell • renal cell carcinoma • hypernephroid carcinoma • grawitz tumor • hypernephroma • renal cell adenocarcinoma • rcc

  18. The UMLS precancer terms • 2,984 terms • Contains 221 terms added by myself and given private J-codes

  19. Step 3. Map the Hopkins phrases to the precancer terms • Start with 418,159 phrases • One-by-one try to find a matching phrase from the list of 2,984 precancer terms list • Prepare a file of all the matching terms • This step takes 33 second to complete with a PERL script running on a 450 MHz desktop computer - i.e., it’s scalable

  20. The result: 10,310 term matches,from 418,159 phrases:a scalable work in progress • early actinic keratosis|actinic keratosis|0022602 • early adenomatous polyp|adenomatous polyp|0206677 • early borderline rejection|borderline|0205189 • early dysplasia|dysplasia|0334044 • early dysplastic change|dysplastic|0334045 • early dysplastic process|dysplastic|0334045 • early gastric mucin cell metaplasia|metaplasia|0025568 • early gastric mucous cell metaplasia|metaplasia|0025568

  21. Step 4. Give precancer match list to Drs. Bill Moore and Robert Miller to create a concordance • 10,310 precancer terms occurred in 54,909 accessioned surgical pathology cases between 1984 and 2000. That is, each of the precancer terms were found in a little more than 5 cases. • 54,909 cases containing a precancer term represents 54,909/ 372,536 =~ 15%

  22. The concordance looks like this: • C0001815^367220497667008419098^^ • C0002893^394120765570701149177^^ • C0002893^435120960421908784068^^ • C0002893^436410698795906686356^^ • C0002893^445510623875200588234^^

  23. 1984 1175 7% 1985 1573 8% 1986 2024 10% 1987 2195 11% 1988 2239 11% 1989 2328 11% 1990 2721 12% 1991 3077 14% 1992 3185 14% 1993 2878 13% 1994 3060 14% 1995 2968 13% 1996 3475 14% 1997 4726 17% 1998 4989 18% 1999 5996 20% 2000 6298 25% Precancer-related cases by year

  24. Precancer-related cases by year

  25. % Precancer-related cases by year

  26. C0007124 1984 14 C0007124 1985 18 C0007124 1986 30 C0007124 1987 31 C0007124 1988 33 C0007124 1989 50 C0007124 1990 42 C0007124 1991 51 C0007124 1992 50 C0007124 1993 80 C0007124 1994 106 C0007124 1995 85 C0007124 1996 100 C0007124 1997 180 C0007124 1998 217 C0007124 1999 228 Cases by year of intraductal ca

  27. C0004763 1984 30 C0004763 1985 35 C0004763 1986 82 C0004763 1987 97 C0004763 1988 106 C0004763 1989 84 C0004763 1990 97 C0004763 1991 100 C0004763 1992 132 C0004763 1993 126 C0004763 1994 144 C0004763 1995 162 C0004763 1996 221 C0004763 1997 307 C0004763 1998 341 C0004763 1999 401 Cases per year of Barrett’s esophagus

  28. What does this really mean? • With this approach, we can identify all the cases of interest for any diagnosis or diagnoses, stratifying data by year of diagnosis, age, gender, or any other record element • We can determine the encrypted case identifiers for all those cases • We can give those encrypted case numbers back to the laboratory archivist, who can supply me with the (encrypted) tissue blocks belonging to that case.

  29. Conclusion: • With these techniques, laboratories with good informatics infrastructure can create a virtual omni-archive (at very low cost) that operates within current human subject protection guidelines for minimal-risk de-identified retrospective studies.

More Related