280 likes | 446 Views
Modern Neoplasm Classification Dept of Pathology University of Michigan October 27, 2005 Jules J. Berman, Ph.D., M.D. jjberman@alum.mit.edu. What is a [tumor] classification? A grouped taxonomy [listing of all tumors] with the following properties:
E N D
Modern Neoplasm Classification Dept of Pathology University of Michigan October 27, 2005 Jules J. Berman, Ph.D., M.D. jjberman@alum.mit.edu
What is a [tumor] classification? A grouped taxonomy [listing of all tumors] with the following properties: Inheritance: Hierarchical structure, with each class of tumors inheriting properties of its ancestors Uniqueness: Each tumor occurs in only one place in the classification Comprehensive: All tumors are included Class-intransitive: A tumor from one class does not change into a tumor from another class (e.g. an adenocarcinoma does not become a lymphoma) Ernst Mayr: The growth of biological thought: diversity, evolution and inheritance. Cambridge: Belknap Press; 1982.
Problems with current tumor classifications Mixed bag of tumor classes based on: Anatomic site (roughly distance from the tumor to the floor as in “head and neck” tumors) Clinical specialty (dermatologic tumors) Functional similarity of cell types (e.g. endocrine tumors) Not based on any describable biologic premise.
Molecular classification of cancer The so-called molecular classifications (based largely on gene expression arrays of tumors) are simply a way of finding variants within a population. Mostly, you see experiments designed to cluster out variants of a tumor type (slow-growing, responsive to a specific treatment, prone to metastasize, etc.) This is simply not classification (ignores the intransitive law), and in fact, no classification has emerged from any of the work that's been done with molecular diagnostics. My opinion: Gene expression array studies do not create classifications – but are very useful taxon finders
Developmental Lineage Classification and Taxonomy of Neoplasms Similar to (but different from) the classification efforts of the 1950s (particularly Willis) Old hypothesis (more or less discredited) is that tumor development recapitulates embryologic development. New (my) hypothesis is that tumors will tend to inherit the molecular pathways from their developmental ancestors. May be helpful in selecting classes of tumors responsive to molecular targets. Despite the difference in hypotheses, either way you end up with a classification that follows embryologic lines and that fits in will stem cell hypothesis.
Developmental Lineage Classification and Taxonomy of Neoplasms Now 145,000+ terms (10+ Megabytes) Publicly available and free The latest version at: www.pathologyinformatics.org
53 ways of writing prostate cancer Prostate cancer is the concept, the 53 synonyms are the terms for the concept, and C486300 is the code <name nci-code = "C4863000">prostate with adenoca</name> <name nci-code = "C4863000">adenoca arising in prostate</name> <name nci-code = "C4863000">adenoca involving prostate</name> <name nci-code = "C4863000">adenoca arising from prostate</name> <name nci-code = "C4863000">adenoca of prostate</name> <name nci-code = "C4863000">adenoca of the prostate</name> <name nci-code = "C4863000">prostate with adenocarcinoma</name> <name nci-code = "C4863000">adenocarcinoma arising in prostate</name> <name nci-code = "C4863000">adenocarcinoma involving prostate</name> <name nci-code = "C4863000">adenocarcinoma arising from prostate</name> <name nci-code = "C4863000">adenocarcinoma of prostate</name> <name nci-code = "C4863000">adenocarcinoma of the prostate</name> <name nci-code = "C4863000">adenocarcinoma arising in the prostate</name> <name nci-code = "C4863000">adenocarcinoma involving the prostate</name> <name nci-code = "C4863000">adenocarcinoma arising from the prostate</name> <name nci-code = "C4863000">prostate with ca</name> <name nci-code = "C4863000">ca arising in prostate</name> <name nci-code = "C4863000">ca involving prostate</name> <name nci-code = "C4863000">ca arising from prostate</name> <name nci-code = "C4863000">ca of prostate</name> <name nci-code = "C4863000">ca of the prostate</name> <name nci-code = "C4863000">prostate with cancer</name> <name nci-code = "C4863000">cancer arising in prostate</name> <name nci-code = "C4863000">cancer involving prostate</name> <name nci-code = "C4863000">cancer arising from prostate</name> <name nci-code = "C4863000">cancer of prostate</name>
More: <name nci-code = "C4863000">cancer of the prostate</name> <name nci-code = "C4863000">cancer arising in the prostate</name> <name nci-code = "C4863000">cancer involving the prostate</name> <name nci-code = "C4863000">cancer arising from the prostate</name> <name nci-code = "C4863000">prostate with carcinoma</name> <name nci-code = "C4863000">carcinoma arising in prostate</name> <name nci-code = "C4863000">carcinoma involving prostate</name> <name nci-code = "C4863000">carcinoma arising from prostate</name> <name nci-code = "C4863000">carcinoma of prostate</name> <name nci-code = "C4863000">carcinoma of the prostate</name> <name nci-code = "C4863000">carcinoma arising in the prostate</name> <name nci-code = "C4863000">carcinoma involving the prostate</name> <name nci-code = "C4863000">carcinoma arising from the prostate</name> <name nci-code = "C4863000">prostate adenoca</name> <name nci-code = "C4863000">prostate adenocarcinoma</name> <name nci-code = "C4863000">prostate ca</name> <name nci-code = "C4863000">prostate cancer</name> <name nci-code = "C4863000">prostate carcinoma</name> <name nci-code = "C4863000">prostatic cancer</name> <name nci-code = "C4863000">prostatic carcinoma</name> <name nci-code = "C4863000">prostatic adenocarcinoma</name> <name nci-code = "C4863000">prostate gland adenocarcinoma</name> <name nci-code = "C4863000">adenocarcinoma of the prostate gland</name> <name nci-code = "C4863000">adenocarcinoma of prostate gland</name> <name nci-code = "C4863000">prostate gland carcinoma</name> <name nci-code = "C4863000">carcinoma of the prostate gland</name> <name nci-code = "C4863000">carcinoma of prostate gland</name>
Is the taxonomy comprehensive? Let's compare it with SNOMED.
Comparing the Developmental Lineage Classification with SNOMED. 1. Used the 2005 version of UMLS (free from ww.nlm.gov) 2. MRCON05 650,948,750 1-18-05 and MRCXT 1,610,612,736 1-18-05 MRCXT2 1,610,612,736 1-18-05 MRCXT3 1,610,612,736 1-18-05 MRCXT4 1,610,612,736 1-18-05 MRCXT5 1,610,612,736 1-18-05 MRCXT6 1,610,612,736 1-18-05 MRCXT7 1,196,031,492 1-18-05 4. Extracted the snomed ct terms from mrcon05 using the script: MRCON05 .PL 2,098 5-30-05
MRCON05.PL $line = " "; $start = time(); open (TEXT,"mrcon05"); open (OUT,">snom05"); while ($line ne "") { $line = <TEXT>; @linearray = split(/\|/,$line); $cuinumber = $linearray[0]; $language = $linearray[1]; $vocabulary = $linearray[11]; next if ("ENG" ne $language); next if ("SNOMEDCT" ne $vocabulary); print OUT "$cuinumber $linearray[14]\n"; #print "$cuinumber $linearray[14]\n"; } $end = time(); $total = $end - $start; print "\ntotal time was $total seconds\n"; exit; Execution time of 132 seconds on a 2.89 Ghz PC
5. This produced a 35+ MByte file: SNOM05 35,127,210 5-30-05 6. Created a perl script, neopull2.pl that uses the mrcxt "Neoplasm" relationship to identify all the neoplasm CUIs in UMLS and to pull out any of the SNOMED terms that corresponded to a Neoplasm CUI (neopull2.pl) 7. The output file is: SNOM .OUT 567,372 5-30-05 8. This output file contains a lot of redundant terms and plurals, so I wrote snoclean.pl to get rid of the extraneous terms: SNOCLEAN .PL 1,092 5-30-05 9. The final output file is: SNOCLEAN .OUT 300,834 5-30-05 SNOMED contains 2,673 different neoplasm concepts and 7,696 neoplasm terms
SNOMED The total number of neoplasm concepts is 2,673 The total number of neoplasm terms is 7,696 Developmental Lineage The total number of neoplasm concepts is 6,193 The total number of neoplasm terms is 146,666 The Developmental Lineage has: 2.3 times the neoplasm concepts as SNOMED 19 times the neoplasm terms as SNOMED Can one pathologist create a better nomenclature than the CAP? maybe
The large curated nomenclatures can't be used for concept matching and are fast becoming obsolete for their intended mode of human-based implementation due to the explosive growth of the data domain terabytes and terabytes every day – think about all types of digital data in medical information systems PRAKASH NADKARNI, MD, ROLAND CHEN, MD, CYNTHIA BRANDT, MD, MPH, UMLS Concept Indexing for Production Databases:A Feasibility StudyJ Am Med Inform Assoc. 2001;8:80-91. Conclusions: Considerable curation needs to be performed to define a UMLS subset that is suitable for concept matching.
What is the value of a comprehensive neoplasm classification? 1. A modern classification is the key to retrieving, organizing, and integrating the data held in biomedical databases (including the data held in hospital information systems) Can we use the taxonomy to code our surgical pathology reports and other textual documents? 2. A classification is a hypothesis about the nature of reality. Can we use the classification to select classes of tumors (rather than single tumors) to molecular targeted cancer therapy? [We've done this with antibiotics with astounding success]. Can we learn something about the biology of tumors by using the classification to stratify the data found in large biological databases and inspecting the results?
Autocoding Surgical Pathology Reports What is the size of the data domain when we're talking about surgical pathology reports. There are about 25 million surgical pathology reports generated in the U.S. each year (about 50 million cytology reports)
Autocoding Surgical Pathology Reports Allowing 1000 bytes per report, these reports occupy 25 Gigabytes of text (25 thousand million bytes) Here is what 1000 bytes looks like: To be, or not to be,--that is the question:-- Whether 'tis nobler in the mind to suffer The slings and arrows of outrageous fortune Or to take arms against a sea of troubles, And by opposing end them?--To die,--to sleep,-- No more; and by a sleep to say we end The heartache, and the thousand natural shocks That flesh is heir to,--'tis a consummation Devoutly to be wish'd. To die,--to sleep;-- To sleep! perchance to dream:--ay, there's the rub; For in that sleep of death what dreams may come, When we have shuffled off this mortal coil, Must give us pause: there's the respect That makes calamity of so long life; For who would bear the whips and scorns of time, The oppressor's wrong, the proud man's contumely, The pangs of despis'd love, the law's delay, The insolence of office, and the spurns That patient merit of the unworthy takes, When he himself might his quietus make With a bare bodkin? who would these fardels bear, To grunt and sweat under a weary life, But Compressed, all of the surgical pathology reports produced in the U.S. In one year will fit easily on one DVD (like 10 episodes of I Love Lucy).
<?xml version="1.0"?> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://www.purl.org/dc/elements/1.0/" xmlns:v="http://www.pathologyinformatics.org/informatics_r.htm"> <rdf:Description about="urn:PMID-16160487"> <dc:title> interobserver and intraobserver variability in the diagnosis of hydatidiform mole </dc:title> <v:autocode term="mole" code="C0000000" /> <v:autocode term="hydatidiform mole" code="" /> <de_id> * * and * * in the * of hydatidiform mole * *</de_id> </rdf:Description> <rdf:Description about="urn:PMID-16160486"> <dc:title> primary glial tumor of the retina with features of myxopapillary ependymoma </dc:title> <v:autocode term="tumor" code="C0000000" /> <v:autocode term="myxopapillary ependymoma" code="C0000000" /> <v:autocode term="tumor of the retina" code="C0000000" /> <v:autocode term="glial tumor" code="C3059000" /> <v:autocode term="ependymoma" code="C0000000" /> <de_id> * * glial tumor of the retina with * of myxopapillary ependymoma * *</de_id> </rdf:Description> <rdf:Description about="urn:PMID-16160485"> <dc:title> cd20-negative t-cell-rich b-cell lymphoma as a progression of a nodular lymphocyte-predominant hodgkin lymphoma treated with rituximab a molecular analysis using laser capture microdissection </dc:title> <v:autocode term="lymphoma" code="C0000000" /> <v:autocode term="hodgkin" code="C0000000" /> <v:autocode term="b-cell lymphoma" code="C6858100" /> <v:autocode term="t-cell-rich b-cell lymphoma" code="C9496100" /> <v:autocode term="hodgkin lymphoma" code="" /> <de_id> * * t-cell-rich b-cell lymphoma as a * of a * * hodgkin lymphoma * with * a * * using * * * * *</de_id> </rdf:Description>
The autocoder prepares an XML file in RDF format (self-describing document) that autocodes and scrubs text concurrently, at a speed of about 8,000 reports per second.... and does an incomparably better job than human coders! This means that it will code and scrub the 25 million surgical pathology reports in the U.S. In about an hour using a desktop PC If we had access to a supercomputer (operating more than 3,000 times faster than my desktop PC), we could autocode and scrub every pathology report produced in the country in about a second.
Why is it so important to autocode fast? Because we're not really talking about coding (coded datasets cannot be justified on the basis of their scientific value). We're really talking about re-coding very large datasets as necessary. You almost always need to re-code!!! 1. Whenever you want to change from one nomenclature to another (eliminates problem of brand-name loyalty) 2. Whenever you introduce a new version of a nomenclature 3. Whenever you want to use a new coding algorithm (e.g. Parsimonious versus comprehensive, linking code to a particular extracted portion of report) 4. Whenever you add legacy data to your LIS 5. Whenever you merge different pathology datasets – forget mapping!!!
How can we integrate the neoplasm classification with OMIM to discover a new biological observation about tumors? What is OMIM? Omim is a free, comprehensive listing of all the so-called Mendelian inherited diseases. Omim is 103,610,906 bytes (over 100 million bytes) Shakespeare's Hamlet is 180,711 bytes OMIM is about 573 times larger than Hamlet Each record of OMIM lists the name of the inherited disease, and all the medical conditions (including neoplasms) that may be associated with the condition.
Let's autocode all of OMIM and examine the results: 1. The time to autocode was 92 seconds 2. The number of records in omim is 16785 3. The number of records listing primitive tumors is 348 4. The number of records listing endoderm_or_ectoderm tumors is 1220 5. The number of records lising mesoderm tumors is 1766 (completely unlike what you might expect with non-inherited tumors) 6. The number of records listing neuroectoderm tumors is 747 So, because we have a class system, we can look at instance-coded datasets and make observations about CLASS
Easy to count the three combinations of two-lineage (discordant) records: The number of OMIM records with neoplasm concepts in the record text is 1,015. ectoderm/mesoderm 72 omim records ectoderm/neuroectoderm 24 omim records mesoderm/neuroectoderm 39 omim records total 135 class-discordant OMIM records So, 135/1,015 (13%) have a lineage discordance.
Causes for 135 cases of class discordance: 1. Inherited conditions with an (external) environmental factor 2. Physiologic (internal) effects that cross lineages (breast and ovarian cancers caused by an endocrine sensitivity that extends across lineages) 3. Conditions that included a tumor that occurs too infrequently to be correctly associated with the inherited condition 4. Mistakes in parsing omim (finding the name of a tumor in a record that was never intended to indicate that the condition is associated with the tumor) 5. Bad classification How do you decide? In this case, you go back and read the 135 records and try to understand what went wrong in each case.
Classification papers Autocoding papers (Doublet Method 20,000 times faster than other published methods) Confidentiality/privacy papers - De-identification and data scrubbing (Concept Match method) - Zero-knowledge reconciliation of identities - Threshold method for exchanging pieces of data Data integration papers www.pubmed.org search on: berman jj