1 / 63

Malignancy Types

Molecular Entity Types. Phenotypic Entity Types. Gene. Differentiation Status. Clinical Stage. Site. Genomic Information. Malignancy Types. Phenomic Information. Histology. Developmental State. Heredity Status. Variation. Genomic Variation associated with Malignancy.

prema
Download Presentation

Malignancy Types

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Molecular Entity Types Phenotypic Entity Types Gene Differentiation Status Clinical Stage Site Genomic Information Malignancy Types Phenomic Information Histology Developmental State Heredity Status Variation Genomic Variation associated with Malignancy

  2. Flow Chart for Manual Annotation Process Auto-Annotated Texts Biomedical Literature Machine-learning Algorithm Annotators (Experts) Manually Annotated Texts Annotation Ambiguity Entity Definitions

  3. Defining biomedical entities A point mutation was found at codon 12 (G  A).  Variation

  4. Defining biomedical entities A point mutation was found at codon 12 (G  A).  Variation A point mutation was found at codon 12   Variation.Type Variation.Location (G  A).   Variation.InitialState Variation.AlteredState Data Gathering Data Classification

  5. Defining biomedical entities • Conceptual boundaries • Sub-classification of entities

  6. Defining biomedical entities • Conceptual boundaries • Sub-classification of entities • Levels of specificity

  7. Levels of specificity Gene Entity Gene Protein kinase (Super family) MAPK (Gene family) MAPK10 Malignancy type Entity Cancer/Tumor Carcinoma Lung carcinoma Squamous cell lung carcinoma

  8. Defining biomedical entities • Conceptual boundaries • Sub-classification of entities • Levels of specificity • Conceptual overlaps between entities Symptom: Subjective or objective evidence of disease. Disease: A specific pathological process with a characteristic set of symptoms. Arrhythmia vs. Long QT Syndrome

  9. Defining biomedical entities • Conceptual boundaries • Sub-classification of entities • Levels of specificity • Conceptual overlaps between entities • Domain-specific clarification Gene entity clarification: Regulation element -- promoters (eg. TATA box)

  10. Defining biomedical entities • Conceptual boundaries • Sub-classification of entities • Levels of specificity • Conceptual overlaps between entities • Domain-specific clarification • Syntactical boundaries • Text boundary issues The K-ras gene……

  11. Defining biomedical entities • Conceptual boundaries • Sub-classification of entities • Levels of specificity • Conceptual overlaps between entities • Domain-specific clarification • Syntactical boundaries • Text boundary issues (The K-ras gene) • Pronoun co-reference (this gene, it, they)

  12. Defining biomedical entities • Conceptual boundaries • Sub-classification of entities • Levels of specificity • Conceptual overlaps between entities • Domain-specific clarification • Syntactical boundaries • Text boundary issues (The K-ras gene) • Co-reference (this gene, it, they) • Structural overlap -- entity within entity (same entity type) MAP kinase kinase kinase

  13. Defining biomedical entities • Conceptual boundaries • Sub-classification of entities • Levels of specificity • Conceptual overlaps between entities • Domain-specific clarification • Syntactical boundaries • Text boundary issues (The K-ras gene) • Pronoun co-reference (this gene, it, they) • Structural overlap -- entity within entity (different entity type) Squamous cell lung carcinoma

  14. Defining biomedical entities • Conceptual boundaries • Sub-classification of entities • Levels of specificity • Conceptual overlaps between entities • Domain-specific clarification • Syntactical boundaries • Text boundary issues (The K-ras gene) • Co-reference (this gene, it, they) • Structural overlap -- entity within entity • Discontinuous mentions (N- and K-ras )

  15. Semantic ambiguity challenges • Ambiguity within an entity type catalase glycine-N-acyltransferase (GLYAT) CAT

  16. Semantic ambiguity challenges • Ambiguity within an entity type • Ambiguity between entity types Gene entity Organism CAT

  17. Semantic ambiguity challenges • Ambiguity within entity types • Ambiguity between entity types • Gene entity ambiguity • 3% of human genes share aliases • Huge ambiguity of genes between species (mouse and human) • Gene.general, Gene.gene/RNA, Gene.protein

  18. Gene Variation Malignancy Type Gene RNA Protein Type Location Initial State Altered State Site Histology Clinical Stage Differentiation Status Heredity Status Developmental State Physical Measurement Cellular Process Expressional Status Environmental Factor Clinical Treatment Clinical Outcome Research System Research Methodology Drug Effect

  19. http://www.ldc.upenn.edu/mamandel/itre/annotators/onco/definitions.htmlhttp://www.ldc.upenn.edu/mamandel/itre/annotators/onco/definitions.html

  20. Manual Annotation Corpus Release Jena University Language & Information Engineering Lab: http://www.julielab.de K Bretonnel Cohen and Lawrence Hunter, BMC Bioinformatics. 2006; 7(Suppl 3): S5.

  21. Summary -- Entity Definition • Developed iterative process for biomedical entity definition; • Defined genomic and phenotypic entities with distinct conceptual and syntactical boundaries in genomic variation of malignancy; • Constructed a manually annotated corpus with 1442 oncology-focused articles.

  22. Named Entity Extractors Mycn is amplified in neuroblastoma. Gene Variation type Malignancy type

  23. Automated Extractor Development • Training and testing data • 1442 cancer-focused MEDLINE abstracts • 70% for training, 30% for testing

  24. Automated Extractor Development • Training and testing data • 1442 cancer-focused MEDLINE abstracts • 70% for training, 30% for testing • Machine-learning algorithm • Conditional Random Field (CRF) • Sets of Features Lungcancer is the … of carcinoma deaths worldwide. MType Mtype

  25. Automated Extractor Development • Training and testing data • 1442 cancer-focused MEDLINE abstracts • 70% for training, 30% for testing • Machine-learning algorithm • Conditional Random Fields (CRFs) • Sets of Features • Orthographic features (capitalization, punctuation, digit/number/alpha-numeric/symbol); • Character-N-grams (N=2,3,4); • Prefix/Suffix: (*oma); • Offsite conjuction (3 consecutive word tokens); • Domain-specific lexicon (NCI neoplasm list).

  26. Extractor Performance • Precision: (true positives)/(true positives + false positives) • Recall: (true positives)/(true positives + false negatives)

  27. CRF-based Extractor vs. Pattern Matcher • The testing corpus • 39 manually annotated MEDLINE abstracts selected • 202 malignancy type mentions identified • The pattern matching system • 5,555 malignancy types extracted from NCI neoplasm ontology • Case-insensitive exact string matching applied • 85 malignancy type mentions (42.1%) recognized correctly • The malignancy type extractor • 190 malignancy type mentions (94.1%) recognized correctly • Included all the baseline-identified mentions

  28. The Types of Mentions NOT Identified by Pattern Matching

  29. Normalization abdominal neoplasm abdomen neoplasm Abdominal tumour Abdominal neoplasm NOS Abdominal tumor Abdominal Neoplasms Abdominal Neoplasm Neoplasm, Abdominal Neoplasms, Abdominal Neoplasm of abdomen Tumour of abdomen Tumor of abdomen ABDOMEN TUMOR Unique Identifier

  30. Normalization UMLS metathesaurus Concept Unique Identifier (CUI) 19,397 CUIs with 92,414 synonyms abdominal neoplasm abdomen neoplasm Abdominal tumour Abdominal neoplasm NOS Abdominal tumor Abdominal Neoplasms Abdominal Neoplasm Neoplasm, Abdominal Neoplasms, Abdominal Neoplasm of abdomen Tumour of abdomen Tumor of abdomen ABDOMEN TUMOR C0000735

  31. Normalization – Computational Procedures • Rule-based algorithm • Applied to both entity mentions and vocabulary terms (UMLS metathesaurus) • Case insensitivity (carcinoma/Carcinoma) • Space/punctuation removal (lung-cancer/lungcancer) • Stemming (neuroblastoma/neuroblastomas) • Applied to mentions only • First/last character removal (additional space/punctuation) • First/last word removal (translocation lung carcinoma) • Evaluate the accuracy and the priority of the rules • 1,000 randomly selected entity mentions • Choose the best performed rule combination and sequences

  32. MEDLINE Data Processing • Tagging MEDLINE pre-2006 abstracts • 15,433,668 MEDLINE abstracts • 9,153,340 redundant and 580,002 distinct malignancy type mentions • ~60% extracted mentions matched to UMLS CUIs • 1,642 CPU-hours (2.44 days on a 28-CPU cluster) • Infrastructure construction (postgreSQL Database)

  33. Gene-Malignancy-Evidence Matrix Gene Malignancy Evidence A1BG Adenocarcinoma 1634938 A1BG Adenocarcinoma 2292657 A1BG Adenocarcinoma 3566173 …… …… …… ABCC1 Lung Carcinoma 11156254 ABCC1 Lung Carcinoma 11159731 ABCC1 Lung Carcinoma 11172691 …… …… …… B3GAT1 Breast Neoplasm 6870377 B3GAT1 Breast Neoplasm 9129046 B3GAT1 Breast Neoplasm 9701020 …… …… …… ERVK6 Stage IV Melanoma of the Skin 9056412 ERVK6 Stage IV Melanoma of the Skin 9620301 ERVK6 Stage IV Melanoma of the Skin 9640365 …… …… …… NFKB1 Colon Carcinoma 12842827 NFKB1 Colon Carcinoma 12901803 NFKB1 Colon Carcinoma 12934082 …… …… …… VIM Gastrointestinal Stromal Tumor 12375611 VIM Gastrointestinal Stromal Tumor 12657940 VIM Gastrointestinal Stromal Tumor 12673425 …… …… …… 21,493,687 normalized gene symbols (16,875 unique)

  34. Gene-Malignancy-Evidence Matrix Gene Malignancy Evidence A1BG Adenocarcinoma 1634938 A1BG Adenocarcinoma 2292657 A1BG Adenocarcinoma 3566173 …… …… …… ABCC1 Lung Carcinoma 11156254 ABCC1 Lung Carcinoma 11159731 ABCC1 Lung Carcinoma 11172691 …… …… …… B3GAT1 Breast Neoplasm 6870377 B3GAT1 Breast Neoplasm 9129046 B3GAT1 Breast Neoplasm 9701020 …… …… …… ERVK6 Stage IV Melanoma of the Skin 9056412 ERVK6 Stage IV Melanoma of the Skin 9620301 ERVK6 Stage IV Melanoma of the Skin 9640365 …… …… …… NFKB1 Colon Carcinoma 12842827 NFKB1 Colon Carcinoma 12901803 NFKB1 Colon Carcinoma 12934082 …… …… …… VIM Gastrointestinal Stromal Tumor 12375611 VIM Gastrointestinal Stromal Tumor 12657940 VIM Gastrointestinal Stromal Tumor 12673425 …… …… …… 5,398,954 normalized malignancy types (4,166 CUIs)

  35. Gene-Malignancy-Evidence Matrix Gene Malignancy Evidence A1BG Adenocarcinoma 1634938 A1BG Adenocarcinoma 2292657 A1BG Adenocarcinoma 3566173 …… …… …… ABCC1 Lung Carcinoma 11156254 ABCC1 Lung Carcinoma 11159731 ABCC1 Lung Carcinoma 11172691 …… …… …… B3GAT1 Breast Neoplasm 6870377 B3GAT1 Breast Neoplasm 9129046 B3GAT1 Breast Neoplasm 9701020 …… …… …… ERVK6 Stage IV Melanoma of the Skin 9056412 ERVK6 Stage IV Melanoma of the Skin 9620301 ERVK6 Stage IV Melanoma of the Skin 9640365 …… …… …… NFKB1 Colon Carcinoma 12842827 NFKB1 Colon Carcinoma 12901803 NFKB1 Colon Carcinoma 12934082 …… …… …… VIM Gastrointestinal Stromal Tumor 12375611 VIM Gastrointestinal Stromal Tumor 12657940 VIM Gastrointestinal Stromal Tumor 12673425 …… …… …… 3,100,773 distinct Gene-Malignancy-Evidence relations

  36. Ranked by Frequency

  37. Summary -- Extractor Development and Application • Developed well-performed automated entity extractors across genomic and phenotypic domains; • Constructed rule-based computational procedure for normalization; • Applied the extractors and normalizers to all MEDLINE abstracts; • Imported the extracted information into a relational database.

  38. Text Mining Applications -- Hypothesizing NB Candidate Genes

  39. Text Mining Applications -- Hypothesizing NB Candidate Genes Two distinct subtypes of neuroblastoma

  40. Text Mining Applications -- Hypothesizing NB Candidate Genes • Two distinct subtypes of neuroblastoma • Distinct clinical behaviors (favorable vs. unfavorable) • NGF/NTRK1 (TrkA) vs. BDNF/NTRK2 (TrkB) signaling pathways

  41. Text Mining Applications -- Hypothesizing NB Candidate Genes • Two distinct subtypes of neuroblastoma • Distinct clinical behaviors (favorable vs. unfavorable) • NGF/NTRK1 (TrkA) vs. BDNF/NTRK2 (TrkB) signaling pathways • Determine the early response genes differentiating the two pathways • More precise prognosis and clinical intervention

  42. Text Mining Applications -- Hypothesizing NB Candidate Genes NTRK1 NTRK2 SH-SY5Y SH-SY5Y NGF BDNF RNA extraction at 0,1.5hrs,4hrs and 12hrs Affymetrix U133A Expression Array (RMAexpress normalization, SAM test) 751 differentially expressed genes

  43. Text Mining Applications -- Hypothesizing NB Candidate Genes Microarray Expression Data Analysis Gene Set 1: NTRK1, NTRK2 468 283 Gene Set 2: NTRK2, NTRK1

  44. Text Mining Applications -- Hypothesizing NB Candidate Genes • Differentially represented genes in biomedical literature • NTRK1 vs. NTRK2 pathway differentially associated genes/proteins based on literature • Preferential association determined by co-occurrence with either receptor 5 times or more over the other • Assumption: the co-occurrence frequency is reflecting functional correlation

  45. Text Mining Applications -- Hypothesizing NB Candidate Genes NTRK1/NTRK2 Preferentially Associated Genes in Literature LitSet 1: NTRK1 Associated Genes 514 157 LitSet 2: NTRK2 Associated Genes

  46. Text Mining Applications -- Hypothesizing NB Candidate Genes Microarray Expression Data Analysis NTRK1/NTRK2 Associated Genes in Literature NTRK1 Associated Genes Gene Set 1: NTRK1, NTRK2 18 514 468 4 283 157 NTRK2 Associated Genes Gene Set 2: NTRK2, NTRK1

  47. Functional Pathway Analysis Determine gene enrichment score for six selected functional pathways: CD -- Cell Death; CGP -- Cell Growth and Proliferation; CCSI -- Cell-to-Cell Signaling and Interaction; CM -- Cell Morphology NSDF -- Nervous System Development and Function; CAO -- Cellular Assembly and Organization.

  48. Functional Pathway Analysis Six selected pathways: CD -- Cell Death; CM -- Cell Morphology; CGP -- Cell Growth and Proliferation; NSDF -- Nervous System Development and Function; CCSI -- Cell-to-Cell Signaling and Interaction; CAO -- Cellular Assembly and Organization. Ingenuity Pathway Analysis Tool Kit

More Related