1 / 39

Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer

Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer. December 7, 2001. Agenda. Overview New Features Future plans Behind the scenes url: http://cgap.nci.nih.gov. CGAP Informatics. Main CGAP Susan Greenhut (OCG) Denise Hise (NCICB) Carl Schaefer (NCICB) Kotien Wu (NCICB)

nemo
Download Presentation

Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cancer Genome Anatomy Project (CGAP) InformaticsCarl F. Schaefer December 7, 2001

  2. Agenda • Overview • New Features • Future plans • Behind the scenes url: http://cgap.nci.nih.gov

  3. CGAP Informatics • Main CGAP • Susan Greenhut (OCG) • Denise Hise (NCICB) • Carl Schaefer (NCICB) • Kotien Wu (NCICB) • GAI • Bob Clifford (LPG) • Michael Edmonson (LPG) • Ying Hu (LPG) • Cu Nguyen (LPG)

  4. Gene expression Polymorphism Chromosomal aberrations Topography (tissue) Morphology (histology) Cancer Genome Anatomy Project Find correlations between ... ? • Earlier informatics segments: • Tumor Gene Index • Cancer Chromosome Aberration Project • Gene Annotation Initiative

  5. History (Simplified)

  6. NCI CGAP Site: Original Goals • Organize by biology rather than by funding • Genes, tissues, chromosomes, reagents, … • Add bio-functional component (e.g. pathways) • Make the site • Consistent: search forms; lists; info pages • Coherent: tied together with internal links

  7. New Features • Expression: DGED; SAGE data • Function: GO and pathways • Structure: protein motifs • Chromosome aberrations in cancer

  8. Measuring Gene Expression • Sequencing • ESTs (100-600 bp end or full sequence of single clone) • SAGE (10 bp “tag” excised with restriction enzymes) • Hybridization • Spotted cDNA arrays • Longer probes, fewer features per slide • “Gene chip” (e.g. Affymetrix) • Multiple shorter probes, more features per chip

  9. Digital Gene Expression Display • Evolved from a concrete request: help find vaccine targets • Similar to NCBI’s DDD but more flexible interface • Queries both EST and SAGE data • But better not to mix

  10. DGED-1

  11. DGED-2

  12. DGED-3

  13. Other Expression Stuff • SAGE data • SAGE libraries accessible in library browser and DGED • Caveat: tag-to-gene mapping is ambiguous both ways • Stay tuned for improvement here • Virtual Northern • For each gene, contrast cancer vs. normal, in ESTs and SAGE, for each of 50 tissues • Ratio: tags for G in given tissue, histology divided by total tags in given tissue, histology • Convert to decile

  14. Virtual Northern

  15. Functional Information • Ontologies • Set membership • E.g., TP53 is in “DNA binding”, “DNA repair”, “transcription factor”, … • Pathways • Set membership, e.g. TP53 is in “ATM Signaling Pathway”, “p53 Signaling Pathway” … • But also relations among members of a pathway, e.g. “is catalyst for”, “activates”, …

  16. Gene Ontology • Three top-level categories • Biological process • Molecular function • Cellular component • Given gene may appear in multiple sets • Mouse by JAX; human by Proteome • An evolving vocabulary • The sorry fate of “tumor suppressor” and “oncogenesis”

  17. GO Browser

  18. BioCarta Pathways • 95 pathway diagrams • Artistic rendering … • but dumb • Relations (e.g. “is catalyst for”) are drawn, but are not data

  19. AKT Signaling Pathway

  20. KEGG Pathways • Mainly metabolic pathways; some regulatory • Genes represented by EC numbers • Many can be hyperlinked to CGAP gene info pages • Some refs to non-human organisms • Compounds appear under various names; each has a unique KEGG compound number • Database contains representation of reactions (unlike BioCarta)

  21. D-Glutamine and D-Glutamate Metabolism

  22. L-Glutamate Compound Info Page

  23. Summary of Functional Information for CASPASE 7

  24. Structure: Protein Motifs • GAI using HMMER to locate Pfam motifs on RefSeq (NM_ …) and MGC (BC…) transcripts • Similarity among transcripts: • Raw sequence • Single motif occurrences • Multiple motif occurrences • E-value: fit of motif to transcript • P-value: relative probability that two transcripts are closely related

  25. Structure: Protein Motifs (Example: ICE_p10, ICE_p20, and CARD among the CASPASes)

  26. Mitelman Database of Chromosomal Aberrations in Cancer • Data culled from literature -- 39,000 cases • Case records: • Clinical/demographic • Topography/morpology • Karyotype • Reference • Recurrent subset • Separate dataset of associations, often to specific genes

  27. Future Plans • Function: smarter pathways • Expression: • New SAGE data and display • Microarray data (NCI 60 cell lines) (see CMAP presentation) • Structure: gene query by motif • Operations on lists of genes • Adding columns of information to gene lists

  28. Genes in AKT Signaling Pathway

  29. Clone List

  30. Genes in GO Apoptosis

  31. Pathways, Ontology, Tissues

  32. Behind the Scenes • The build process (not a pretty sight) • Software architecture

  33. Data Sources/Sizes

  34. Build Process -- Goals • Automated • Current (with respect to external data sources) • Internally consistent (i.e. new UniGene cluster numbers throughout) • Efficient (only recompute when necessary)

  35. Makefile Example $(HS_GENE_TISSUE_DAT): $(TISSUE_SELECTION_DAT) $(HS_GXS_DAT) $(GENE_TISSUE_CMD) \ Hs \ $(ALL_LIBRARIES_DAT) \ $(TISSUE_SELECTION_DAT) \ $(LIBRARY_KEYWORD_DAT) \ $(HS_GXS_DAT) \ $(HS_GENE_TISSUE_DAT) $(DATA_DIR)/load_hs_gene_tissue.mak: $(HS_GENE_TISSUE_DAT) echo "drop index Hs_Gene_Tissue1;" | sqlplus $(DB_USER) echo "drop index Hs_Gene_Tissue2;" | sqlplus $(DB_USER) sqlldr userid=$(DB_USER) control=$(LOAD_DIR)/Hs_Gene_Tissue.ctl \ >$(LOAD_DIR)/load_hs_gene_tissue.log 2>&1 echo "create index Hs_Gene_Tissue1 on Hs_Gene_Tissue(tissue_code);" | sqlplus $(DB_USER) echo "create index Hs_Gene_Tissue2 on Hs_Gene_Tissue(cluster_number);" | sqlplus $(DB_USER) echo "analyze table Hs_Gene_Tissue compute statistics for table;" | sqlplus $(DB_USER) echo "analyze table Hs_Gene_Tissue compute statistics for all indexes;" | sqlplus $(DB_USER) touch $(DATA_DIR)/load_hs_gene_tissue.mak

  36. CGAP Site Architecture (Overview)

  37. Distributed Processing

  38. Application Support

More Related