390 likes | 638 Views
Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer. December 7, 2001. Agenda. Overview New Features Future plans Behind the scenes url: http://cgap.nci.nih.gov. CGAP Informatics. Main CGAP Susan Greenhut (OCG) Denise Hise (NCICB) Carl Schaefer (NCICB) Kotien Wu (NCICB)
E N D
Cancer Genome Anatomy Project (CGAP) InformaticsCarl F. Schaefer December 7, 2001
Agenda • Overview • New Features • Future plans • Behind the scenes url: http://cgap.nci.nih.gov
CGAP Informatics • Main CGAP • Susan Greenhut (OCG) • Denise Hise (NCICB) • Carl Schaefer (NCICB) • Kotien Wu (NCICB) • GAI • Bob Clifford (LPG) • Michael Edmonson (LPG) • Ying Hu (LPG) • Cu Nguyen (LPG)
Gene expression Polymorphism Chromosomal aberrations Topography (tissue) Morphology (histology) Cancer Genome Anatomy Project Find correlations between ... ? • Earlier informatics segments: • Tumor Gene Index • Cancer Chromosome Aberration Project • Gene Annotation Initiative
NCI CGAP Site: Original Goals • Organize by biology rather than by funding • Genes, tissues, chromosomes, reagents, … • Add bio-functional component (e.g. pathways) • Make the site • Consistent: search forms; lists; info pages • Coherent: tied together with internal links
New Features • Expression: DGED; SAGE data • Function: GO and pathways • Structure: protein motifs • Chromosome aberrations in cancer
Measuring Gene Expression • Sequencing • ESTs (100-600 bp end or full sequence of single clone) • SAGE (10 bp “tag” excised with restriction enzymes) • Hybridization • Spotted cDNA arrays • Longer probes, fewer features per slide • “Gene chip” (e.g. Affymetrix) • Multiple shorter probes, more features per chip
Digital Gene Expression Display • Evolved from a concrete request: help find vaccine targets • Similar to NCBI’s DDD but more flexible interface • Queries both EST and SAGE data • But better not to mix
Other Expression Stuff • SAGE data • SAGE libraries accessible in library browser and DGED • Caveat: tag-to-gene mapping is ambiguous both ways • Stay tuned for improvement here • Virtual Northern • For each gene, contrast cancer vs. normal, in ESTs and SAGE, for each of 50 tissues • Ratio: tags for G in given tissue, histology divided by total tags in given tissue, histology • Convert to decile
Functional Information • Ontologies • Set membership • E.g., TP53 is in “DNA binding”, “DNA repair”, “transcription factor”, … • Pathways • Set membership, e.g. TP53 is in “ATM Signaling Pathway”, “p53 Signaling Pathway” … • But also relations among members of a pathway, e.g. “is catalyst for”, “activates”, …
Gene Ontology • Three top-level categories • Biological process • Molecular function • Cellular component • Given gene may appear in multiple sets • Mouse by JAX; human by Proteome • An evolving vocabulary • The sorry fate of “tumor suppressor” and “oncogenesis”
BioCarta Pathways • 95 pathway diagrams • Artistic rendering … • but dumb • Relations (e.g. “is catalyst for”) are drawn, but are not data
KEGG Pathways • Mainly metabolic pathways; some regulatory • Genes represented by EC numbers • Many can be hyperlinked to CGAP gene info pages • Some refs to non-human organisms • Compounds appear under various names; each has a unique KEGG compound number • Database contains representation of reactions (unlike BioCarta)
Structure: Protein Motifs • GAI using HMMER to locate Pfam motifs on RefSeq (NM_ …) and MGC (BC…) transcripts • Similarity among transcripts: • Raw sequence • Single motif occurrences • Multiple motif occurrences • E-value: fit of motif to transcript • P-value: relative probability that two transcripts are closely related
Structure: Protein Motifs (Example: ICE_p10, ICE_p20, and CARD among the CASPASes)
Mitelman Database of Chromosomal Aberrations in Cancer • Data culled from literature -- 39,000 cases • Case records: • Clinical/demographic • Topography/morpology • Karyotype • Reference • Recurrent subset • Separate dataset of associations, often to specific genes
Future Plans • Function: smarter pathways • Expression: • New SAGE data and display • Microarray data (NCI 60 cell lines) (see CMAP presentation) • Structure: gene query by motif • Operations on lists of genes • Adding columns of information to gene lists
Behind the Scenes • The build process (not a pretty sight) • Software architecture
Build Process -- Goals • Automated • Current (with respect to external data sources) • Internally consistent (i.e. new UniGene cluster numbers throughout) • Efficient (only recompute when necessary)
Makefile Example $(HS_GENE_TISSUE_DAT): $(TISSUE_SELECTION_DAT) $(HS_GXS_DAT) $(GENE_TISSUE_CMD) \ Hs \ $(ALL_LIBRARIES_DAT) \ $(TISSUE_SELECTION_DAT) \ $(LIBRARY_KEYWORD_DAT) \ $(HS_GXS_DAT) \ $(HS_GENE_TISSUE_DAT) $(DATA_DIR)/load_hs_gene_tissue.mak: $(HS_GENE_TISSUE_DAT) echo "drop index Hs_Gene_Tissue1;" | sqlplus $(DB_USER) echo "drop index Hs_Gene_Tissue2;" | sqlplus $(DB_USER) sqlldr userid=$(DB_USER) control=$(LOAD_DIR)/Hs_Gene_Tissue.ctl \ >$(LOAD_DIR)/load_hs_gene_tissue.log 2>&1 echo "create index Hs_Gene_Tissue1 on Hs_Gene_Tissue(tissue_code);" | sqlplus $(DB_USER) echo "create index Hs_Gene_Tissue2 on Hs_Gene_Tissue(cluster_number);" | sqlplus $(DB_USER) echo "analyze table Hs_Gene_Tissue compute statistics for table;" | sqlplus $(DB_USER) echo "analyze table Hs_Gene_Tissue compute statistics for all indexes;" | sqlplus $(DB_USER) touch $(DATA_DIR)/load_hs_gene_tissue.mak