Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer

Gene expression Polymorphism Chromosomal aberrations Topography (tissue) Morphology (histology) Cancer Genome Anatomy Project Find correlations between ... ? • Earlier informatics segments: • Tumor Gene Index • Cancer Chromosome Aberration Project • Gene Annotation Initiative

History (Simplified)

NCI CGAP Site: Original Goals • Organize by biology rather than by funding • Genes, tissues, chromosomes, reagents, … • Add bio-functional component (e.g. pathways) • Make the site • Consistent: search forms; lists; info pages • Coherent: tied together with internal links

New Features • Expression: DGED; SAGE data • Function: GO and pathways • Structure: protein motifs • Chromosome aberrations in cancer

Measuring Gene Expression • Sequencing • ESTs (100-600 bp end or full sequence of single clone) • SAGE (10 bp “tag” excised with restriction enzymes) • Hybridization • Spotted cDNA arrays • Longer probes, fewer features per slide • “Gene chip” (e.g. Affymetrix) • Multiple shorter probes, more features per chip

Digital Gene Expression Display • Evolved from a concrete request: help find vaccine targets • Similar to NCBI’s DDD but more flexible interface • Queries both EST and SAGE data • But better not to mix

DGED-1

DGED-2

DGED-3

Other Expression Stuff • SAGE data • SAGE libraries accessible in library browser and DGED • Caveat: tag-to-gene mapping is ambiguous both ways • Stay tuned for improvement here • Virtual Northern • For each gene, contrast cancer vs. normal, in ESTs and SAGE, for each of 50 tissues • Ratio: tags for G in given tissue, histology divided by total tags in given tissue, histology • Convert to decile

Virtual Northern

Functional Information • Ontologies • Set membership • E.g., TP53 is in “DNA binding”, “DNA repair”, “transcription factor”, … • Pathways • Set membership, e.g. TP53 is in “ATM Signaling Pathway”, “p53 Signaling Pathway” … • But also relations among members of a pathway, e.g. “is catalyst for”, “activates”, …

Gene Ontology • Three top-level categories • Biological process • Molecular function • Cellular component • Given gene may appear in multiple sets • Mouse by JAX; human by Proteome • An evolving vocabulary • The sorry fate of “tumor suppressor” and “oncogenesis”

GO Browser

BioCarta Pathways • 95 pathway diagrams • Artistic rendering … • but dumb • Relations (e.g. “is catalyst for”) are drawn, but are not data

AKT Signaling Pathway

KEGG Pathways • Mainly metabolic pathways; some regulatory • Genes represented by EC numbers • Many can be hyperlinked to CGAP gene info pages • Some refs to non-human organisms • Compounds appear under various names; each has a unique KEGG compound number • Database contains representation of reactions (unlike BioCarta)

D-Glutamine and D-Glutamate Metabolism

L-Glutamate Compound Info Page

Summary of Functional Information for CASPASE 7

Structure: Protein Motifs • GAI using HMMER to locate Pfam motifs on RefSeq (NM_ …) and MGC (BC…) transcripts • Similarity among transcripts: • Raw sequence • Single motif occurrences • Multiple motif occurrences • E-value: fit of motif to transcript • P-value: relative probability that two transcripts are closely related

Structure: Protein Motifs (Example: ICE_p10, ICE_p20, and CARD among the CASPASes)

Mitelman Database of Chromosomal Aberrations in Cancer • Data culled from literature -- 39,000 cases • Case records: • Clinical/demographic • Topography/morpology • Karyotype • Reference • Recurrent subset • Separate dataset of associations, often to specific genes

Future Plans • Function: smarter pathways • Expression: • New SAGE data and display • Microarray data (NCI 60 cell lines) (see CMAP presentation) • Structure: gene query by motif • Operations on lists of genes • Adding columns of information to gene lists

Genes in AKT Signaling Pathway

Clone List

Genes in GO Apoptosis

Pathways, Ontology, Tissues

Behind the Scenes • The build process (not a pretty sight) • Software architecture

Data Sources/Sizes

Build Process -- Goals • Automated • Current (with respect to external data sources) • Internally consistent (i.e. new UniGene cluster numbers throughout) • Efficient (only recompute when necessary)

Makefile Example $(HS_GENE_TISSUE_DAT): $(TISSUE_SELECTION_DAT) $(HS_GXS_DAT) $(GENE_TISSUE_CMD) \ Hs \ $(ALL_LIBRARIES_DAT) \ $(TISSUE_SELECTION_DAT) \ $(LIBRARY_KEYWORD_DAT) \ $(HS_GXS_DAT) \ $(HS_GENE_TISSUE_DAT) $(DATA_DIR)/load_hs_gene_tissue.mak: $(HS_GENE_TISSUE_DAT) echo "drop index Hs_Gene_Tissue1;" | sqlplus $(DB_USER) echo "drop index Hs_Gene_Tissue2;" | sqlplus $(DB_USER) sqlldr userid=$(DB_USER) control=$(LOAD_DIR)/Hs_Gene_Tissue.ctl \ >$(LOAD_DIR)/load_hs_gene_tissue.log 2>&1 echo "create index Hs_Gene_Tissue1 on Hs_Gene_Tissue(tissue_code);" | sqlplus $(DB_USER) echo "create index Hs_Gene_Tissue2 on Hs_Gene_Tissue(cluster_number);" | sqlplus $(DB_USER) echo "analyze table Hs_Gene_Tissue compute statistics for table;" | sqlplus $(DB_USER) echo "analyze table Hs_Gene_Tissue compute statistics for all indexes;" | sqlplus $(DB_USER) touch $(DATA_DIR)/load_hs_gene_tissue.mak

CGAP Site Architecture (Overview)

Distributed Processing

Application Support

Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer