1 / 63

MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim Notwell & Harendra Guturu

CS173. Lecture 14: Personal Genomics, GSEA/GREAT. MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim Notwell & Harendra Guturu. Announcements. Coming M onday 3/4 lecture is again in LK101 (see class website for room reminders)

sera
Download Presentation

MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim Notwell & Harendra Guturu

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS173 Lecture 14: Personal Genomics,GSEA/GREAT MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim Notwell & Harendra Guturu http://cs173.stanford.edu [BejeranoWinter12/13]

  2. Announcements • Coming Monday 3/4 lecture is again in LK101(see class website for room reminders) • I’ll be working on grad student admissions – Harendra will lecture about his work.(we’ll prepare the ground today) http://cs173.stanford.edu [BejeranoWinter12/13]

  3. Quick recap http://cs173.stanford.edu [BejeranoWinter12/13]

  4. Sequencing Public project: Celera project:

  5. Human Structural Variation http://cs173.stanford.edu [BejeranoWinter12/13]

  6. Human Disease • Cancer • Congenital defects • Disease Association studies • Genic and cis-regulatory contributions http://cs173.stanford.edu [BejeranoWinter12/13]

  7. Personal genomics http://cs173.stanford.edu [BejeranoWinter12/13]

  8. Gameplan • 1. As your budget allows, characterize all the variants in an individual’s genome: • Against the reference genome. • Against variants known in the population. • If possible, against unaffected relatives. • 2 Compare the structural variants you observe to the body of knowledge about genome content & function. Seek culprit mutations. • 3. Having detected a smoking gun mutation, attempt to recreate it in a cell population or organism to obtain a “disease model”. http://cs173.stanford.edu [BejeranoWinter12/13]

  9. Targeted Sequencing, orlooking under the lamp is 50x cheaper Capture Methods vs. Shotgun Targeted sequencing allows for much higher coverage at less cost Will only capture known sites These methods also introduce significant captures bias, including failure to capture sites that differ significantly from the reference genome. (analogous to microarrays) Exome Library Shotgun Library Exon 2 Exon 1 Genomic DNA Modified from Meyerson et al. . 2010. Advances in understanding cancer genomes through second-generation sequencing. Nature Reviews Genetics 11, no. 10 (October): 685-696

  10. Consumer genomics http://cs173.stanford.edu [BejeranoWinter12/13]

  11. Gameplan 1 Collect scientific literature about all structural variant correlations with human disease& traits. 2 Genotype customers for as many informative loci as is commercially viable. 3 Offer counseling for your findings, and their meaning. 4 Ask customers to phenotype themselves. 5 Discover new associations! http://cs173.stanford.edu [BejeranoWinter12/13]

  12. Pay, send biosample, get genotyped

  13. Trait associations

  14. Disease Risk Alleles http://cs173.stanford.edu [BejeranoWinter12/13]

  15. Side Effects: Serious Ethical Issues http://cs173.stanford.edu [BejeranoWinter12/13]

  16. Gene set enrichment analysis: The genic version http://cs173.stanford.edu [BejeranoWinter12/13]

  17. Imagine you did a microarray experiment http://cs173.stanford.edu [BejeranoWinter12/13]

  18. Cluster all genes for differential expression Experiment Control (replicates) (replicates) Most significantly up-regulated genes genes Unchanged genes Most significantly down-regulated genes http://cs173.stanford.edu [BejeranoWinter12/13]

  19. Determine cut-offs, examine individual genes Experiment Control (replicates) (replicates) Most significantly up-regulated genes genes Unchanged genes Most significantly down-regulated genes http://cs173.stanford.edu [BejeranoWinter12/13]

  20. Genes usually work in groups • Biochemical pathways, signaling pathways, etc. • Asking about the expression perturbation of groups of genes is both more appealing biologically, and more powerful statistically (you sum perturbations). http://cs173.stanford.edu [BejeranoWinter12/13]

  21. Gene set 3 up regulated Gene set 2 down regulated Ask about whole gene sets Gene Set 1 Gene Set 2 Gene Set 3 Exper. Control + ES/NES statistic - http://cs173.stanford.edu [BejeranoWinter12/13]

  22. Dataset distribution Gene set 1 distribution One approach: GSEA Number of genes Gene Expression Level Gene set 3 distribution http://cs173.stanford.edu [BejeranoWinter12/13]

  23. Another popular approach: DAVID Input: list of genes of interest (without expression values). http://cs173.stanford.edu [BejeranoWinter12/13]

  24. Multiple Testing Correction run tool Note that statistically you cannot just run individual tests on 1,000 different gene sets. You have to apply further statistical corrections, to account for the fact that even in 1,000 random experiments a handful may come out good by chance. (eg experiment = Throw a coin 10 times. Ask if it is biased. If you repeat it 1,000 times, you will eventually get an all heads series, from a fair coin. Mustn’t deduce that the coin is biased) http://cs173.stanford.edu [BejeranoWinter12/13]

  25. What will you test? run tool Also note that this is a very general approach to test gene lists. Instead of a microarray experiment you can do RNA-seq. Instead of up/down-regulated genes you can test all the genes in a personal genome where you see surprising mutations. Any gene list can be tested. http://cs173.stanford.edu [BejeranoWinter12/13]

  26. Gene Sets: Cataloging biological knowledge http://cs173.stanford.edu [BejeranoWinter12/13]

  27. embryo Anatomy Hierarchy organ system … … cardiovascular … … heart … … … … Keyword lists are not enough Anatomy keywords Organ system Cardiovascular system Heart • Sheer number of terms too much to remember and sort • Need standardized, stable, carefully defined terms • Need to describe different levels of detail • So…defined terms need to be related in a hierarchy • With structured vocabularies/hierarchies • Parent/child relationships exist between terms • Increased depth -> Increased resolution • Can annotate data at appropriate level • May query at appropriate level

  28. Annotate genes to most specific terms TJL-2004

  29. embryo Hierarchy DAG molecular function Query for this term organ system … … … chaperone regulator enzyme regulator cardiovascular … … … … enzyme activator heart … … chaperone activator … … Returns things annotated to descendents General Implementations for Vocabularies • 1. Annotate at appropriate level, query at appropriate level • 2. Queries for higher level terms include annotations to lower level terms

  30. Gene Sets • Gene Ontology (“GO”) • Biological Process • Molecular Function • Cellular Location • Pathway Databases • KEGG • BioCarta • Broad Institute

  31. Other Gene Sets • Transcription factor targets • All the genes regulated by particular TF’s • Protein complex components • Sets of genes whose protein products function together • Ion channel receptors • RNA / DNA Polymerase • Paralogs • Families of genes descended (in eukaryotic times) from a common ancestor

  32. Natural Language Processing (NLP) Opportunities Ontology Map genesto ontologyusing literature Literature Genes http://cs173.stanford.edu [BejeranoWinter12/13]

  33. Gene set enrichment analysis: The gene regulatory version http://cs173.stanford.edu [BejeranoWinter12/13]

  34. Combinatorial Regulatory Code 2,000 different proteins can bind specific DNA sequences. Proteins DNA Protein binding site Gene DNA A regulatory region encodes 3-10 such protein binding sites. When all are bound by proteins the regulatory region turns “on”,and the nearby gene is activated to produce protein. http://cs173.stanford.edu [BejeranoWinter12/13]

  35. ChIP-Seq: first glimpses of the regulatory genome in action Peak Calling Cis-regulatory peak http://cs173.stanford.edu [BejeranoWinter12/13] 35

  36. What is the transcription factor I just assayed doing? • Collect known literature of the form • Function A: Gene1, Gene2, Gene3, ... • Function B: Gene1, Gene2, Gene3, ... • Function C: ... • Ask whether the binding sites you discovered are preferentially binding (regulating) any one or more of the functions listed above. • Form hypothesis and perform further experiments. Gene transcription start site Cis-regulatory peak http://cs173.stanford.edu [BejeranoWinter12/13] 36

  37. Example: inferring functions of Serum Response Factor (SRF) from its ChIP-seq binding profile Gene transcription start site SRF binding ChIP-seq peak • ChIP-seq identified 2,429 SRF binding peaks in human Jurkat cells1 • SRF is known as a “master regulator of the actin cytoskeleton” • In the ChIP-Seq peaks, we expect to find binding sites regulating (genes involved in) actin cytoskeleton formation. http://cs173.stanford.edu [BejeranoWinter12/13]

  38. Example: inferring functions of Serum Response Factor (SRF) from its ChIP-seq binding profile Gene transcription start site SRF binding ChIP-seq peak Ontology term (e.g. ‘actin cytoskeleton’) π π π π • Existing, gene-based method to analyze enrichment: • Ignore distal binding events. • Count affected genes. • Rank by enrichment hypergeometric p-value. N = 8 genes in genome K = 3 genes annotated with n = 2 genes selected by proximal peaks k = 1 selected gene annotated with π π π π P = Pr(k ≥1 | n=2, K =3, N=8) π π http://cs173.stanford.edu [BejeranoWinter12/13]

  39. We have (reduced ChIP-Seq into) a gene list!What is the gene list enriched for? Pro: A lot of tools out there for the analysis of gene lists. Cons: These tools are built for microarray analysis. Does it matter ?? Microarray data Microarray data Generegulation data Microarray tool http://cs173.stanford.edu [BejeranoWinter12/13]

  40. SRF Gene-based enrichment results • Original authors can only state: “basic cellular processes, particularly those related to gene expression” are enriched1 SRF acts on genes both in nucleus and cytoplasm, that are involved in transcription and various types of binding SRF SRF Z ~ Where’s the signal? Top “actin” term is ranked #28 in the list. ~ [1] Valouev A. et al., Nat. Methods, 2008 http://cs173.stanford.edu [BejeranoWinter12/13] 40

  41. Associating only proximal peaks loses a lot of information Relationship of binding peaks to nearest genes for eight human (H) and mouse (M) ChIP-seq datasets Restricting to proximal peaks often leads to complete loss of key enrichments http://cs173.stanford.edu [BejeranoWinter12/13]

  42. Bad Solution: Associating distal peaks brings in many false enrichments π π π Why bad? 14% of human genes tagged ‘multicellular organismal development’. But 33% of base pairs have such a gene nearest upstream/downstream. SRF ChIP-seq set has >2,000 binding events. Throw a random set of 2,000 regions at the genome. What do you get from a gene list analysis? Term Bonferroni corrected p-value nervous system development 5x10-9 system development 8x10-9 anatomical structure development 7x10-8 multicellular organismal development 1x10-7 developmental process 2x10-6 Large “gene deserts” are oftennext to key developmental genes http://cs173.stanford.edu [BejeranoWinter12/13]

  43. Real Solution: Do not convert to gene list.Analyze the set of genomic regions Gene regulatory domain Genomic region (ChIP-seq peak) Gene transcription start site Ontology term ( ‘actin cytoskeleton’) π π π π GREAT = Genomic RegionsEnrichment of Annotations Tool π p = 0.33 of genome annotated with π n = 6 genomic regions P = Prbinom(k ≥5 | n=6, p =0.33) k = 5 genomic regions hit annotation π π π Since 33% of base pairs are near a ‘multicellular organismal development’ gene, we now expect 33% of genomic regions to hit this term by chance. => Toss 2,000 random regions at genome, get NO (false) enrichments. http://cs173.stanford.edu [BejeranoWinter12/13]

  44. How does GREAT know how to assign distal binding peaks to genes? Future: High-throughput assays based on chromosome conformation capture (3C) methods will elucidate complex regulation mechanisms • Currently: Flexible computational definitions allow assignment of peaks to nearest gene, nearest two genes, etc. • Default: each gene has a “basal regulatory domain” of 5 kb up- and 1kb downstream of transcription start site, extends to basal domain of nearest genes within 1 Mb • Though some associations may be missed or incorrect, in general signal richness and robustness is greatly improved by associating distal peaks http://cs173.stanford.edu [BejeranoWinter12/13]

  45. GREAT infers many specific functions of SRF from its binding profile Top GREAT enrichments of SRF Ontology Term # Genes Binomial Experimental P-value support* Top gene-based enrichments of SRF 30 31 7x10-9 5x10-5 Gene Ontology actin cytoskeleton actin binding Miano et al. 2007 Miano et al. 2007 32 26 5x10-7 2x10-6 Bertolotto et al. 2000 Poser et al. 2000 Pathway Commons TRAIL signaling Class I PI3K signaling 5 1x10-8 TreeFam Chai & Tarnawski 2002 FOS gene family 84 28 44 23 5x10-76 4x10-9 1x10-6 2x10-4 Positive control ChIp-Seq support Natesan & Gilman 1995 TF Targets (top actin-related term 28th in list) Targets of SRF Targets of GABP Targets of YY1 Targets of EGR1 * Known from literature – as in function is known, SOME of the genes are known, and the binding sites highlighted are NOT. Similar results for GABP, NRSF, Stat3, p300 ChIP-Seq [McLean et al., Nat Biotechnol., 2010] http://cs173.stanford.edu [BejeranoWinter12/13]

  46. Limb P300: I was blind and I can see Gene List http://cs173.stanford.edu [BejeranoWinter12/13]

  47. GREAT works with ANY cis-regulatory rich setExample: GWAS Compendium set Height-associated unlinked SNPs http://cs173.stanford.edu [BejeranoWinter12/13]

  48. GREAT analysis of histone mark combinations http://cs173.stanford.edu [BejeranoWinter12/13]

  49. GREAT includes multiple ontologies • Twenty ontologies spanning broad categories of biology • 44,832 total ontology terms tested in each GREAT run (2,800 terms) (6,700) (5,215) (3,079) (834) (911) (5,781) (615) (427) (19) (456) (222) (9) (150) (1,253) (6,857) (288) (8,272) (706) (238) Michael Hiller http://cs173.stanford.edu [BejeranoWinter12/13]

  50. Advantages of the GREAT approach Tailored to the biology of gene regulation: • Distal sites are incorporated, not ignored • Variable length gene regulatory domains • Multiple bindings next to same target gene rewarded • Extensive ontologies, some home-made http://cs173.stanford.edu [BejeranoWinter12/13]

More Related