1 / 99

Stoyan Georgiev advisors: Uwe Ohler and Sayan Mukherjee

Identification of regulatory elements using high-throughput binding evidence. Inference of population structure on large genetic data sets. Stoyan Georgiev advisors: Uwe Ohler and Sayan Mukherjee Computational Biology and Bioinformatics, Duke University February 2011. Outline.

Download Presentation

Stoyan Georgiev advisors: Uwe Ohler and Sayan Mukherjee

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Identification of regulatory elements using high-throughput binding evidence. Inference of population structure on large genetic data sets. StoyanGeorgiev advisors: Uwe Ohler and Sayan Mukherjee Computational Biology and Bioinformatics, Duke University February 2011

  2. Outline • Motif analysis • Transcriptional regulation • genome-wide DNA binding data (Georgiev et al. 2010) • Post-transcriptional regulation • transcriptome-wide RNA binding data (Mukherjee et al., under review; Corcoran* and Georgiev* et al., submitted) • Inference of population structure • randomized algorithm

  3. Motif analysis

  4. Outline • Introduction • Transcriptional regulation • Problem statement • Genomic assays • Statistical framework • Results • Post-transcriptional regulation

  5. Gene regulation DNA motifs miRBP RBP RNA-binding Proteins Transcription Splicing, Capping, Poly-adenylation Nucleus Export Cytoplasm Stability Translation miR-RBP complexes RNA motifs

  6. Gene regulatory code • Transcriptional regulation: short patterns in DNA (motifs) control the initiation of production of gene transcripts • mechanism: sequence-specific DNA binding proteins (TFs) Motif Discovery Tool: cERMIT (Georgiev et al. 2010) • Post-transcriptional regulation: short patterns in RNA control the utilization of gene transcripts • mechanism: sequence-specific RNA binding proteins (RBPs), or microRNA mediated Motif Analysis Tool (Corcoran* and Georgiev* et al.; Mukherjee et al.)

  7. Transcriptional regulation

  8. Transcriptional regulation • Chromatin arrangement • Activity of transcription factors • intra-cellular environment • cis-regulatory code • DNA methylation • Copy Number Variation

  9. location Simplified abstraction

  10. ChIP-seq

  11. cERMIT • Computational tool for de-novo motif discovery • Predict binding motif and functional targets of a specific transcription factor of interest (e.g. TF) using genome-wide measurements of binding (e.g. ChIP-seq, ChIP-chip) (Georgiev et al. 2010) • Input: set of sequence regions with assigned binding evidence • Output: ranked list of predicted binding motifs and corresponding target locations

  12. Brief introduction to cERMIT • Binding site representation: consensus sequence • Search for the "best" binding site that explains the genome-wide binding evidence. • "best“: occurs in regions that tend to have high evidence of being bound (this is formalized as a normalized average score) • can evaluate all possible binding sites up to some reasonable length...in theory • in practice, we try to cover as many as possible • start with all possible 5-mers (AAAAA, AAAAG, AAAAC,...,TTTTT) • for each, evaluate its "neighbours“ and replace it with the "best" one • repeat until no neighbour scores better than the current motif

  13. sequence regions sequence regions sequence regions high evidence AAAAA AAAAG AAAAC AAAAT AAAGA . . . . . TTTCT TTTTA TTTTG TTTTC TTTTT RTGASTCA TGACTCA RTGASTCAK GAWTCAYY TGACTCA TGAWTCAK . . . . . ES = 15.0 ES = 1.5 low evidence evolved motifs 512 seed motifs Algorithmic view ES = normalized average binding evidence

  14. Variable definitions

  15. Motif model

  16. Motif model

  17. ChIP-seq motif discovery input output

  18. Results

  19. ChIP-chip validation • conservation filter improves prediction accuracy (Georgiev et al. 2010)

  20. Example yeast ChIP-chip output GCN4 SKO1

  21. CTCF STAT1 SRF Human ChIP-seq prediction literature Barski et al. 2007 Robertson et al. 2007 Valouev et al. 2008

  22. Post-transcriptionalcontrol

  23. Gene regulatory code • Post-transcriptional regulation: short patterns in RNA control the utilization of gene transcripts • mechanism: sequence-specific RNA binding proteins (RBPs), or microRNA mediated to control translation Motif Analysis Tool (Corcoran*, Georgiev* et al.; Mukherjee et al.)

  24. PAR-CLIP • CLIP: Cross linking and immunoprecipitation • a method of transcriptome-wide identification of RNA-protein interaction sites – problem, quite noisy • PAR-CLIP = CLIP + photoactivatable nucleotides • more efficient cross linking • directly observable evidence of Protein-RNA cross linking: upon reverse transcription T->C conversion near or at the interaction site

  25. PAR-CLIP 1. culture with 4-SU 2. cross-link 3. Immunoprecipitate & size-select 4. convert into a cDNA library & sequence [Hafner et al. 2010]

  26. RBP motif analysis pipeline RBP cERMIT Motif predictions Motif seeds

  27. Modified motif score

  28. Variable definitions

  29. Motif model

  30. Motif model

  31. Motif model

  32. Results

  33. predicted motif Pumilio • 2 million mapped reads • # clusters with site / total # clusters = 1,162 / 8,483 (Hafner et al. 2010)

  34. Summary • cERMIT: motif discovery using genome-wide binding data • identify motifs that are highly enriched in targets with high binding evidence. • applicable to RNA and DNA binding data • adjust for sequence biases and other potential confounders using linear regression framework • In progress… • Bayesian formulation • improve stability of predictions • more comprehensive search

  35. Inference of population structure and generalized eigendecomposition

  36. Outline • Motivation • Current approaches • Extensions • large data sets • supervised dimension reduction • Empirical results • Wishart simulation • WCCC Crohn’s disease data set

  37. Motivation A classic problem in biology and genetics is to study population structure (Cavalli-Sforza 1978, 2003) Genotype data on millions of loci and thousands of individuals Can we detect structure based on the genetic data? infer population demographic histories correct for population structure in disease association studies correspondence to geography

  38. Current approaches Structure(Pritchard et al. 2000) Bayesian model-based clustering of genotype data Eigenstrat (Patterson et al. 2006) PCA-based inference of axis of genetic variation

  39. Population structure within Europe (Novembreet al. 2008)

  40. Eigenstrat (Patterson et al. 2006) • Combines Principal Component Analysis and Random Matrix Theory

  41. Eigenstrat (Patterson et al. 2006) • Runtime O(m2n) computation • The challenge: future (current?) genetic data sets n ≥ 500, 000 m ≥ 20, 000 (e.g. WTCCC Nature 2007: 17,000 individuals, 500K snp array) • Can we extend Eigenstrat to this data to be run on a standard desktop? • Assume low rank, k << min(m,n) • Approx algorithm in O(kmn)computation

  42. Randomized PCA Basic steps: • Random projection (approx. preserves distances) • project data onto low dimensional space • do SVD on Y -- similar to SVD on M • Power method : when spectrum decay is slow

  43. Properties of Randomized PCA • Error bound on the k rank approximation : power iteration drives the leading constant to one exponentially fast as i increases! • Top k eigenvalues and eigenvectors can be well approximated in time O(ikmn) • rapid convergence when close to low rank structure (i=1-3) • slowly decaying singular values require more iterations • Clearly no benefit when ik ≈ m << n

  44. Properties of Randomized PCA • Empical observations • we don’t seem to need power iteration, as random projection good enough (data is low rank) • eigenvalue accuracy estimate can be “sloppy” if emphasis is on subspace estimation, assuming a spectral gap • often we care mainly about subspace estimation accuracy

  45. Generalized eigdecomposition • (Semi) supervised dimension reduction • add prior information by means of class labels • linear and non-linear variations: (L)SIR(Li et al. 1991, Wu et al. 2010) • (Non-) linear embeddings • Laplacian Eigenmaps(Belkin and Niyogi 2002) • Locality Preserving Projections(He and Niyogi, 2003) • Canonical Correlation Analysis

  46. Empirical results • Wishart Covariance Structure • independent N(0,1) entries for data matrix • The Wellcome Trust Case Control Consortium (Nature 2007) • Crohn’s Disease; 500K SNP array; 5,000 individuals

  47. Subspace distance metric • Exact method -- subspace A, approx. method -- subspace B (consider column spaces) • Construct projection operators • Define distance metric:(Ye and Weiss, 2003)

  48. Wishart covariance Data matrix: independent N(0,1) entries Runtime improvement over exact

  49. Spiked wishart (rank = 5)

  50. WTCCC Crohn’s disease data set

More Related