410 likes | 844 Views
Understanding GWAS SNPs. Xiaole Shirley Liu Stat 115/215. GWAS SNPs. Association <> Causal What ’ s the most likely causal SNP / Gene in LD with the genotyped SNP? Use functional genomics to identify the disease tissue of origin What ’ s the SNP doing in non-coding regions? RSNPs.
E N D
Understanding GWAS SNPs Xiaole Shirley Liu Stat 115/215
GWAS SNPs • Association <> Causal • What’s the most likely causal SNP / Gene in LD with the genotyped SNP? • Use functional genomics to identify the disease tissue of origin • What’s the SNP doing in non-coding regions? RSNPs
Use Literature & Pathway Information to Identify Putative Causal SNPs / Genes
Literature Mining Terms • Corpus: Collection of documents. E.g.all papers in PubMed • Term frequency: Number of times a word appears in a document. E.g. “polymerase” appeared 41 times in a paper • Document frequency: Number of documents a word appears in. E.g. 1234x papers has the word “transcription” • Collection frequency: Total number of times a word appears in a corpus. E.g. “transcription” appeared 6789X times in all of PubMed indexed papers • Stop words: Words in the corpus that contribute little to meaning. E.g. to, is, an • Stemming: Group together different variations of the same word. E.g. activate vs. activated vs. activating
Documents Represented as Vectors • A document is summarized as a vector of word counts. • Each dimension contains the number of times a word appears. • Can calculate similarity between two documents by comparing their vectors • ”Our analysis includes comparison of amino acid environments with random control environments as well as with each of the other amino acid environments.” acid 2 amino 2 analysis 1 comparison 1 control 1 environments 2 […] our 1
Comparing Two Documents • Intuitive comparison between two papers correlation coefficient of their word occurrence vectors • Correlation measures the strength of linear relationship between two random variables a = c(1, 3, 5, 1, 8, 20, 0, 0, 0, 3, 1) b = c(2, 3, 4, 0, 10, 25, 1, 0, 2, 4, 3) c = c(2, 0, 1, 10, 2, 4, 7, 1, 5, 0, 8) cor(a, b) 0.985615 Correlated cor(b, c) -0.110328 Not correlated
Term Weighting Considerations • Give different terms different weight • Global weight • Document frequency
Term Weighting Considerations • Give different terms different weight • Global weight • Document frequency: Fewer documents, more weight: log(N / df). E.g. progesterone vs gene • Local weight • Term frequency
Term Weighting Considerations • Give different terms different weight • Global weight • Document frequency: Fewer documents, more weight: log(N / df). E.g. progesterone vs gene • Local weight • Term frequency: More frequent, more weight: 1 + log(tf). E.g. progesterone: 10 times in paper1 vs 3 in paper2 • Document length
Term Weighting Considerations • Give different terms different weight • Global weight • Document frequency: Fewer documents, more weight: log(N / df). E.g. progesterone vs gene • Local weight • Term frequency: More frequent, more weight: 1 + log(tf). E.g. progesterone: 10 times in paper1 vs 3 in paper2 • Document length: Less weight for longer document. E.g. paper1 200 pages vs paper2 3 pages
Evaluate Related of Papers • Related Articles • Similarity between two documents: all terms (local wt1 × local wt2 × global wt) • Pre-computed related articles for each citation • Rank ordered by relevance
GRAIL: Gene Relationships Across Implicated Loci Raychaudhuri et al PLOS Genetics 2009
GRAIL on Crohn’s Disease • Use literature / pathways to identify potential causal gene • Find likely reproducible SNP hits, and increase statistical power
GWAS SNPs • Association <> Causal • What’s the most likely causal SNP / Gene in LD with the genotyped SNP? • Use functional genomics to identify the disease tissue of origin • What’s the SNP doing in non-coding regions? RSNPs
Identifying Causal Cell-type for Complex Disease • E.g. Rheumatoid Arthritis (RA) • Many cell types implicated over the years, ranging from neutrophils, synoviocytes, and all classes of lymphocytes! • It is difficult to establish causality complex phenotypes in human • Use expression data: Comprehensive and unbiased, publicly available
Immunological Genome Project • Start with a list of disease SNPs • Find genes near the SNP that are specifically expressed in a cell type • Identify cell types that have many such genes ... more than expected by chance
Identifying Causal Cell-type for Complex Disease From Expression • Negative control: simulation from random set of SNPs • P-value: proportion of simulations exceeding the observed enrichment Hu et al, American Journal of Human Genetics, 2011
GWAS SNPs • Association <> Causal • What’s the most likely causal SNP / Gene in LD with the genotyped SNP? • Use functional genomics to identify the disease tissue of origin • What’s the SNP doing in non-coding regions? eQTL and RSNPs
eQTL • eQTL: use expression as phenotype • Are there SNPs that are associated with expression changes? • Heritable genetic variation for transcription levels
RSNPs • A SNP influences TF binding, affecting downstream (disease-related) gene expression
eQTL and RSNPs • eQTL: use expression as phenotype • Are there SNPs that are associated with expression changes? • Heritable genetic variation for transcription levels • RSNP: regulatory SNP • Much of the influential variation is located cis- to the coding locus • In humans, mouse, and maize, 35%-50% of the genetic basis for intraspecific differences in transcription level are cis- to the coding locus (e.g. Morley et al. 2004; Schadtet al. 2003; Stranger et al. 2005; Cheung et al. 2005, etc.).
RSNPs from GWAS • Enriched in regulatory sequences (promoters and enhancers) that are identified through histone mark ChIP-seq or DNase-seq Maurano et al, Science 2012
Highest Correlated Genes of Distal DHSs Harboring GWAS Variants
Trans-Effect of Cis-SNPs • Three risk loci for ESR1, MYC, and KLF4 • Effect on TF expression is small, but much strong when looking at the expression of their downstream target genes Li et al, Cell 2013
Useful Tools to Understand RSNPs • Identify putative TFs whose binding might be influences by SNPs based on ENCODE ChIP-seq / DNase-seq data
Understanding GWAS SNPs • Association <> Causal • Use literature and pathways to identify the putative causal SNP / Gene in LD with the genotyped SNP • Use (cell-type specific) expression and epigenomics to: • Identify the disease tissue of origin • Identify regulatory SNPs that affect TF binding and influence the expression of important downstream disease genes
Acknowledgement • SoumyaRaychaudhuri • ManolisDermitzakis