LS-SNP: Large-scale annotation of coding non-synonymous SNPs based on multiple information sources

-Bioinformatics April 2005 LS-SNP: Large-scale annotation of coding non-synonymous SNPs based on multiple information sources

Motivation • Over 9 million snps in dbsnp with little functional annotation • nsSNPs are critical importance for disease and drug sensitivity • Prediction of functional snps enables targetting of snps to be genotyped in candidate gene studies • help identify causative snp within snps that are in ld

Aims • Identify candidate functional SNPs in • Gene • Haplotype • pathway • Map nsSNPs onto protein sequences, functional pathways, comparative structure models

Predictions of snp function • Predict positions where nsSNPs • rule based: • destabilize proteins, • interfere with formations of domain-domain interfaces • protein-ligand binding • supervised learning (svm): • severely affect human health

Methods - pipeline • SNP-protein mapping • Sequence to structure (exp derived) • genomic seq, protein seq, protein structure • SNP prediction annotations combine: • rule based • supervised learning (svm)

SNP Annotations-rule based • destabilizing (Sunyaev, et al., 2001) if: • RSA (rel solv access)< 25% and diff in accessible surface propensities (knowledge based hydrophobic potentials) > 0.75 • RSA>50% and diff in accessible surface propensities > 2 • RSA<25% and charge change • variant involves a proline ina helix

rule based (cont.) • Interference with domain-domain if: • any of 4 rules combined and • within <=6A of an atom in an adjacent domain • effect protein-ligand binding is predicted • any of 4 rules combined and • ligand-binding if <=5A of a HETATM • (not covalently bonded to the protein, not one of the 20 aa nor in a water mol)

(measure of strain) SNP Annotations-supervised learning (svm) (chemical similarity) • train svm to discriminate between mongenic disease nsSNPs from OMIM and neutral snps from dbSNP

svm – training dataset • 1457 disease-associated • VARIANTS in SWISS and OMIM • 2504 neutral • neutral VARIANTS according to rules 1-4 • 3-fold cross validation • train on subset 1 and 2 test on 3 • repeated 10 times

svm – training dataset • the absolute values gives confidence • exclude low confidence predictions • accuracy of 80.5%(+-0.3%) • false pos 19.7%(+-0.2%) • false neg 18.7%(+-0.8%) • 122 rejected on low confidence

Results-mapping • snp to protein mapping • 28,043 (21,255 dbSNP) validated coding nSNPs • 70,147 (54,048 dbSNP) incl non validated

Results-structure • 13,391(53%) proteins have modelled domains with equivalent residues • 13,062 (19%) nsSNPs (all) • 8725 (31%) nsSNPs (validated) • 67 nsSNPs appear in more than one protein (alt splicing)

Results -function • 1886 destablizing nsSNPs (structural rules (1-4)) • 1317 monogenic disease-associated nsSNPs by svm • comparative models • conservation • sub properties

Web resourcehttp://alto.compbio.ucsf.edu/LS-SNP/ • SCOP • swissprot • KEGG • UCSC • PDBSUM • MODBASE KEGG pathway,snp id(rs),hugo, swissprot filter

genomic seq protein seq

structure

snp prediction annotations

Discussion-data quality • validated/non validated snps? • multiple independent submissions • submitter confirmation • alleles observed in at least 2 chr • submision to hapmap • report non val and val snps with option to filter

Discussion -ligands • local structural env of each snp-ligand cannot be evaluated by the pipeline • all contacts reported • some will not be biologically interesting • eg snp in proximity of glycerol will have no functional effect • but, in glycerolkinase, the snp could be important

Discussion -structural annotations • ModSNP 4109 str annotations. 70% sequence identity cutoff • LS-SNP 13,062 dbSNP rsIDs (4907 validated) str annotations. No sequence identity cutoff- • instead, score given (0-1) based on seq identity and model assessment (avg identity ~28%)

Discussion -structural annotations • ‘…because structure annotations are models, use properties that depend on correct fold assignment and a good target template alignments opposed to atomic-level structural details such as loss of either salt bridges or hydrogen or disulphide bonds.’

Discussion -structural annotations • not possible to model effects such as changes in backbone geometry • or small side chain alterations

Case study-Glutathione S-Transferase • GSTs play key role in cellular detoxification • domain interface • buried charge change • unfavourable change in accessible surface potential at buried postion • conserved in mouse, rat,chicken • combination of info sources build convincing case

Caveats • only updated twice a year • dependant on structure (comparative modelling) • allowing predictions without structure data would have increased numbers • no option to add your own snps • no idea as to which predictors are best • combinations of predictors • domain-domain or ligand binding but no indication of how damaging this might be • next version will have hapmap snps • svm – monogenic • only chose small, subset of Sunyaevs rules - conservation?

LS-SNP: Large-scale annotation of coding non-synonymous SNPs based on multiple information sources

LS-SNP: Large-scale annotation of coding non-synonymous SNPs based on multiple information sources

Presentation Transcript

Manager Self Service Start HCM Human Capital Management

Coding

Large-Scale SQL Server Deployments for DBAs

Historical Sources for ‘The Duchess of Malfi’

LUNG

4. SCALE-UP OF BIOREACTOR SYSTEMS

Impact, Washback and Consequences of Large-scale Testing

Chapter 5 DICTIONARY CODING

Introduction to Large Scale Modeling Systems

Thesis Defense Large -Scale Graph Computation on Just a PC

GraphChi : Large-Scale Graph Computation on Just a PC

Session 8 Paying for Large-Scale Disasters

Video Coding Concept

Online access information sources and services

Scalable Web Architectures

Coding vs. Programming

CS 54001-1: Large-Scale Networked Systems

Video Coding Concept

Week 4 The Large Scale Universe

Large Scale Studies of Dyslexia in Florida

Monitoring and Evaluation: Information Sources and Systems