240 likes | 361 Views
-Bioinformatics April 2005. LS-SNP: Large-scale annotation of coding non-synonymous SNPs based on multiple information sources. Motivation. Over 9 million snps in dbsnp with little functional annotation nsSNPs are critical importance for disease and drug sensitivity
E N D
-Bioinformatics April 2005 LS-SNP: Large-scale annotation of coding non-synonymous SNPs based on multiple information sources
Motivation • Over 9 million snps in dbsnp with little functional annotation • nsSNPs are critical importance for disease and drug sensitivity • Prediction of functional snps enables targetting of snps to be genotyped in candidate gene studies • help identify causative snp within snps that are in ld
Aims • Identify candidate functional SNPs in • Gene • Haplotype • pathway • Map nsSNPs onto protein sequences, functional pathways, comparative structure models
Predictions of snp function • Predict positions where nsSNPs • rule based: • destabilize proteins, • interfere with formations of domain-domain interfaces • protein-ligand binding • supervised learning (svm): • severely affect human health
Methods - pipeline • SNP-protein mapping • Sequence to structure (exp derived) • genomic seq, protein seq, protein structure • SNP prediction annotations combine: • rule based • supervised learning (svm)
SNP Annotations-rule based • destabilizing (Sunyaev, et al., 2001) if: • RSA (rel solv access)< 25% and diff in accessible surface propensities (knowledge based hydrophobic potentials) > 0.75 • RSA>50% and diff in accessible surface propensities > 2 • RSA<25% and charge change • variant involves a proline ina helix
rule based (cont.) • Interference with domain-domain if: • any of 4 rules combined and • within <=6A of an atom in an adjacent domain • effect protein-ligand binding is predicted • any of 4 rules combined and • ligand-binding if <=5A of a HETATM • (not covalently bonded to the protein, not one of the 20 aa nor in a water mol)
(measure of strain) SNP Annotations-supervised learning (svm) (chemical similarity) • train svm to discriminate between mongenic disease nsSNPs from OMIM and neutral snps from dbSNP
svm – training dataset • 1457 disease-associated • VARIANTS in SWISS and OMIM • 2504 neutral • neutral VARIANTS according to rules 1-4 • 3-fold cross validation • train on subset 1 and 2 test on 3 • repeated 10 times
svm – training dataset • the absolute values gives confidence • exclude low confidence predictions • accuracy of 80.5%(+-0.3%) • false pos 19.7%(+-0.2%) • false neg 18.7%(+-0.8%) • 122 rejected on low confidence
Results-mapping • snp to protein mapping • 28,043 (21,255 dbSNP) validated coding nSNPs • 70,147 (54,048 dbSNP) incl non validated
Results-structure • 13,391(53%) proteins have modelled domains with equivalent residues • 13,062 (19%) nsSNPs (all) • 8725 (31%) nsSNPs (validated) • 67 nsSNPs appear in more than one protein (alt splicing)
Results -function • 1886 destablizing nsSNPs (structural rules (1-4)) • 1317 monogenic disease-associated nsSNPs by svm • comparative models • conservation • sub properties
Web resourcehttp://alto.compbio.ucsf.edu/LS-SNP/ • SCOP • swissprot • KEGG • UCSC • PDBSUM • MODBASE KEGG pathway,snp id(rs),hugo, swissprot filter
genomic seq protein seq
Discussion-data quality • validated/non validated snps? • multiple independent submissions • submitter confirmation • alleles observed in at least 2 chr • submision to hapmap • report non val and val snps with option to filter
Discussion -ligands • local structural env of each snp-ligand cannot be evaluated by the pipeline • all contacts reported • some will not be biologically interesting • eg snp in proximity of glycerol will have no functional effect • but, in glycerolkinase, the snp could be important
Discussion -structural annotations • ModSNP 4109 str annotations. 70% sequence identity cutoff • LS-SNP 13,062 dbSNP rsIDs (4907 validated) str annotations. No sequence identity cutoff- • instead, score given (0-1) based on seq identity and model assessment (avg identity ~28%)
Discussion -structural annotations • ‘…because structure annotations are models, use properties that depend on correct fold assignment and a good target template alignments opposed to atomic-level structural details such as loss of either salt bridges or hydrogen or disulphide bonds.’
Discussion -structural annotations • not possible to model effects such as changes in backbone geometry • or small side chain alterations
Case study-Glutathione S-Transferase • GSTs play key role in cellular detoxification • domain interface • buried charge change • unfavourable change in accessible surface potential at buried postion • conserved in mouse, rat,chicken • combination of info sources build convincing case
Caveats • only updated twice a year • dependant on structure (comparative modelling) • allowing predictions without structure data would have increased numbers • no option to add your own snps • no idea as to which predictors are best • combinations of predictors • domain-domain or ligand binding but no indication of how damaging this might be • next version will have hapmap snps • svm – monogenic • only chose small, subset of Sunyaevs rules - conservation?