200 likes | 215 Views
Explore the latest tools for phylogenetic analysis, multiple sequence alignment, and profile searching in bioinformatics. Learn about sensitivity, specificity, and probabilities in predicting gene functions. Discover limitations and software packages for better gene analysis.
E N D
Basic Overview of Bioinformatics Tools and Biocomputing Applications III Dr Tan Tin Wee Director Bioinformatics Centre
More BioComputational Tools • Phylogenetics Analysis • Multiple Sequence Alignment • Profile Searching • Sensitivity and Specificity and Probabilities in the Prediction of Functions
Phylogenetic Analysis • Assumption: evolutionary descent • Divergence • Phylogenetic tree • Rooted and unrooted trees Species X Y A B
Rooted and Unrooted Trees • Rooted: ancestral state of the evolved organism or gene is known. • Branches at bifurcation points until terminal branches, or tips/ leaves. • Unrooted trees represent branching order, but does not indicate the root of the last common ancestor
Phylogenetic inference for genes • Infancy, inexact science • computational tools based on general mathematical and statistical principles • Phylogenetic reconstructions may conflict with common sense. • Incorrect sequence alignments, inadequate models • All sites within sequences evolve at different rates • unequal rate effects
Some algorithms • Maximum parsimony • maximum likelihood • distance methods • UPGMA • paralinear (logdet) distances • Software Packages: PAUP phylogenetic analysis using parsimonyPHYLIP phylogenetic inference packageMacClade, GAMBIT, MEGA/METREE
Limitations • Inspection of sequence alignments • Removal of deviant sequences from the phylogenetic inference • Different genes analysed produce different trees • "Bootstrapping" for estimating statistical significance may still have errors in interpretation
A B Uses C D • Molecular Taxonomy • 16S and 23S rRNA analysis for bacterial classification • 18S rRNA analysis of nematodes, drosophila • epidemiological analysis of strain variation eg. In infections pathogens
Multiple Sequence Analysis • Gather a set of sequences of putative similarity or homology • Pairwise comparison for each set of multiple sequences • Build a "tree" of similarity • realignment of all sequences based on "ancestral" sequence padding with gaps etc • Used for generating "profiles"
Use • Detection of conserved and variable regions • Infer gene functions • Variable segments - infer dispensable to function or antigenic variants • Motifs can be used to analyse unknown sequence and infer possible function or relatedness • Motifs as basis for annotation of genome project sequences
Software • CLUSTALW • Profile software based on Hidden Markov Models (HMM) statistical models, eg HMMer, HMMPro, META-MEME, PROBE, BLOCKS
Example • C. elegans genome project • several large gene families of sequence homology - function unknown. • Now classified as putative G-protein coupled receptors (GPCRs). • Have to detect significant similarity between putative Worm GPCRs and experimentally known GPCRs in other species
Process • Select a typical unknown sequenceBLAST Search against nr database • Inspect hits and E-values • Top scoring hits - mitochondrial L11 ribosomal protein E=0.002 (not low enough to be trusted for annotation) • The rest of top scorers are all nematode-specific unknown sequences • Compare with PSI-BLAST iterative searching at NCBI • Similarity with mammalian GPCRs or the high scoring mt rL11 protein ?
Further analysis • Gather all nematode specific sequences • WormPep database of non-redundant seqs • Discard seqs of abnormally long or short • Multiple sequence alignment using CLUSTALW • General Profile of multiple alignment using HMMer • Use profile to search database again
Results • Similarity at significance level detected with Mammalian GPCRs • Find that L11 protein has very significant high score E=5x10 • Pitfalls of PSI-Blast - significance of match to the training set during iteration. • Finally, L11 protein may be wrongly annotated and not based on experimental results -49
A.Sensitivity and Specificity of a Fairly Good Test • Total real +ve = 73Total real - ve = 27 • Specificity = (25)/(2+25)=.93picked up 25 of the 27 negatives, very specificLow false positives • Sensitivity = 70/(70+3)=.96able to pickup 70 of the total 73 that are known positive- quite sensitive- Low false negatives • Gold standards Known gold standard+ ve - ve + ve - ve 2 70 3 25 Exptaltest result N=100
B.Increase Sensitivity but Lower Specificity of a Test • Total real +ve = 73Total real - ve = 27 • Specificity = (14)/(13+14)=.52picked up 14 of the 27 negatives, not very specifichigh false positives • Sensitivity = 72/(72+1)=.99able to pickup 72 of the total 73 that are known positive- super sensitive Low false negatives Known gold standard+ ve - ve + ve - ve 13 72 1 14 Exptaltest result N=100
C.Increase Specificity of a Test butSensitivity may drop • Total real +ve = 73Total real - ve = 27 • Specificity = (27)/(0+27)=1.0picked up 27 of the 27 negatives,completely specificincrease threshold to zero false positives, true positives will drop • Sensitivity = 50/(50+23)=.68able to pickup 50 of the total 73 that are known positive- not quite sensitive- Low false negatives Known gold standard+ ve - ve + ve - ve 0 50 23 27 Exptaltest result N=100
Trade off involved • If threshold of test set high, so that all the noise disappears, you may also miss out on some true positives, get a lot of false negatives and thus not so sensitive - case C • If threshold of test set low, so that you get as much of the positives as you can get, ie high sensitivity, your non-specific false positive hits start appearing - Case B
Computational Predictions of Gene Function • Sensitivity and specificity has similar tradeoffs. • Cutoff threshold values have to be empirically determined or arbitrarily chosen depending on situation