210 likes | 335 Views
BioInformatics Consultation Practice 8 Gá bor Pauler , Ph.D. Tax.reg.no: 63673852-3-22 Bank account: 50400113-11065546 Location: 1st Széchenyi str., 7666 Pogány, Hungary Tel: +36-309-015-488 E-mail: pauler @ t-online.hu. Content of the Practice. Multiple sequence alignment Basic terms
E N D
BioInformatics Consultation Practice 8 Gábor Pauler, Ph.D. Tax.reg.no: 63673852-3-22 Bank account: 50400113-11065546 Location: 1st Széchenyi str., 7666 Pogány, Hungary Tel: +36-309-015-488 E-mail: pauler@t-online.hu
Content of the Practice • Multiple sequence alignment • Basic terms • Searching Conserved Regions/Domains/Motivs/Patterns • Purposes • Degrees of similarity • Searching Rapid changing regions • Clustering methods • Similarity metrics • Partial Proximity • Multivariate Homolog Proximity • Complex Proximity Metrics • Algorithm types • K-mean • Evaluation • Hierarchic Distributive • Hierarchic Agglomerative • Steps • Evaluation • Resolving value concentration problem • Philogenic tree analysis • Basic terms • Obstacles • Software: • ClustalW2 • Main screen • Outputs • References
Multiple sequence alignment:Basic terms • It compares large number of sequences to deal with the following things: • Conserved Regions/Domains/Motivs/Patterns (Konzervált régiók/domének/motívomok): • Coding parts of structurally sensitive proteins, being under consderable evolutional pressure: vaste majority of mutations destroy them, therefore these have to be well conserved in survivors. We can search them 4 possible purpose: • At Chromosome Walking(Kromoszóma lépkedés) we search for matching fragment ends, to assemble longest posssible contig • At Gene Search(Génkeresés) we are looking for Expressed Sequence Tags (EST) • In recognized genes, we can search for parts coding enzimes/active regions • Create Philogenic Tree(Gén-családfa) based on similarity of sequences, inferring descendance of relatively distant organisms from each other. Grades of Similarity(Hasonlóság) are: • Analogy(Analógia):sequences code same function, but have different origin • Homology(Homológia):sequen- ces code same function, have common origin • Paralogy(Paralógia):sequences code same (or slightly modified) function, have common origin, but in common ancestor they separated with Gene Duplication, so they are in different regions of genome • Orthology(Ortológia):sequences code same function, have com- mon origin, and they are in same place in the genome • Comparing rapidly changing regions: They are non-coding parts, or code stucturally less sensitive proteins (eg. fibrin), survavibility sustained even at rapid variation • Therefore we use them analyzing descendance of closely related organisms
Multiple sequence alignment:Clustering methods:Similarity metrics1 • Grouping sequences bases on Clustering Methods (Klaszterezési Módszerek) of multivariate statistics creating groups from observed objects. • Clustering methods are based on Similarity/ Matching/ Proximity Metrics (Hasonlósági metrikák):all of them try to put similar objects into one group and dissimilar objects in separate groups. But how we can measure similarity? • Univariate, Partial Proximity (Egy változós, Parciális hasonlóság):How we can compare a given position in 2 sequences? • Nominal(Nominális): {0,1} discrete valued distance: • Eg. Identical nucleotids or not in a position of nucleotid sequence • Cardinal(Kardinális): [0,1] continous intervall distance: • Eg. at nucleotids, we can give a fraction distance if at least pirimidin/purin group of nucleotids are identical, • Eg. at amino acids we can set up a contionous distance scores from physical properties
Multiple sequence alignment:Clustering methods:Similarity metrics2 TGC AGCT * Pos3 GTG AGCT * Pos2 AGCT * Pos1 CC AGCT * Pos2 AGCT * Pos1 • Multivariate Homolog Proximity Measures (Több változós hasonlóság homológ objektumok közt):Homolog sequences have numerous positions (i=1..n). How to aggregate positionwise distances into single distance metrics? • Its easier to imagine it graphically: each positions value forms a cordinate axis in a n-dimensional coordinate system called Decision Space (Döntési Tér). Each compared sequences appear there as 1 coordinate point. We try to measure a distance between 2 coordinate points with alternative methods: • We can summarize mismatch for i=1..n position long-sequences with Manhattan Distance Manhattan távolság): Mismatch % = (SiScorei)/n (8.1) • Because of the simple summing up, it is called Full compensatory(Teljesen Kompenzáló) distance metrics, but it can be misleading! If ET’s genome has also many repetitive parts, even he can be considered similar to humans because of numerous small random match, which fully compensate essential longer non-matching parts! • Graphically, it is thedistance between points moving on a „grid” • Equally distant points are on a square rotated by 45° • Distance function from a point forms a pyramid rotated by 45° ? ≠ = CGTAGCGTAGCGTAGCGTAGCGTAGCGTAGCGTAGCGTAGACACGATACGTACGGCCTGATAC ATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGCTTATCGTACGTACTTGAACTGA
Multiple sequence alignment:Clustering methods:Similarity metrics3 TGC AGCT * Pos3 GTG AGCT * Pos2 AGCT * Pos1 CC AGCT * Pos2 AGCT * Pos1 • Alternatively, we can compute Euclidean distance (Euklideszi távolság) for n position long-sequences: Mismatch % = (SiScorei2/n)0.5 (8.2) • Please note that it squares error scores making big differences even bigger. This way lot of small random matches cannot compensate a longer mismatching region, because Euclidean Distance is not full compensatory (Nem teljesen kompenzáló) distance metrics • Graphically, it is the „straight” distance between points • Equally distant points are on a circle • Distance function from a point forms a cone • Complex Proximity Metrics (Komplex hasonlósági metrikák):Unfortunately, frameshift mutations can insert/delete nucleotids into originally homolog sequences, therefore they cannot be compared postion-by-position: • We use BLASTP-type word search algorithms to search most compatible parts between 2 sequences, and BLASTP-matching score is used as distance CGTAGCGTAGCGTAGCGTAGCGTAGCGTAGCGTAGCGTAGACACGATACGTACGGCCTGATAC ATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGCTTATCGTACGTACTTGAACTGA
Content of the Practice • Multiple sequence alignment • Basic terms • Searching Conserved Regions/Domains/Motivs/Patterns • Purposes • Degrees of similarity • Searching Rapid changing regions • Clustering methods • Similarity metrics • Partial Proximity • Multivariate Homolog Proximity • Complex Proximity Metrics • Algorithm types • K-mean • Evaluation • Hierarchic Distributive • Hierarchic Agglomerative • Steps • Evaluation • Resolving value concentration problem • Philogenic tree analysis • Basic terms • Obstacles • Software: • ClustalW2 • Main screen • Outputs • References
Multiple sequence alignment:Clustering methods:Algorithm types1 AGCT * AGCT * Pos3 Pos3 CGT CCT GTG G*G GTA CTT TTT CGT GTG G*G GTA TTT CCT CTT AGCT * AGCT * Pos2 Pos2 AGCT * AGCT * Pos1 Pos1 • K-mean Clustering (K-közép klaszterezés) • It can be used only if we know the number of groups to create (k) • It starts with k random equal lenght (n) sequences as group centroids( ). They are coordinates in n dimensional space • During numerous iterations group centroids distract each other in space (eg. they try to be as different sequences as possible), however they are attracted by large groups of coordinate points( ) of m observed sequences, moving towards them, in decreasing steps () (eg. they try to became the compromise sequence of large groups) • Iteration stops if aggregated movement of group centroids does not exceed a treshold anymore, reaching Ljapunov-Stability (Ljapunov-stabilitás): they can resonate forth and back, but basically will not move. • Finally, group centroids distribute observed sequences: they are grouped to nearest group centroid, forming groups • Evaluation of K-mean Clustering: • It is less sensitive to outlier sequences (eg. result of sequencing errors, or frameshift mutation) • It has low computational requirement growing linearly with number of observed sequences m and number of groups k • It can delimit only Compact Shaped Clusters (Kompakt csoportok) in decision space and gets uncertain in case of strong mutation cross-effects across sequence positions, resulting Spurious Clusters(Elnyújtott Alakú Klaszter) in decision space (eg. some of the sample sequences were formed from other samples with translocation mutation) • All groups will be on the same level of grouping, they cannot be ordered in hierarchy • It can be used only if we know number of groups in advance. In philogenic analysis, this is not the usual situation!
Multiple sequence alignment:Clustering methods:Hierarchic:Agglomerative1 • Hierarchic Clustering (Hierarchikus klaszterezés):It creates hierarchy of groups • Distributive Method (Disztributiv módszer): At the begginning it treats whole sample of sequences as one group then it splits them to subgroups • Agglomerative method (Agglomeratív módszer): just the reversed, more usable in practice: • STEP 0: At start, m observed sequences( ) form m separate groups • STEP 1: It compares still existing groups pairwise and the nearest two are agglomera-ted () into one group( ). There are 2 methods to compute distance of groups: • Nearest Neighbor Method (Legközelebbi szomszéd): distance of 2 groups ( ) already containing sevearal sequences are defined by their closest members: • This detects spurious clusters well • But extremely sensitive to Outlier (Kilógó) sequences (eg. result of sequencing error), they form separate group against the rest of the smple, which is misleading • Ward Method (Ward-módszer): distance of 2 groups are given by the variance of their joint members in decision space. • This can detect only more compact clusters than nearest neighbor • But still can detect more spurious clusters than K-mean • Less sensitive to outlier sequences • Higher computational requirement
Multiple sequence alignment:Clustering methods:Hierarchic:Agglomerative2 GTG G*G GTA TTT CCT CTT CGT Distance Stop! Distance 7 6 5 4 3 2 1 Clusters left • STEP 2: It records how much distance of 2 groups was disappeared by their agglomeration, and plots it on Scree Plot (Könyök diagramm): line chart representing info loss in a given iteration of algorithm • STEP 3: It records on a binary Dendo-gramm (Bináris fa diagramm) which two groups (Leaf elements, Levél elem), were aggregated in a new group (Branch element, Ág elem). Lenght of branches expresses the distance disappeared • STEP 4: GOTO STEP 1 until m initial groups are aggregated into 1 in m-1 iteration • Termination: Of course agglomerating everything into 1 group does not make any sense. However during m-1 iterations we observe Scree plot: • If info loss suddenly jumps up after a given iteration, it signals that further iterations can be deleted. This is how it detects the number of groups to leave. • If scree plot has no break at all and resembles a mirrored 1/x function, there are no distinctive groups, or clusters were too spurious to detect • Multiple big „steps” on scree plot shows multi level-hierarchic group structure
Multiple sequence alignment:Clustering methods:Agglomerative3 CG CC GC GT GG • Evaluation of agglomerative hierarchic clustering: • It can create binary group hierarchy. From the dendogramm we can read Philogenic Tree (filogenikus fa) • It detects number of groups automatically • It has exact termination criteria instead of Ljapunov-stability treshold of K-Mean clustering • Unfortunately it has much more high computational requirement: it grows quadratic with numbr of observations m. CURE: This can be resolved if we have to group large number of sequences (eg. 10000s), then we pre-group them with K-Mean into 1000 groups, and from each group we select the sequence nearest to group centroids. These selected sequences are weighted with pre-group sizes and grouped further with hierarchic agglomerative method • It is more sensitive for outliers than K-mean: in case of many outliers it typically creates very uneven sized, very Instabile Clusters (Instabil csoport): adding/removing one sequence to the sample will lead totally different results. Therefore it is important to check out group sizes, even we are primary interested in dendogram, because unbalanced group sizes will warn to distorted, unbalanced, useless dendogramm! CURE: Outliers should be removed from sample before clustering (ET, go home!) • It does not really work well if there is Value Concen- tration Problem (Érték-koncentrációs probléma): there are only low number of possible values in one seq- uence position (eg. for nucleotides A,C,T,G) so partial distances are not continous but discrete, even {0,1} binary valued. This will result in huge amount of identical distances among observed sequences, and agglomeration gets confused which of them to agglomerate first. Therefore grouping will be again very instabile. CURE: We can resolve this only painfully computation-intensive auxiliary methods embedded in agglomerative clustering algorithm:
Content of the Practice • Multiple sequence alignment • Basic terms • Searching Conserved Regions/Domains/Motivs/Patterns • Purposes • Degrees of similarity • Searching Rapid changing regions • Clustering methods • Similarity metrics • Partial Proximity • Multivariate Homolog Proximity • Complex Proximity Metrics • Algorithm types • K-mean • Evaluation • Hierarchic Distributive • Hierarchic Agglomerative • Steps • Evaluation • Resolving value concentration problem • Philogenic tree analysis • Basic terms • Obstacles • Software: • ClustalW2 • Main screen • Outputs • References
Multiple sequence alignment: Clustering methods: Hierarchic: Resolving value concetration 1 NotGrp NotGrp NotGrp NotGrp NotGrp NotGrp NotGrp NotGrp Group Group Group Group Group Group Group Group GG GT GC CC CG • Agglomeration-optimization with dynamic programming: • The optimal pair to agglomerate from many identical distances could be determined by Dynamic Integer Linear Programming Model, with B&B algorithm • This would give exactly the potimal grouping and hierarchy of sequences • It has infeasibly colossal computational requirement • Progressive Method: • At first sequences are pairwise matched with BLASTP-type word search algorithm • For the best matching pair Alignment String(Illeszkedési Sztring) is determined and represents the agglomerated group. In further iterations, matching strings are used to pairwise match groups. • By default BLASTP algorithm has higher computational requriement than distance computation, but much less than B&B optimization • Moreover, as shorter match strings are matched instead of whole sequences, it reduces computation requirement considerably • In a given position of the matching strings, there are much more possible values because of partial matches than positions in nucleotide sequences. This resolves value concentration problem • It also can handle frameshift mutations cCTt
Multiple sequence alignment: Clustering methods: Hierarchic: Resolving value concetration 2 +1:*TT +1:cCTt +1:cCTt EST A G C T A G C T • Weighted Progressive Method: • Same as above except that it computes a weight for each sample sequences: it is the average of BLASTP-match scores at all other m-1 sequences compared • Higher weighted sample sequences/groups tend to be near group centroids, so they are preferred at agglomeration • It has very slightly higher computational requirement than Progressive method • But it is even more effective • Iterative Method: • Same as above, just it allows to compare matching string of a sample sequence/group with already clustered sequences/subgroups also • It has considerably higher computational requirement than progressive methods, growing on third power with number of sample sequences m • Can perform well if dendogramm has many, almost equally good alternavie solutions because of spurious clusters • Motive Method: • Same as progressive method, just searching of matching Expressed Sequence Tags (EST) is executed before agglomeration decision. ESTs are described by weight matrices specific of the organisms of origin of sample sequences. ESTs are searched by HMM-algorithm.
Multiple sequence alignment:Philogenic tree analysis • Basic terms: • It is a tree-structure, whose leaves are analysed sequences • Defined by a hierarchic clustering method using a given distance metrics of sequences • From similarities/distances we try to infer that Most Recent Common Ancestor(Legutobbi Közös Ős) sequence is branched in which other sequences • Using the Molecular Clock (Molekláris Óra) hypothesis, which assumes that forming a given quantity of genetic modification (mutations, recombinations) requires given number of generations in average. This way, from distances we can compute the Coalescence time (Szétválási Idő), so tree can be represented on time axis • Obstacles of Philogenic analysis: • Deviant Sequence (Deviáns szekvencia): sequence containing inproportionally lot of Match Gaps (Illesztési rés) in the given group, it is best to omit it from analysis • Long Branch Attraction (Ál-távoli rokon effektus): proven closely related organisms sometimes show up much more distant in tree - resulting in long branch - because the following factors can have different speed in the analysed sub-species: • Evolution Rate (Evolúciós Ráta)%= mutation probability% × survival probability% (8.3) • Speed of DNA Repair(DNS javító) mechanisms • Silent Mutation (Csendes mutáció): Many mutations are in non-coding parts without evolutional pressure • Gene Convergence (Génkonvergencia): originally very distant organisms living in same niche conditions tend to express very similar proteins. Eg. elephants are originally more closely related to mice than mammoths, regardless external similarity • Horizontal gene transfer (Horizontális géntranszfert): Retroviruses (Retrovírus) write their genome back in cellular genome, creating gene transfer within one generation, without any descendancy, creating pseudo-paralogy between very distant organisms
Content of the Practice • Multiple sequence alignment • Basic terms • Searching Conserved Regions/Domains/Motivs/Patterns • Purposes • Degrees of similarity • Searching Rapid changing regions • Clustering methods • Similarity metrics • Partial Proximity • Multivariate Homolog Proximity • Complex Proximity Metrics • Algorithm types • K-mean • Evaluation • Hierarchic Distributive • Hierarchic Agglomerative • Steps • Evaluation • Resolving value concentration problem • Philogenic tree analysis • Basic terms • Obstacles • Software: • ClustalW2 • Main screen • Outputs • References
Multiple sequence alignment:Software:ClustalW2:Main screen1 • http://www.ebi.ac.uk/Tools/clustalw2/index.html :At Main Screen: • Set of sequences: copy them after each other in FASTA format, give your E-mail and TITLE of the analysis • Alignment: Full/Fast: you can select between full distance computation and word search • KTUP word size: def/1..5: word size at word search, def everywhere means automatic optimization! • Window lenght: def/0..10: max length of HSP windows • Score type: absolute/percent: scoring matrix values are used as it is, or their sum is norma-lized to 100% • Top diag: def/1..10: max number of insertion mutations between sequence pairs can be handled in the same time • Matrix:def/blosum30/pam350/ gonnet250: type of score matrix • Pairgap: def/1..500: punishment weight of gap in starting position of matching parts in a pair Click Click Click Click Click Click Click
Multiple sequence alignment:Software:ClustalW2:Main screen2 2 levels back • GAP Open/ Extend/ Distance: punishment weight for opening/ extending gap or contunied match at large distance within HSPs • No end gap: Yes/No: gaps not allowed at the end of sequence (means equal lenght sequences) • Iteration, Numiter: use match iteration, max how much levels step back in tree to compare • Run button: Execute • At proteins, we should consider: • Matrix selection: lower BLOSUM or higher PAM detects more Divergent(Szétszórt) matches in a large region, the opposite detects Convergent (Összetartó) match in one block: • Synchronizing matrix:with other parameters (see table) • At nucleotides, we can select: • Nukleotid Similarity Matrix (Nukleotid hasonlósági mátrix): no partial match, or • PUPPY: partial match inside Purin bases (Purin bázis): (AG) and Pirimydin bases(Pirimidin bázis):(CTU) group Click Click Click Click Click Click Click
Multiple sequence alignment:Software:ClustalW2:Outputs 1 • Summary: • Overview table: • Jalview graphic match browser: • Matches can be overriden manually • With Calcu-late|Tree menu shows the tree: • Score table: • Detailed matches: • Amino acids are colored with standard colors • Based on their biochemical properties: Click Tree Click Click
Multiple sequence alignment:Software:ClustalW2:Outputs 2 ( ( ( S.meliloti:0.09650, R.sp:0.07189) :0.06304, A.ehrlichei:0.13385) :0.00826, ( P.stutzeri:0.09343, P.aerugi:0.13196) :0.00761, M.algicola:0.14653); • Overview trees: • Filogram tree with/without distance data: shows proportional lenght of branches • Cladogram tree with/without distance data: branch distances are equalized at leaf elements to eliminate graphically disturbing long branch effects • Tree structure as standard DND script:
References • Multiple sequence alignment: • http://en.wikipedia.org/wiki/Multiple_sequence_alignment • Links: http://pbil.univ-lyon1.fr/alignment.html • Clustering methods: • http://en.wikipedia.org/wiki/Cluster_analysis • www.iis.sinica.edu.tw/~hil/summer/sorin/sorin2-2.ppt • http://www.technion.ac.il/docs/sas/stat/chap42/sect11.htm • Philogenic tree analysis: • http://en.wikipedia.org/wiki/Phylogenetic_tree • http://www.cs.huji.ac.il/course/2006/cbio/Scribes/lect13-yaar/lect13.pdf • Philogenic software: • EBI ClustalW: http://www.ebi.ac.uk/Tools/clustalw2/index.html • BCM: http://searchlauncher.bcm.tmc.edu/multi-align/multi-align.html • STRAP: http://www.bioinformatics.org/strap/