UNIVERSITA’ DI MILANO-BICOCCA LAUREA MAGISTRALE IN BIOINFORMATICA

UNIVERSITA’ DI MILANO-BICOCCALAUREA MAGISTRALE IN BIOINFORMATICA Corso di BIOINFORMATICA: TECNICHE DI BASE Prof. Giancarlo Mauri Lezione 8 Allineamento multiplo di sequenze

Perché l’allineamento multiplo? • Analisi filogenetica • Proteine con funzioni simili in specie diverse • L’allineamento multiplo ottimale fornisce informazioni sulla storia evolutiva delle specie • Scoperta di irregolarità (es. gene della Fibrosi Cistica) • Individuazione regioni conservate • L’allineamento multiplo locale rivela regioni conservate • Le regioni conservate di solito sono regioni chiave per la funzionalità • Le regioni conservate sono target prioritari per lo sviluppo di farmaci

Definizione del problema INPUT: un insieme S = {S1,S2,…,Sn} di n sequenze definite su un alfabeto S euna matrice di punteggiod: ({-})2 R OUTPUT: un insieme S’ = {S’1,S’2,…,S’n} di n sequenze sull’alfabeto SU{} con le seguenti proprietà: • Si’| = Li • eliminando gli spazi da Si’ si ottiene Sii • il punteggio di allineamento A è massimo

Esempio 1 1 2 3 4 5 6 7 Indice di colonna M Q P I L L L M L R - L L - M K - I - L L M P P V I L L

Esempio 2 SUGAR- SUC-RE ------- SUG/C?R? -SUGAR- -SUC-RE AZUCAR- -ZUCKER- -SAKARI -ZUCKERO -SOKKER- -------- -SUCKAR- E SUGAR SUCRE AZUCAR ZUCKER SAKARI ZUCCHERO SOKKER

Complessità del problema Tutte le formulazioni del problema di allineamento multiplo sono NP-difficili Occorre trovare algoritmi che producano un buon allineamento e siano efficienti rispetto al tempo utilizzato

Punteggio di un allineamento multiplo • La valutazione di un allineamento multiplo è basata su una funzione di punteggio predefinita; l’obiettivo è l’ottenimento di uno fra gli allineamenti di punteggio massimo • Il punteggio relativo ad un allineamento S’ di S, è dato dalla somma dei punteggi associati a tutte le colonne di S’ • Dovremmo dare una funzione w: (SU{})k Ra k argomenti, e quindi considerare un numero esponenziale di casi |S|k • Sappiamo calcolare il punteggio dell’allineamento a coppie, per somma dei punteggi delle colonne; come estendere all’allineamento multiplo? (N.B: diamo un punteggio anche all’allineamento tra due spazi, ad esempio d(,) = 0).

La misura SP (Sum of Pairs) • Il punteggio relativo ad una colonna dell’allineamento S’ è dato dalla somma dei punteggi di tutte le coppie di simboli nella colonna (incluse coppie di gap) • Il punteggio dell’allineamento S’ è la somma dei punteggi di tutte le colonne • Euristiche per accelerare l’algoritmo di programmazione dinamica

Sequenza di consenso • Dato un allineamento S’ di un insieme S di n sequenze si definiscee la sequenza di consenso Sc per S’ nel seguente modo • Sc ha la stessa lunghezza l è la della generica sequenza S’i in S’ • Il simbolo i-esimo in Sc è uguale al simbolo più frequente nella colonna i-esima di S’ • Il punteggio dell’allineamento S’ è la somma dei punteggi degli allineamenti di ciascuna sequenza S’i in S’ con Sc

Allineamento con albero • Dato l’insieme S = {S1,S2,…,Sn} e l’albero T che ha S come insieme dei nodi, si determinano gli allineamenti ottimi tra le coppie (Si,Sj) che appartengono all’insieme degli archi di T • A T si associa un punteggio P dato dalla somma dei punteggi degli allineamenti di cui al punto 1 (somma dei punteggi degli archi) • Si ricostruisce l’allineamento multiplo S’ di S a partire dagli allineamenti ottimi determinati al punto 1

Metodo Star Alignment • Dato l’insieme S = {S1,S2,…,Sn}, si determinano gli allineamenti ottimi tra tutte le coppie di sequenze (Si,Sj) (j=1,2,…,n e j  i) • Ad ogni Sisi associa un punteggio Pidato dalla somma dei punteggi degli allineamenti di cui al punto 1 con tutte le altre sequenze Sj • Si considera l’indice i per cui il valore di Piè ottimo (minimo o massimo) • Si ricostruisce l’allineamento multiplo S’ di S a partire dagli allineamenti ottimi determinati al punto 1 per la stella di indice i, aggiungendo man mano gaps a Si

S1 S2 S3 S4 S5 S1 7 -2 0 -3 S2 7 -2 0 -4 S3 -2 -2 0 -7 S4 0 0 0 -3 S5 -3 -4 -7 -3 Pi 2 1 -11 -3 -17 Star Alignment: esempio S = { ATTGCCATT, ATGGCCATT, ATCCAATTTT, ATCTTCTT, ACTGACC } Schema di punteggio: d(x,x) = 1 d(x,y) = -1 d(x,-) = d(-,x) = -2

Metodo Star Alignment: esempio Centro stella  S1=ATTGCCATT Ricostruzione dell’allineamento multiplo ATTGCCATT-- ATGGCCATT-- ATC-CAATTTT ATCTTC-TT-- ACTGACC----

Sofware disponibili • CLUSTAL • Basato su algoritmo di Feng-Doolittle • Idea: • allineare a coppie le sequenze del set di input S • utilizzare l’insieme dei punteggi trovati come matrice delle distanze del metodo neighbor-joining per costruire un albero filogenetico per le sequenze in S • allineare le sequenze secondo l’ordine fissato dall’albero filogenetico (prima le sequenze più simili) • DiAlign • Idea: • individuare diagonali (sottosequenze allineate senza spazi) • costruire l’allineamento a partire dalle diagonali

Sofware disponibili • CLUSTAL-W • Standard popular software • It does multiple alignment as follows: • Align 2 • Repeat: keep on adding a new sequence to the alignment until no more, or do tree-like heuristics • Problem: It is simply a heuristics • Alternative: dynamic programming nk for k sequences. This is simply too slow • We need to understand the problem and solve it right

Making the problem simpler! • Multiple alignment is very hard • For k sequences, nk time, by dynamic programming • NP hard in general, not clear how to approximate • Popular practice -- alignment within a band: the p-th letter in one sequence is not more than c places away from the p-th letter in another sequence in the final alignment – the alignment is along a diagonal bandwidth 2c • Used in final stage of FASTA program

In literature • NP hardness under various models • Wang-Jiang (JCB) • Li-Ma-Wang (STOC99) • Just • Approximation results • Gusfield (2- 1/L) • Bafna, Lawler, Pevzner (CPM94, 2-k/L) • star alignment • Sankoff, Kruskal discussed “within a band” • Pearson showed alignment within a band gives very good results for a lot protein superfamilies • Altschul and Lipman, Chao-Pearson-Miller, Fickett, Ukkonen, Spouge (survey) all have studied alignment within a band

SP-Alignment NP hard PTAS for constant band PTAS for constant number of insertion/deletion gaps per sequence on average (for coding regions, this assumption makes a lot of sense) Star-Alignment PTAS in constant band PTAS for constant number of insertion/deletion gaps per sequence on average The following were proved

We will do only SP-alignment • Notation: in an alignment, a block of inserted “---” is called a gap. If a multiple alignment has c gaps on the average for each sequence, we call it average c-gap alignment • We first design a PTAS for the average c-gap SP alignment • Then using the PTAS for the average c-gap SP alignment, we design a PTAS for SP-alignment within a band

Average c-gap SP Alignment • Key Idea: choose r representative sequences, we find their “correct” alignment in the optimal alignment, by exhaustive search. Then we use this alignment as reference • Then we align every other sequence against this alignment • Then choose the best • All we have to show is that there are r sequences whose letter frequencies in each column of their alignment approximates the complete alignment

Some over-simplified reasoning • If M is optimal average c-gap SP alignment • In this alignment, many sequences have less than cl gaps. • So if we take r of these sequences, and try every possibly way, one way coincides with M • Then hopefully, its letter frequencies in each column “more or less” approximates that of M’s • Then we can simply optimally align all the rest of the sequences one by one according to this frequency matrix

j j Sampling r sequences Complete Alignment Alignment with r sequences We also expect this column has ~ k percent a’s If this column has k percent a’s

for L=m to nm { for any r sequences { for all possible alignment M’ of length L and with no more than cl gaps { align all other sequences to M’//one alignment }}} Output the best alignment AverageSPAlign

Basic Idea Dynamically cut sequences into segments Each segment satisfies the average c-gap condition. Hence use previous algorithm Then assemble the segments together Divide and Conquer Cutting these sequences into 6 segments, each segment has c-gaps per sequence on average in optimal alignment SP Alignment within c-Band

while (not finished) { find a maximum prefix for each sequence (same length) such that AverageSPAlign returns “low”cost. Keep the multiple alignment for this segment } Concatenate the multiple alignments for all segments together to as final alignment The final algorithm: diagonalAlign

UNIVERSITA’ DI MILANO-BICOCCA LAUREA MAGISTRALE IN BIOINFORMATICA