1.62k likes | 1.63k Views
Explore the best techniques for multiple sequence alignment in the genomic era, including tools for comparing different types of sequences such as genes, genomes, promoters, motifs, nucleosomes, and more.
E N D
最佳的多重序列比對方法針對基因組領域 Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program
Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program
In- SilVo Biology • In Silico Biology • Making Sense of digital data • In Vivo Biology • Recording data in a living Cell • In SilVo Biology • Connect In-Vitro and In-Vivo • In-Vivo: High-throughput recording • In-Silico: High-Throughput analysis
Is it Possible to Compare all Types of Sequences ? • Non Transcribed World • Genes/Full Genomes • Lagan, TBA • Promoter Regions • Meta-Aligner • Motifs Finders • Nucleosome • ??? • Multiple Genome Aligners • Not Very Accurate • Very Fast • Deal with rearrangements
Multiple Genome Alignments andre-sequencing • Before • Re-sequence Human Genomes • Map the Reads onto the reference genome • Now • Re-sequence • Assemble • Align • Non trivial with very large datasets
Is it Possible to Compare all Types of Sequences ? • RNA Comparison • Less Accurate than Proteins • Secondary Structures • ncRNA World • Sankoff • Time O(L2n) • Space O(L3n) • Consan • R-Coffee
Is it Possible to Compare all Types of Sequences ? • Protein Comparisons • Very Accurate • 3D-Structure Improves it • Protein Aligners • ClustalW • T-Coffee • 3D-Coffee
What Changes with 1000 Genomes? Phylogeny
Phylogeny Vs Function • Function • Low level => Biochemistry => Protein Domains • High Level => Metabolic Pathway => Orthology • Orthology • Phylogenetic Analysis • Phylogenetic Analysis =>Accurate Alignments
apparent one2one one2many one2one many2many one2one Duplication node Speciation node or leaf (Adpated from “Going beyond AGC and T, E. Birney)
Using The tree Correct Tree Correct Orthologous Assignment Correct Functional Prediction
Genomic Era: The Goal • 10.000 Sequences: interspecies • 1 Billion: Re-sequencing • Incorporation of ALL experimental Data • Structure, Genomic, ChIp-Chip, ChIp-Seq… • Alignments suitable for all applications of comparative genomics • Homology Modeling (function) • Functional Analysis • Phylogenetic Reconstruction • 3D-Modelling • Accurate Alignments for ALL kind of data • Non Transcribed DNA • Transcribed DNA • Translated DNA
Accuracy Proteins: 30% is the limit DNA/RNA 70% is the limit Scale With too many sequences algorithms lose in accuracy Data Integration Structure Homology Genomic Structure Function Proteomics Methods Wealth of alternative methods Poorly Characterized Genomic Era Challenges
Consistency and Data Integration • Most methods rely on the progressive algorithm • Consistency based methods have been designed as an extension • Consistency based alignment methods have been designed to: • Better extract the signal contained in the data • Integrate/Confront existing methods • Integrate/Confront heterogeneous types of Information
T-Coffee and Concistency… SeqA GARFIELD THE LAST FAT CAT SeqB GARFIELD THE FAST CAT SeqC GARFIELD THE VERY FAST CAT SeqD THE FAT CAT SeqA GARFIELD THE LAST FA-T CAT SeqB GARFIELD THE FAST CA-T --- SeqC GARFIELDTHEVERY FAST CAT SeqD -------- THE ---- FA-T CAT
T-Coffee and Concistency… SeqA GARFIELD THE LAST FAT CAT Prim. Weight =88 SeqB GARFIELD THE FAST CAT --- SeqA GARFIELD THE LAST FA-T CAT Prim. Weight =77 SeqC GARFIELDTHE VERY FAST CAT SeqA GARFIELD THE LAST FAT CAT Prim. Weight =100 SeqD -------- THE ---- FAT CAT SeqB GARFIELD THE ---- FAST CATPrim. Weight =100 SeqC GARFIELDTHEVERY FAST CAT SeqC GARFIELDTHEVERY FAST CAT Prim. Weight =100 SeqD -------- THE ---- FA-T CAT
SeqA GARFIELD THE LAST FAT CAT Prim. Weight =88 SeqB GARFIELD THE FAST CAT --- SeqA GARFIELD THE LAST FA-T CAT Prim. Weight =77 SeqC GARFIELDTHE VERY FAST CAT SeqA GARFIELD THE LAST FAT CAT Prim. Weight =100 SeqD -------- THE ---- FAT CAT SeqB GARFIELD THE ---- FAST CATPrim. Weight =100 SeqC GARFIELDTHEVERY FAST CAT SeqC GARFIELDTHEVERY FAST CAT Prim. Weight =100 SeqD -------- THE ---- FA-T CAT SeqA GARFIELD THE LAST FAT CAT Weight =88 SeqB GARFIELD THE FAST CAT --- SeqA GARFIELD THE LAST FA-T CAT Weight =77 SeqC GARFIELDTHE VERY FAST CAT SeqB GARFIELD THE ---- FAST CAT SeqA GARFIELD THE LAST FA-T CAT Weight =100 SeqD -------- THE ---- FA-T CAT SeqB GARFIELD THE ---- FAST CAT T-Coffee and Concistency…
SeqA GARFIELD THE LAST FAT CAT Weight =88 SeqB GARFIELD THE FAST CAT --- SeqA GARFIELD THE LAST FA-T CAT Weight =77 SeqC GARFIELDTHE VERY FAST CAT SeqB GARFIELD THE ---- FAST CAT SeqA GARFIELD THE LAST FA-T CAT Weight =100 SeqD -------- THE ---- FA-T CAT SeqB GARFIELD THE ---- FAST CAT T-Coffee and Concistency…
Methods Scalability Data
A Brief History of Consistency A Long Chain of Small Contributions…
Gotoh (1990) Iterative strategy using consistency Martin Vingron (1991) Dot Matrices Multiplications Accurate but too stringeant Dialign (1996, Morgenstern) Concistency Agglomerative Assembly T-Coffee (2000, Notredame) Concistency Progressive algorithm ProbCons (2004, Do) T-Coffee with a Bayesian Treatment Biphasic Gap Penalty AMAP (Schwarz, 2007) ProbCons Consistency Replace Progressive alignment with simulated Annealing Hard to distinguish from ProbCons FSA ( Patcher, 2009) AMAP with automated parameter estimation Hard to distinguish from ProbCons Consistency Based Algorithms
Combining Many MSAs into ONE ClustalW MAFFT T-Coffee MUSCLE ???????
Integrating New Types of DataTemplate Based Sequence Alignments
Templates Templates Template Aligner TARGET TARGET TARGET Experimental Data … Experimental Data … Template Alignment Template-Sequence Alignment Template based Alignment of the Sequences Primary Library
Expresso: Finding the Right Structure Sources BLAST BLAST SAP Templates Templates Template Alignment Source Template Alignment Library Remove Templates
What is Homology Extension ? -Simple scoring schemes result in alignment ambiguities L ? L L
What is Homology Extension ? L L Profile 1 L L L L L L L L L L L I L Profile 2 V L I L L L
What is Homology Extension ? L L Profile 1 L L L L L L L L L L L I L V L Profile 2 I L L L
PSI-Coffee: Homology Extension Sources BLAST BLAST Profile Aligner Templates Templates Template Alignment Source Template Alignment Library Remove Templates
Score: fraction of correct columns when compared with a structure based reference (BB11 of BaliBase).
Consistency Score: fraction of correct columns when compared with a structure based reference (BB11 of BaliBase).
Homology Extension Score: fraction of correct columns when compared with a structure based reference (BB11 of BaliBase).
Structural Extension Score: fraction of correct columns when compared with a structure based reference (BB11 of BaliBase).
T-Coffee and The World -Some Templates are obtained with a BLAST -Queries can be sent to the EBI or the NCBI -No Need for a Local BLAST installation BLAST/ SOAP Users sequences