460 likes | 558 Views
Ivan G. Costa Filho igcf@cin.ufpe.br Centro de Informática Universidade Federal de Pernambuco. Processamento de Cadeias de Caracteres. Tópicos. Cadeias de Caracteres Biológicas Problemas Básicos alinhamento par/múltiplo busca de motifs modelagem de famílias de proteínas Métodos
E N D
Ivan G. Costa Filhoigcf@cin.ufpe.brCentro de InformáticaUniversidade Federal de Pernambuco Processamento de Cadeias de Caracteres
Tópicos • Cadeias de Caracteres Biológicas • Problemas Básicos • alinhamento par/múltiplo • busca de motifs • modelagem de famílias de proteínas • Métodos • Algoritmos dinâmicos • cadeias escondidas de Markov • métodos probabilísticos
Disciplina • Aulas – Marco/Abril • introdução de conceitos/métodos básicos • Aulas práticas • Seminários - Abril/Maio • apresentação de tópicos da disciplina • Individual - pós • duplas – graduação • Projeto Maio a Junho • analise de dados reais (de artigos discutidos) em grupo
Avaliação • 40% - apresentação dos seminários • avaliação pelos companheiros de classe e presença • 20% - listas de exercícios • 40% - projeto em grupo • nota individual - cada grupo é responsável por descrever a participação
Bibliografia • R Durbin, Sean R Eddy, A Krogh, Biological Sequence Analysis : Probabilistic Models of Proteins and Nucleic Acids, Cambridge University Press. • An Introduction to Bioinformatics Algorithms, Neil Jones e Pavel Pevzner, MIT Press, 2004 • Ver pagina para literatura especifica de cada aula … • www.cin.ufpe.br/~igcf
Entender a vida a nível celular • Como a informação genética é herdada • Como a informação genética influencia processos celulares • Como genes trabalham juntos para realizar uma função celular
Informação Genética - DNA • DNA (ácido desoxirribonucleico) • Cadeia de nucleotídeos • 4 tipos: A;C;G;T • forma fita dupla a partir da complementaridade. • A = T e C = G
Dogma Central - Transcrição • Transcrição • DNA para RNA • RNA (acido ribonucléico) • fita simples. • 4 tipos: A;C;G;U • Moléculas instáveis • Transporte de informação do núcleo ao citoplasma
Dogma Central - Transcrição • Transcrição – copia seqüência de bases do DNA para o RNA (com U ao invéss de T).
Dogma Central - Tradução • Tradução • RNA -> Proteínas • realizada pelo ribossomo • Código genético • Proteínas • cadeia de aminoácidos • 20 tipos diferentes • adquire uma estrutura tri-dimensional • entidades funcionais da célula
Tradução - Código Genético • Combinações de códons (3 bases) codificam um dos 20 aminoácidos.
Dogma Central • Dogma: fluxo de informação DNA mRNA Proteína • Gene: segmento de DNA codificando uma proteína. • Transcrito: segmento de RNA transcrito de uma gene. • Um gene corresponde a uma proteína e uma função celular.
Controle da Expressão Gênica • Como se da o controle da expressão gênica? • Certas proteínas, fatores de transcrição, se ligam ao DNA e são responsáveis por iniciar a transcrição.
Bioinformatics • Manage molecular biological data • Store in databases, organise, formalise, describe... • Compare molecular biological data • Find patterns in molecular biological data • phylogenies • correlations (sequence / structure / expression / function / disease) Goals: • characterise biological patterns & processes • predict biological properties • low level data ⇒ high level properties (eg., sequence ⇒ function)
Bioinformatics: neighbour disciplines • Computational biology • Broader concept: includes computational ecology, physiology, neurology etc... • -omics: • Genomics • Transcriptomics • Proteomics • Systems biology • Putting it all together... • Building models, identify control & regulation
Molecular biology data... • DNA sequences >alpha-D ATGCTGACCGACTCTGACAAGAAGCTGGTCCTGCAGGTGTGGGAGAAGGTGATCCGCCAC CCAGACTGTGGAGCCGAGGCCCTGGAGAGGTGCGGGCTGAGCTTGGGGAAACCATGGGCA AGGGGGGCGACTGGGTGGGAGCCCTACAGGGCTGCTGGGGGTTGTTCGGCTGGGGGTCAG CACTGACCATCCCGCTCCCGCAGCTGTTCACCACCTACCCCCAGACCAAGACCTACTTCC CCCACTTCGACTTGCACCATGGCTCCGACCAGGTCCGCAACCACGGCAAGAAGGTGTTGG CCGCCTTGGGCAACGCTGTCAAGAGCCTGGGCAACCTCAGCCAAGCCCTGTCTGACCTCA GCGACCTGCATGCCTACAACCTGCGTGTCGACCCTGTCAACTTCAAGGCAGGCGGGGGAC GGGGGTCAGGGGCCGGGGAGTTGGGGGCCAGGGACCTGGTTGGGGATCCGGGGCCATGCC GGCGGTACTGAGCCCTGTTTTGCCTTGCAGCTGCTGGCGCAGTGCTTCCACGTGGTGCTG GCCACACACCTGGGCAACGACTACACCCCGGAGGCACATGCTGCCTTCGACAAGTTCCTG TCGGCTGTGTGCACCGTGCTGGCCGAGAAGTACAGATAA >alpha-A ATGGTGCTGTCTGCCAACGACAAGAGCAACGTGAAGGCCGTCTTCGGCAAAATCGGCGGC CAGGCCGGTGACTTGGGTGGTGAAGCCCTGGAGAGGTATGTGGTCATCCGTCATTACCCC ATCTCTTGTCTGTCTGTGACTCCATCCCATCTGCCCCCATACTCTCCCCATCCATAACTG TCCCTGTTCTATGTGGCCCTGGCTCTGTCTCATCTGTCCCCAACTGTCCCTGATTGCCTC TGTCCCCCAGGTTGTTCATCACCTACCCCCAGACCAAGACCTACTTCCCCCACTTCGACC TGTCACATGGCTCCGCTCAGATCAAGGGGCACGGCAAGAAGGTGGCGGAGGCACTGGTTG AGGCTGCCAACCACATCGATGACATCGCTGGTGCCCTCTCCAAGCTGAGCGACCTCCACG CCCAAAAGCTCCGTGTGGACCCCGTCAACTTCAAAGTGAGCATCTGGGAAGGGGTGACCA GTCTGGCTCCCCTCCTGCACACACCTCTGGCTACCCCCTCACCTCACCCCCTTGCTCACC ATCTCCTTTTGCCTTTCAGCTGCTGGGTCACTGCTTCCTGGTGGTCGTGGCCGTCCACTT CCCCTCTCTCCTGACCCCGGAGGTCCATGCTTCCCTGGACAAGTTCGTGTGTGCCGTGGG CACCGTCCTTACTGCCAAGTACCGTTAA
Molecular biology data... • Amino acid sequences • Protein structure: • X-ray crystallography • NMR
Cell biology & proteomics data... • Subcellular localization
Prediction Methods • Homology / Alignment • Simple pattern (“word”) recognition • Statistical methods • Weight matrices: calculate amino acid probabilities • Other examples: Regression, variance analysis, clustering • Machine learning • Like statistical methods, but parameters are estimated by iterative training rather than direct calculation • Examples: Neural Networks (NN), Hidden Markov Models (HMM), Support Vector Machines (SVM) • Combinations
Similarity between sequences If two sequences look similar, the explanation may be: • Homology (common descent) • Convergent evolution (common function → common selective pressure) • Chance!
Sequences are related • Darwin: all organisms are related through descent with modification • => Sequences are related through descent with modification • => Similar molecules have similar functions in different organisms Phylogenetic tree based on ribosomal RNA: three domains of life
Sequences are related II Phylogenetic tree of globin-type proteins found in humans
Why compare sequences? • Determination of evolutionary relationships • Prediction of protein function and structure (database searches). Protein 1: binds oxygen Sequence similarity Protein 2: binds oxygen ?
Biological Databases • Vast biological and sequence data is freely available through online databases • Use computational algorithms to efficiently store large amounts of biological data Examples • NCBI GeneBank http://ncbi.nih.gov Huge collection of databases, the most prominent being the nucleotide sequence database • Protein Data Bank http://www.pdb.org Database of protein tertiary structures • SWISSPROT http://www.expasy.org/sprot/ • Database of annotated protein sequences • PROSITE http://kr.expasy.org/prosite Database of protein active site motifs
BLAST • A computational tool that allows us to compare query sequences with entries in current biological databases. • A great tool for predicting functions of a unknown sequence based on alignment similarities to known genes.
Some Early Roles of Bioinformatics • Sequence comparison • Searches in sequence databases
Biological Sequence Comparison • Needleman- Wunsch, 1970 • Dynamic programming algorithm to align sequences
Protein sorting in eukaryotes • Proteins belong in different organelles of the cell – and some even have their function outside the cell • Günter Blobel was in 1999 awarded The Nobel Prize in Physiology or Medicine for the discovery that "proteins have intrinsic signals that govern their transport and localization in the cell"
Protein sorting: secretory pathway / ER Secretory proteins have a signal peptide Initially, they are transported across the ER membrane
Signal peptides A signal peptideis an N-terminal part of the amino acid chain, containing a hydrophobic region. Signal peptides differ between proteins, and can be hard to recognize.
Simple pattern (“word”) recognition Example: PROSITE entry PS00014, ER_TARGET: Endoplasmic reticulum targeting sequence (”KDEL-signal”). Pattern: [KRHQSA]-[DENQ]-E-L NB: only yes/no answers!
Statistical Methods • Estimate probabilities for nucleotides / amino acids • Information content in sequences; logos; Position- Weight Matrices. • Quantitative answers. ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV
Random Sample atgaccgggatactgataccgtatttggcctaggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccgacccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatactgggcataaggtacatgagtatccctgggatgacttttgggaacactatagtgctctcccgatttttgaatatgtaggatcattcgccagggtccgagctgagaattggatgaccttgtaagtgttttccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggagatcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatggcccacttagtccacttataggtcaatcatgttcttgtgaatggatttttaactgagggcatagaccgcttggcgcacccaaattcagtgtgggcgagcgcaacggttttggcccttgttagaggcccccgtactgatggaaactttcaattatgagagagctaatctatcgcgtgcgtgttcataacttgagttggtttcgaaaatgctctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgtattggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatttcaacgtatgccgaaccgaaagggaagctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttctgggtactgatagca
Implanting Motif AAAAAAAGGGGGGG atgaccgggatactgatAAAAAAAAGGGGGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccgacccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataAAAAAAAAGGGGGGGatgagtatccctgggatgacttAAAAAAAAGGGGGGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccgagctgagaattggatgAAAAAAAAGGGGGGGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggagatcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAAAAAAAAGGGGGGGcttataggtcaatcatgttcttgtgaatggatttAAAAAAAAGGGGGGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaacggttttggcccttgttagaggcccccgtAAAAAAAAGGGGGGGcaattatgagagagctaatctatcgcgtgcgtgttcataacttgagttAAAAAAAAGGGGGGGctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgtattggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatAAAAAAAAGGGGGGGaccgaaagggaagctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttAAAAAAAAGGGGGGGa
Where is the Implanted Motif? atgaccgggatactgataaaaaaaagggggggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccgacccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataaaaaaaaagggggggatgagtatccctgggatgacttaaaaaaaagggggggtgctctcccgatttttgaatatgtaggatcattcgccagggtccgagctgagaattggatgaaaaaaaagggggggtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggagatcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaataaaaaaaagggggggcttataggtcaatcatgttcttgtgaatggatttaaaaaaaaggggggggaccgcttggcgcacccaaattcagtgtgggcgagcgcaacggttttggcccttgttagaggcccccgtaaaaaaaagggggggcaattatgagagagctaatctatcgcgtgcgtgttcataacttgagttaaaaaaaagggggggctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgtattggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcataaaaaaaagggggggaccgaaagggaagctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttaaaaaaaaggggggga
Implanting Motif AAAAAAGGGGGGG with Four Mutations atgaccgggatactgatAgAAgAAAGGttGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccgacccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacAAtAAAAcGGcGGGatgagtatccctgggatgacttAAAAtAAtGGaGtGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccgagctgagaattggatgcAAAAAAAGGGattGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggagatcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAtAAtAAAGGaaGGGcttataggtcaatcatgttcttgtgaatggatttAAcAAtAAGGGctGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaacggttttggcccttgttagaggcccccgtAtAAAcAAGGaGGGccaattatgagagagctaatctatcgcgtgcgtgttcataacttgagttAAAAAAtAGGGaGccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgtattggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatActAAAAAGGaGcGGaccgaaagggaagctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttActAAAAAGGaGcGGa
Where is the Motif??? atgaccgggatactgatagaagaaaggttgggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccgacccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacaataaaacggcgggatgagtatccctgggatgacttaaaataatggagtggtgctctcccgatttttgaatatgtaggatcattcgccagggtccgagctgagaattggatgcaaaaaaagggattgtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggagatcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatataataaaggaagggcttataggtcaatcatgttcttgtgaatggatttaacaataagggctgggaccgcttggcgcacccaaattcagtgtgggcgagcgcaacggttttggcccttgttagaggcccccgtataaacaaggagggccaattatgagagagctaatctatcgcgtgcgtgttcataacttgagttaaaaaatagggagccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgtattggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatactaaaaaggagcggaccgaaagggaagctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttactaaaaaggagcgga
Why Finding (15,4) Motif is Difficult? atgaccgggatactgatAgAAgAAAGGttGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccgacccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacAAtAAAAcGGcGGGatgagtatccctgggatgacttAAAAtAAtGGaGtGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccgagctgagaattggatgcAAAAAAAGGGattGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggagatcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAtAAtAAAGGaaGGGcttataggtcaatcatgttcttgtgaatggatttAAcAAtAAGGGctGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaacggttttggcccttgttagaggcccccgtAtAAAcAAGGaGGGccaattatgagagagctaatctatcgcgtgcgtgttcataacttgagttAAAAAAtAGGGaGccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgtattggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatActAAAAAGGaGcGGaccgaaagggaagctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttActAAAAAGGaGcGGa AgAAgAAAGGttGGG ..|..|||.|..||| cAAtAAAAcGGcGGG
Próxima Aula • Ler capitulo 1 do Durbin • Introdução a algoritmos dinâmicos (10/08)
Agradecimentos • Alguns slides extraidos de • Biological Sequence Analysis course, CBS, Universidade Tecnica da Dinamarca • Neil Jones, University of California at San Diego