1 / 46

Ivan G. Costa Filho igcf@cin.ufpe.br Centro de Informática Universidade Federal de Pernambuco

Ivan G. Costa Filho igcf@cin.ufpe.br Centro de Informática Universidade Federal de Pernambuco. Processamento de Cadeias de Caracteres. Tópicos. Cadeias de Caracteres Biológicas Problemas Básicos alinhamento par/múltiplo busca de motifs modelagem de famílias de proteínas Métodos

holland
Download Presentation

Ivan G. Costa Filho igcf@cin.ufpe.br Centro de Informática Universidade Federal de Pernambuco

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Ivan G. Costa Filhoigcf@cin.ufpe.brCentro de InformáticaUniversidade Federal de Pernambuco Processamento de Cadeias de Caracteres

  2. Tópicos • Cadeias de Caracteres Biológicas • Problemas Básicos • alinhamento par/múltiplo • busca de motifs • modelagem de famílias de proteínas • Métodos • Algoritmos dinâmicos • cadeias escondidas de Markov • métodos probabilísticos

  3. Disciplina • Aulas – Marco/Abril • introdução de conceitos/métodos básicos • Aulas práticas • Seminários - Abril/Maio • apresentação de tópicos da disciplina • Individual - pós • duplas – graduação • Projeto Maio a Junho • analise de dados reais (de artigos discutidos) em grupo

  4. Avaliação • 40% - apresentação dos seminários • avaliação pelos companheiros de classe e presença • 20% - listas de exercícios • 40% - projeto em grupo • nota individual - cada grupo é responsável por descrever a participação

  5. Bibliografia • R Durbin, Sean R Eddy, A Krogh, Biological Sequence Analysis : Probabilistic Models of Proteins and Nucleic Acids, Cambridge University Press. • An Introduction to Bioinformatics Algorithms, Neil Jones e Pavel Pevzner, MIT Press, 2004 • Ver pagina para literatura especifica de cada aula … • www.cin.ufpe.br/~igcf

  6. Biologia Molecular

  7. Entender a vida a nível celular • Como a informação genética é herdada • Como a informação genética influencia processos celulares • Como genes trabalham juntos para realizar uma função celular

  8. Informação Genética - DNA • DNA (ácido desoxirribonucleico) • Cadeia de nucleotídeos • 4 tipos: A;C;G;T • forma fita dupla a partir da complementaridade. • A = T e C = G

  9. Dogma Central - Transcrição • Transcrição • DNA para RNA • RNA (acido ribonucléico)‏ • fita simples. • 4 tipos: A;C;G;U • Moléculas instáveis • Transporte de informação do núcleo ao citoplasma

  10. Dogma Central - Transcrição • Transcrição – copia seqüência de bases do DNA para o RNA (com U ao invéss de T).

  11. Dogma Central - Tradução • Tradução • RNA -> Proteínas • realizada pelo ribossomo • Código genético • Proteínas • cadeia de aminoácidos • 20 tipos diferentes • adquire uma estrutura tri-dimensional • entidades funcionais da célula

  12. Tradução - Código Genético • Combinações de códons (3 bases) codificam um dos 20 aminoácidos.

  13. Dogma Central • Dogma: fluxo de informação DNA  mRNA  Proteína • Gene: segmento de DNA codificando uma proteína. • Transcrito: segmento de RNA transcrito de uma gene. • Um gene corresponde a uma proteína e uma função celular.

  14. Controle da Expressão Gênica • Como se da o controle da expressão gênica? • Certas proteínas, fatores de transcrição, se ligam ao DNA e são responsáveis por iniciar a transcrição.

  15. Controle da Regulação Gênica

  16. Bioinformatics • Manage molecular biological data • Store in databases, organise, formalise, describe... • Compare molecular biological data • Find patterns in molecular biological data • phylogenies • correlations (sequence / structure / expression / function / disease)‏ Goals: • characterise biological patterns & processes • predict biological properties • low level data ⇒ high level properties (eg., sequence ⇒ function)‏

  17. Bioinformatics: neighbour disciplines • Computational biology • Broader concept: includes computational ecology, physiology, neurology etc... • -omics: • Genomics • Transcriptomics • Proteomics • Systems biology • Putting it all together... • Building models, identify control & regulation

  18. Molecular biology data... • DNA sequences >alpha-D ATGCTGACCGACTCTGACAAGAAGCTGGTCCTGCAGGTGTGGGAGAAGGTGATCCGCCAC CCAGACTGTGGAGCCGAGGCCCTGGAGAGGTGCGGGCTGAGCTTGGGGAAACCATGGGCA AGGGGGGCGACTGGGTGGGAGCCCTACAGGGCTGCTGGGGGTTGTTCGGCTGGGGGTCAG CACTGACCATCCCGCTCCCGCAGCTGTTCACCACCTACCCCCAGACCAAGACCTACTTCC CCCACTTCGACTTGCACCATGGCTCCGACCAGGTCCGCAACCACGGCAAGAAGGTGTTGG CCGCCTTGGGCAACGCTGTCAAGAGCCTGGGCAACCTCAGCCAAGCCCTGTCTGACCTCA GCGACCTGCATGCCTACAACCTGCGTGTCGACCCTGTCAACTTCAAGGCAGGCGGGGGAC GGGGGTCAGGGGCCGGGGAGTTGGGGGCCAGGGACCTGGTTGGGGATCCGGGGCCATGCC GGCGGTACTGAGCCCTGTTTTGCCTTGCAGCTGCTGGCGCAGTGCTTCCACGTGGTGCTG GCCACACACCTGGGCAACGACTACACCCCGGAGGCACATGCTGCCTTCGACAAGTTCCTG TCGGCTGTGTGCACCGTGCTGGCCGAGAAGTACAGATAA >alpha-A ATGGTGCTGTCTGCCAACGACAAGAGCAACGTGAAGGCCGTCTTCGGCAAAATCGGCGGC CAGGCCGGTGACTTGGGTGGTGAAGCCCTGGAGAGGTATGTGGTCATCCGTCATTACCCC ATCTCTTGTCTGTCTGTGACTCCATCCCATCTGCCCCCATACTCTCCCCATCCATAACTG TCCCTGTTCTATGTGGCCCTGGCTCTGTCTCATCTGTCCCCAACTGTCCCTGATTGCCTC TGTCCCCCAGGTTGTTCATCACCTACCCCCAGACCAAGACCTACTTCCCCCACTTCGACC TGTCACATGGCTCCGCTCAGATCAAGGGGCACGGCAAGAAGGTGGCGGAGGCACTGGTTG AGGCTGCCAACCACATCGATGACATCGCTGGTGCCCTCTCCAAGCTGAGCGACCTCCACG CCCAAAAGCTCCGTGTGGACCCCGTCAACTTCAAAGTGAGCATCTGGGAAGGGGTGACCA GTCTGGCTCCCCTCCTGCACACACCTCTGGCTACCCCCTCACCTCACCCCCTTGCTCACC ATCTCCTTTTGCCTTTCAGCTGCTGGGTCACTGCTTCCTGGTGGTCGTGGCCGTCCACTT CCCCTCTCTCCTGACCCCGGAGGTCCATGCTTCCCTGGACAAGTTCGTGTGTGCCGTGGG CACCGTCCTTACTGCCAAGTACCGTTAA

  19. Molecular biology data... • Amino acid sequences • Protein structure: • X-ray crystallography • NMR

  20. Cell biology & proteomics data... • Subcellular localization

  21. Prediction Methods • Homology / Alignment • Simple pattern (“word”) recognition • Statistical methods • Weight matrices: calculate amino acid probabilities • Other examples: Regression, variance analysis, clustering • Machine learning • Like statistical methods, but parameters are estimated by iterative training rather than direct calculation • Examples: Neural Networks (NN), Hidden Markov Models (HMM), Support Vector Machines (SVM)‏ • Combinations

  22. Similarity between sequences If two sequences look similar, the explanation may be: • Homology (common descent)‏ • Convergent evolution (common function → common selective pressure)‏ • Chance!

  23. Sequences are related • Darwin: all organisms are related through descent with modification • => Sequences are related through descent with modification • => Similar molecules have similar functions in different organisms Phylogenetic tree based on ribosomal RNA: three domains of life

  24. Sequences are related II Phylogenetic tree of globin-type proteins found in humans

  25. Why compare sequences? • Determination of evolutionary relationships • Prediction of protein function and structure (database searches). Protein 1: binds oxygen Sequence similarity Protein 2: binds oxygen ?

  26. Biological Databases • Vast biological and sequence data is freely available through online databases • Use computational algorithms to efficiently store large amounts of biological data Examples • NCBI GeneBank http://ncbi.nih.gov Huge collection of databases, the most prominent being the nucleotide sequence database • Protein Data Bank http://www.pdb.org Database of protein tertiary structures • SWISSPROT http://www.expasy.org/sprot/ • Database of annotated protein sequences • PROSITE http://kr.expasy.org/prosite Database of protein active site motifs

  27. Alinhamento de Sequencias

  28. BLAST • A computational tool that allows us to compare query sequences with entries in current biological databases. • A great tool for predicting functions of a unknown sequence based on alignment similarities to known genes.

  29. BLAST

  30. Some Early Roles of Bioinformatics • Sequence comparison • Searches in sequence databases

  31. Biological Sequence Comparison • Needleman- Wunsch, 1970 • Dynamic programming algorithm to align sequences

  32. Busca de Sinais de Localização

  33. Protein sorting in eukaryotes • Proteins belong in different organelles of the cell – and some even have their function outside the cell • Günter Blobel was in 1999 awarded The Nobel Prize in Physiology or Medicine for the discovery that "proteins have intrinsic signals that govern their transport and localization in the cell"

  34. Protein sorting: secretory pathway / ER Secretory proteins have a signal peptide Initially, they are transported across the ER membrane

  35. Signal peptides A signal peptideis an N-terminal part of the amino acid chain, containing a hydrophobic region. Signal peptides differ between proteins, and can be hard to recognize.

  36. Simple pattern (“word”) recognition‏ Example: PROSITE entry PS00014, ER_TARGET: Endoplasmic reticulum targeting sequence (”KDEL-signal”). Pattern: [KRHQSA]-[DENQ]-E-L NB: only yes/no answers!

  37. Statistical Methods • Estimate probabilities for nucleotides / amino acids • Information content in sequences; logos; Position- Weight Matrices. • Quantitative answers. ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV

  38. Busca de Motifs

  39. Random Sample atgaccgggatactgataccgtatttggcctaggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccgacccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatactgggcataaggtacatgagtatccctgggatgacttttgggaacactatagtgctctcccgatttttgaatatgtaggatcattcgccagggtccgagctgagaattggatgaccttgtaagtgttttccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggagatcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatggcccacttagtccacttataggtcaatcatgttcttgtgaatggatttttaactgagggcatagaccgcttggcgcacccaaattcagtgtgggcgagcgcaacggttttggcccttgttagaggcccccgtactgatggaaactttcaattatgagagagctaatctatcgcgtgcgtgttcataacttgagttggtttcgaaaatgctctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgtattggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatttcaacgtatgccgaaccgaaagggaagctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttctgggtactgatagca

  40. Implanting Motif AAAAAAAGGGGGGG atgaccgggatactgatAAAAAAAAGGGGGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccgacccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataAAAAAAAAGGGGGGGatgagtatccctgggatgacttAAAAAAAAGGGGGGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccgagctgagaattggatgAAAAAAAAGGGGGGGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggagatcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAAAAAAAAGGGGGGGcttataggtcaatcatgttcttgtgaatggatttAAAAAAAAGGGGGGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaacggttttggcccttgttagaggcccccgtAAAAAAAAGGGGGGGcaattatgagagagctaatctatcgcgtgcgtgttcataacttgagttAAAAAAAAGGGGGGGctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgtattggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatAAAAAAAAGGGGGGGaccgaaagggaagctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttAAAAAAAAGGGGGGGa

  41. Where is the Implanted Motif? atgaccgggatactgataaaaaaaagggggggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccgacccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataaaaaaaaagggggggatgagtatccctgggatgacttaaaaaaaagggggggtgctctcccgatttttgaatatgtaggatcattcgccagggtccgagctgagaattggatgaaaaaaaagggggggtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggagatcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaataaaaaaaagggggggcttataggtcaatcatgttcttgtgaatggatttaaaaaaaaggggggggaccgcttggcgcacccaaattcagtgtgggcgagcgcaacggttttggcccttgttagaggcccccgtaaaaaaaagggggggcaattatgagagagctaatctatcgcgtgcgtgttcataacttgagttaaaaaaaagggggggctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgtattggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcataaaaaaaagggggggaccgaaagggaagctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttaaaaaaaaggggggga

  42. Implanting Motif AAAAAAGGGGGGG with Four Mutations atgaccgggatactgatAgAAgAAAGGttGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccgacccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacAAtAAAAcGGcGGGatgagtatccctgggatgacttAAAAtAAtGGaGtGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccgagctgagaattggatgcAAAAAAAGGGattGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggagatcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAtAAtAAAGGaaGGGcttataggtcaatcatgttcttgtgaatggatttAAcAAtAAGGGctGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaacggttttggcccttgttagaggcccccgtAtAAAcAAGGaGGGccaattatgagagagctaatctatcgcgtgcgtgttcataacttgagttAAAAAAtAGGGaGccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgtattggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatActAAAAAGGaGcGGaccgaaagggaagctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttActAAAAAGGaGcGGa

  43. Where is the Motif??? atgaccgggatactgatagaagaaaggttgggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccgacccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacaataaaacggcgggatgagtatccctgggatgacttaaaataatggagtggtgctctcccgatttttgaatatgtaggatcattcgccagggtccgagctgagaattggatgcaaaaaaagggattgtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggagatcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatataataaaggaagggcttataggtcaatcatgttcttgtgaatggatttaacaataagggctgggaccgcttggcgcacccaaattcagtgtgggcgagcgcaacggttttggcccttgttagaggcccccgtataaacaaggagggccaattatgagagagctaatctatcgcgtgcgtgttcataacttgagttaaaaaatagggagccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgtattggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatactaaaaaggagcggaccgaaagggaagctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttactaaaaaggagcgga

  44. Why Finding (15,4) Motif is Difficult? atgaccgggatactgatAgAAgAAAGGttGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccgacccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacAAtAAAAcGGcGGGatgagtatccctgggatgacttAAAAtAAtGGaGtGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccgagctgagaattggatgcAAAAAAAGGGattGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggagatcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAtAAtAAAGGaaGGGcttataggtcaatcatgttcttgtgaatggatttAAcAAtAAGGGctGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaacggttttggcccttgttagaggcccccgtAtAAAcAAGGaGGGccaattatgagagagctaatctatcgcgtgcgtgttcataacttgagttAAAAAAtAGGGaGccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgtattggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatActAAAAAGGaGcGGaccgaaagggaagctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttActAAAAAGGaGcGGa AgAAgAAAGGttGGG ..|..|||.|..||| cAAtAAAAcGGcGGG

  45. Próxima Aula • Ler capitulo 1 do Durbin • Introdução a algoritmos dinâmicos (10/08)

  46. Agradecimentos • Alguns slides extraidos de • Biological Sequence Analysis course, CBS, Universidade Tecnica da Dinamarca • Neil Jones, University of California at San Diego

More Related