690 likes | 1.12k Views
Bioinformatics. مهدی صادقی پژوهشگاه ملی مهندسی ژنتیک و زیست فناوری مركز تحقيقات بيوشيمي-بيوفيزيك ؛ دانشگاه تهران. Bioinformatics is interdisciplinary . Computer Science. Biology. Information Management. Biochemistry. Molecular Biology. Bioinformatics. Theoretical CS.
E N D
Bioinformatics مهدی صادقی پژوهشگاه ملی مهندسی ژنتیک و زیست فناوری مركز تحقيقات بيوشيمي-بيوفيزيك ؛ دانشگاه تهران
Bioinformatics is interdisciplinary Computer Science Biology Information Management Biochemistry Molecular Biology Bioinformatics Theoretical CS Machine Learning Data Mining Biophysics Applied Mathematics & Statistics
bio – informatics: bioinformatics is conceptualizing biology in terms of molecules (in the sense of physical chemistry) and applying “informatics techniques” (derived from disciplines such as applied maths, computer science and statistics) tounderstandandorganizetheinformationassociated with these molecules, on alarge scale.In short, bioinformatics is a management information system for molecular biology and has manypractical applications.
DNA RNA PROTEIN SEQ PROTEIN STRUCT PROTEIN FUNCTION ………. Flow of information
Reading the Genetic Code • THREE nucleotides is a CODON
genome protein
Protein structures are depicted in a variety of ways Backbone only Ribbon Space-filling Space-filing, With surface charge Blue = negative charge Red = positive
Recent Trend • A great surge in genomics • The Human Genome Project • Genome projects for ~400 organisms • >1000 completed published genomes • Recent advances in molecular genetics technologies, especially microarrays • Push to analyze genes and gene products, and to determine protein structure/function relationship • High through-put biology, large scale data analysis
Aims of bioinformatics • First Data organization researchers access to existing information submit new entries • Second develop tools and resources that aid in the analysis of data • Third interpret the results in a biologically meaningful manner.
General Types of “….Informatics techniques…..” • Geometry • Robotics • Graphics (Surfaces, Volumes) • Comparison and 3D Matching (Vision, recognition) • Physical Simulation • Newtonian Mechanics • Electrostatics • Numerical Algorithms • Simulation • Databases • Building, Querying • Object DB • Text String Comparison • Text Search • 1D Alignment • Significance Statistics • Finding Patterns • AI / Machine Learning • Clustering • Datamining
Bioinformatcs Tools and Services Databases: text, sequence, structure • Database annotation text searches • Sequence similarity search tools • Gene finding • Sequence and structure analysis tools • Structure prediction tools • 3D structure visualization tools • Phylogenetic analysis tools • Metabolic analysis tools
Sequence comparison: Gene sequences can be aligned to see similarities between gene from different sources 768 TT....TGTGTGCATTTAAGGGTGATAGTGTATTTGCTCTTTAAGAGCTG 813 || || || | | ||| | |||| ||||| ||| ||| 87 TTGACAGGTACCCAACTGTGTGTGCTGATGTA.TTGCTGGCCAAGGACTG 135 . . . . . 814 AGTGTTTGAGCCTCTGTTTGTGTGTAATTGAGTGTGCATGTGTGGGAGTG 863 | | | | |||||| | |||| | || | | 136 AAGGATC.............TCAGTAATTAATCATGCACCTATGTGGCGG 172 . . . . . 864 AAATTGTGGAATGTGTATGCTCATAGCACTGAGTGAAAATAAAAGATTGT 913 ||| | ||| || || ||| | ||||||||| || |||||| | 173 AAA.TATGGGATATGCATGTCGA...CACTGAGTG..AAGGCAAGATTAT 216
Multiple sequence alignment: Sequences of proteins from different organisms can be aligned to see similarities and differences
Three sequence recurrence relation S(i,j,k) = max[S(i-1, j-1, k-1) + m(i,j) + m(i,k) + m(j,k), S(i-1, j-1, k) + m(i,j) + g, S(i-1, j, k-1) + m(i,k) + g, S(i, j-1, k-1) + m(j,k) + g, S(i-1, j, k)+ g + g, S(i, j-1, k) + g + g, S(i, j, k-1) + g + g] m(i,j) = similarity matrix eg BLOSUM g = gap penalty
Dynamic programming time increases exponentially • Clearly, for N sequences, each sequence Li characters long, the time required will be N O( P Li ) i=1 This is exponential - O( LN ) We need to fill out each ‘box’ in the grid
Motifs and Transcriptional Start Sites ATCCCG gene TTCCGG gene ATCCCG gene gene ATGCCG gene ATGCCC
Motif Logo • Motifs can mutate on non important bases • The five motifs in five different genes have mutations in position 3 and 5 • Representations called motif logosillustrate the conserved and variable regions of a motif TGGGGGA TGAGAGA TGGGGGA TGAGAGA TGAGGGA
Random Sample atgaccgggatactgataccgtatttggcctaggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccgacccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatactgggcataaggtacatgagtatccctgggatgacttttgggaacactatagtgctctcccgatttttgaatatgtaggatcattcgccagggtccgagctgagaattggatgaccttgtaagtgttttccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggagatcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatggcccacttagtccacttataggtcaatcatgttcttgtgaatggatttttaactgagggcatagaccgcttggcgcacccaaattcagtgtgggcgagcgcaacggttttggcccttgttagaggcccccgtactgatggaaactttcaattatgagagagctaatctatcgcgtgcgtgttcataacttgagttggtttcgaaaatgctctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgtattggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatttcaacgtatgccgaaccgaaagggaagctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttctgggtactgatagca
Implanting Motif AAAAAAAGGGGGGG atgaccgggatactgatAAAAAAAAGGGGGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccgacccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataAAAAAAAAGGGGGGGatgagtatccctgggatgacttAAAAAAAAGGGGGGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccgagctgagaattggatgAAAAAAAAGGGGGGGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggagatcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAAAAAAAAGGGGGGGcttataggtcaatcatgttcttgtgaatggatttAAAAAAAAGGGGGGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaacggttttggcccttgttagaggcccccgtAAAAAAAAGGGGGGGcaattatgagagagctaatctatcgcgtgcgtgttcataacttgagttAAAAAAAAGGGGGGGctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgtattggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatAAAAAAAAGGGGGGGaccgaaagggaagctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttAAAAAAAAGGGGGGGa
Where is the Implanted Motif? atgaccgggatactgataaaaaaaagggggggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccgacccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataaaaaaaaagggggggatgagtatccctgggatgacttaaaaaaaagggggggtgctctcccgatttttgaatatgtaggatcattcgccagggtccgagctgagaattggatgaaaaaaaagggggggtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggagatcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaataaaaaaaagggggggcttataggtcaatcatgttcttgtgaatggatttaaaaaaaaggggggggaccgcttggcgcacccaaattcagtgtgggcgagcgcaacggttttggcccttgttagaggcccccgtaaaaaaaagggggggcaattatgagagagctaatctatcgcgtgcgtgttcataacttgagttaaaaaaaagggggggctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgtattggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcataaaaaaaagggggggaccgaaagggaagctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttaaaaaaaaggggggga
Implanting Motif AAAAAAGGGGGGG with Four Mutations atgaccgggatactgatAgAAgAAAGGttGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccgacccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacAAtAAAAcGGcGGGatgagtatccctgggatgacttAAAAtAAtGGaGtGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccgagctgagaattggatgcAAAAAAAGGGattGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggagatcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAtAAtAAAGGaaGGGcttataggtcaatcatgttcttgtgaatggatttAAcAAtAAGGGctGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaacggttttggcccttgttagaggcccccgtAtAAAcAAGGaGGGccaattatgagagagctaatctatcgcgtgcgtgttcataacttgagttAAAAAAtAGGGaGccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgtattggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatActAAAAAGGaGcGGaccgaaagggaagctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttActAAAAAGGaGcGGa
Where is the Motif??? atgaccgggatactgatagaagaaaggttgggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccgacccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacaataaaacggcgggatgagtatccctgggatgacttaaaataatggagtggtgctctcccgatttttgaatatgtaggatcattcgccagggtccgagctgagaattggatgcaaaaaaagggattgtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggagatcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatataataaaggaagggcttataggtcaatcatgttcttgtgaatggatttaacaataagggctgggaccgcttggcgcacccaaattcagtgtgggcgagcgcaacggttttggcccttgttagaggcccccgtataaacaaggagggccaattatgagagagctaatctatcgcgtgcgtgttcataacttgagttaaaaaatagggagccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgtattggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatactaaaaaggagcggaccgaaagggaagctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttactaaaaaggagcgga
Why Finding (15,4) Motif is Difficult? atgaccgggatactgatAgAAgAAAGGttGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccgacccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacAAtAAAAcGGcGGGatgagtatccctgggatgacttAAAAtAAtGGaGtGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccgagctgagaattggatgcAAAAAAAGGGattGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggagatcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAtAAtAAAGGaaGGGcttataggtcaatcatgttcttgtgaatggatttAAcAAtAAGGGctGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaacggttttggcccttgttagaggcccccgtAtAAAcAAGGaGGGccaattatgagagagctaatctatcgcgtgcgtgttcataacttgagttAAAAAAtAGGGaGccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgtattggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatActAAAAAGGaGcGGaccgaaagggaagctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttActAAAAAGGaGcGGa AgAAgAAAGGttGGG ..|..|||.|..||| cAAtAAAAcGGcGGG
Challenge Problem • Find a motif in a sample of - 20 “random” sequences (e.g. 600 nt long) - each sequence containing an implanted pattern of length 8, - each pattern appearing with 2 mismatches as (8,2)-motif.
Identifying Motifs: Complications • We do not know the motif sequence • We do not know where it is located relative to the genes start • Motifs can differ slightly from one gene to the next • How to discern it from “random” motifs?
The Motif Finding Problem (cont’d) • The patterns revealed with no mutations: cctgatagacgctatctggctatccacgtacgtaggtcctctgtgcgaatctatgcgtttccaaccat agtactggtgtacatttgatacgtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc aaacgtacgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtacgtataca ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaacgtacgtc acgtacgt Consensus String
The Motif Finding Problem (cont’d) • The patterns with 2 point mutations: cctgatagacgctatctggctatccaGgtacTtaggtcctctgtgcgaatctatgcgtttccaaccat agtactggtgtacatttgatCcAtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc aaacgtTAgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtCcAtataca ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaCcgtacgGc Can we still find the motif, now that we have 2 mutations?
Defining Motifs • To define a motif, lets say we know where the motif starts in the sequence • The motif start positions in their sequences can be represented as s = (s1,s2,s3,…,st)
a G g t a c T t C c A t a c g t Alignment a c g t T A g t a c g t C c A t C c g t a c g G _________________ A3 0 1 0 31 1 0 ProfileC24 0 0 14 0 0 G 0 1 4 0 0 0 31 T 0 0 0 5 1 0 14 _________________ Consensus A C G T A C G T Line up the patterns by their start indexes s = (s1, s2, …, st) Construct matrix profile with frequencies of each nucleotide in columns Consensus nucleotide in each position has the highest score in column Motifs: Profiles and Consensus
Consensus • Think of consensus as an “ancestor” motif, from which mutated motifs emerged • The distance between a real motif and the consensus sequence is generally less than that for two real motifs
Evaluating Motifs • We have a guess about the consensus sequence, but how “good” is this consensus? • Need to introduce a scoring function to compare different guesses and choose the “best” one.
Defining Some Terms • t - number of sample DNA sequences • n - length of each DNA sequence • DNA - sample of DNA sequences (t x n array) • l - length of the motif (l-mer) • si - starting position of an l-mer in sequence i • s=(s1, s2,… st) - array of motif’s starting positions
Parameters DNA l= 8 cctgatagacgctatctggctatccaGgtacTtaggtcctctgtgcgaatctatgcgtttccaaccat agtactggtgtacatttgatCcAtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc aaacgtTAgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtCcAtataca ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaCcgtacgGc t=5 n = 69 s1= 26s2= 21s3= 3 s4= 56s5= 60 s