1 / 25

Molecular Data

Molecular Data. DNA/RNA Protein Expression Interaction. A sequence. A sequence is a linear set of characters (sequence elements) representing nucleotides or amino acids. http://www.cmu.edu/bio/education/courses/03310/LectureNotes/. Character representation of sequences. DNA or RNA

noam
Download Presentation

Molecular Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Molecular Data DNA/RNA Protein Expression Interaction

  2. A sequence • A sequence is a linear set of characters (sequence elements) representing nucleotides or amino acids http://www.cmu.edu/bio/education/courses/03310/LectureNotes/

  3. Character representation of sequences • DNA or RNA • use 1-letter codes (e.g., A,C,G,T) • protein • use 1-letter codes • can convert to/from 3-letter codes http://www.cmu.edu/bio/education/courses/03310/LectureNotes/

  4. The I.U.B. Code proposed by International Union of Biochemistry A, C, G, T, U R = A, G (puRine) Y = C, T (pYrimidine) S = G, C (Strong hydrogen bonds) W = A, T (Weak hydrogen bonds) M = A, C (aMino group) K = G, T (Keto group) B = C, G, T (not A) D = A, G, T (not C) H = A, C, T (not G) V = A, C, G (not T/U) N = A, C, G, T/U (iNdeterminate) X or - are sometimes used

  5. DNA code

  6. Fasta format >gi|17978494|ref|NM_078467.1| Homo sapiens cyclin-dependent kinase inhibitor AGCTGAGGTGTGAGCAGCTGCCGAAGTCAGTTCCTTGTGGAGCCGGAGCTGGGCGCGGATTCGCCGAGGC ACCGAGGCACTCAGAGGAGGTGAGAGAGCGGCGGCAGACAACAGGGGACCCCGGGCCGGCGGCCCAGAGC CGAGCCAAGCGTGCCCGCGTGTGTCCCTGCGTGTCCGCGAGGATGCGTGTTCGCGGGTGTGTGCTGCGTT CACAGGTGTTTCTGCGGCAGGCGCCATGTCAGAACCGGCTGGGGATGTCCGTCAGAACCCATGCGGCAGC AAGGCCTGCCGCCGCCTCTTCGGCCCAGTGGACAGCGAGCAGCTGAGCCGCGACTGTGATGCGCTAATGG CGGGCTGCATCCAGGAGGCCCGTGAGCGATGGAACTTCGACTTTGTCACCGAGACACCACTGGAGGGTGA CTTCGCCTGGGAGCGTGTGCGGGGCCTTGGCCTGCCCAAGCTCTACCTTCCCACGGGGCCCCGGCGAGGC CGGGATGAGTTGGGAGGAGGCAGGCGGCCTGGCACCTCACCTGCTCTGCTGCAGGGGACAGCAGAGGAAG ACCATGTGGACCTGTCACTGTCTTGTACCCTTGTGCCTCGCTCAGGGGAGCAGGCTGAAGGGTCCCCAGG TGGACCTGGAGACTCTCAGGGTCGAAAACGGCGGCAGACCAGCATGACAGATTTCTACCACTCCAAACGC CGGCTGATCTTCTCCAAGAGGAAGCCCTAATCCGCCCACAGGAAGCCTGCAGTCCTGGAAGCGCGAGGGC CTCAAAGGCCCGCTCTACATCTTCTGCCTTAGTCTCAGTTTGTGTGTCTTAATTATTATTTGTGTTTTAA TTTAAACACCTCCTCATGTACATACCCTGGCCGCCCCCTGCCCCCCAGCCTCTGGCATTAGAATTATTTA AACAAAAACTAGGCGGTTGAATGAGAGGTTCCTAAGAGTGCTGGGCATTTTTATTTTATGAAATACTATT TAAAGCCTCCTCATCCCGTGTTCTCCTTTTCCTCTCTCCCGGAGGTTGGGTGGGCCGGCTTCATGCCAGC TACTTCCTCCTCCCCACTTGTCCGCTGGGTGGTACCCTCTGGAGGGGTGTGGCTCCTTCCCATCGCTGTC ACAGGCGGTTATGAAATTCACCCCCTTTCCTGGACACTCAGACCTGAATTCTTTTTCATTTGAGAAGTAA ACAGATGGCACTTTGAAGGGGCCTCACCGAGTGGGGGCATCATCAAAAACTTTGGAGTCCCCTCACCTCC TCTAAGGTTGGGCAGGGTGACCCTGAAGTGAGCACAGCCTAGGGCTGAGCTGGGGACCTGGTACCCTCCT GGCTCTTGATACCCCCCTCTGTCTTGTGAAGGCAGGGGGAAGGTGGGGTCCTGGAGCAGACCACCCCGCC TGCCCTCATGGCCCCTCTGACCTGCACTGGGGAGCCCGTCTCAGTGTTGAGCCTTTTCCCTCTTTGGCTC CCCTGTACCTTTTGAGGAGCCCCAGCTACCCTTCTTCTCCAGCTGGGCTCTGCAATTCCCCTCTGCTGCT GTCCCTCCCCCTTGTCCTTTCCCTTCAGTACCCTCTCAGCTCCAGGTGGCTCTGAGGTGCCTGTCCCACC CCCACCCCCAGCTCAATGGACTGGAAGGGGAAGGGACACACAAGAAGAAGGGCACCCTAGTTCTACCTCA GGCAGCTCAAGCAGCGACCGCCCCCTCCTCTAGCTGTGGGGGTGAGGGTCCCATGTGGTGGCACAGGCCC CCTTGAGTGGGGTTATCTCTGTGTTAGGGGTATATGATGGGGGAGTAGATCTTTCTAGGAGGGAGACACT GGCCCCTCAAATCGTCCAGCGACCTTCCTCATCCACCCCATCCCTCCCCAGTTCATTGCACTTTGATTAG CAGCGGAACAAGGAGTCAGACATTTTAAGATGGTGGCAGTAGAGGCTATGGACAGGGCATGCCACGTGGG CTCATATGGGGCTGGGAGTAGTTGTCTTTCCTGGCACTAACGTTGAGCCCCTGGAGGCACTGAAGTGCTT AGTGTACTTGGAGTATTGGGGTCTGACCCCAAACACCTTCCAGCTCCTGTAACATACTGGCCTGGACTGT TTTCTCTCGGCTCCCCATGTGTCCTGGTTCCCGTTTCTCCACCTAGACTGTAAACCTCTCGAGGGCAGGG ACCACACCCTGTACTGTTCTGTGTCTTTCACAGCTCCTCCCACAATGCTGAATATACAGCAGGTGCTCAA TAAATGATTCTTAGTGACTTTAAAAAAAAAAAAAAAAAAAA

  7. Sequence Content • Mononucleotide frequencies • GC content • Dinucleotide frequencies • CpG islands

  8. GC content is non-random Lander et al

  9. GC content and expression

  10. Determining mononucleotide frequencies • Alphabet: A T C G • Count how many times each nucleotide appears in sequence • Divide (normalize) by total number of nucleotides • fAmononucleotide frequency of A (frequency that A is observed) • pAmononucleotide probability that a nucleotide will be an A http://www.cmu.edu/bio/education/courses/03310/LectureNotes/

  11. Determining dinucleotide frequencies • Make 4 x 4 matrix, one element for each ordered pair of nucleotides • Set all elements to zero • Go through sequence linearly, adding one to matrix entry corresponding to the pair of sequence elements observed at that position • Divide by total number of dinucleotides • fAC dinucleotide frequency of AC (frequency that AC is observed out of all dinucleotides) http://www.cmu.edu/bio/education/courses/03310/LectureNotes/

  12. Dinucleotide counts Create a 4 x 4 matrix Set all cells to zeros Use a window of size 2 and add 1 to each cell of the matrix when encountering the specified dinucleotide ATTCGACCAGAG

  13. Dinucleotide counts ATTCGACCAGAG

  14. Observed and expected frequencies http://www.maths.lth.se/bioinformatics/publications/BasicE_2005.pdf

  15. Observed and expected frequencies http://www.maths.lth.se/bioinformatics/publications/BasicE_2005.pdf

  16. Dinucleotide frequencies in genome http://www.lapcs.univ-lyon1.fr/~piau/mps/Poster-CpG.pdf

  17. Sequence features • A sequence feature is a pattern that is observed to occur in more than one sequence and (usually) to be correlated with some function http://www.cmu.edu/bio/education/courses/03310/LectureNotes/

  18. Sequence features • promoters • transcription initiation sites • transcription termination sites • polyadenylation sites • ribosome binding sites • protein features http://www.cmu.edu/bio/education/courses/03310/LectureNotes/

  19. Consensus sequences • A consensus sequence is a sequence that summarizes or approximates the pattern observed in a group of aligned sequences containing a sequence feature • Consensus sequences are regular expressions http://www.cmu.edu/bio/education/courses/03310/LectureNotes/

  20. Occurences • Example: recognition site for a restriction enzyme • EcoRI recognizes GAATTC • AccI recognizes GTMKAC • Basic Algorithm • Start with first character of sequence to be searched • See if enzyme site matches starting at that position • Advance to next character of sequence to be searched • Repeat previous two steps until all positions have been tested http://www.cmu.edu/bio/education/courses/03310/LectureNotes/

  21. Statistics of pattern appearance • Goal: Determine the significance of observing a feature (pattern) • Method: Estimate the probability that a pattern would occur randomly in a given sequence. Three different methods • Assume all nucleotides are equally frequent • Use measured frequencies of each nucleotide (mononucleotide frequencies) • Use measured frequencies with which a given nucleotide follows another (dinucleotide frequencies) http://www.cmu.edu/bio/education/courses/03310/LectureNotes/

  22. Example 1 • What is the probability of observing the sequence feature ART (A followed by a purine, either A or G, followed by a T)? • Using observed mononucleotide frequencies: • pART = pA (pA + pG) pT • Using equal mononucleotide frequencies • pA = pC = pG = pT = 1/4 • pART = 1/4 * (1/4 + 1/4) * 1/4 = 1/32 http://www.cmu.edu/bio/education/courses/03310/LectureNotes/

  23. Example 1: using mononucleotide frequencies • Using equal mononucleotide frequencies • pA = pC = pG = pT = 1/4 • pART = 1/4 * (1/4 + 1/4) * 1/4 = 1/32 • Using observed mononucleotide frequencies: • pART = pA (pA + pG) pT

  24. Example 1: using dinucleotide frequencies • pART=pA(p*AAp*AT+p*AGp*GT)

  25. Example 2: • What is the probability of observing the sequence feature ARYT (A followed by a purine {either A or G}, followed by a pyrimidine {either C or T}, followed by a T)? • Using equal mononucleotide frequencies • pA = pC = pG = pT = 1/4 • pARYT = 1/4 * (1/4 + 1/4) * (1/4 + 1/4) * 1/4 = 1/64 http://www.cmu.edu/bio/education/courses/03310/LectureNotes/

More Related