250 likes | 682 Views
DNA sequence analysis. IT Carlow Bioinformatics October 2006. A, T/U, C, G. Simple code, lots of sequence Sequence analysis Computer intensive BLAST homology searching Gene/exon prediction Multiple sequence alignment Alignments in general “Trivial”. Trivial. Could be done by hand
E N D
DNA sequence analysis IT Carlow Bioinformatics October 2006
A, T/U, C, G • Simple code, lots of sequence • Sequence analysis • Computer intensive • BLAST homology searching • Gene/exon prediction • Multiple sequence alignment • Alignments in general • “Trivial”
Trivial • Could be done by hand • Computers • Quicker • More reliable • Examples • Translate DNA • Restriction sites • Synonymous codon usage
Sequence formats • Fasta Format >gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus] LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX • Phylip Format 4 131 IXI_234 TSPASIRPPA GPSSRPAMVS SRRTRPSPPG PRRPTGRPCC SAAPRRPQAT IXI_235 TSPASIRPPA GPSSR----- ----RPSPPG PRRPTGRPCC SAAPRRPQAT IXI_236 TSPASIRPPA GPSSRPAMVS SR--RPSPPP PRRPPGRPCC SAAPPRPQAT IXI_237 TSPASLRPPA GPSSRPAMVS SRR-RPSPPG PRRPT----C SAAPRRPQAT • CLUSTAL W(1.4) multiple sequence alignment IXI_234 TSPASIRPPAGPSSRPAMVSSRRTRPSPPGPRRPTGRPCCSAAPRRPQAT IXI_235 TSPASIRPPAGPSSR---------RPSPPGPRRPTGRPCCSAAPRRPQAT IXI_236 TSPASIRPPAGPSSRPAMVSSR--RPSPPPPRRPPGRPCCSAAPPRPQAT IXI_237 TSPASLRPPAGPSSRPAMVSSRR-RPSPPGPRRPT----CSAAPRRPQAT • http://thr.cit.nih.gov/molbio/readseq/
DNA sequence analysis • Look for EMBOSS • A suite of programs with the same look&feel • http://bioweb.pasteur.fr/seqanal/dna/intro-uk.html
Translation • DNA anti-parallel. • One strand 5’ -3’ matches the complementary strand 3’ – 5’ • Translation, transcription always 5’ – 3’ • Six possible translations, 3 each strand • ATGCCCGCATTTGAATAA • ATGCCCGCATTTGAATAA • ATGCCCGCATTTGAATAA • Stop codons underlined Frameshift errors Frameshift mutations
Genetic code The “Universal” Genetic Code. Phe UUU Ser UCU Tyr UAU Cys UGU UUC UCC UAC UGC Leu UUA UCA ter UAA ter UGA UUG UCG ter UAG Trp UGG Leu CUU Pro CCU His CAU Arg CGU CUC CCC CAC CGC CUA CCA Gln CAA CGA CUG CCG CAG CGG Ile AUU Thr ACU Asn AAU Ser AGU AUC ACC AAC AGC AUA ACA Lys AAA Arg AGA Met AUG ACG AAG AGG Val GUU Ala GCU Asp GAU Gly GGU GUC GCC GAC GGC GUA GCA Glu GAA GGA GUG GCG GAG GGG
Exceptions to the code • #1: Yeast Mitochondrial Code: CUN=T AUA=M UGA=W • #2: Mitochondrial Code of Vertebrates: AGR=* AUA=M UGA=W • #3: Mitochondrial Code of Filamentous fungi: UGA=W • #4: Mitochondrial Code of Insects and platyhelminths: AUA=M UGA=W AGR=S • #5: Nuclear Code of Candida cylindracea: CUG=S (*) • #6: Nuclear Code of Ciliata: UAR = Q • #7: Nuclear Code of Euplotes: UGA=C • #8: Mitochondrial Code of Echinoderms: UGA=W AGR=S AAA=N • #9: Mitochondrial Code of Ascidaceae: UGA=W AGR=G AUA=M • #10: Mitochondrial Code of Platyhelminthes: UGA=W AGR=S UAA=Y AAA=N • #11: Nuclear Code of Blepharisma: UAG=Q (*) (see Nature 341:164):
Start codons • ATG the “universal” start codon … but • 10% E.coli genes start with GTG • 1% start with TTG. • Bioinformaticians only make predictions • Molecular biologists verify
Restriction sites • Essential for the construction of plasmids • A key tool for molecular biology • Hundreds available commercially • Need to decide which to order • Costs from $3.80/1000units - $500/1000 • http://tools.neb.com/NEBcutter2/index.php • Usually need an enzyme that cuts once EcoR1 5'G’AATTC 3'CTTAA’G BamH1 5'G’GATCC 3'CCTAG’G Alu1 5'AG’CT 3'TC’GA
Promoter Prediction • To find start of transcript (97% Human genome not coding) • False positive rate too high • Predicted 1 / kb reasonable 1 / 30kb • RNA polII transcribes DNA – RNA • Needs general transcription factors (GTFs) • Also specific (species, tissue, devt stage) TF • TF binding sites short and “fuzzy” • 7% of vertebrate genes are TFs
Promoters 2 A00333001 C12000002 G00000110 T21000220 TCAAATTC NF-AT4 matrix (3 known sites) and consensus: Predicts five sites in 3Kb of human IL-11: Bp 007 TTAAAGGC Bp 248 ACAAATTC Bp1959 GAGTTTGA Bp2154 TCAAAGGA Bp2181 GACTTTTA Ask if TF site relevant to your cell type is present.
Primer design • You will be asked to design primers for sequencing, PCR etc. • Manual pages cover this • Computationally trivial, so lots of choice for available websites
Trivial but time -consuming • Genome trawls for repeats • LINES • SINES • Microsatellites • Masking genomic seq prior to gene finding • Codon usage • Codons, codonw, gcua,
Not-trivial • NA secondary structure • EMBOSS einverted for short palindromes • http://bioweb.pasteur.fr/seqanal/interfaces/einverted.html • Huge database of 16sRNA structures
Secondary Structure • DNA (and RNA) can form base-pairs. • Not all of these are with complementary strands. Closer to reality Bioinformatic view = a cartoon
16s RNA Gram -ve Gram +ve Evolutionary consequences? Coordinated/dependent mutational change
RDP • Ribosomal Database Project-II Release 9 Notes • RDP Release 9.42 (Release 9, update 42) consists of 262,030 aligned and annotated 16S rRNA sequences, along with five online analysis tools. Update 42 was released on Sep 14, 2006.