Lecture 2: Introduction to Computational Biology

Lecture 2: Introduction to Computational Biology Alexei Drummond

Outline • Sequences and sequence databases • Similarity and Homology • Sequence alignment • Dot plots • Database searches for similar sequences CS369 2007

Sequence • Definition: A sequenceS is an ordered set of n characters (si) representing nucleotides or amino acids. S = {s1, s2,…,sn-1 ,sn} • DNA is composed of four nucleotides or bases: si {A, C, G, T} • RNA is composed of four nucleotides: si {A, C, G, U}(T is transcribed as U) • Proteins are composed of twenty amino acids CS369 2007

Biomolecular sequences 5’-ACGATCGACTGGTATATCGATGCT-3’ DNA 5’-ACGAUCGACUGGUAUAUCGAUGCU-3’ RNA MFINRWLFSTNHKDIGTLYLLFGAW Protein CS369 2007

What is a gene? Splice sites Stop codon Intergenic DNA Start codon 5’ 3’ DNA 3’ 5’ Intron 1 Exon 2 Intron 2 Exon 3 Exon 1 Both the exons and introns are transcribed Primary RNA transcript 5’ 3’ The introns are removed Messenger RNA (mRNA) Translated to protein CS369 2007

Eukaryotes versus Prokaryotes Note: There is no cellular biology in the exam! • Bacteria and Archaea • Small • No nucleus • No introns • Not much intergenic DNA • Typically 1-10Mb genomes • Plants, animals and fungi • Larger cells, often multicellular • Well defined nucleus, and specialized organelles • Introns • Lots of intergenic DNA • 100Mb -100 Gb genomes CS369 2007 Graphics from MIT: http://web.mit.edu/hst.035/labs/labs.html

Sequence databases • Where do biologists store their data? • Databases • Public, private proprietary • General, specialist • Hard drive • Chromatograms/Electropherograms • Flat file sequence formats • Fasta, Genbank et cetera • Flat file alignment formats • Nexus, ClustalX, GCG et cetera CS369 2007

CS369 2007

NCBI Nucleotide database CS369 2007

Searching by accession number CS369 2007

Genbank record CS369 2007

Genbank headers LOCUS X00166 711 bp DNA linear PHG 10-FEB-1999 DEFINITION Bacteriophage lambda cI gene encoding the repressor protein for transcriptional control of tetracycline resistance on plasmid pTR 262. ACCESSION X00166 VERSION X00166.1 GI:15056 KEYWORDS repressor; tetracycline resistance. SOURCE Enterobacteria phage lambda ORGANISM Enterobacteria phage lambda Viruses; dsDNA viruses, no RNA stage; Caudovirales; Siphoviridae; Lambda-like viruses. REFERENCE 1 (bases 1 to 711) AUTHORS Nilsson,B., Uhlen,M., Josephson,S., Gatenbeck,S. and Philipson,L. TITLE An improved positive selection plasmid vector constructed by oligonucleotide mediated mutagenesis JOURNAL Nucleic Acids Res. 11 (22), 8019-8030 (1983) PUBMED 6316281 CS369 2007

Genbank feature table FEATURES Location/Qualifiers source 1..711 /organism="Enterobacteria phage lambda" /mol_type="genomic DNA" /db_xref="taxon:10710" CDS 1..>711 /note="unnamed protein product; coding sequence cI gene" /codon_start=1 /transl_table=11 /protein_id="CAA24991.1" /db_xref="GI:15057" /db_xref="GOA:P03034" /db_xref="InterPro:IPR001387" /db_xref="InterPro:IPR006198" /db_xref="InterPro:IPR010982" /db_xref="InterPro:IPR011056" /db_xref="PDB:1F39" /db_xref="PDB:1GFX" /db_xref="PDB:1J5G" /db_xref="PDB:1LLI" /db_xref="PDB:1LMB" /db_xref="PDB:1LRP" … CS369 2007

Genbank sequence ORIGIN 1 atgagcacaa aaaagaaacc attaacacaa gagcagcttg aggacgcacg tcgccttaaa 61 gcaatttatg aaaaaaagaa aaatgaactt ggcttatccc aggaatctgt cgcagacaag 121 atggggatgg ggcagtcagg cgttggtgct ttatttaatg gcatcaatgc attaaatgct 181 tataacgccg cattgcttgc aaaaattctc aaagttagcg ttgaagaatt tagcccttca 241 atcgccagag aaatctacga gatgtatgaa gcggttagta tgcagccgtc acttagaagt 301 gagtatgagt accctgtttt ttctcatgtt caggcaggga tgttctcacc tgagcttaga 361 acctttacca aaggtgatgc ggagagatgg gtaagcacaa ccaaaaaagc cagtgattct 421 gcattctggc ttgaggttga aggtaattcc atgaccgcac caacaggctc caagccaagc 481 tttcctgacg gaatgttaat tctcgttgac cctgagcagg ctgttgagcc aggtgatttc 541 tgcatagcca gacttggggg tgatgagttt accttcaaga aactgatcag ggatagcggt 601 caggtgtttt tacaaccact aaacccacag tacccaatga tcccatgcaa tgagagttgt 661 tccgttgtgg ggaaagttat cgctagtcag tggcctgaag agacgtttgg c // CS369 2007

Fasta format >gi|15056|emb|X00166.1| Bacteriophage lambda cI gene encoding the… ATGAGCACAAAAAAGAAACCATTAACACAAGAGCAGCTTGAGGACGCACGTCGCCTTAAAGCAATTTATG AAAAAAAGAAAAATGAACTTGGCTTATCCCAGGAATCTGTCGCAGACAAGATGGGGATGGGGCAGTCAGG CGTTGGTGCTTTATTTAATGGCATCAATGCATTAAATGCTTATAACGCCGCATTGCTTGCAAAAATTCTC AAAGTTAGCGTTGAAGAATTTAGCCCTTCAATCGCCAGAGAAATCTACGAGATGTATGAAGCGGTTAGTA TGCAGCCGTCACTTAGAAGTGAGTATGAGTACCCTGTTTTTTCTCATGTTCAGGCAGGGATGTTCTCACC TGAGCTTAGAACCTTTACCAAAGGTGATGCGGAGAGATGGGTAAGCACAACCAAAAAAGCCAGTGATTCT GCATTCTGGCTTGAGGTTGAAGGTAATTCCATGACCGCACCAACAGGCTCCAAGCCAAGCTTTCCTGACG GAATGTTAATTCTCGTTGACCCTGAGCAGGCTGTTGAGCCAGGTGATTTCTGCATAGCCAGACTTGGGGG TGATGAGTTTACCTTCAAGAAACTGATCAGGGATAGCGGTCAGGTGTTTTTACAACCACTAAACCCACAG TACCCAATGATCCCATGCAATGAGAGTTGTTCCGTTGTGGGGAAAGTTATCGCTAGTCAGTGGCCTGAAG AGACGTTTGGC CS369 2007

Hepatitis C sequence database • Specialist databases usually refer to sequences in the public databases, but have extra information and search criteria specific to the domain. CS369 2007

Hepatitis C sequence database CS369 2007

Problem 1: detecting sequence similarity between two sequences • Biologists often want to detect if two sequences are similar • How is sequence similarity defined? • What is it used for? • Are there different types of similarity? CS369 2007

How is sequence similarity defined? • The number of matching nucleotides (when aligned)? • The amount of shared information? • The “distance” between the two sequences under some metric? 38 out of 60 sites are identical in this alignment CS369 2007

How is sequence similarity defined? • A1 is 42 nucleotides long • A2 is 60 nucleotides long • So 38/42 = 90% of A1 is “explained” by A2 • Whereas 38/60 = 63% of A2 is “explained” by A1 CS369 2007

What is similarity used for? • Detecting homology (shared evolutionary history) • Reconstructing evolutionary history to better understand biology • Determining the structure and function of new sequences, by matching them with sequences of known structure/function • Grouping sequences together to increase statistical power of single-sequence analyses • Many many more uses… CS369 2007

Are their different types of similarity? • Chance similarity • For example: if you compare two long random sequences of DNA you will always find some small region containing the same sequence. • Similarity due to a common origin, followed by divergent/independent evolution (called homology) • Similarity due to convergence • Bird wings and bat wings • Lysozyme gut enzyme in cows and colobus monkeys CS369 2007

Sequence Homology x • Homologous protein or DNA sequences share common ancestry • A statement of homology is therefore an evolutionary hypothesis • Homology need not imply similar function • Homology is a binary property, a pair of sequences are either homologous or not homologous. • No such thing as degree of homology • Homology is often inferred by sequence similarity t a, b homologous a b x y a, b not homologous a b CS369 2007

Origin of similar genes • Similar genes in the same genome arise by gene duplication • Similar genes in different genomes arise from common ancestry • A copy of a gene might be inserted next to the original • Two copies mutate independently • Each can take on separate functions • All or part can be transferred from one part of genome to another A Gene duplication A B Speciation A B A’ B’ Species I Species II CS369 2007

Orthology and paralogy "Where the homology is a result of gene duplication so that both copies have descended side by side during the history of an organism, (for example, alpha and beta hemoglobin) the genes should be called paralogous (para=in parallel). Where the homology is the result of speciation so that the history of the gene reflects the history of the species (for example alpha hemoglobin in man and mouse) the genes should be called orthologous (ortho=exact). " Fitch WM. Distinguishing homologous from analogous proteins. Systematic Zoology 1970 Jun;19(2):99-113. CS369 2007

Orthology and paralogy CS369 2007

Orthology, paralogy and multigene families Reproduced from NCBI education website CS369 2007

Solution 1: Pairwise sequence alignment • Definition: Procedure for optimizing a score function on a pair of sequence S1 and S2by introducing gap characters into a subsequence of one or both of the sequences so as to construct aligned sequences A1 and A2. The objective is to find the similarity regions in the two sequences. • A1 and A2 will be the same length. • Ai will consist only of a subsequence of Si once gap characters are removed. CS369 2007

Pairwise sequence alignment Sequences S1 = a c g g t S2 = a g g c t t Alignment A1 = a c g g – t - | | | | A2 = a – g g c t t CS369 2007

Global versus Local Alignment • We distinguish • Global alignment algorithms which optimize overall alignment between two sequences • Local alignment algorithms which seek only highly similar subsequences • Alignment stops at the ends of regions of strong similarity • Favors finding conserved patterns in otherwise dissimilar sequences CS369 2007

Global vs. Local Alignment • Global LGPSSKQTGKGS-SRIWDN | | ||| | | LN-ITKSAGKGAIMRLGDA • Local --------GKG-------- ||| --------GKG-------- CS369 2007

Solution 2: The dot plot Window size = 1 Matches = 1 0/1 1/1 CS369 2007

Filtering the dot plot Window size = 3 Matches = 2 0/3 1/3 2/3 3/3 CS369 2007

Dot plots 1,1 2,2 The dot plot is a graphical method that can be tuned CS369 2007

Dot plots 3,3 5,22 CS369 2007

Dot matrix analysis with Geneious • Get phage l cI and phage P22 c2 repressor sequences from Genbank Nucleotide database • Accessions X00166 and V01153 respectively • Use Geneious 2.5.4 (http://www.geneious.com) • Use window size of 11 and stringency of 7 • See figure 3.X in Mount CS369 2007

Dot matrix analysis with Geneious CS369 2007

Dot matrix analysis with Geneious (2) • Get human LDL receptor protein sequence from Genbank (accession P01130) • Make copy, and look at self-similarity • Use window size of 1 and stringency of 1 • Use window size of 23 and stringency of 7 CS369 2007

Human LDL receptor self similarity 23,7 1,1 CS369 2007

Dot plots • Two 100 nucleotide fragments of the nef gene • Low complexity repetitive region is visible as dense region of parallel lines CS369 2007

Which alignment is best? CS369 2007

Problem 2: finding similar sequences in a database using query sequence • Biologists often want to find known sequences that are similar to a newly obtained sequence • How to rapidly compare the new sequence to the hundreds of billions of bases already sequenced? • Pairwise align new sequence to all the sequences in the database? • Which database to search? CS369 2007

Similarity searching • Many heuristic algorithms • BLAST • FASTA • Exact algorithms • Pairwise alignment on all database entries • Only possible for small databases CS369 2007

BLAST CS369 2007

CS369 2007

Lecture 2: Introduction to Computational Biology