300 likes | 419 Views
What is Bioinformatics?. The Data The Analysis Comparison Evolution Long Distance: Comparative Genomics Short Distance: Variation Analysis Homology Non-homology
E N D
What is Bioinformatics? The Data The Analysis Comparison Evolution Long Distance: Comparative Genomics Short Distance: Variation Analysis Homology Non-homology Physical/Chemical/Statistical Mathematical Modelling
The Data & its growth. 1976/79 The first viral genome –MS2/fX174 1995 The first prokaryotic genome – H. influenzae 1996 The first unicellular eukaryotic genome - Yeast 1997 The first multicellular eukaryotic genome – C.elegans 2001 The human genome 3Gb 1.5.03: Known >1000 viral genomes 96 prokaryotic genomes 16 Archeobacterial genomes A series multicellular genomes are coming. A general increase in data involving higher structures and dynamics of biological systems
Genomes & Tree of Life • 3.5-3.8 Gyr Origin of Life • 3+ Gyr LUCA • ~1.4 Gyr Origin of Eukaryotes • 5-600 Myr Origin of Vertebrates • 200+ Myr Origin of Mammals • 80-100 Myr Mouse Mammalian Split • 5-7 Myr Chimp-Human Split • 100 Kyr – Myr Age of Polymorphisms From Janssen, 2003
Comparison of Evolutionary Objects. ACTGT Cabbage 6 1 3 4 7 8 2 5 ACTCCT 4 Turnip Sequences RNA (Secondary) Structure Gene Order/Orientation. Protein Structure 8 6 2 3 5 1 7 Renin HIV proteinase General Theme. Formal Model of Structure Stochastic Model of Structure Evolution. Interaction Networks Any Graph. Gene Structure
The Phylogeny for Evolutionary Objects MRCA-Most Recent Common Ancestor ? Time Direction Parameters:time rates, selection UnobservableEvolutionary Path ATTGCGTATATAT….CAG ATTGCGTATATAT….CAG ATTGCGTATATAT….CAG observable observable observable 3 Problems: i. Test all possible relationships. ii. Examine unknown internal states. iii. Explore unknown paths between states at nodes.
Gene and Genome Evolution TGTGTATA TGCGTATC Chimp Mouse E.coli Higher Cells Fish • Basic Events • substitutions. • insertion deletions. • Chromosome Level events: inversions, duplications, transpositions,.. • Average Number of Mitoses • Per Male generation (15:35 .. 20:150) • Per Female generation: ~24 • Single nucleotide substitutions: ~10-7 • Microsatellites (~100.000): ~10-2 • Small insertion deletions: ~10-8
Principles of String Comparison: Alignment ACTGT ACTGT ACTG-T ACTCCT ACT-GT ACTCCT ACTGCT ACTCGT .41 .41 ACTCCT ACTCCT Cost 2 Probability: e-16.47
Maximum likelihood phylogeny and alignment Gerton Lunter Istvan Miklos Alexei Drummond Yun Song Human alpha hemoglobin;Human beta hemoglobin; Human myoglobin Bean leghemoglobin Probability of data e-1560.138 Probability of data and alignment e-1593.223 Probability of alignment given data 4.279 * 10-15 = e-33.085 Ratio of insertion-deletions to substitutions: 0.0334
Rooting using irreversibility (Lunter) P( )= P( )* P( )* P( ) Reversibility: = The Pulley Principle: = Contagious Dependence CG avoidance creates irreversibility Lunter and Hein, ISMB2004
Comparison of Evolutionary Objects. C C A A G C A U U Observable Unobservable Goldman, Thorne & Jones, 96 Knudsen & Hein, 99 Eddy & co. Meyer and Durbin 02 Pedersen & Hein, 03 Siepel & Haussler 03 Observable Unobservable
The Rise of Comparative Genomics Lander et al(2001) Figure 25A
Recursive Definition of Strings s d d d d s s s Exon 1 Exon 2 Exon 3 Gene Grammar RNA Grammar S I ssS S sS S A S I I ssdSd ssddSdd S S A A S I I E E ssddSdds S A A ATG GAG S S -> E I E ->eE eI e I ->iE iI e S -> sSSs dSd SSe
Stochastic Grammars S--> (0.3)aSa (0.5)bSb (0.1)aa (0.1)bb S -> aSa -> abSba -> abaaba (.015) 0.3 0.5 0.1 If there is a 1-1 derivation (creation) of a string, the probability of a string can be obtained as the product probability of the applied rules.
A starting symbol: • A set of substitution rules applied to variables - - in the present string: Grammars: Finite Set of Rules for Generating Strings Regular Context Free Context Sensitive General (also erasing) finished – no variables
Structure Dependent Evolution: RNA 2 3 4 5 6 7 8 1 2 C 3 4 5 C 6 7 A G C A U U 2 3 4 5 6 7 8 1 2 3 4 5 6 7 U A C A C C G U U A C A C C G U U A C A C C G U From Bjarne Knudsen U A C A C C G U
RNA Structure Application From Knudsen & Hein (1999) Knudsen & Hein, 2003
Observing Evolution has 2 parts C C A A G C A U U P(x): x x P(Further history of x):
Inter- and Intra-species Comparisons At shorter time scales • For sequences sampled within a population, their relationship is determined by population structure. There is no analogue for this for interspecies sequences. • Is within species variation a short time slice of long term variation? • Where do the species and population perspective meet?
Short Time Evolution: Population Genetics and History Time Ancestral Recombination Graph 1 2 1 2 1 2 1 2 1 2 Population N 1 Three large areas of application: Interpretation of Variation Human Population History Gene Mapping Pathogen Evolution Cardon Donnelly Griffiths McVean Wiuf Song Schierup
Time slices All positions have found a common ancestors on one sequence All positions have found a common ancestors Time 1 2 1 2 1 2 1 2 1 2 N 1 Population
Applications to Human Genome (Chr 1) (Wiuf and Hein,97) 0 260 Mb 0 52.000 *35 0 7.5 Mb 6890 8360 *250 30kb 0 4Ne 20.000Segments 52.000 Ancestors 6.800 A randomly picked ancestor: (ancestral material comes in batteries!)
C C C T G G G G A A T A The Origin of Variation C G T C G Time T Show variation N 1 Inter.SNP Consortium (2001): A map of human genome sequence variation containing 1.42 million SNPs. Nature 409.928-33
Slice in Space Time N 1
Minimal ARGs and Haplotype Blocks (Song) a: (3,4) b: (3,4) c: (15,16) d: (16,17) e: (35,36) f: (35,36) g: (36,37)
Genotype and Phenotype Covariation: Gene Mapping Time Reich et al. (2001) Rafnar et al.(2004) – Morris et al(2001) +
Finding Homologies New Sequence Database / P( ) ) P( ) * P( R. Doolittle et al.(1983). New Sequence: Simian Sarcoma Virus onc Gene Similar Sequence: Platelet-Derived Growth Factor P28SIS 51 GGELESLARGSLGSLSVAEPAMIAECKTRTEVFEISAALIDATNANFLVWPPCVEVQACSGCCNNRN.. PDGF-1 1 ----------SLGSLTIAEPAMIAECKTREEVCFCIAAL?DA????????PPCVEVKACTGCCNNRN.. ***** ************ ** *** ** ****** ** ******* Properties for the known sequence are transferred to the new sequence, immediately yielding biological hypotheses about the new sequence.
“Knowledge Based..”: The Products of Evolution - An Example (D.Baker) Sequence Structure Make a List: Choose global structure that doesn’t create new local structures!
What is Bioinformatics? The Data The Analysis Comparison Evolution Long Distance: Comparative Genomics Short Distance: Variation Analysis Homology Non-homology Physical/Chemical/Statistical Mathematical Modelling
Funding: MRC & EPSRC Jotun Hein Alexei Drummond Roald Forsberg Bjarne Knudsen Istvan Miklos Jakob Skou Pedersen Santiago Schnell Carsten Wiuf …. Gerton Lunter Rune LyngsoeIrmtraud Meyer Yun Song Jennifer Taylor Lizhong Hao Ben Holtom Stephen McCauley • Methodology • Evolutionary Models • Alignment • Expression Data • Genome and Gene Evolution • Sequence Variation Data & Recombination • RNA Secondary Structure and Evolution • ………… • Collaborations • William Cookson (WCHG) • John Hancock (Harwell MRC) • Peter Simmonds (Edinburgh) • Bioinformatics Research Centre, Dk • ……… Homepage: http://www.stats.ox.ac.uk/mathgen/bioinformatics/