390 likes | 489 Views
Introduction to bioinformatics (I617). Haixu Tang School of Informatics Email: hatang@indiana.edu Office: EIG 1008 Tel: 812-856-1859. Textbook. A Primer of Genome Science (2nd Edition) by Greg Gibson, Spencer V. Muse, Sinauer Associates, 2004
E N D
Introduction to bioinformatics(I617) Haixu Tang School of Informatics Email: hatang@indiana.edu Office: EIG 1008 Tel: 812-856-1859
Textbook • A Primer of Genome Science (2nd Edition) by Greg Gibson, Spencer V. Muse, Sinauer Associates, 2004 • Suggested reading materials will be posted on the class wiki page: http://cheminfo.informatics.indiana.edu/djwild/I617_2006_wiki/index.php/Main_Page • Office Hour: MW 11:00-12:00, EIG 1008 or appointment
Grading • Class project: selected from one of four covered areas (bioinformatics, Chemical informatics, Laboratory informatics and Health informatics) 25% • Suggested Bioinformatics topics will be posted on the class wiki page • Homework: 25% in Bioinformatics • 4, each 6.25%
Bioinformatics = BIOlogy + informatics? • Not really: it is a term (somehow arbitrarily chosen) to define a multi-disciplinary area that combines life sciences, physical sciences and computer science / informatics; • It addresses biological problems using theoretical informatics approaches, not vice versa; • It is transforming classical Biology into a Information Science.
The birth of bioinformatics • A revolution in biology research: the emergence of Genome Science • Technology advancement in both biology and information science
Classical Biology Genome Science Data Hypothesis Knowledge Knowledge Genome science: a revolution of biology Hypothesis Data Hypothesis driven approach Data driven approach
Classical Biology Data Hypothesis 1 2 3 … Bioinformatics: from data analysis to data mining • Genome Science Hypothesis Data High throughput data Low throughput data Hypothesis generation Hypothesis confirmation / rejection
Classical Biology Data Hypothesis Knowledge Knowledge Bioinformatics: in the driver’s seat • Genome Science Hypothesis Data Data mining Data analysis
Key technology advancements • High throughput biotechnologies • Genome sequencing techniques • DNA microarray • Mass spectrometry • Large-scale experiments • HGP, HapMap • Omics / Systems Biology • Massive data generation, storage, exchange and analysis • CPU, storage, etc. • High speed network (Internet) • Bioinformatics
For biologists Fragment assembly in genome sequencing Genome comparison Gene clustering in DNA microarray analysis Protein identification in proteomics For computer scientists String algorithms / Tree algorithms Alternative Eulerian path (BEST theorem) Reversal distances Probabilistic graphic models (HMMs, BNs, etc.) Bioinformatics: mutually beneficial
Two origins of bioinformatics • Combinatorial pattern matching in theoretical computer science • DNA and protein sequence analysis • Physical and analytical chemistry of Biomolecules • Protein structure analysis Structural bioinformatics • Bio-analytical chemistry Proteomics
Bioinformatics addresses computational challenges in life and medical sciences • New computational problems for automatic data analysis • Reformulation of old problems using new high throughput data • Formulating new problems using high throughput data
Bioinformatics addresses computational challenges in life and medical sciences • New computational problems for automatic data analysis • Genome sequencing • Proteomics • Transcriptomics • Data representation and visualization • Genome Browser • Solving biological problems by in silico approaches • Reformulation of old problems using new high throughput data • Gene finding • Protein structure and function • Formulating new problems using high throughput data • Comparative genomics • Polymorphisms / Population genetics • Systems Biology
Bioinformatics resources • Databases • Nucleic Acid Research (NAR) annual database issue • Organization • ISCB (International Society in Computational Biology) • Conferences • ISMB • RECOMB • Many other smaller or regional conferences, e.g. ECCB, CSB, PSB, etc, including local Indiana Bioinformatics conference
A case study • How bioinformatics help and transform classical biological topics? • Molecular evolutionary studies: from anatomical features to molecular evidences • Genome evolution: comparison of gene orders
Early Evolutionary Studies • Anatomical features were the dominant criteria used to derive evolutionary relationships between species since Darwin till early 1960s
Early Evolutionary Studies • Anatomical features were the dominant criteria used to derive evolutionary relationships between species since Darwin till early 1960s • The evolutionary relationships derived from these relatively subjective observations were often inconclusive. Some of them were later proved incorrect
Evolution and DNA Analysis: the Giant Panda Riddle • For roughly 100 years scientists were unable to figure out which family the giant panda belongs to • Giant pandas look like bears but have features that are unusual for bears and typical for raccoons, e.g., they do not hibernate
Evolution and DNA Analysis: the Giant Panda Riddle • In 1985, Steven O’Brien and colleagues solved the giant panda classification problem using DNA sequences and bioinformatics algorithms
Evolutionary Trees: DNA-based Approach • 40 years ago: Emile Zuckerkandl and Linus Pauling brought reconstructing evolutionary relationships with DNA into the spotlight • In the first few years after Zuckerkandl and Pauling proposed using DNA for evolutionary studies, the possibility of reconstructing evolutionary trees by DNA analysis was hotly debated • Now it is a dominant approach to study evolution.
Evolutionary Trees How are these trees built from DNA sequences?
Evolutionary Trees How are these trees built from DNA sequences? • leaves represent existing species • internal vertices represent ancestors • root represents the common evolutionary ancestor
Rooted and Unrooted Trees • In the unrooted tree the position of the root (“common ancestor”) is unknown. Otherwise, they are like rooted trees
Distances in Trees • Edges may have weights reflecting: • Number of mutations on evolutionary path from one species to another • Time estimate for evolution of one species into another • In a tree T, we often compute dij(T) - the length of a path between leaves i and j dij(T) – treedistance between i and j
j i Distance in Trees: an Exampe d1,4 = 12 + 13 + 14 + 17 + 12 = 68
Distance Matrix • Given n species, we can compute the n x n distance matrixDij • Dij may be defined as the edit distance between a gene in species i and species j, where the gene of interest is sequenced for all n species. Dij – editdistance between i and j
Fitting Distance Matrix • Given n species, we can compute the n x n distance matrixDij • Evolution of these genes is described by a tree that we don’t know. • We need an algorithm to construct a tree that best fits the distance matrix Dij
Tree reconstruction for any 3x3 matrix is straightforward We have 3 leaves i, j, k and a center vertex c Reconstructing a 3 Leaved Tree Observe: dic + djc = Dij dic + dkc = Dik djc + dkc = Djk
Turnip vs Cabbage: Look and Taste Different • Although cabbages and turnips share a recent common ancestor, they look and taste different
Turnip vs Cabbage: Comparing Gene Sequences Yields No Evolutionary Information
Turnip vs Cabbage: Almost Identical mtDNA gene sequences • In 1980s Jeffrey Palmer studied evolution of plant organelles by comparing mitochondrial genomes of the cabbage and turnip • 99% similarity between genes • These surprisingly identical gene sequences differed in gene order • This study helped pave the way to analyzing genome rearrangements in molecular evolution
Turnip vs Cabbage: Different mtDNA Gene Order • Gene order comparison: Before After Evolution is manifested as the divergence in gene order
Turnip vs Cabbage: Different mtDNA Gene Order • Gene order comparison:
Turnip vs Cabbage: Different mtDNA Gene Order • Gene order comparison:
Turnip vs Cabbage: Different mtDNA Gene Order • Gene order comparison:
Turnip vs Cabbage: Different mtDNA Gene Order • Gene order comparison:
Transforming Cabbage into Turnip Reversal distance
History of Chromosome X Rat Consortium, Nature, 2004