390 likes | 607 Views
Biology 162: Computational Genetics Fall 2004. Todd Vision Assistant Professor Department of Biology, UNC Chapel Hill. Bioinformatics vs computational genetics. Bioinformatics : The application of computing technology to molecular biology
E N D
Biology 162: Computational GeneticsFall 2004 Todd Vision Assistant Professor Department of Biology, UNC Chapel Hill
Bioinformatics vs computational genetics • Bioinformatics: The application of computing technology to molecular biology • Computational genetics: The interdisciplinary intersection of genetics, computer science and statistics
Course emphasis • Data analysis in molecular genetics • We will not cover • Developments in IT hardware • Analysis of protein structure • Modeling of metabolic pathways, cells, tissues, organs, etc. (i.e. systems biology)
Prerequisites • Bio 50: Molecular Biology and Genetics • Gene/protein structure and expression • Principles of inheritance • Comp Sci 14: Introduction to Programming • Algorithms and their design • Fundamental programming skills • Stat 31: Introduction to Statistics • Probability and Distributions • Hypothesis testing and parameter estimations
Related courses at UNC • Biology 170/Math 107, Mathematical and Computational Models in Biology (Tim Elston and Maria Servedio) • Summer courses in • Computer Science • Graduate courses in • Bioinformatics and Computational Biology • Biostatistics • School of Pharmacy
Readings • Gibson and Muse, A Primer of Genome Science, Sinauer Associates. • Available in Student Bookstore • Primarily covers genomic technologies • Brief on computational/statistical aspects • Supplemental papers • Handed out in class or posted on Blackboard • Includes • More detail on computational/statistical aspects • Papers which you will review for class assignments
Computer labs / Problem sets • Thursdays 3:30-4:30 in Wilson 132 • Assignments are due following Tuesday • Purpose: • Familiarity with genomic databases and tools • Functional and evolutionary sequence analysis • Gene expression analysis • Mapping of genomes and complex traits • Comfort with command-line tools and computing • Exercise of scientific reasoning and biological judgement • No programming required (but learn Perl anyway!)
Research paper • Critical review of the computational challenges involved in assembly of the human genome • Based on opposing articles from the main players in the drama • Paper will be judged on • Understanding of content • Critical and synthetic reasoning • Clarity of scientific writing
Late policy • Assignments are due at beginning of class on the due date • Late assignments receive half-credit • Exceptions can be made but require more than 24 hours notice
Group work • You are encouraged to work together on most assignments (some exceptions) • What you turn in should be your own • Show your work • Be able to defend your answers • Know and love the UNC Honor Code • http://honor.unc.edu
Exams • Two midterms • Final exam will be cumulative • May include material from labs/problem sets, readings and lectures • Most questions will be similar to those on lab/problem sets • You will receive a study guide in advance
Grading • 10 Labs/problem sets - 50% (5% each) • Review paper - 10% • Midterms - 20% (10% each) • Final exam - 20% • Final grades • No curve, point divisions at discretion of instructor • Different divisions for undergraduate/graduate students
Computer lab server: Biolinux • All necessary analysis software is installed • Dell PowerEdge server • Linux Redhat operating system • 2 Xeon processors • 2 GB RAM • 60 GB disk space • Requires an ONYEN for login • Uses AFS file space
Connecting to Biolinux • biolinux.bio.unc.edu (IP 152.2.66.25) • Windows • Zip archive contains necessary connection software • MacOSX • X11 for graphical sessions • Fugu for secure ftp • Linux/Solaris/etc. • Should work as is
Cretaceous Park? • In 1994, researchers reported a remarkably well-preserved Cretaceous dinosaur fossil. • DNA was extracted • Care was taken to prevent contamination • Specific regions were amplified • 20 different PCR primer pairs used, including 6 pairs from mitochondrial cytB • How would you design primers for dinosaur DNA? • All yielded products in mammals, birds and reptiles • Only one cytB pair yielded a product from the fossil • Negative controls did not reveal contamination
Cretaceous Park? • One cytB fragment amplified • 9 sequences obtained from two bone samples • Variability was present within and between the two samples, none were identical • Consensus sequences used to search for homologs • Genbank (215,000 sequences) • BLAST • Measured percent identity • Closest matches were ~70% identical • Equidistant to mammals, birds, and reptiles
Cretaceous Park? • One would expect dinosaur DNA to be most similar to that of birds, and then crocodilians • Other authors reanalyzed the data • Multiple alignment • Protein sequence scoring matrix • Phylogenetic analysis • All concluded that the DNA was clearly mammalian, possibly human • One group showed that similar sequences could be amplified from human nuclear DNA
Cretaceous Park? • Three possibilities • Preparation of human nuclear DNA could have been contaminated by dinosaur DNA • Dinosaurs and humans might have hybridized during the Cretaceous • Dinosaur extracts were contaminated by human DNA • Study revealed an interesting aspect of human molecular evolution, but not much about dinosaurs • Lesson learned: naïve computational analysis can lead to very misguided conclusions!
Discussion question • You are given the sequence of a new gene and asked to determine its function. • How would you begin? • What ‘wet lab’ approaches are possible? • What ‘in silico’ approaches are possible? • What approaches might require both wet lab and in silico components?
Biological topics • Sequence alignment and assembly • Sequence homology searching • Sequence evolution and phylogenetics • Finding genes and other features • Patterns of gene expression • Genetic mapping • Dissecting genetic diseases and quantitative traits
Computational topics • Dynamic programming • Regular expressions and suffix trees • Markov chains • Hidden Markov models and machine learning • Techniques for clustering and classification • Maximum likelihood and Bayesian statistics • Graph traversal
Some informatics tools • Genbank, Uniprot, and major sequence repositories • InterPro and protein signature dBs • Gene Ontology • Model organism genome databases (SGD, FlyBase, Ensembl) • A sampling of software programs • Chosen primarily for pedagogical utility
Genomics • Genetics on lots of genes? • Hypothesis-free science? • Some technologies • Enabled by • Robotics • Computers
Genome database examples • Primary databases • Genbank/EMBL/DDBJ • Secondary databases • Pfam (protein domains) • Organism-specific • SGD (yeast genomics) • Specialized dBs • OMIM (human genetic disorders) • Annual database issue of Nucleic Acids research: http://www3.oup.co.uk/nar/database/c/
First bacterial genome: 1995 • Haemophilus influenzae (TIGR) • 1.8 x 106 bp shotgun assembly • Required 9 months of computer time • Now there are hundreds • 160 Bacterial • 19 Archaeal • 32 Eukaryotic • Over a thousand projects ongoing • And a bacterial genome takes only days to sequence and assemble
Other types of genomic data • Spatiotemporal gene expression • Alternative transcription • Genetic knockout/overexpression phenotypes • Genetic variability • Molecular polymorphism • Phenotypic variation / disease • Comparative data / molecular evolution • Protein • Structure, including modifications • Interactions with other molecules • Metabolic profiling, etc., etc.
Algorithmic/statistical innovations • The most fundamental and heavily used application in the field is pairwise alignment • Smith-Waterman algorithm (1981) • Still too slow for general database search • BLAST (1987) • Made database search of 107-108 sequences feasible • Statistical ranking of each alignment • Statistical methods in molecular evolution <25 yrs old • Modern genetic mapping methods ~15 yrs old
Things to review • Chemical differences among amino acids • Prokaryotic and eukaryotic gene structure • The central dogma • Anatomy of a typical protein
Reading for Thursday • Gibson and Muse, Ch.1 Genome Projects, pgs. 1-58.