1 / 41

An Introduction to Bioinformatics

An Introduction to Bioinformatics. Brian Canada PhD Candidate in Integrative Biosciences (IBIOS) Option in Bioinformatics & Genomics (BG) IST 497 - April 24, 2007 Some material has been adapted from course notes from IBIOS 551: Genomics and BIOL 597F: Bioinformatics I.

lyris
Download Presentation

An Introduction to Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Introduction to Bioinformatics Brian Canada PhD Candidate in Integrative Biosciences (IBIOS) Option in Bioinformatics & Genomics (BG) IST 497 - April 24, 2007 Some material has been adapted from course notes from IBIOS 551: Genomics and BIOL 597F: Bioinformatics I

  2. What is Bioinformatics? • Simplest definition: • The use of computers to study biology (particularly molecular biology and genetics) • Highly interdisciplinary • Mathematics, statistics, computer science, biology, engineering

  3. Applications & Subfields of Bioinformatics • Genomics • Mapping & sequencing of entire genomes (all the DNA on all the chromosomes in an organism) • Functional genomics (sometimes called “phenomics”): deducing information about the function of DNA sequences • Proteomics • Prediction of protein structure and function from protein sequence • Systems biology • Study of the dynamics with which genes and gene products interact with each other • Other applications • Enzyme design/re-design • Quantitative image analysis

  4. Outline for lecture • Some basic definitions • How are genomes sequenced? • What are some of the ethical and social concerns in bioinformatics and genomics? • What are the key computational skills & methods used in bioinformatics? • How do I use some of the more popular bioinformatics tools?

  5. Some basic definitions • DNA - a double-stranded biological macromolecule (deoxyribonucleic acid) consisting of a sequence of 4 nucleotides: • A = Adenine • C = Cytosine • G = Guanine • T = Thymine • In double-stranded DNA, each nucleotide base-pairs with a complementary nucleotide: • A base-pairs with T • C base-pairs with G Image source: Wikipedia

  6. Definitions, cont’d • mRNA (messenger RNA) - the single-stranded “transcribed” form of DNA, consisting of the nucleotides A, C, G, and U (uracil) • mRNA is transcribed by an enzyme (catalytic protein) called RNA polymerase • Gene - a sequence of DNA that contains both coding elements (exons) interspersed with noncoding elements (introns) • mRNA contains only the exons – the parts of the gene that “code” for a protein

  7. http://138.192.68.68/bio/Courses/biochem2/GeneIntro/GeneIntroResources/http://138.192.68.68/bio/Courses/biochem2/GeneIntro/GeneIntroResources/

  8. Definitions, cont’d • Protein - a macromolecule produced by the translation of the mRNA sequence • Translation is mediated by tRNA (transfer RNA) and rRNA (ribosomal RNA) • Proteins consist of a combination of 20 different amino acids linked by peptide bonds • A sequence of three nucleotides is called a codon, each of which corresponds to a specific amino acid • Proteins carry out most of the functions of a cell

  9. Codon table

  10. Central Dogma ofMolecular Biology • DNA acts as a template to replicate itself • DNA is also transcribed into RNA • RNA is translated into protein

  11. Genotype and Phenotype • Genotype refers to the specific hereditary genetic makeup of an individual organism • Homozygous: both copies of a gene (or part of a gene) are identical • Heterozygous: offspring inherits one version of the gene from one parent, and another version of the gene from the other parent • Phenotype refers to an organism’s observable trait or other characteristic that results from the interaction of genotype and environment

  12. The Human Genome Project (HGP) • Coordinated by DOE and NIH, begun in 1990 • Objectives: • Identify all the genes in human DNA and how they vary within our species • Determine the sequences of the 3 billion nucleotide basepairs that make up human DNA • Store this information in well-designed databases for easy retrieval • Develop improved tools for analysis of gene sequence data • Address the ethical, legal, and social issues (ELSI) that may arise from the project • Private-sector effort conducted in parallel by Celera Genomics (headed by Craig Venter) • Working draft completed in 2003

  13. The HGP approach to sequencing the human genome • Painstakingly precise • Small pieces of DNA were “clipped” from the 23 pairs of human chromo-somes, which were individually separated out of human blood and sperm cells • Each of these short DNA pieces wasindividually sequenced using electro-phoresis gels • Each piece of sequenced DNA was matched up with the DNA on eitherside of it in the chromosomal sequence • Analogous to taking out one page of an encyclopedia at a time, ripping that page up, and then putting it together again

  14. The Celera Genomics approach to sequencing the human genome • “Shotgun” sequencing strategy • All genes in all chromosomes are “torn up” simultaneously and individually sequenced • Computational methods are used to look for overlaps in the sequence fragments to rebuild them into a whole genome • Analogous to ripping up all pages of an entire encyclopedia at once and then attempting to put it all back together • Much faster than traditional sequencing methods, but prone to incorrect assembly of “random” fragments

  15. What are some of the ethical and social implications and concerns of the human genome project outcomes? • Fair use: • Who should have access to personal genetic information, and how will it be used? • Privacy and confidentiality: • Who owns and controls genetic information? • Psychological impact and stigmatization: • How does personal genetic information affect an individual and society's perceptions of that individual? • How does genomic information affect members of minority communities? Source: http://www.ornl.gov/sci/techresources/Human_Genome/elsi/elsi.shtml

  16. What are some of the ethical and social implications and concerns of the human genome project outcomes? • Clinical issues: • How will genetic tests be evaluated and regulated for accuracy, reliability, and utility? • How do we prepare healthcare professionals for the new genetics? • How do we prepare the public to make informed choices? • How do we as a society balance current scientific limitations and social risk with long-term benefits? • Uncertainties: • Should testing be performed when no treatment is available? • Should parents have the right to have their minor children tested for adult-onset diseases? • Are genetic tests reliable and interpretable by the medical community? Source: http://www.ornl.gov/sci/techresources/Human_Genome/elsi/elsi.shtml

  17. What are some of the ethical and social implications and concerns of the human genome project outcomes? • Conceptual and philosophical implications • Do people's genes make them behave in a particular way? • Can people always control their behavior? • What is considered acceptable diversity? • Where is the line between medical treatment and enhancement? • Reproductive rights and decision making: • Do healthcare personnel properly counsel parents about the risks and limitations of genetic technology? • How reliable and useful is fetal genetic testing? • What are the larger societal issues raised by new reproductive technologies? Source: http://www.ornl.gov/sci/techresources/Human_Genome/elsi/elsi.shtml

  18. deCODE:A Case Study in Ethics • In 1996, Kari Stefansson started his company, deCODE Genetics, with a mission to use population genetics to discover new genes associated with human disease • Target population: 275,000 living Icelanders • Iceland’s government had originally endorsed deCODE’s effort to obtain medical records of all Icelanders as well as the creation of “genomic fingerprints” from every citizen

  19. What are the advantages of such a plan? • Iceland’s population is highly homogeneous • The vast majority have descended from a few European explorers arriving in Iceland 1,000 years ago • Icelanders have a strong tradition of maintaining family trees • Single healthcare provider, so all medical records are in one database • Family relationships can thus be easily correlated with medical records • Therefore, finding significant genetic differences that lead to certain medical conditions, such as cardiovascular disease, cancer, and schizophrenia, are likely to be easier than in a heterogeneous population (like that of the U.S.)

  20. Why was there opposition? • Method for obtaining data and medical records was “opt-out” (informed dissent) rather than “opt-in” (informed consent) • Records and other data may be sold to other companies that wanted to use this information to help develop new drugs • Some felt patient-physician confidentiality was compromised, and doctors worried that patients would be less forthcoming about their illnesses • Iceland’s supreme court ultimately ruled against the default of automatic inclusion in deCODE’s database • Court based decision on complaints from a minor who objected to her dead father’s information being included in the database • Theoretically possible to use the father’s medical data to make inferences about the daughter; could lead to unfairly assessed insurance premiums

  21. Interesting fiction about the ethics of genetics and genomics… Only $18.45 at Amazon!

  22. Outline for lecture • Some basic definitions • How are genomes sequenced? • What are some of the ethical and social concerns in bioinformatics and genomics? • What are the key computational skills & methods used in bioinformatics? • How do I use some of the more popular bioinformatics tools?

  23. Knowing where to look:Using public databases and data formats • PubMed: For surveying biological/medical literature • http://www.ncbi.nlm.nih.gov/entrez/query.fcgi • GenBank: Nucleic acid & protein sequences • http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?db=Nucleotide • http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?db=Protein • SWISS-PROT at ExPasy: Protein sequences • http://us.expasy.org/sprot/ • PFAM: Database of alignments of protein families • http://www.sanger.ac.uk/Software/Pfam/ • Protein Data Bank (PDB): Protein structure • http://www.pdb.org • Gene Ontology (GO): A standardized vocabulary for describing protein functions • http://www.geneontology.org/ • OMIM (Online Mendelian Inheritance in Man): Catalog of genes and associated disorders • http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM • PhenomicDB: Simultaneously compare phenotypes of several organisms sharing homologous genes • http://www.phenomicdb.de

  24. Computational Methods in Bioinformatics • Sequence alignment & sequence searching • BLAST: Basic Local Alignment Search Tool • http://www.ncbi.nlm.nih.gov/BLAST/ • Whole genome analysis • UCSC Genome Browser • http://genome.ucsc.edu • Gene prediction • GenScan: searches for putative (hypothetical) genes • http://genes.mit.edu/GENSCAN.html • Multiple sequence alignment • ClustalW • http://www.ebi.ac.uk/clustalw/

  25. A multiple sequence alignment (MSA) Image source: http://www.biochemj.org/bj/370/0651/bj3700651.htm

  26. Computational Methods in Bioinformatics • Phylogenetic Analysis • Attempts to describethe evolutionary rela-tionships within a groupof sequences • Uses a “tree” or “cladogram” to re-present relationships • PHYLIP • http://evolution.genetics.washington.edu/phylip.html Image source: http://www.nature.com/ng/journal/v33/n3s/fig_tab/ng1113_F1.html

  27. Computational Methods in Bioinformatics • Protein structure visualization • RCSB-PDB Explorer: • http://www.rcsb.org/pdb/home/home.do • Protein sequence analysis, structure prediction, and structural analysis • ExPASy: • http://us.expasy.org/ • Protein structural alignment and comparison • Combinatorial Extension of the Optimal Path (CE): • http://cl.sdsc.edu/ Image source: http://www.p450.kvl.dk/gallery/

  28. Two “ubiquitous” bioinformatics tools • BLAST: Basic Local Alignment Search Tool (Altschul et al, 1990) • Genome Browser at University of California–Santa Cruz (Kent et al, 2002)

  29. BLAST: Basic Local Alignment Search Tool • Co-developed by Prof. Webb Miller, director of bioinformatics at PSU • Initially conceived to visualize DNA sequences retrieved from a database and identify local alignments to a query sequence • Break the query and database sequences into “words” of geneor protein letters, then seek matches between fragments • Uses “substitution matrices” anddynamic programming to calculate alignment scores • http://www.ncbi.nlm.nih.gov/BLAST/

  30. Similarity and homology • Sequences (or structures or other objects) that look like each other are similar. • If that similarity results from their having a common ancestor, then those sequences are homologous. • If the homologs have diverged because of a speciation event, the sequences are orthologous • Ex: Human hemoglobin vs. mouse hemoglobin • If the homologs have diverged because of gene duplication, the sequences are paralogous • Ex: Different versions of hemoglobin in human (adult vs. fetal) • If the similarity results from convergent evolution from ancestrally different sequences, then the sequences are analogous.

  31. Definition of alignments • Alignment • A mapping of one sequence onto at least one other sequence to bring out similarities • An alignment column can contain matches, mismatches, or gaps • Global alignment • The mapping extends throughout the sequences • Appropriate when the sequences are homologous throughout their lengths • Local alignment • The mapping is limited to the regions of highest similarity • Most appropriate for database searches

  32. Making a local alignment • An alignment of two sequences (frequently called a local alignment) can be obtained as follows: • Extract a segment from each sequence • Add dashes (gap symbols) to each segment to create equal-length sequences • Place one “padded” segment over the other • For example: AACC-GTACTTG A-CAGGTGG-TG

  33. Alignment scores • To distinguish between “good” and “bad” alignments, we need a rule that assigns a numerical score to any alignment. The higher the score, the better the alignment. • Example of a simple scoring rule: • Match scores +1 • Mismatch or gap scores -1 • The following alignment scores +2 total (7 matches, 5 mismatches/gaps) AACC-GTACTTG A-CAGGTGG-TG +-+--++-+-++

  34. Substitution Matrices(also called “scoring matrices”) • Scores depend on “evolutionary distance” • Example at right shows scores used in a human-mouse alignment

  35. Amino Acid scoring matrices • This is the BLOSUM62 amino acid scoring matrix, which uses a database containing clusters of amino acid sequences with 62% or greater sequence similarity • Each score in the matrix is a “log odds” score • Positive score: In an alignment of two protein sequences, this amino acid pair is found more often than by chance • Negative score: less often than by chance • Zero score: same as expected by chance • More weight is given to the rarer amino acids, such as sulfur-containing residues (e.g., cysteine, C) or very large amino acids like tryptophan (W) Image source: Bioinformatics: Sequence and Genome Analysis by David Mount (2nd ed., 2004)

  36. Can you score this alignment?MREQHMSCQH

  37. M R E Q H M S C Q H 5 -1 -4 +5 +8  = +13More closely related than by chance!

  38. So let’s see it action:the BLAST tutorial • http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/tut1.html

  39. The UCSC Genome Browser • Much more interactive than BLAST and most other bioinformatics tools • Quick demonstration usingHBB (human beta hemoglobin, a blood protein) • http://genome.ucsc.edu

  40. Programming • BioPerl • Open-source Perl tools for bioinformatics & genomics • Includes a collection of modules that facilitate the development of scripts for bioinformatics applications • http://www.bioperl.org • Online course: http://www.pasteur.fr/recherche/unites/sis/formation/bioperl/ • Other languages: • Java (for developing your own bioinformatics GUIs) • Rapid prototyping languages: R, Matlab • Matlab Bioinformatics Toolkit

  41. Summary • Now you know: • What bioinformatics is • Where to look for biological data • What kinds of skills and methods are used to analyze biological data • How to query BLAST and the UCSC Genome Browser • What a sequence alignment is • How the human genome was sequenced • What are some of the questions surrounding the ethical and social implications of human genome project • To learn more, consider taking BIOL 597F (Bioinformatics I) or IBIOS 551 (Genomics) in the fall

More Related