520 likes | 767 Views
CIS 595 Bioinformatics. Lecture 2 Introduction to Bioinformatics A number of slides taken/modified from: Russ B. Altman (http://www.smi.stanford.edu/projects/helix/bmi214/) Patrik Medstrand (www.cmb.lu.se/devbiol/bioinfo/ old/download/intro2003/databases_handouts.pdf)
E N D
CIS 595 Bioinformatics Lecture 2 Introduction to Bioinformatics A number of slides taken/modified from: Russ B. Altman (http://www.smi.stanford.edu/projects/helix/bmi214/) Patrik Medstrand (www.cmb.lu.se/devbiol/bioinfo/ old/download/intro2003/databases_handouts.pdf) Mark Gerstein (http://bioinfo.mbb.yale.edu/mbb452a/2002/sequences2002.pdf)
What is Bioinformatics? • Every application of computer science to biology • Sequence analysis, images analysis, sample management, population modeling, … • Analysis of data coming from large-scale biological projects • Genomes, transcriptomes, proteomes, metabolomes, etc…
The New Biology • Traditional biology • Small team working on a specialized topic • Well defined experiment to answer precise questions • New “high-throughput” biology • Large international teams using cutting edge technology defining the project • Results are given raw to the scientific community without any underlying hypothesis
Examples of “High-Throughput” • Complete genome sequencing • Simultaneous expression analysis of thousands of genes (DNA microarrays, SAGE) • Large-scale sampling of the proteome • Protein-protein analysis large-scale 2-hybrid (yeast, worm) • Large-scale 3D structure production (yeast) • Metabolism modeling • Biodiversity
Role of Bioinformatics • Control and management of the data • Sequence, Structure and Function analysis • Analysis of primary data e.g. • Mass spectra analysis • DNA microarrays image analysis • Statistics • Database storage and access • Interpreting results in a biological context
Sequence, Structure and Function Analysis In order to gather insight into the ways in which genes and gene products (proteins) function perform: • SEQUENCE ANALYSIS: Analyze DNA and protein sequences, searching for clues about structure, function, and control. • STRUCTURE ANALYSIS: Analyze biological structures, searching for clues about sequence, function and control. • FUNCTION ANALYSIS: Understand how the sequences and structures leads to the functions.
Evolution and Bioinformatics • Common descent of organisms implies that they will share many “basic technologies.” • Development of new phenotypes in response to environmental pressure can lead to “specialized technologies.” • More recent divergence implies more shared technologies between species. • All of biology is about two things: understanding shared or unshared features.
Biology is Fundamentally Information Science Where is information: • DNA Sequences • GENBANK release 128 (2/02) contains 17,089,143,893 bases in 1,546,532 sequences • Protein Sequences • PIR or Swiss-prot (as of 3/02); 106,736 sequences, 39,242,287 total amino acids • Protein 3D Structures • Protein Data Bank (PDB), as of March 2002: 17,679 Coordinate Entries; 15,855 proteins, 1060 nucleic acids, 746 protein/nucleic acid complex 18 carbohydrates
Biology is Fundamentally Information Science Where is information: • Online access to DNA microarray data • http://smd.stanford.edu/; 10,000 to 40,000 genes per chip; Each set of experiments involves 3 to 100 “conditions” • Medical Literature on line. • Online database of published literature since 1966 = Medline = PubMED resource 4,600 journals 11,000,000+ articles (most with abstracts) • ETC…
Topics • Sequence Alignment; Sequence Motifs; Gene Finding • Computing with Biological Structures • Phylogenetic Algorithms • Microarray Data Analysis • Genetic Networks • Comparative Genomics • Proteomics • Biological Ontologies; Biological Text Mining
Sequence Alignment • What is sequence alignment? • Given two sequences and a scoring scheme find the optimal pairing of letters. RKVA--GMAKPNM RKIAVAAASKPAV • Why align sequences? • A few sequences with known structure and function; much more with unknown properties. • If one of them has known structure/function, then alignment to the other yields insight about another • Similarity may be used as evidence of homology, but does not necessarily imply homology
Sequence Alignment Types of alignment: • Local vs. global; • Pairwise vs. multiple d1dhfa_ LNCIVAVSQNMGIGKNGDLPWPPLRNEFRYFQRMTTTSSVEGKQ-NLVIMGKKTWFSI d8dfr__ LNSIVAVCQNMGIGKDGNLPWPPLRNEYKYFQRMTSTSHVEGKQ-NAVIMGKKTWFSI d4dfra_ ISLIAALAVDRVIGMENAMPWN-LPADLAWFKRNTL--------NKPVIMGRHTWESI d3dfr__ TAFLWAQDRDGLIGKDGHLPWH-LPDDLHYFRAQTV--------GKIMVVGRRTYESF
Sequence Alignment How to measure the alignment quality? • Define scoring matrix (PAM250)
Sequence Alignment Alignment algorithms: • dot matrix • dynamic programming • Fasta, • Blast, • Psi-Blast; • Clustal Similarity strength: • Percent identity • E-value (statistical measure)
Sequence Motifs • A subsequence that occurs in multiple sequences with a biological importance. • Protein motifs often result from structural features • DNA sequences that provide signals for protein binding or nucleic acid folding
Sequence Motifs • PROSITE Database a collection of motifs (1135 different motifs): • A manually created collection of regular expressions associated with different protein families/functions. • Globin sequence signature (PDOC00933): F-[LF]-x(5)-G-[PA]-x(4)-G-[KRA]-x-[LIVM]-x(3)-H
Gene Finding • Problem : Identify the genes within raw genomic DNA sequence • Input: Raw DNA sequence • Output: Location of gene elements in the raw sequence (including exons, introns, other sequence annotations)
Topics • Sequence Alignment; Sequence Motifs; Gene Finding • Computing with Biological Structures • Phylogenetic Algorithms • Microarray Data Analysis • Genetic Networks • Comparative Genomics • Proteomics • Biological Ontologies; Biological Text Mining
Computing with Biological Structures • General Issues • How do we represent structure for computation? • How do we compare structures? • How can we summarize structural families?
Computing with Biological Structures Applications: • Structure alignment • Build fold library Hb Alignment of Individual Structures Fusing into a Single Fold “Template” Mb
Computing with Biological Structures Why align structures: • Provides the “gold standard” for sequence alignment • For nonhomologous proteins, identify common substructures of interest • Classify proteins into clusters, based on structural similarity (SCOP)
Computing with Biological Structures Applications: • Predicting RNA Secondary Structure (the MFOLD Program http://www.bioinfo.rpi.edu/applications/mfold/old/rna/)
Computing with Biological Structures Protein secondary structure prediction Sequence RPDFCLEPPYTGPCKARIIRYFYNAKAGLVQTFVYGGCRAKRNNFKSAEDAMRTCGGA Structure CCGGGGCCCCCCCCCCCEEEEEEETTTTEEEEEEECCCCCTTTTBTTHHHHHHHHHCC
Topics • Sequence Alignment; Sequence Motifs; Gene Finding • Computing with Biological Structures • Phylogenetic Algorithms • Microarray Data Analysis • Genetic Networks • Comparative Genomics • Proteomics • Biological Ontologies; Biological Text Mining
Phylogenetic Algorithms Why build evolutionary tree? • Understand the lineage of different species. • Have an organizing principle for sorting species into a taxonomy • Understand how various functions evolved. • Understand forces and constraints on evolution. • To do multiple alignment.
Phylogenetic Algorithms Multiple Alignment and Trees • Progressive alignment methods do multiple alignment and evolutionary tree construction at the same time. • Sequence alignment provides scores which can be interpreted as inversely related to distances in evolution. • Distances can be used to build trees. • Trees can be used to give multiple alignments via common parents.
Topics • Sequence Alignment; Sequence Motifs; Gene Finding • Computing with Biological Structures • Phylogenetic Algorithms • Microarray Data Analysis • Genetic Networks • Comparative Genomics • Proteomics • Biological Ontologies; Biological Text Mining
Microarray Data Analysis Experimental Protocol
Microarray Data Analysis What are expression arrays good for? • Follow population of (synchronized) cells over time, to see how expression changes (vs. baseline). • Expose cells to different external stimuli and measure their response (vs. baseline). • Take cancer cells (or other pathology) and compare to normal cells. • (Also some non-expression uses, such as assessing presence/absence of sequences in the genome)
Microarray Data Analysis Preprocessing Score differential hybridization Merging replicate experiments Background correction Cy5/Cy3 normalization Data input Duplicate spot variability Replicate experiment variability Spot quality Artifactual regions
Microarray Data Analysis Convert microarray images to data
Microarray Data Analysis Clustering: • If two genes are expressed in the same way, they may be functionally related. • If a gene has unknown function, but clusters with genes of known function, this is a way to assign its general function. • We may be able to look at high resolution measurements of expression and figure out which genes control which other genes. • E.g. peak in cluster 1 always precedes peak in cluster 2 => cluster 1 turns cluster 2 on?
Microarray Data Analysis Classification: • Uses known groups of interest (from other sources) to • learn the features associated with these groups in the primary data, • create rules for associating the data with the groups of interest. • Often called “supervised machine learning.”
Topics • Sequence Alignment; Sequence Motifs; Gene Finding • Computing with Biological Structures • Phylogenetic Algorithms • Microarray Data Analysis • Genetic Networks • Comparative Genomics • Proteomics • Biological Ontologies; Biological Text Mining
Genetic Networks What is a genetic network? • Individual genes have a function (e.g. transforming a substance or binding to a substance) • Sets of functions when sequenced can produce pathways (e.g. output of one transformation is the input to another) • Sets of pathways, as they interact with other pathways, create a genetic network of interactions.
Genetic Networks Reconstructing Genetic Regulatory Networks: • Hard problem. • Given N genes, there are an exponential number of connections between the genes. • Relationships are not generally +/- but are but are continuous valued. • Must use knowledge about expected function and membership in pathways to prune the list of possible network interactions.
Topics • Sequence Alignment; Sequence Motifs; Gene Finding • Computing with Biological Structures • Phylogenetic Algorithms • Microarray Data Analysis • Genetic Networks • Comparative Genomics • Proteomics • Biological Ontologies; Biological Text Mining
Comparative Genomics • Large scale comparison of genomes to • understand the biology of individual genomes • extract general principles applying to groups of genomes. • Assumption: • many biological sequences, structures, and functions are shared across organisms, • the signal from these organisms can be increased by combining them in analyses.
Comparative Genomics Important issues for Comparative Genomics • Aligning very large sequences • Comparative approaches to gene finding • Comparative approaches to assigning function • Comparative approaches to identifying key regulatory regions
Comparative Genomics Example: Assigning protein functions
Topics • Sequence Alignment; Sequence Motifs; Gene Finding • Computing with Biological Structures • Phylogenetic Algorithms • Microarray Data Analysis • Genetic Networks • Comparative Genomics • Proteomics • Biological Ontologies; Biological Text Mining
Proteomics • What is PROTEOMICS? • -OMICS has become the suffix to denote the study of the entire set of something • Genomics: study of all genes • Proteomics: study of all proteins • Transcriptomics: study of all mRNA transcripts • Metabolomics: study of metabolites in cell
Proteomics Proteomics questions • Which proteins are made from the genome? • What is their 3D structure? • Where they are? • What they do? • Which other proteins they interact with? • Are they modified in the cell post-translationally?
Proteomics Key proteomic technologies • 3D structure determination (X-ray/NMR) • 2D Gels to assess all the proteins in a cell. • Mass spectrometry to identify proteins, protein modifications. • Yeast-Two-Hybrid systems to assess protein-protein interactions • Protein Arrays to assess all proteins in a cell using antibodies or other recognition technology.
Topics • Sequence Alignment; Sequence Motifs; Gene Finding • Computing with Biological Structures • Phylogenetic Algorithms • Microarray Data Analysis • Genetic Networks • Comparative Genomics • Proteomics • Biological Ontologies; Biological Text Mining
Biomedical Ontologies • In order to communicate effectively we need: • common language • basic knowledge • Example: • Metabolic Pathways: • language: names of products, enzymes, substrates and pathways • knowledge: what is a reaction, how do enzymes and substrates participate, what are the legal components of a pathway
Biomedical Ontologies Gene Ontology (http://www.geneontology.org/) • Used to classify gene function. • A controlled listing of three types of function: • Molecular Function • Biological Process • Cellular Component
Biological Text Mining • Literature in Biomedicine • Much literature generated quickly. • 11 million citations in MEDLINE. • 400,000 added yearly. • Need methods to deal with data. • Query • Summarize • Organize • Understand