Overture (Motivating bioinformatics)

Overture(Motivating bioinformatics) CS 498 SS Saurabh Sinha

Special issue of journal Science, July 1, 2005.

>What Is the Universe Made Of?>What is the Biological Basis of Consciousness?>Why Do Humans Have So Few Genes?>To What Extent Are Genetic Variation and Personal Health Linked?>Can the Laws of Physics Be Unified?>How Much Can Human Life Span Be Extended?>What Controls Organ Regeneration?>How Can a Skin Cell Become a Nerve Cell?>How Does a Single Somatic Cell Become a Whole Plant?>How Does Earth's Interior Work?>Are We Alone in the Universe?>How and Where Did Life on Earth Arise?>What Determines Species Diversity?>What Genetic Changes Made Us Uniquely Human?>How Are Memories Stored and Retrieved?>How Did Cooperative Behavior Evolve?>How Will Big Pictures Emerge from a Sea of Biological Data?>How Far Can We Push Chemical Self-Assembly?>What Are the Limits of Conventional Computing?>Can We Selectively Shut Off Immune Responses?>Do Deeper Principles Underlie Quantum Uncertainty and Nonlocality?>Is an Effective HIV Vaccine Feasible?>How Hot Will the Greenhouse World Be?>What Can Replace Cheap Oil -- and When?>Will Malthus Continue to Be Wrong?

Where does bioinformatics come in ?

“Why do humans have so few genes ?”

A simple organism Environmental signal GENE Response (protein) Raw materials

A simple organism GENE1 GENE2 GENE3

GENE1 GENE2 GENE3 GENE4 GENE5 GENE6 GENE7 GENE8 GENE9 GENE10 A simple organism

GENE1 GENE2 GENE3 GENE4 GENE5 GENE6 GENE7 GENE8 GENE9 GENE10 A complex organism Complex circuit of interactions

Regulatory networks • Genes are switches, transcription factors are (one type of) input signals, proteins are outputs • Proteins (outputs) may be transcription factors and hence become signals for other genes (switches) • This may be the reason why humans have so few genes (the circuit, not the number of switches, carries the complexity) • Bioinformatics can unravel such networks, given the genome (DNA sequence) and gene activity information

Decoding the regulatory network: Method 1 • Find patterns (“motifs”) in DNA sequence that occur more often than expected by chance • These are likely to be binding sites for transcription factors • Knowing these can tell us if a gene is regulated by a transcription factor (i.e., the “switch”)

How to find motifs ? • One method called “Gibbs sampling” • A special kind of “Markov chain Monte Carlo” sampling technique popular in physics and computer science • We shall see such an approach (Lawrence et al. 1992) later in the course

Decoding the regulatory network: Method 2

Common problem in bio- informatics: Integration of different types of information, in principled manner More on method 2 • Paper from Nature journal, Sep 2004. • We shall see this later in the course • Integrates multiple, heterogeneous sources of information • Multiple species conservation • What is functional is probably conserved • How to find “conserved” sequence ? (Next topic) • “ChIP-on-chip” data • Which transcription factors bind which sequences ? A high throughput, low resolution measurement

Sequence alignment • The staple of a bioinformatician’s diet • Are two sequences similar ? Are portions of two sequences similar ? Which portions are these ? • Several algorithms that use dynamic programming, a popular technique in computer science • Realistic versions of the problem are NP-hard, approximation algorithms abound. • Can we handle sequences of length ~109 ?

Sequence alignment • Blanchette et al. (2004) in the journal Genome Research • We shall see this paper later in the course • Efficient algorithm for genome-wide multiple sequence alignment • Uses the “divide and conquer” approach, as well as “dynamic programming” • Can help in reconstruction of “ancestral genomes” • Jurassic Park MMVI ? ! ? !

On counting genes • The original question was “Why do humans have so few genes?” • How do we know how many genes there are in the human genome ? (And where they are in the genome) • Experiments can be designed, but bioinformatics plays a major role

Gene prediction • The task of predicting the locations of genes in a new genome (“annotation”) • Several gene prediction software exist • We shall see one such approach (Lukashin & Borodovsky 1998, in the journal Nucleic Acids Research) later in the course. • The more sophisticated ones use Hidden Markov models (HMM) and multiple species comparison

“What controls organ regeneration ?”“How does a single somatic cell become a whole plant ?”

Regeneration is mysterious, but do we understand the “generation” part of “regeneration” • Developmental biology • The timeline from a single cell (with genetic material from mother and father) to a multicellular embryo, and to an adult • A paradox : All cells in the adult body have the same DNA, then how come different cells are different ?

… and to this ? Drosophila (fruitfly) How does a single cell lead to this ? …

Answer: Regulatory networks (Again !) • Bioinformatics used to scan entire genome for regions that participate in “segmenting” the embryo • Hidden Markov models, a popular technique in signal processing, used to detect such regions • Rajewsky et al. (2002) in the journal “BMC Bioinformatics” • Multiple species comparison aids discovery • Evolutionary models to integrate multiple species information in the HMM framework • Sinha et al. (2004) in the journal “BMC Bioinformatics”

“How did cooperative behavior evolve?”

A related question • What is the genetic (molecular) basis of social behavior ? • Social behavior in honey bees • Young worker bees are nurses in the hive; older ones go out to forage • This behavioral maturation is determined by needs of colony • Shortage of foragers => some nurses will become foragers prematurely • Bees respond to social cues. • What is the genetic basis of this ?

Bioinformatics of social behavior • UIUC team (Sinha, Ling, Whitfield, Zhai, Robinson) scanned the genome to understand this • Attend the BIO seminar to learn about this • Regulatory network of social behavior • Statistical tools, such as Hypergeometric test, partial correlation analysis, threshold-based classification, support vector machine classifier, etc. used for this project

“How will big pictures emerge from a sea of biological data?”

The sea • Genomes: 3 x 109 bp of human genome • Similar numbers for other genomes: mouse, rat, dog, chicken, chimp etc. • Microarray: snapshots of 1000s of genes’ activities at one time and condition. Thousands of microarrays. • ChIP-on-chip data: measurements of a transcription factor’s binding affinity for 1000s of genes (promoters).

Big pictures • An example: Segal et al. (2004) in the journal Nature Genetics • We shall see this work later in the course • “A module map showing conditional activity of expression modules in cancer” (Segal et al.) • Integrates sequence, motif, and microarray data through statistical tests to predict involvement of particular genes in particular types of cancer

The sea of biological data • Biological literature, capturing decades of painstaking experimental work on genetics and molecular biology • Can we glean useful information from this vast body of knowledge ? • Biological literature mining. • Natural language processing • Text Information Retrieval (statistical approaches) • Ling et al. (2005) in Pacific Symposium on Biocomputing. On “Gene summarization” • We shall see this paper later in the course

Some other challenges • Protein structure prediction • Can we predict the 3-D structure of a protein from its amino acid sequence ? • Why ? • One good reason: structure gives clues about function. If we can tell the structure, we can perhaps tell the function • We can design amino acid sequences that will fold into proteins that do what we want them to do. Drug design !! • One approach: neural networks, a popular technique in comptuer science, applied to this problem (Jones, 1999 in the J. of Mol Bio.)

Some other challenges • “Metagenomics” • Most studies to date are on genomes of one species • A sample from the soil contains hundreds of bacteria, thousands of viruses. Can we study all of these ? • Bioinformatics is indispensable !! • New type of data, new types of algorithms

Many more challenges • New types of data come due to technological breakthroughs in biology • High throughput data carries unprecedented amount of information • Too much noise • Bioinformatics removes the noise and reveals the truth

Bioinformatics • Is not about one problem (e.g., designing better computer chips, better compilers, better graphics, better networks, better operating systems, etc.) • Is about a family of very different problems, all related to biology, all related to each other • How can computers help solve any of this family of problems ?

Bioinformatics and You • You can learn the tools of bioinformatics • These tools owe their origin to computer science, information theory, probability theory, statistics, etc. • You can learn the language of biology, enough to understand what the problems are • You can apply the tools to these problems and contribute to science

Overture (Motivating bioinformatics)

Overture (Motivating bioinformatics)

Presentation Transcript

OVERTURE

Motivating People

Overture to “ Candide ”

The 2011 Overture:

Bioinformatics

The CMBI: Bioinformatics

William Tell Overture

William Tell Overture

Overture

Motivating students through project work

Motivating Operations, Rule Setting, and Stimulus Control

Bioinformatics and Machine Learning

The French Overture

Overture

OVERTURE

Overture: Jeremiah

The CMBI: Bioinformatics

(Now Playing – “Lilith Overture”)

Lecture 7. Basic statistical modeling

Lecture 6. Hidden Markov Models

The 2011 Overture:

CS 177 Introduction to Bioinformatics Fall 2004