Bioinformatics

Bioinformatics Gil McVean Department of Statistics

What is it to be a human?

What is it to be an individual? Photos from UN photo gallery www.un.org/av/photo

Is it your genes?

Is it your transcripts?

Is it your proteins?

Is it your protein interactions?

Is it your systems?

Bioinformatics and genome biology • Bioinformatics is the analytical wing of genome biology • It concerns itself with large amounts of data (more than you can look at!) • It uses computers and efficient algorithms • It is • Data assembly • Data summary • Data modelling • Data analysis

The raw material

The output

Classical bioinformatics I: DNA and protein sequence alignment

Classical bioinformatics II: Genome assembly

Classical bioinformatics III: Gene finding

Classical bioinformatics IV: Protein structure prediction

Bioinformatics of genetic variation • An area of considerable current attention is human genetic variation • The aim of current experiments is to map the genetic basis of human phenotypic variation • Disease susceptibility • Normal variation • It is challenging because of • The scale of the data • The structure of the data • The underlying processes that shape variation • Bioinformatics is needed to • Assemble, collate, check and summarise data • Model the data • Make inferences

What does the data look like? • Single Nucleotide Polymorphisms (SNPs) • Insertion-Deletion Polymorphisms (INDELs) TGCTTGGCAGGGCAGACTGACTGT TGCTTGGCAGGGCAGACTGACTGT TGCATGGCAGGGCAG-CTGACTGT TGCATGGCAGGGCAG-CTGACTGT TGCATGGCAGGGCAGACTGACTGT TGCATGGCAGGGCAGACTGACTGT SNP INDEL

Collections of SNPs HCB JPT YRI CEU SNP

Engineering challenges • Identifying SNPs • Working out which SNPs will work on a given platform • Controlling the genotyping work-flow • Controlling the output quality • Performing quality-assurance exercises • Identifying problems, gaps and inconsistencies

A Bioinformatics problem: How small is my P-value? • The basic idea of association studies is to look for genetic differences between groups Cases (D) It is easy to ask the question “Is there a significant difference in the frequency of a mutation between groups?” Controls (C) Locus of interest

The problems • In a study of several hundred thousand mutations (or even millions) it is unlikely that we have actually typed the causal variant(s). • In a study of several hundred thousand mutations (or even millions), even if NONE of them are causal a lot of them will show significance at the 5%, 1% or even 0.01% level • Differences in the frequency of disease incidence between groups (for example African Americans and European Americans) will be associated with ANY genetic difference between them

What we really want to ask • “Does any of the genome show an association with disease over and above any effect I might expect from the correlation between genotype and environmental risk?” • “If so, what is the most likely position for the causal mutation(s)?” • Answering these questions is difficult, but a natural way to approach the problem is to model the process

MODEL MODEL Modelling genetic variation Evolutionary parameters Population Sample Stochastic Evolutionary process Stochastic Sampling process Selection Mutation Genetic drift Recombination Migration ATGCATGGGCTATTGGACCT ATGGATGGGCTATTGCACCT ATGCATGGGCAATTGCACCT ATGCATGGGCAATTGGACCT ATGGATGGGCTATTGCACCT Inference

Genes in populations Present day

Ancestry of current population Present day

Ancestry of sample Present day

The coalescent: samples in populations Most recent common ancestor (MRCA) coalescence Ancestral lineages Present day time

How does this help us to think about mapping disease? • Individuals are related to each other through their genealogical history • Two nearby points on the genome will have similar genealogical histories, a result of which is that mutations at these positions will also be correlated • Understanding how genealogical history changes along the genome (through recombination) and between populations (through historical demography) will allow us to • Construct more powerful tests for disease association • Localise disease-associated mutations

The bioinformatics module • Genomic technologies • Annotating genomes • Modelling gene evolution • Mapping disease genes • Measuring gene and protein expression • Predicting protein structure

Bioinformatics

Bioinformatics

Presentation Transcript

Bioinformatics

Bioinformatics:

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

BIOINFORMATICS

Bioinformatics