310 likes | 351 Views
Bioinformatics. Gil McVean Department of Statistics. What is it to be a human?. What is it to be an individual?. Photos from UN photo gallery www.un.org/av/photo. Is it your genes?. Is it your transcripts?. Is it your proteins?. Is it your protein interactions?. Is it your systems?.
E N D
Bioinformatics Gil McVean Department of Statistics
What is it to be an individual? Photos from UN photo gallery www.un.org/av/photo
Bioinformatics and genome biology • Bioinformatics is the analytical wing of genome biology • It concerns itself with large amounts of data (more than you can look at!) • It uses computers and efficient algorithms • It is • Data assembly • Data summary • Data modelling • Data analysis
Classical bioinformatics I: DNA and protein sequence alignment
Bioinformatics of genetic variation • An area of considerable current attention is human genetic variation • The aim of current experiments is to map the genetic basis of human phenotypic variation • Disease susceptibility • Normal variation • It is challenging because of • The scale of the data • The structure of the data • The underlying processes that shape variation • Bioinformatics is needed to • Assemble, collate, check and summarise data • Model the data • Make inferences
What does the data look like? • Single Nucleotide Polymorphisms (SNPs) • Insertion-Deletion Polymorphisms (INDELs) TGCTTGGCAGGGCAGACTGACTGT TGCTTGGCAGGGCAGACTGACTGT TGCATGGCAGGGCAG-CTGACTGT TGCATGGCAGGGCAG-CTGACTGT TGCATGGCAGGGCAGACTGACTGT TGCATGGCAGGGCAGACTGACTGT SNP INDEL
Collections of SNPs HCB JPT YRI CEU SNP
Engineering challenges • Identifying SNPs • Working out which SNPs will work on a given platform • Controlling the genotyping work-flow • Controlling the output quality • Performing quality-assurance exercises • Identifying problems, gaps and inconsistencies
A Bioinformatics problem: How small is my P-value? • The basic idea of association studies is to look for genetic differences between groups Cases (D) It is easy to ask the question “Is there a significant difference in the frequency of a mutation between groups?” Controls (C) Locus of interest
The problems • In a study of several hundred thousand mutations (or even millions) it is unlikely that we have actually typed the causal variant(s). • In a study of several hundred thousand mutations (or even millions), even if NONE of them are causal a lot of them will show significance at the 5%, 1% or even 0.01% level • Differences in the frequency of disease incidence between groups (for example African Americans and European Americans) will be associated with ANY genetic difference between them
What we really want to ask • “Does any of the genome show an association with disease over and above any effect I might expect from the correlation between genotype and environmental risk?” • “If so, what is the most likely position for the causal mutation(s)?” • Answering these questions is difficult, but a natural way to approach the problem is to model the process
MODEL MODEL Modelling genetic variation Evolutionary parameters Population Sample Stochastic Evolutionary process Stochastic Sampling process Selection Mutation Genetic drift Recombination Migration ATGCATGGGCTATTGGACCT ATGGATGGGCTATTGCACCT ATGCATGGGCAATTGCACCT ATGCATGGGCAATTGGACCT ATGGATGGGCTATTGCACCT Inference
Genes in populations Present day
Ancestry of current population Present day
Ancestry of sample Present day
The coalescent: samples in populations Most recent common ancestor (MRCA) coalescence Ancestral lineages Present day time
How does this help us to think about mapping disease? • Individuals are related to each other through their genealogical history • Two nearby points on the genome will have similar genealogical histories, a result of which is that mutations at these positions will also be correlated • Understanding how genealogical history changes along the genome (through recombination) and between populations (through historical demography) will allow us to • Construct more powerful tests for disease association • Localise disease-associated mutations
The bioinformatics module • Genomic technologies • Annotating genomes • Modelling gene evolution • Mapping disease genes • Measuring gene and protein expression • Predicting protein structure