340 likes | 359 Views
Learn the fundamental principles of genetics including DNA structure, gene anatomy, genetic code, and gene expression. Discover how DNA sequencing and analysis reveal genetic variations affecting phenotypes. Explore statistical techniques like probability distributions, hypothesis testing, and conditional probability applied in genetics research.
E N D
Intro. Bioinformatics Spencer Muse, NCSU Statistics Hamid Ashrafi, NCSU HorticuluturalScience Fred Wright, NCSU Statistics/Biological Sciences Block 1: DNA Sequence Analysis 8/17 – 9/21 Spencer Muse
Basic Concepts For most organisms • DNA is the genetic material • DNA composes chromosomes • Chromosomes are found in the nucleus of cells • Chromosomes are inherited by offspring from parents
DNA, the Genetic Material • DNA is a chain of nucleotides, or bases. • DNA has 4 different nucleotides: • A: adenine • C: cytosine • G: guanine • T: Thymine • U: Uracil (in RNA) ATGCTACTTCACTGA ||||||||||||||| TACGATGAAGTGACT DNA is often found in a double-stranded form. A pairs with T, C pairs with G.
Genes A gene is a small region of a chromosome (and is thus simply a string of nucleotides). ATGCTACTTCACTGA
The Genetic Code • Protein-coding genes are composed of triplets of nucleotides called codons. • Each codon encodes one of 20 possible amino acids. • Chains of amino acids form proteins. ATG CTA CTT CAC TGA Met Leu Leu His Stop M L L H *
Central Dogma DNA ATG CTA CTT CAC TGA RNA AUG CUA CUU CAC UGA Transcription Protein Translation M L L H
Anatomy of a Gene Introns Promoter Exons
DNA RNA Nucleotide Base Transcription Translation Intron Exon RNA polymerase Promoter Chromosome Gene Protein Amino acid Splicing Nucleus Keywords
Phenotypes • Which genes affect a phenotype? • Relating genetic variation to phenotypic variation
SNPs • Single Nucleotide Polymorphisms • Very dense SNP maps are currently being produced (1,000,000+ in humans) • Fast, cheap to score
Plasminogen activator inhibitor-2 HMG CoA reductase Gene Expression Profiling using DNA Microarrays Each spot corresponds to a single human gene Signal color and intensity reveal changes in gene activity
Other Markers • SSRs (Simple Sequence Repeats; microsatellites) • RFLP (Restriction Fragment Length Polymorphisms) • SSCP (Single Sequence Confirmation Polymorphisms)
Random Variables and Probability Probability Distributions Parameter Estimation Hypothesis Testing Likelihood Conditional Probability Stochastic Processes Inference for Stochastic Processes Overview
Probability The probability of a particular event occurring is the frequency of that event over a very long series of repetitions. • P(tossing a head) = 0.50 • P(rolling a 6) = 0.167 • P(average age in a population sample is greater than 21) = 0.25
Random Variables A random variable is a quantity that cannot be measured or predicted with absolute accuracy.
Probability Distributions • The distribution of a random variable describes the possible values of the variable and the probabilities of each value. • For discrete random variables, the distribution can be enumerated; for continuous ones we describe the distribution with a function.
Examples of Distributions Binomial Normal
Parameter Estimation One of the primary goals of statistical inference is to estimate unknown parameters. For example, using a sample taken from the target population, we might estimate the population mean using several different statistics: the sample mean, the sample median, or the sample mode. Different statistics have different sampling properties.
Hypothesis Testing A second goal of statistical inference is testing the validity of hypotheses about parameters using sample data. If the observed frequency is much greater than 0.5, we should reject the null hypothesis in favor of the alternative hypothesis. How do we decide what “much greater” is?
Likelihood For our purposes, it is sufficient to define the likelihood function as Analyses based on the likelihood function are well-studied, and usually have excellent statistical properties.
We say that is the maximum likelihood estimate of . Maximum Likelihood Estimation The maximum likelihood estimate of an unknown parameter is defined to be the value of that parameter that maximizes the likelihood function:
If , then Some simple calculus shows that the MLE of is , the frequency of “successes” in our sample of size n. Example: Binomial Probability If we had been unable to do the calculus, we could still have found the MLE by plotting the likelihood:
Likelihood Ratio Tests Consider testing the hypothesis: The likelihood ratio test statistic is:
Distribution of the Likelihood Ratio Test Statistic Under quite general conditions, where n-1 is the difference between the number of free parameters in the two hypotheses.
Conditional Probability The conditional probability of event A given that event B has happened is
Stochastic Processes A stochastic process is a series of random variables measured over time. Values in the future typically depend on current values. • Closing value of the stock market • Annual per capita murder rate • Current temperature
ACGGTTACGGATTGTCGAA t = 0 ACaGTTACGGATTGTCGAA t = 1 ACaGTTACGGATgGTCGAA t = 2 ACcGTTACGGATgGTCGAA t = 3
Inference for Stochastic Processes We often need to make inferences that involve the changes in molecular genetic sequences over time. Given a model for the process of sequence evolution, likelihood analyses can be performed.