630 likes | 644 Views
BIOL3014 Review. Advanced Bioinformatics. Protein Structure. Proteins are linear polymers that fold up by themselves…mostly. The amino acids. They can be grouped by properties in many ways according to the chemical and physical properties (e.g. size) of the side chain.
E N D
BIOL3014 Review Advanced Bioinformatics
Proteins are linear polymers that fold up by themselves…mostly.
The amino acids They can be grouped by properties in many ways according to the chemical and physical properties (e.g. size) of the side chain. Here is one grouping based on chemical properties: Basic: proton acceptors Acidic: proton donors Uncharged polar: have polar groups like CONH2 or CH2OH Nonpolar: tend to be hydrophobic Weird: proline links to the N in the main chain Strong: Cysteine can make “disulphide bridges”
Alpha Helix Wikipedia 3.6 amino acid (residues) per turn O(i) hydrogen bonds to N(i+4) From book…correct?
Beta Sheet A. Three strands shown B. Anti-parallel sheet C. Parallel sheet Sheets are usually curved and can even form barrels.
Beta Turns: getting around tight corners Steric hindrance determines whether a tight turn is possible R3’s side chain is usually Hydrogen (R3 is glycine)
X-ray crystallography Needs crystallized proteins Hard to get crystals Very tough for hydrophobic (e.g. transmembrane) proteins Better accuracy than NMR Expensive: $100,000/protein
NMR spectroscopy Protons resonate at a frequency that depends on their chemical environment. This can be used to predict structure. Does not require crystallization; protein may be in solution. Lower resolution than X-ray crystallography
Predict what? • There are many types of secondary structure. • Which do we want to predict? • Alpha helix • Beta strand • Beta turn • Random coil • Pi-helices • 310-helices • Type I turns • …
Start with some proteins of known structure • Get some good X-ray or NMR models of proteins. • Since we know their tertiary structures, certainly we can assign each residue in each protein a secondary state. • Or can we?
DSSP to the rescue! • In 1983 Kabsch and Sander introduced DSSP (Dictionary of Protein Secondary Structure) …not a typo.. • It automated the assignment of secondary structure from tertiary structure to make it less arbitrary.
Rules • Chou-Fassman:created tables of breaking/forming propensity and the relative frequency of each residue type in helices and strands. • Self information (what the identity of a residue tells you about its likely secondary structure state) is not the only thing we can extract from the known structures. • Maybe certain residues have a strong influence (or are strongly correlated) with what the secondary state is several residues away. So, look at “long-distance” relationships: • Directionalinformation: information about the conformation at position i carried by the residue at position j, where i≠j, and is independent of the type of residue at position j. • Pair information: like directional information, but takes account of the type of residue at position j.
Don’t forget about evolution! • Sequence evolves faster than structure. • So, imagine a position in an alpha helix (or other conformation) that recently mutated. • If we could find the orthologous residue in the same protein in other species, those residues would give us a much better picture. • So, we should look at the distribution of residues at that position, not just the residue in a particular protein.
PSI-BLAST is often used to get residue distributions • The simplest way to get an estimate of the distribution of residues at each position in the protein we are trying to predict is to use PSI-BLAST. • PSI-BLAST will output a “profile” containing an estimate of the residue distribution at each position in the query protein. • Each column of the profile is a multinomial probability vector. • The PSI-BLAST profile can be used in place of the protein in prediction rules. • PSI-BLAST also outputs a multiple alignment, and it, too, can be used in prediction rules. • You could predict the secondary structure for each protein in the alignment, and choose the “majority” or “average” prediction.
Why use HMMs for transmembrane topology? • Transmembrane proteins have a simple, repetitive topology. • The topology can be subdivided into a small set of regions. • Helices • Inside • Outside • Tails/Caps (at ends of helices) • The helices tend to have lengths in a limited range.
p 1-p HMM design: Modeling sequences of varying lengths • Self-loops can model sequences of length 1 to infinity: L = [1,…,infinity] • Each time through the self-loop generates one more letter. • This 1-state model generates sequences of length L with probability: Pr(L) = pL-1(1-p). • So, you control the length of the sequences (sort of…).
Grouping states • To avoid over-fitting, we want to reduce the number of parameters. • Each emitting state has nineteen free parameters (one for each amino acid - 1). • If a group of states are modeling regions with very similar amino acid preferences, why not require that they all use the same parameters? • If you tie n states together, you “save” 19n parameters, so the model is less prone to over-fitting when you train it.
Generalization • We want to know how well a model will generalize to data it has never “seen”. • If we test (measure accuracy) on the same data we trained on: • We overestimate the generalization accuracy • We will tend to over-fit the training data (by adjusting the model design to fit it)
Sample questions 1. Obtaining protein secondary structure • a. Define the protein secondary structure task. • b. List five types of secondary structure element. • c. Describe what is meant by the ideas of “self information”, “directional information” and “pair information” when predicting secondary structure using a sliding-window method. • d. What is a PSI-BLAST profile and why are they used in secondary structure prediction? • e. What kinds of proteins are HMMs particularly suited to modeling?
The Goals • Functional Genomics: • To know when, where and how much genes are expressed. • To know when, where, what kind and how much of each protein is present. • Systems Biology: • To understand the transcriptional and translational regulation of RNA and proteins in the cell.
Measuring Gene Expression • What we want to do is measure the number of copies of each RNA transcript in a cell at a given point in time. • Extract the RNA from the cell. • Measure each type of transcript quantitatively. • How do you measure it? • Sequence it in a quantitative way • But sequencing is (used to be) very expensive • So, use technology and tricks…
Low-throughput Sequencing • qPCR (also called rtPCR) allows you to accurately measure a given transcript. • But you have to decide which transcript you want to measure and make primers for it. • So it is very expensive and low-throughput. • So the “array technologies” were born…
Gene Arrays • Put a bunch of different, short single-stranded DNA sequences at predefined positions on a substrate. • Let the unknown mixture of tagged DNA or RNA molecules hybridize to the DNAs. • Measure the amount of hybridized material.
Measuring Protein Expression • In order to measure all the types of protein in a cell we must • Extract the proteins • Purify the proteins • Identify the individual proteins • How do we accomplish purification and identification of proteins.
The Technologies:Protein Expression • Low-throughput • 2D Gel Electrophoresis + Mass Spectrometry • Liquid chromatograph + Mass Spectrometry • Protein microarrays • Limited in application at this point • Can be used for things other than protein expression like protein-protein interactions
Separating the Proteins:2D Gel Electrophoresis • First step: pI/pH • Proteins are introduced to a gel with an imobilized pH gradient. • A charge is applied. • Proteins migrate until the pH causes them to lose their charge (isoelectric point) and then stop. • Second step: mass • First gel transferred to second gel • SDS (detergent) breaks structure and charges the proteins proportional to their mass.
Steps of Mass Spectrometry • Digest: • Sample (spot) is digested with a proteolytic enzyme • Spectrum: • Peaks correspond to the mass-charge ratio of protein fragments • These provide a fingerprint • Identify: • Compare fingerprint to theoretical fingerprints • Post-translational modifications screw things up.
Goals • We’ve measured the expression of genes or proteins using the technologies discussed previously. • What can we do with that information? • Identify significant differences in expression • Identify similar patterns of expression (clustering)
Analysis steps • Data normalization • Statistical Analysis • Cluster Analysis
Data Normalization • Why normalize? • Removes systematic errors • Makes the data easier to analyze statistically
Sources of Error • Measurements always contain errors. • Systematic (oops) • Random (noise!) • Subtracting the background level can remove some systematic error • Using the ratio in two-channel experiments does this • Subtracting the overall average intensity can be used with one-channel data. • Taking averages over replicates of the experiment reduces the random error.
Statistical Analysis • Determining what differences in expression are statistically significant • Controlling false positives
When are two measurements significantly different? • We want to say that an expression ratio is significant if it is big enough (>1) or small enough (<1). • A two-fold ratio (for example) is only significant if the variances of the underlying measurements are sufficiently small. • The significance is related to the area of the overlap of the underlying distributions.
The Z-test • If the data is approximately normal, convert it to a Z-score. • X can be the log expression ratio; is then 0 • is the sample standard deviation; n is the number of repeats • The Z-score is distributed N(0,1) (standard normal). • The significance level is the area in the tail(s) of the standard normal distribution.
The t-test • The t-test makes fewer assumptions about the data than the Z-test • It can be applied to compare two average measurements which can have • Different variances • Different numbers of observations
Cluster Analysis • Similar expression patterns • Groups of genes/proteins with similar expression profiles • Similar expression sub-patterns • Groups of genes/proteins with similar expression profiles in a subset of conditions
Distance Measures Between Pairs of Points • In order to cluster the points (genes or conditions), we need some concept of which points are “close” to each other. • So we need a measure of distance (or, conversely,) similarity between two rows (or columns) in our n by m matrix. • We can then compute all the pair-wise distances between rows (or columns).
Standard Distance Measures • Euclidean Distance • Pearson Correlation Coefficient • Mahalanobis Distance
Euclidean Distance • Standard, everyday distance • Treats all dimensions equally • If some genes vary more than others (have higher variance), they influence the distance more.
Mahalanobis Distance • The “normalized” Euclidean distance • Scales each dimension by the variance in that dimension. • This is useful if the genes tend to vary much more in one sample than in others since it reduces the affect of that sample on the distances.
Pearson Correlation Coefficient • Distances are small when two genes have similar patterns of change even if the size of the changes are different. • This is accomplished by scaling by the sample variance of the gene’s expression levels under different conditions.
Types of Linkage • A. Single Linkage • B. Complete Linkage • C. Centroid Method
Sample Questions 1. Gene expression analysis • a. What kind of molecules do expression microarrays measure? • b. Expression microarray data is known to be “noisy”. Describe as many ways as you can of reducing this problem. • c. What experimental technique is commonly used to validate the results of expression microarrays? • d. The “Z-test” or “t-test” is usually applied to expression microarray data. Why is this done and what do these tests tell us? • e. Principle components analysis is often applied to microarray data as well. What is its purpose and what can it tell us? • f. Name two types of distance measures that can be used with microarray data for clustering expression profiles.
Overview Evolution and sequence variation Phylogenetic trees The meaning of distance Evolutionary sequence models Constructing trees
Rooted and Unrooted Trees “Leaves” are extant species Internal nodes are ancestral species Adding a root gives time a direction