Introduction to Pattern Recognition

Introduction to Pattern Recognition Prediction in Bioinformatics • What do we want to predict? • Features from sequence • Data mining • How can we predict? • Homology / Alignment • Pattern Recognition / Statistical Methods / Machine Learning • What is prediction? • Generalization / Overfitting • Preventing overfitting: Homology reduction • How do we measure prediction? • Performance measures • Threshold selection Henrik Nielsen Center for Biological Sequence Analysis Technical University of Denmark

Sequence → structure → function

Prediction from DNA sequence • Protein-coding genes • transcription factor binding sites • transcription start/stop • translation start/stop • splicing: donor/acceptor sites • Non-coding RNA • tRNAs • rRNAs • miRNAs • General features • Structure (curvature/bending) • Binding (histones etc.)

Prediction from amino acid sequence • Folding / structure • Post-Translational Modifications • Attachment: phosphorylation glycosylation lipid attachment • Cleavage: signal peptides, propeptides, transit peptides • Sorting: secretion, import into various organelles, insertion into membranes • Interactions • Function • Enzyme activity • Transport • Receptors • Structural components • etc...

Protein sorting in eukaryotes • Proteins belong in different organelles of the cell – and some even have their function outside the cell • Günter Blobel was in 1999 awarded The Nobel Prize in Physiology or Medicine for the discovery that "proteins have intrinsic signals that govern their transport and localization in the cell"

Data: UniProt annotation of protein sorting Annotations relevant for protein sorting are found in: • the CC (comments) lines • cross-references (DR lines) to GO (Gene Ontology)‏ • the FT (feature table) lines ID INS_HUMAN Reviewed; 110 AA. AC P01308; ... DE Insulin precursor [Contains: Insulin B chain; Insulin A chain]. GN Name=INS; ... CC -!- SUBCELLULAR LOCATION: Secreted. ... DR GO; GO:0005576; C:extracellular region; IC:UniProtKB. ... FT SIGNAL 1 24 3 types of non-experimental qualifiers in the CC and FT lines: • Potential: Predicted by sequence analysis methods • Probable: Inconclusive experimental evidence • By similarity: Predicted by alignment to proteins with known location

Problems in database parsing Extreme example: A4_HUMAN, Alzheimer disease amyloid protein CC -!- SUBCELLULAR LOCATION: Membrane; Single-pass type I membrane CC protein. Note=Cell surface protein that rapidly becomes CC internalized via clathrin-coated pits. During maturation, the CC immature APP (N-glycosylated in the endoplasmic reticulum) moves CC to the Golgi complex where complete maturation occurs (O- CC glycosylated and sulfated). After alpha-secretase cleavage, CC soluble APP is released into the extracellular space and the C- CC terminal is internalized to endosomes and lysosomes. Some APP CC accumulates in secretory transport vesicles leaving the late Golgi CC compartment and returns to the cell surface. Gamma-CTF(59) peptide CC is located to both the cytoplasm and nuclei of neurons. It can be CC translocated to the nucleus through association with Fe65. Beta- CC APP42 associates with FRPL1 at the cell surface and the complex is CC then rapidly internalized. APP sorts to the basolateral surface in CC epithelial cells. During neuronal differentiation, the Thr-743 CC phosphorylated form is located mainly in growth cones, moderately CC in neurites and sparingly in the cell body. Casein kinase CC phosphorylation can occur either at the cell surface or within a CC post-Golgi compartment. ... DR GO; GO:0009986; C:cell surface; IDA:UniProtKB. DR GO; GO:0005576; C:extracellular region; TAS:ProtInc. DR GO; GO:0005887; C:integral to plasma membrane; TAS:ProtInc.

Prediction methods • Homology / Alignment • Simple pattern recognition • Example: PROSITE entry PS00014, ER_TARGET: Endoplasmic reticulum targeting sequence. Pattern: [KRHQSA]-[DENQ]-E-L> • Statistical methods • Weight matrices: calculate amino acid probabilities • Other examples: Regression, variance analysis, clustering • Machine learning • Like statistical methods, but parameters are estimated by iterative training rather than direct calculation • Examples: Neural Networks (NN), Hidden Markov Models (HMM), Support Vector Machines (SVM)

Prediction of subcellular localisation from sequence • Homology: threshold  30%-70% identity • Sorting signals (‘‘zip codes’’) • N-terminal: secretory (ER) signal peptides, mitochondrial & chloroplast transit peptides. • C-terminal: peroxisomal targeting signal 1, ER-retention signal. • internal: Nuclear localisation signals, nuclear export signals. • Global properties • amino acid composition, aa pair composition • composition in limited regions • predicted structure • physico-chemical parameters • Combined approaches

Signal-based prediction • Signal peptides • von Heijne 1983, 1986 [WM] • SignalP (Nielsen et al. 1997, 1998; Bendtsen et al. 2004) [NN, HMM] • Mitochondrial & chloroplast transit peptides • Mitoprot (Claros & Vincens 1996) [linear discriminant using physico-chemical parameters] • ChloroP, TargetP* (Emanuelsson et al. 1999, 2000) [NN] • iPSORT* (Bannai et al. 2002) [decision tree using physico-chemical parameters] • Protein Prowler* (Hawkins & Bodén 2006) [NN] *= includes also signal peptides • Nuclear localisation signals • PredictNLS (Cokol et al. 2000) [regex] • NucPred (Heddad et al. 2004) [regex, GA]

Composition-based prediction • Nakashima and Nishikawa 1994 [2 categories; odds-ratio statistics] • ProtLock (Cedano et al. 1997) [5 categories; Mahalanobis distance] • Chou and Elrod 1998 [12 categories; covariant discriminant] • NNPSL (Reinhardt and Hubbard 1998) [4 categories; NN] • SubLoc (Hua and Sun 2001) [4 categories; SVM] • PLOC (Park and Kanehisa 2003) [12 categories; SVM] • LOCtree (Nair & Rost 2005) [6 categories; SVM incl. regions, structure and profiles] • BaCelLo (Pierleoni et al. 2006) [5 categories; SVM incl. regions and profiles] Pro: • does not require knowledge of signals • works even if N-terminus is wrong Con: • cannot identify isoform differences

A simple statistical method: Linear regression Observations (training data): a set of x values (input) and y values (output). Model:y = ax + b (2parameters, which are estimated from the training data) Prediction: Use the model to calculate a y value for a newx value Note: the model does not fit the observations exactly. Can we do better than this?

Overfitting y = ax + b 2 parameter model Good description, poor fit y = ax6+bx5+cx4+dx3+ex2+fx+g 7 parameter model Poor description, good fit Note: It is not interesting that a model can fit its observations (training data) exactly. To function as a prediction method, a model must be able to generalize, i.e. produce sensible output on new data.

A classification problem • How complex a model should we choose? This depends on: • The real complexity of the problem • The size of the training data set • The amount of noise in the data set

How to estimate parameters for prediction?

Model selection Linear Regression Quadratic Regression Join-the-dots

The test set method

Cross Validation

Which kind of Cross Validation? Note:Leave-one-out is also known as jack-knife

Problem: sequences are related • If the sequences in the test set are closely related to those in the training set, we can not measure true generalization performance ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV

Solution: Homology reduction • Calculate all pairwise similarities in the data set • Define a threshold for being ”neighbours” (too closely related) • Calculate numbers of neighbours for each example, and remove the example with most neighbours • Repeat until there are no examples with neighbours left Alternative: Homology partitioning • keep all examples, but cluster them so that no neighbours end up in the same fold • Should be combined with weighting The Hobohm algorithm

Defining a threshold for homology reduction First approach: two sequences are too closely related, if the prediction problem can be solved by alignment The Sander/Schneider curve: For protein structure prediction, 70% identical classification of secondary structure means prediction by alignment is possible This corresponds to 25% identical amino acids in a local alignment > 80 positions

Defining a threshold for homology reduction Second approach: two sequences are too closely related, if their homology is statistically significant The Pedersen / Nielsen / Wernersson curve: Use the extreme value distribution to define the BLAST score at which the similarity is stronger than random

Introduction to Pattern Recognition