T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

T cell Epitope predictionsusing bioinformatics(Neural Networks andhidden Markov models) Morten Nielsen, CBS, BioCentrum, DTU

Processing of intracellular proteins MHC binding http://www.nki.nl/nkidep/h4/neefjes/neefjes.htm

What makes a peptide a potential and effective epitope? • Part of a pathogen protein • Successful processing • Proteasome cleavage • TAP binding • Binds to MHC molecule • Protein function • Early in replication • Sequence conservation in evolution Sars virus

From proteins to immunogens 20% processed 0.5% bind MHC 50% CTL response Lauemøller et al., 2000 => 1/2000 peptide are immunogenic

Location of class I epitopes GP1200 protein Structure (1GM9)

Anchor positions MHC class I with peptide http://www.nki.nl/nkidep/h4/neefjes/neefjes.htm

Prediction of HLA binding specificity • Simple Motifs • Allowed/non allowed amino acids • Extended motifs • Amino acid preferences (SYFPEITHI) • Anchor/Preferred/other amino acids • Hidden Markov models • Peptide statistics from sequence alignment • Neural networks • Can take sequence correlations into account

Where to get data? • SYFPEITHI database • 3500 peptides known to bind to HMC class I and II • Only published data • MHCpep • 13000 peptides known to bind to HMC class I and II • Published data and direct submission • No update since 1998 • Binding affinity assays • Quantitative data. How strong does a peptide bind to the MHC molecule? • Costly and people do not publish negative results..

Databases and web resources • HLA Informatics Group, ANRI (HLA sequence database) • IMGT/HLA Database (HLA sequence database) • SYFPEITHI(Database of HLA Class I and II peptides) • MHCPEP (Database of HLA Class I and II peptides) • BIMAS (HLA Class I predictor) • SYFPEITHI (HLA Class I predictor) • NetMHC (HLA Class I prediction)

Sequence information SLLPAIVEL YLLPAIVHI TLWVDPYEV GLVPFLVSV KLLEPVLLL LLDVPTAAV LLDVPTAAV LLDVPTAAV LLDVPTAAV VLFRGGPRG MVDGTLLLL YMNGTMSQV MLLSVPLLL SLLGLLVEV ALLPPINIL TLIKIQHTL HLIDYLVTS ILAPPVVKL ALFPQLVIL GILGFVFTL STNRQSGRQ GLDVLTAKV RILGAVAKV QVCERIPTI ILFGHENRV ILMEHIHKL ILDQKINEV SLAGGIIGV LLIENVASL FLLWATAEA SLPDFGISY KKREEAPSL LERPGGNEI ALSNLEVKL ALNELLQHV DLERKVESL FLGENISNF ALSDHHIYL GLSEFTEYL STAPPAHGV PLDGEYFTL GVLVGVALI RTLDKVLEV HLSTAFARV RLDSYVRSL YMNGTMSQV GILGFVFTL ILKEPVHGV ILGFVFTLT LLFGYPVYV GLSPTVWLS WLSLLVPFV FLPSDFFPS CLGGLLTMV FIAGNSAYE KLGEFYNQM KLVALGINA DLMGYIPLV RLVTLKDIV MLLAVLYCL AAGIGILTV YLEPGPVTA LLDGTATLR ITDQVPFSV KTWGQYWQV TITDQVPFS AFHHVAREL YLNKIQNSL MMRKLAILS AIMDKNIIL IMDKNIILK SMVGNWAKV SLLAPGAKQ KIFGSLAFL ELVSEFSRM KLTPLCVTL VLYRYGSFS YIGEVLVSV CINGVCWTV VMNILLQYV ILTVILGVL KVLEYVIKV FLWGPRALV GLSRYVARL FLLTRILTI HLGNVKYLV GIAGGLALL GLQDCTMLV TGAPVTYST VIYQYMDDL VLPDVFIRC VLPDVFIRC AVGIGIAVV LVVLGLLAV ALGLGLLPV GIGIGVLAA GAGIGVAVL IAGIGILAI LIVIGILIL LAGIGLIAA VDGIGILTI GAGIGVLTA AAGIGIIQI QAGIGILLA KARDPHSGH KACDPHSGH ACDPHSGHF SLYNTVATL RGPGRAFVT NLVPMVATV GLHCYEQLV PLKQHFQIV AVFDRKSDA LLDFVRFMG VLVKSPNHV GLAPPQHLI LLGRNSFEV PLTFGWCYK VLEWRFDSR TLNAWVKVV GLCTLVAML FIDSYICQV IISAVVGIL VMAGVGSPY LLWTLVVLL SVRDRLARL LLMDCSGSI CLTSTVQLV VLHDDLLEA LMWITQCFL SLLMWITQC QLSLLMWIT LLGATCMFV RLTRFLSRV YMDGTMSQV FLTPKKLQC ISNDVCAQV VKTDGNPPE SVYDFFVWL FLYGALLLA VLFSSDFRI LMWAKIGPV SLLLELEEV SLSRFSWGA YTAFTIPSI RLMKQDFSV RLPRIFCSC FLWGPRAYA RLLQETELV SLFEGIDFY SLDQSVVEL RLNMFTPYI NMFTPYIGV LMIIPLINV TLFIGSHVV SLVIVTTFV VLQWASLAV ILAKFLHWL STAPPHVNV LLLLTVLTV VVLGVVFGI ILHNGAYSL MIMVKCWMI MLGTHTMEV MLGTHTMEV SLADTNSLA LLWAARPRL GVALQTMKQ GLYDGMEHL KMVELVHFL YLQLVFGIE MLMAQEALA LMAQEALAF VYDGREHTV YLSGANLNL RMFPNAPYL EAAGIGILT TLDSQVMSL STPPPGTRV KVAELVHFL IMIGVLVGV ALCRWGLLL LLFAGVQCQ VLLCESTAV YLSTAFARV YLLEMLWRL SLDDYNHLV RTLDKVLEV GLPVEYLQV KLIANNTRV FIYAGSLSA KLVANNTRL FLDEFMEGV ALQPGTALL VLDGLDVLL SLYSFPEPE ALYVDSLFF SLLQHLIGL ELTLGEFLK MINAYLDKL AAGIGILTV FLPSDFFPS SVRDRLARL SLREWLLRI LLSAWILTA AAGIGILTV AVPDEIPPL FAYDGKDYI AAGIGILTV FLPSDFFPS AAGIGILTV FLPSDFFPS AAGIGILTV FLWGPRALV ETVSEQSNV ITLWQRPLV

Sequence logo MHC class I HLA-A0201 • Height of a column equal to log 20 + S p log p • Relative height of a letter is p • Highly useful tool to visualize sequence motifs High information positions http://www.cbs.dtu.dk/~gorodkin/appl/plogo.html

ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV Characterizing a binding motif 10 peptides known to bind MHC • What can we learn? • A at P1 favors binding? • I is not allowed at P9? • K at P4 favors binding?

Description of binding motif Example PA = 6/10 PG = 2/10 PT = PK = 1/10 PC = PD = …PV = 0 Problems Few data Data redundancy/duplication ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV Sequence information

ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV Sequence information Raw sequence counting

ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV Pseudo-count and sequence weighting • Poor or biased sampling of sequence space • I is not found at position P9. Does this mean that I is forbidden? • No! Use Blosum substitution matrix to estimate pseudo frequency of I at P9 } Similar sequences Weight 1/5

The Blosum matrix

ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV Sequence weighting

Pseudo counts • Sequence weighting and pseudo count • Prediction accuracy 0.60 • Motif found on all data (485) • Prediction accuracy 0.79

Weight matrices • Estimate amino acid frequencies from alignment • Now a weight matrix is given as Wij = log(pij/qj) • Here i is a position in the motif, and j an amino acid. qj is the background frequency for amino acid j. • W is a L x 20 matrix, L is motif length • Score sequences to weight matrix by looking up and adding L values from matrix

Scoring sequences to a weight matrix A R N D C Q E G H I L K M F P S T W Y V 1 0.6 0.4 -3.5 -2.4 -0.4 -1.9 -2.7 0.3 -1.1 1.0 0.3 0.0 1.4 1.2 -2.7 1.4 -1.2 -2.0 1.1 0.7 2 -1.6 -6.6 -6.5 -5.4 -2.5 -4.0 -4.7 -3.7 -6.3 1.0 5.1 -3.7 3.1 -4.2 -4.3 -4.2 -0.2 -5.9 -3.8 0.4 3 0.2 -1.3 0.1 1.5 0.0 -1.8 -3.3 0.4 0.5 -1.0 0.3 -2.5 1.2 1.0 -0.1 -0.3 -0.5 3.4 1.6 0.0 4 -0.1 -0.1 -2.0 2.0 -1.6 0.5 0.8 2.0 -3.3 0.1 -1.7 -1.0 -2.2 -1.6 1.7 -0.6 -0.2 1.3 -6.8 -0.7 5 -1.6 -0.1 0.1 -2.2 -1.2 0.4 -0.5 1.9 1.2 -2.2 -0.5 -1.3 -2.2 1.7 1.2 -2.5 -0.1 1.7 1.5 1.0 6 -0.7 -1.4 -1.0 -2.3 1.1 -1.3 -1.4 -0.2 -1.0 1.8 0.8 -1.9 0.2 1.0 -0.4 -0.6 0.4 -0.5 -0.0 2.1 7 1.1 -3.8 -0.2 -1.3 1.3 -0.3 -1.3 -1.4 2.1 0.6 0.7 -5.0 1.1 0.9 1.3 -0.5 -0.9 2.9 -0.4 0.5 8 -2.2 1.0 -0.8 -2.9 -1.4 0.4 0.1 -0.4 0.2 -0.0 1.1 -0.5 -0.5 0.7 -0.3 0.8 0.8 -0.7 1.3 -1.1 9 -0.2 -3.5 -6.1 -4.5 0.7 -0.8 -2.5 -4.0 -2.6 0.9 2.8 -3.0 -1.8 -1.4 -6.2 -1.9 -1.6 -4.9 -1.6 4.5 15.0 -3.4 0.8 ILYQVPFSV ALPYWNFAT MTAQWWLDA Which peptide is most likely to bind? Which peptide second?

How to predict • The effect on the binding affinity of having a given amino acid at one position can be influenced by the amino acids at other positions in the peptide (sequence correlations). • Two adjacent amino acids may for example compete for the space in a pocket in the MHC molecule. • Artificial neural networks (ANN) are ideally suited to take such correlations into account

Neural networks • Neural networks can learn higher order correlations! • What does this mean? A A => 0 A C => 1 C A => 1 C C => 0 No linear function can learn this pattern

w21 w12 w22 v1 v2 Neural networks w11

TP FP AP AN Evaluation of prediction accuracy True positive proportion = TP/(AP) False positive proportion = FP/(AN) Pearson correlation Roc curves Aroc=0.8 Aroc=0.5

Epitope predictionsSequence motif and HMM’s Sequence motif HMM cc: 0.80 Aroc: 0.95 cc: 0.76 Aroc: 0.92

Epitope prediction. Neural Networks cc: 0.91 Aroc: 0.98

Evaluation of prediction accuracy

Hepatitis C virus. Epitope predictions

Proteasomal cleavage • Netchop(http://www.cbs.dtu.dk/services/NetChop-3.0/) • Epitopes have strong C terminal cleavage • Epitopes can have strong internal cleavage sites • Selection strategy • High binding peptides • High cleavage probability at C terminal NMVPFFPPV ..S.....S

Hvad nu? • 29 marts. Introduktion til hidden Markov models og weight matrices • 5 april. Introduktion til neural networks • 12 april. Introduktion til projekt • 10 maj. Aflever projekt

T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)