620 likes | 935 Views
Rational HIV vaccine design. Nebojsa Jojic and David Heckerman Machine Learning and Applied Statistics Microsoft Research. Collaborators. Vladimir Jojic, Microsoft/U Toronto Carl Kadie, Microsoft Jennifer Listgarten, Microsoft/U Toronto Chris Meek, Microsoft
E N D
Rational HIV vaccine design Nebojsa Jojic and David Heckerman Machine Learning and Applied Statistics Microsoft Research
Collaborators • Vladimir Jojic, Microsoft/U Toronto • Carl Kadie, Microsoft • Jennifer Listgarten, Microsoft/U Toronto • Chris Meek, Microsoft • Brendan Frey, Microsoft/ U Toronto • Bette Korber, Los Alamos National Laboratory • Christian Brander, Harvard/MGH • Nicole Frahm, Harvard/MGH • Simon Mallal/ Royal Perth Hospital • Jim Mullins/ University of Washington
Epitome as a model of diversity in natural signals A set of image patches Input image Epitome
Using the epitome for recognition The smiling point Epitome of 295 face images Images with the highest total posterior at the “smiling point” Images with the lowest total posterior at the “smiling point”
Epitomes may also allow some variability Epitome e: Mean Variances
Epitomes can be computed for ordered datasets (e.g., 1-D arrays or 2-D, or 3-D or n-D matrices) with arbitrary measurement types: • Intensities • R, G, B values • Gradient values • Wavelet coefficients • Spectral energies • Nucelotide or aminoacid content … • We even played with text and MIDI files
AIDS 101 • AIDS (acquired immune deficiency syndrome) was first described in the early 1980s • HIV (human immunnodeficiency virus) causes AIDS was isolated in 1983; 40 million people now infected • HIV is RNA virus: protein coat + copying proteins + regulatory proteins + RNA • Copying proteins + RNA enters cell • RNA is reverse transcribed to DNA • DNA inserts into cells DNA and is transcribed and translated to more HIV protein • Infected cell assembles more copies of HIV • Cell bursts releasing many new copies of HIV
The map of HIV From http://www.mcld.co.uk/hiv (A simplified version of the LANL detailed map)
HIV diversity (LANL database) HIV is encoded in an RNA sequence of about 10000 nucleotides, divided into several genes. NEF is one of the shorter and moderately variable ones. The NEF length in the strain The 73 nucelotides of the NEF gene Note the insertions, deletions and mutations. A triplet of nucleotides encode for one aminoacid. A change in a single aminoacid may lower the cellular immunity to the virus in one patient and increase it in the other.
MHC-I Molecule Epitope
Epitopes in variable regions Colors signify different human immune types
Immunology 101 “Train and kill” mechanism • Immune system sees a virus and trains “killer cells” (T cells) to kill any cell showing a pattern from the virus • Patterns are short peptides (8-11 amino acids long) called epitopes: 3D structure of an epitope as presented by an infected cell to the killer cells SLYNTVATL Amino-acid pattern (peptide)
But, HIV is variable… The train-and-kill mechanism doesn’t work as well for HIV – the virus adapts through rapid mutation. As soon as the killer cells get the upper hand, the epitopes start changing. Possible solution: • Find epitopes that occur frequently across a *population* of HIV viruses • Compact these epitopes into a small vaccine (small is good: long vaccines are hard to deliver, and less likely to be effective)
Colors: Different patients Sequence data VLSGGKLDKWEKIRLRPGGKKKYKLKHIVWASRELERF LSGGKLDRWEKIRLR KKKYQLKHIVW KKKYRLKHIVW Epitome
Machine Learning Approach to Vaccine Design • Use sample HIV strains from multiple patients • Build models that compactly encode as many epitopes (or likely epitopes) as possible • Learning techniques: • Myopic • Split and merge • Expectation Maximization
Coverage of all 10aa blocks from 245 Gag proteins (Perth data)
A Vaccine for HIV/AIDS • Typical vaccines are near copies of the virus that is being vaccinated against • HIV mutates at a high rate – can’t use traditional techniques • Machine learning allows us to build compact forms of “pseudo-virus” that covers the diversity of the HIV virus (or rather a pseudo-protein that covers the diversity of a particular HIV protein) • This pseudo-protein, which we call the epitome is much shorter than the concatenation of all strains
Expected (weighted) coverage optimization We have algorithms to predict this! p(T), p(S): Cleavage, MHC binding, transport P(XS|ET): T-cell cross-reactivity We have some idea about this, too.
MHC-I Molecule Peptide Finding Epitopes and their MHC-I counterparts
Important to find both epitopes and the MHC-I types that can present them • Each patient has six MHC-I types (2 As, 2Bs, 2Cs) • Most epitopes can be presented by only a few MHC-I molecules • Different populations (China, India, South Africa, etc.) have different MHC-I frequencies
Finding Epitopes and their MHC-I counterparts Existing methods: • Trial and error in the wet lab • Machine learning Our methods: • More machine learning • Machine learning + physics • Machine learning + wet lab
Machine Learning Examples of peptide is epitope for MHC-I type Examples of peptide is NOT epitope for MHC-I type • Classifier: • Logisitc regression • SVM • Neural net • Etc
Issues (from experience) • Amount of data • Feature extraction • Algorithm choice
Simple feature extraction SLYNTVATL, A02 • Amino acid at position 1=S • Amino acid at position 2=L • Amino acid at position 3=Y • … • Amino acid at position 9=L • MHC-I type=A02
Better feature extraction SLYNTVATL, A02 • Previously mentioned features • Amino acid at position 1 = S & MHC-I = A02 • Amino acid at position 2 = L & MHC-I = A02 • … • Amino acid at position 9 = L & MHC-I = A02
Machine learning + physicswith David Baker and Ora Furman, UW
Machine learning + physicswith David Baker and Ora Furman, UW
Machine learning + wet labWith Christian Brander & Nicole Frahm, HarvardJennifer Listgarten, U. Toronto • If a patient’s blood reacts with a peptide, then it is very likely that some subsequence of the peptide is an epitope for at least one of the patient’s six MHC-I types • From observations for many patients, tease out the responsible MHC-I type(s) • Find the subsequence in the lab peptide, e.g., NYTSLIYTLIEESQNQQEK … Pt1 Pt2 Pt3 Pt4 PtN
What makes a good solution for a peptide? • The fewer the responsible MHC-I types the better • An MHC-I type gets “points” for appearing in reacting patients and loses “points” for appearing in non-reacting patients
Not easy… • Lots of noise: p(react | is epitope)~0.25 • “Leaks”: may see a reaction even when the peptide is not an epitope for any MHC-I type of the patient • “Explaining away”: When a patient has two MHC-I types that can be responsible for a reaction, those two get less credit • Don’t actually know • p(react | is epitope) • Leak probabilities • Example solution: A B C reacting patients non-reacting patients A B C
Graphical model for a peptide A01 A02 A03 B01 B02 B03 C01 C02 C03 A02c A01c A03c A03c … B01c B02c B02c B03c C01c C01c OR OR C03c C02c pt1 reacts pt2 reacts leak leak p0 p0
Fuel TurnOver Gauge Start Battery (Directed Acyclic) Graphical Models p(F,B,T,G,S) = p(F) p(B|F) p(T|F,B) p(G|F,B,T) p(S|F,B,T,G) = p(F) p(B|F) p(T|F,B) p(G|F,B,T) p(S|F,B,T,G) = Pvarsp(var|parents)
Graphical model for a peptide A01 A02 A03 B01 B02 B03 C01 C02 C03 … … …
Graphical model for a peptide A01 A02 A03 B01 B02 B03 C01 C02 C03 p A02c A03c B01c B02c C01c C03c
Graphical model for a peptide A01 A02 A03 B01 B02 B03 C01 C02 C03 p p p A02c p p A03c p B01c B02c C01c C03c
Graphical model for a peptide A01 A02 A03 B01 B02 B03 C01 C02 C03 A02c A03c B01c B02c C01c OR C03c pt1 reacts leak p0
Graphical model for a peptide A01 A02 A03 B01 B02 B03 C01 C02 C03 A02c A01c A03c A03c … B01c B02c B02c B03c C01c C01c OR OR C03c C02c pt1 reacts pt2 reacts leak leak p0 p0
Solving the model • Principle: find the p, p0 and MHC-I assignments that maximize the likelihood of the data • Algorithm: Guess p, p0 Iterate • Use relaxation method to find max likelihood MHC-I assignments • Use gradient descent to find values of p, p0 that maximize the likelihood
Status Most likely assignments have been confirmed
Summary • HIV vaccine design is a data intensive problem • Data is in the form of discrete sequences, making it ideal for computer-science/machine-learning analysis • Machine learning approaches are instrumental in finding epitopes and vaccine compression • Work in progress: Our vaccine designs are scheduled to be tested at Mass General in vitro this summer