1 / 16

Data driven method training

Stabilization matrix method ( Rigde regression) Morten Nielsen Department of Systems Biology , DTU. Data driven method training. A prediction method contains a very large set of parameters A matrix for predicting binding for 9meric peptides has 9x20=180 weights Over fitting is a problem.

melody
Download Presentation

Data driven method training

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Stabilization matrix method(Rigde regression)Morten NielsenDepartment of Systems Biology, DTU

  2. Data driven method training • A prediction method contains a very large set of parameters • A matrix for predicting binding for 9meric peptides has 9x20=180 weights • Over fitting is a problem Temperature years

  3. Stop training Model over-fitting (early stopping) Evaluate on 600 MHC:peptide binding data PCC=0.89

  4. Stabilization matrix method The mathematics y = ax + b 2 parameter model Good description, poor fit y = ax6+bx5+cx4+dx3+ex2+fx+g 7 parameter model Poor description, good fit

  5. Stabilization matrix method The mathematics y = ax + b 2 parameter model Good description, poor fit y = ax6+bx5+cx4+dx3+ex2+fx+g 7 parameter model Poor description, good fit

  6. SMM training Evaluate on 600 MHC:peptide binding data L=0: PCC=0.70 L=0.1 PCC = 0.78

  7. Stabilization matrix method.The analytic solution Each peptide is represented as 9*20 number (180) H is a stack of such vectors of 180 values t is the target value (the measured binding) l is a parameter introduced to suppress the effect of noise in the experimental data and lower the effect of overfitting

  8. SMM - Stabilization matrix method Linear function I1 I2 Sum over weights Sum over data points w1 w2 o

  9. Per target error: SMM - Stabilization matrix method Global error: Linear function Sum over weights Sum over data points I1 I2 w1 w2 o

  10. SMM - Stabilization matrix methodDo it yourself Linear function l per target I1 I2 w1 w2 o

  11. And now you

  12. SMM - Stabilization matrix method Linear function l per target I1 I2 w1 w2 o

  13. SMM - Stabilization matrix method Linear function I1 I2 w1 w2 o

  14. SMM - Stabilization matrix methodMonte Carlo Global: Linear function • Make random change to weights • Calculate change in “global” error • Update weights if MC move is accepted I1 I2 w1 w2 o Note difference between MC and GD in the use of “global” versus “per target” error

  15. Training/evaluation procedure • Define method • Select data • Deal with data redundancy • In method (sequence weighting) • In data (Hobohm) • Deal with over-fitting either • in method (SMM regulation term) or • in training (stop fitting on test set performance) • Evaluate method using cross-validation

  16. A small doittcshscript/home/projects/mniel/ALGO/code/SMM #! /bin/tcsh -f set DATADIR = /home/projects/mniel/ALGO/data/SMM/ foreach a ( A0101 A3002 ) mkdir -p $a cd $a foreach n ( 0 1 2 3 4 ) mkdir -p n.$n cd n.$n cat $DATADIR/$a/f00$n | gawk '{print $1,$2}' > f00$n cat $DATADIR/$a/c00$n | gawk '{print $1,$2}' > c00$n touch $$.perf rm -f $$.perf foreach l ( 0 0.05 0.1 0.5 1.0 2.0 5.0 10.0 20.0 50.0 ) smm-nc 500 -l $l f00$n > mat.$l pep2score -mat mat.$l c00$n > c00$n.pred echo $l `cat c00$n.pred | grep -v "#" | args 2,3 | gawk 'BEGIN{n+0; e=0.0}{n++; e += ($1-$2)*($1-$2)}END{print e/n}' ` >> $$.perf end set best_l = `cat $$.perf | sort -nk2 | head -1 | gawk '{print $1}'` echo "Allele" $a "n" $n "best_l" $best_l pep2score -mat mat.$best_l c00$n > c00$n.pred cd .. end echo $a `cat n.?/c00?.pred | grep -v "#" | args 2,3 | xycorr | grep -v "#"` \ `cat n.?/c00?.pred | grep -v "#" | args 2,3 | \ gawk 'BEGIN{n+0; e=0.0}{n++; e += ($1-$2)*($1-$2)}END{print e/n}' ` cd .. end

More Related