1 / 13

DC2 C++ decision trees

DC2 C++ decision trees. Quick review of classification (or decision) trees Training and testing How Bill does it with Insightful Miner Applications to the “good-energy” trees: how does it compare?. Toby Burnett Frank Golf. predicate. predicate. purity. Quick Review of Decision Trees.

stella
Download Presentation

DC2 C++ decision trees

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DC2 C++ decision trees • Quick review of classification (or decision) trees • Training and testing • How Bill does it with Insightful Miner • Applications to the “good-energy” trees: how does it compare? Toby Burnett Frank Golf DC2 at GSFC - 28 Jun 05 - T. Burnett

  2. predicate predicate purity Quick Review of Decision Trees • Introduced to GLAST by Bill Atwood, using InsightfulMiner • Each branch node is a predicate, or cut on a variable, likeCalCsIRLn > 4.222 • If true, this defines the right branch, otherwise the left branch. • If there is no branch, the node is a leaf; a leaf contains the purity of the sample that reaches that point • Thus the tree defines a function of the event variables used, returning a value for the purity from the training sample true: right false: left DC2 at GSFC - 28 Jun 05 - T. Burnett

  3. Training and testing procedure • Analyze a training sample containing a mixture of “good” and “bad” events: I use the even events in order to have an independent set for testing • Choose set of variables and find the optimal cut for such that the left and right subsets are purer than the orignal. Two standard criteria for this: “Gini” and entropy. I currently use the former. • WS : sum of signal weights • WB : sum of background weights Gini = 2 WS WB/(WS +WB) Thus Gini wants to be small. Actually maximize the improvement:Gini(parent)-Gini(left child)-Gini(right child) • Apply this recursively until too few events. (100 for now) • Finally test with the odd events: measure purity for each node DC2 at GSFC - 28 Jun 05 - T. Burnett

  4. Evaluate by Comparing with IM results From Bill’s Rome ’03 talk: The “good cal” analysis CAL-Low CT Probabilities CAL-High CT Probabilities All Good Good All Good Bad Bad DC2 at GSFC - 28 Jun 05 - T. Burnett

  5. Compare with Bill, cont • Since Rome: • Three energy ranges, three trees. • CalEnergySum: 5-350; 350-3500; >3500 • Resulting decision trees implemented in Gleam by a “classification” package: results saved to the IMgoodCal tuple variable. • Development until now for training, comparison: the all_gamma from v4r2 • 760 K events E from 16 MeV to 160 GeV (uniform in log(E), and 0<  < 90 (uniform in cos() ). • Contains IMgoodCal for comparison • Now: DC2 at GSFC - 28 Jun 05 - T. Burnett

  6. Bill-type plots for all_gamma DC2 at GSFC - 28 Jun 05 - T. Burnett

  7. Another way of looking at performance Define efficiency as fraction of good events after given cut on node purity; determine bad fraction for that cut DC2 at GSFC - 28 Jun 05 - T. Burnett

  8. The classifier package • Properties • Implements Gini separation (entropy now implemented) • Reads ROOT files • Flexible specification of variables to use • Simple and compact ascii representation of decision trees • Recently implemented: multiple trees, boosing • Not yet: pruning • Advantages vs IM: • Advanced techniques involving multiple trees, e.g. boosting, are available • Anyone can train trees, play with variables, etc. DC2 at GSFC - 28 Jun 05 - T. Burnett

  9. Growing trees • For simplicity, run a single tree: initially try all of Bill’s classification tree variables: list at right. • The run over the full 750 K events, trying each of the 70 variables takes only a few minutes! DC2 at GSFC - 28 Jun 05 - T. Burnett

  10. Performance of the truncated list 1691 leaf nodes 3381 total Note: this is cheating: evaluation using the same events as for training DC2 at GSFC - 28 Jun 05 - T. Burnett

  11. Separate tree comparison IM classifier DC2 at GSFC - 28 Jun 05 - T. Burnett

  12. Current status: boosting works! Example using D0 data DC2 at GSFC - 28 Jun 05 - T. Burnett

  13. Current status, next steps • The current classifier “goodcal” single-tree algorithm applied to all energies is slightly better than the three individual IM trees • Boosting will certainly improve the result • Done: • One-track vs. vertex: which estimate is better? • In progress in Seattle (as we speak) • PSF tail suppression: 4 trees to start. • In progress in Padova (see F. Longo’s summary) • Good-gamma prediction, or background rejection • Switch to new all_gamma run, rename variables. DC2 at GSFC - 28 Jun 05 - T. Burnett

More Related