220 likes | 435 Views
Genetic Programming for Mining DNA Chip data from Cancer Patients. W.B. Langdon & B.F. Buxton Genetic Programming and Evolving Machines, 5 (3): 251-257 September 2004 Presenter John Dynan. Why Genetic Programming ?. Applies principles Darwinism to AI
E N D
Genetic Programming for Mining DNA Chip data from Cancer Patients W.B. Langdon & B.F. Buxton Genetic Programming and Evolving Machines, 5 (3): 251-257 September 2004 Presenter John Dynan
Why Genetic Programming ? • Applies principles Darwinism to AI • Allows natural selection of the Fittest Models • Iterative process that evolves numerous Solutions • Similar to the Biology of Genetic • Resolves over fitting issue found in other Approaches • DNA arrays with limited data sets (<100 Tissues) • Predictive nature of low expression Genes • Disease , treatment and prevention
What is Genetic Programming(GP) ? • Replicates Genetic Process: • Crossover(recombination) • Duplication • Mutation • Production • Deletion • DNA string of Elements (A,C,G,U=T)
What it is not • Clustering K-means • Heuristic Combination of fixed Rules • Single set of features • Sequential learning process for features • Optimal solution • Controlled Feature Deletion or Addition
History • Extension of Holland(1975) Genetic Algorithms Work(Stanford): • Structures are programs • Syntax Trees • Nodes • Functions ( Mul, Add, Div, Sub, Exp ..) • Terminals (Attributes, Gene Expression, ..) • GP is a search for Terminals and Functions
µarray Problem • Pomeroy Data Set (url) • 7129 Gene Expressions • 60 Patents • 39 Survivors ( Cancer Tissues) • 21 Terminal (Non Cancer) • Compare w/ K=5 & 8 Genes - Pomeroy
Pomeroy Data Set Snippet • Brain_MD_30 Brain_MD_31 Brain_MD_32 Brain_MD_33 Brain_MD_34 Brain_MD_35 • Brain_MD_36 Brain_MD_37 Brain_MD_38 Brain_MD_39 Brain_MD_40 Brain_MD_41 • Brain_MD_42 Brain_MD_43 Brain_MD_44 Brain_MD_45 Brain_MD_46 Brain_MD_47 • Brain_MD_48 Brain_MD_49 Brain_MD_50 Brain_MD_51 Brain_MD_52 Brain_MD_53 • Brain_MD_54 Brain_MD_55 Brain_MD_56 Brain_MD_57 Brain_MD_58 Brain_MD_59 • Brain_MD_60 • U08998_at TAR RNA binding protein (TRBP) mRNA 206.0 55.0 106.0 323.0 209.0 88.0 • 179.0 -493.0 -40.0 60.0 -200.0 312.0 -26.0 -234.0 127.0 10.0 135.0 -72.0 • 46.0 -77.0 50.0 375.0 -252.0 -189.0 -112.0 -931.0 193.0 -125.0 -1244.0 -470.0 • -683.0 -261.0 -18.0 -90.0 -3.0 -57.0 -201.0 50.0 -197.0 -141.0 -353.0 -132.0 • -408.0 -262.0 20.0 239.0 -232.0 -593.0-443.0 6.0 -316.0 116.0 -7.0 169.0 • -260.0 -137.0 17.0 100.0 -954.0 -353.0 • U41737_at Pancreatic beta cell growth factor (INGAP) mRNA 15.0 -87.0 11.0 173.0 177.0 • -105.0 35.0 13.0 53.0 8.0 25.0 28.0 21.0 61.0 -8.0 75.0 24.0 • -135.0 55.0 162.0 139.0 22.0 -89.0 13.0 -177.0 -384.0 45.0 -38.0 -38.0 • -136.0 -152.0 -42.0 -85.0 -31.0 70.0 -76.0 -74.0 -50.0 29.0 -81.0 145.0 • 42.0 -79.0 25.0 18.0 -20.0 44.0-78.0 192.0 -66.0 -73.0 -39.0 57.0 • -122.0 -90.0 25.0 -10.0 -80.0 -306.0 -3.0 • 60 2 1 • # class0 class1 • 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 • 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 • 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Method • The individual consist of 5 trees (mating pools) • N=60 fold generates 60 random models • N =60 fold is repeated 10 times • 600 Predictive Patent Survival Models • if Tree(i=1..5)>0, GP model positive (node) • Genetic modifications in tree 1 and 2 • Trees may specialize(tissue) • Program Fitness (Pos/Neg) Accuracy > .5
GP Conditions • Terminals ( µarray data) • Functions(+,-,/,*,exp,<,> ..) • Fitness Measurement(Data) • Program Control(loop,time) • Termination(Generations)
GP 1st/2nd Data Mining • 600 GP models • 6970 of 7129 Attributes in GP Models • 404 Genes in ten or more GP Models • 404 Genes were used in 2nd GP run • Two Genes in 100 GP models • U08998 - 182 GP Models • U41737 – 193 GP Models
Gene Biology • Genes NOT highly Expressed • Not Found in Pomeroy Kmeams Cluster Analysis • U08998_at • TAR RNA binding protein – promotes cancer • TARBP1 GeneCard • U41737_at • Pancreatic beta cell growth • REG3A GeneCard
Final GP • Limited number of functions • Single IF statements ( <,>,,≤) • Random generation of function and Genes • N=60 fold times 10 accuracy = 68% • 147 of 192 were incorrect predictors • 39 of 192 were correct two gene predictors
Two Gene Outcome • Survived/Predicted Correct –TP • Failed Treatment/Predicted Wrong – FP • ⃟ Survived/Predicted Wrong – FN • Failed Treatment/Predicted Correct –TN • Darken points poor predictors • GP Model predictor: • -42 < U41737_at + 2*U0998_at
Limitations • Extensive computer resources( exponential) • NP solution • Only heuristic optimal solution • Replications of the random selection process with various genetic evolutionary change rates, can cause different results
Bioinformatics • Allows the selection of low expression gene into predictive model • New information can be harvested by repeating execution of GP • 5 tree members can be isolated members of different organ tissues • Disease treatment, prediction and cured
References • 1 J. DeRisi, et al. 1998. The transcriptional program of sporulation in budding yeasts. • Science 282:699-705 • 2Mitra, A; Almal, A. ; George, B.;Fry,D. ; Lenehan et. al, The use of genetic programming analysis of quantitative expression profiles… BMC Cancer 206;6:159. • 3University of Manchester GP Web Site URL • : http://dbkgroup.org/gp_home.htm • 4Biolograhy of GP references: • http://liinwww.ira.uka.de/bibliography/Ai/genetic.programming.html • 5Langdon,L.; and Poli, R. Foundations of Genetic Programming ,Springer –Verlag , Berlin. 2001 • 6Koza,John; Bennett, F.;Andre, D. and Keane, Martin. Genetic Programming, Morgan Kaufmann Publishing, San Francisco, 1999. • 7 Hartl, D. and Jones, E. 2002. Essential Genetics 3rd ed. Boston, MA. : .Jones and Bartlett Publishers