Pat Langley Center for the Study of Language and Information

Computational Discovery of Communicable Scientific Models Pat Langley Center for the Study of Language and Information Stanford University, Stanford, California http://cll.stanford.edu/~langley langley@csli.stanford.edu Thanks to N. Asgharbeygi, K. Arrigo, S. Bay, S. Dzeroski, J. Sanchez, Oren Shiran, and L. Todorovski for their contributions to this research, which is funded by a grant from the National Science Foundation.

Data Mining vs. Scientific Discovery There exist two computational paradigms for discovering explicit knowledge from data: Data mining generates knowledge cast as decision trees, logical rules, or other notations invented by AI researchers; Computational scientific discovery instead uses equations, structural models, reaction pathways, or other formalisms invented by scientists and engineers. Both approaches draw on heuristic search to find regularities in data, but they differ considerably in their emphases.

NBLR NBLA PBS + - + - DFR psbA1 Health + - - + - RR psbA2 Photo + + - Light cpcB Traditional notations from machine learning are not communicated easily to domain scientists. Lesson 1 Ecosystem model Gene regulation model NPPc = Smonthmax (E·IPAR, 0) E = 0.56 · T1 · T2 · W T1 = 0.8 + 0.02 · Topt – 0.0005 · Topt2 T2 = 1.18 / [(1 + e0.2 · (Topt – Tempc – 10) ) · (1 + e0.3 · (Tempc – Topt – 10) )] W = 0.5 + 0.5 · EET / PET PET = 1.6 · (10 · Tempc / AHI)A · PET-TW-M if Tempc > 0 PET = 0 if Tempc < 0 A = 0.00000068 · AHI3 – 0.000077 · AHI2 + 0.018 · AHI + 0.49 IPAR = 0.5 · FPAR-FAS · Monthly-Solar · Sol-Conver FPAR-FAS = min [(SR-FAS – 1.08) / SR (UMD-VEG) , 0.95] SR-FAS = (Mon-FAS-NDVI + 1000) / (Mon-FAS-NDVI – 1000)

× × NBLR NBLA PBS + - + + DFR psbA1 Health + - - - RR psbA2 Photo + - Light cpcB NBLR NBLA PBS + - + - DFR psbA1 Health + - - + - RR psbA2 Photo + + - Light cpcB Scientists often have initial models that should influence the discovery process. Lesson 2 Discovery Observations Initial model m Revised model

Number of variables Number of equations Number of parameters Number of samples 8 11 20 303 Scientific data are often rare and difficult to obtain rather than being plentiful. Lesson 3 Ecosystem model Gene regulation model Number of variables Number of initial links Number of possible links Number of samples 9 11 70 20

NPPc E IPAR NBLR NBLA PBS e_max W T2 T1 SOLAR FPAR + - + - DFR psbA1 Health A PET EET Topt SR + - - + - RR psbA2 Photo AHI PETTWM Tempc NDVI VEG + + - Light cpcB Scientists want models that move beyond description to provide explanations of their data. Lesson 4 Ecosystem model Gene regulation model

× × NBLR NBLA PBS + - + + DFR psbA1 Health + - - - RR psbA2 Photo + - Light cpcB NBLR NBLA PBS + - + - DFR psbA1 Health + - - + - RR psbA2 Photo + + - Light cpcB Scientists want computational assistance rather than automated discovery systems. Lesson 5 Discovery Observations Initial model Revised model

The Nature of Systems Science Disciplines like Earth science and computational biology differ from traditional fields in that they: focus on synthesis rather than analysis in their operation; rely on computer modeling as one of their central methods; develop system-level models with many variables and relations; require that models make contact with known mechanisms. However, existing methods for computational scientific discovery were not designed with systems science in mind.

Time Series from the Ross Sea Ecosystem

Inductive Process Modeling Our approach is to design and implement computational methods for inductive process modeling, which: represent scientific models as sets of quantitative processes; use these models to predict and explain observational data; search a space of process models to find good candidates; utilize background knowledge to constrain this search. This framework has great potential both for modeling scientific reasoning and aiding practicing scientists.

regression trees B>6 C>0 C>4 systems of equations 14.3 18.7 11.5 16.9 d[ice_mass,t] =  (18  heat) / 6.02 d[water_mass,t] = (18  heat) / 6.02 hidden Markov models x=16,x=2 y=13,x=1 0.7 1.0 x=12,x=1 y=18,x=2 x=19,x=1 y=11,x=2 Horn clause programs x=12,x=1 y=10,x=2 0.3 1.0 gcd(X,X,X). gcd(X,Y,D) :- X<Y,Z is Y–X,gcd(X,Z,D). gcd(X,Y,D) :- Y<X,gcd(Y,X,D). Existing Formalisms Are Inadequate

model AquaticEcosystem variables: phyto, zoo, nitro, residue observables: phyto, nitro process phyto_loss equations: d[phyto,t,1] =  0.307  phyto d[residue,t,1] = 0.307  phyto process zoo_loss equations: d[zoo,t,1] =  0.251  zoo d[residue,t,1] = 0.251 process zoo_phyto_grazing equations: d[zoo,t,1] = 0.615  0.495  zoo d[residue,t,1] = 0.385  0.495  zoo d[phyto,t,1] =  0.495  zoo process nitro_uptake conditions: nitro > 0 equations: d[phyto,t,1] = 0.411  phyto d[nitro,t,1] =  0.098  0.411  phyto process nitro_remineralization; equations: d[nitro,t,1] = 0.005  residue d[residue,t,1 ] =  0.005  residue A Process Model for an Aquatic Ecosystem

Advantages of Quantitative Process Models Process models offer scientists a promising framework because: they embed quantitative relations within qualitative structure; that refer to notations and mechanisms familiar to experts; they provide dynamical predictions of changes over time; they offer causal and explanatory accounts of phenomena; while retaining the modularity needed for induction/abduction. Quantitative process models provide an important alternative to formalisms used currently in computational discovery.

Challenges of Inductive Process Modeling Process model induction differs from typical learning tasks in that: process models characterize behavior of dynamical systems; variables are continuous but can have discontinuous behavior; observations are not independently and identically distributed; models may contain unobservable processes and variables; multiple processes can interact to produce complex behavior. Compensating factors include a focus on deterministic systems and the availability of background knowledge.

Encoding Background Knowledge To constrain candidate models, we can utilize available backround knowledge about the domain. Previous work has encoded background knowledge in terms of: Horn clause programs (e.g., Towell & Shavlik, 1990) context-free grammars (e.g., Dzeroski & Todorovski, 1997) prior probability distributions (e.g., Friedman et al., 2000) However, none of these notations are familiar to domain scientists, which suggests the need for another approach.

Generic Processes as Background Knowledge We cast background knowledge as generic processes that specify: the variables involved in a process and their types; the parameters appearing in a process and their ranges; the forms of conditions on the process; and the forms of associated equations and their parameters. Generic processes are building blocks from which one can compose a specific process model.

generic process exponential_loss generic process remineralization variables: S{species}, D{detritus} variables: N{nutrient}, D{detritus} parameters:  [0, 1] parameters:  [0, 1] equations: d[S,t,1] = 1  S equations: d[N, t,1] =  D d[D,t,1] =  S d[D, t,1] = 1  D generic process grazing generic process constant_inflow variables: S1{species}, S2{species}, D{detritus} variables: N{nutrient} parameters:  [0, 1],  [0, 1] parameters:  [0, 1] equations: d[S1,t,1] =  S1 equations: d[N,t,1] =  d[D,t,1] = (1 )  S1 d[S2,t,1] = 1  S1 generic process nutrient_uptake variables: S{species}, N{nutrient} parameters:  [0, ],  [0, 1],  [0, 1] conditions: N >  equations: d[S,t,1] =  S d[N,t,1] = 1  S Generic Processes for Aquatic Ecosystems

model AquaticEcosystem variables: nitro, phyto, zoo, nutrient_nitro, nutrient_phyto observables: nitro, phyto, zoo process phyto_exponential_growth equations: d[phyto,t] = 0.1  phyto process zoo_logistic_growth equations: d[zoo,t] = 0.1  zoo / (1  zoo / 1.5) process phyto_nitro_consumption equations: d[nitro,t] = 1  phyto  nutrient_nitro, d[phyto,t] = 1  phyto  nutrient_nitro process phyto_nitro_no_saturation equations: nutrient_nitro = nitro process zoo_phyto_consumption equations: d[phyto,t] = 1  zoo  nutrient_phyto, d[zoo,t] = 1  zoo  nutrient_phyto process zoo_phyto_saturation equations: nutrient_phyto = phyto / (phyto + 0.5) process exponential_growth variables: P {population} equations: d[P,t] = [0, 1,]  P process logistic_growth variables: P {population} equations: d[P,t] = [0, 1, ]  P  (1  P / [0, 1, ]) process constant_inflow variables: I {inorganic_nutrient} equations: d[I,t] = [0, 1, ] process consumption variables: P1 {population}, P2 {population}, nutrient_P2 equations: d[P1,t] = [0, 1, ]  P1  nutrient_P2, d[P2,t] =  [0, 1, ]  P1  nutrient_P2 process no_saturation variables: P {number}, nutrient_P {number} equations: nutrient_P = P process saturation variables: P {number}, nutrient_P {number} equations: nutrient_P = P / (P + [0, 1, ]) training data Inducing Process Models process model Induction generic processes

The IPM algorithm constructs explanatory models from generic elements components in four stages: A Method for Process Model Construction 1. Find all ways to instantiate known generic processes with specific variables, subject to type constraints; 2. Combine instantiated processes into candidate generic models subject to additional constraints (e.g., number of processes); 3. For each generic model, carry out search through parameter space to find good coefficients; 4. Return the parameterized model with the best overall score. Our typical evaluation metric is squared error, but we have also explored other measures of explanatory adequacy.

To estimate the parameters for each generic model structure, the IPM algorithm: Estimating Parameters in Process Models 1. Selects random initial values that fall within ranges specified in the generic processes; 2. Improves these parameters using the Levenberg-Marquardt method until it reaches a local optimum; 3. Generates new candidate values through random jumps along dimensions of the parameter vector and continue search; 4. If no improvement occurs after N jumps, it restarts the search from a new random initial point. This multi-level method gives reasonable fits to time-series data from a number of domains, but it is computationally intensive.

Observations from the Ross Sea

Results on Training Data from Ross Sea

Results on Test Data from Ross Sea

Results on a Protist Ecosystem

Results on Rinkobing Fjord

Results on Biochemical Kinetics observed trajectories predicted trajectories

Interfacing with Scientists Because few scientists want to be replaced, we are developing an interactive environment, PROMETHEUS, that lets users: specify a quantitative process model of the target system; display and edit the model’s structure and details graphically; simulate the model’s behavior over time and situations; compare the model’s predicted behavior to observations; invoke a revision module in response to detected anomalies. The environment offers computational assistance in forming and evaluating models but lets the user retain control.

Viewing a Process Model Graphically

Indicating Processes to Consider Adding

Specifying Data and Search Parameters

Inspecting Revised Process Models

Our approach to computational discovery incorporates ideas from many traditions: Intellectual Influences • computational scientific discovery (e.g., Langley et al., 1983); • theory revision in machine learning (e.g., Towell, 1991); • qualitative physics and simulation (e.g., Forbus, 1984); • languages for scientific simulation (e.g., STELLA, MATLAB); • interactive tools for data analysis (e.g., Schneiderman, 2001). Our work combines, in novel ways, insights from machine learning, AI, programming languages, and human-computer interaction.

In summary, our work on computational scientific discovery has, in responding to various challenges, produced: Contributions of the Research a new formalism for representing scientific process models; a computational method for simulating these models’ behavior; an encoding for background knowledge as generic processes; an algorithm for inducing process models from time-series data; an interactive environment for model construction/utilization. We have demonstrated this approach to model creation on domains from Earth science, microbiology, and engineering.

In recent work, we have extended our approach to incorporate: Some Recent Extensions heuristic beam search through the space of process models; hierarchical generic processes that further constrain search; an ensemble-like method that mitigates overfitting effects; metrics for explanatory adequacy based on trajectory shapes. Inductive process modeling has great potential to speed progress in systems science and engineering.

End of Presentation

Pat Langley Center for the Study of Language and Information

Pat Langley Center for the Study of Language and Information

Presentation Transcript

Nima Asgharbeygi, Pat Langley, Stephen Bay Center for the Study of Language and Information Stanford University Kevin

Pat Langley Computational Learning Laboratory Center for the Study of Language and Information Stanford University, Stan

Pat Langley Seth Rogers Computational Learning Laboratory Center for the Study of Language and Information Stanford Univ

Pat Langley Institute for the Study of Learning and Expertise and Center for the Study of Language and Information Stanf

Pat Langley Computational Learning Laboratory Center for the Study of Language and Information

Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and

Pat Langley Computational Learning Laboratory Center for the Study of Language and Information

Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and

Pat Langley Computational Learning Laboratory Center for the Study of Language and Information

Pat Langley Computational Learning Laboratory Center for the Study of Language and Information

Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and

Pat Langley Computational Learning Laboratory Center for the Study of Language and Information

Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and

Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and

Pat Langley Computational Learning Laboratory Center for the Study of Language and Information

Pat Langley Arizona State University and Institute for the Study of Learning and Expertise

Pat Langley Center for the Study of Language and Information

Pat Langley Computational Learning Laboratory Center for the Study of Language and Information

Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and