Tailoring Data-Intensive Systems via Source Code Generation

Tailoring Data-Intensive Systems viaSource Code Generation

Disclaimer I’m not adata base guy! This is a “position talk”!

Generating Data-Intensive Systems My view: Data Management = DBMS + Application Code My View: Most tailoring on application side! My View: Tailoring = Source Code Generation from Models

Generating Data-Intensive Systems • Object-Relational Mappings (ORM)

Generating Data-Intensive Systems • Object-Relational Mappings (ORM) create table `Department` (ÌD` int not null auto_increment, `DeptName` varchar(255), primary key (ÌD`)) create table `Staff` (ÌD` int not null auto_increment, `Name` varchar(255), Àddress` varchar(255), `Phone` varchar(255), Èmail` varchar(255), ÒfficeExtension` varchar(255), `DepartmentID` int not null, primary key (ÌD`)) alter table `Staff` add index `FK_Staff_4271` (`DepartmentID`), add constraint `FK_Staff_4271` foreign key (`DepartmentID`) references `Department` (ÌD`)

Generating Data-Intensive Systems • Object-Relational Mappings (ORM) • based on model transformations • archetypical MDA application • implemented in gazillions of free/academic/commercial tools

Generating Data-Intensive Systems • Object-Relational Mappings (ORM) • Generating complete Database Application Code (CRUD)

Generating Data-Intensive Systems • Object-Relational Mappings (ORM) • Generating complete Database Application Code (CRUD) • often based on template techniques • implemented in gazillions of free/academic/commercial tools

Generating Data-Intensive Systems • Object-Relational Mappings (ORM) • Generating Database Access Code (CRUD) • Generating Web Applications (WebDSL)

Generating Data-Intensive Systems • Object-Relational Mappings (ORM) • Generating Database Access Code (CRUD) • Generating Web Applications (WebDSL) • variety of techniques • implemented in free/academic/commercial tools

Generating Data-Intensive Systems • Object-Relational Mappings (ORM) • Generating Database Access Code (CRUD) • Generating Web Applications (WebDSL) • Generating Advanced Applications (RBAC) • …

Code Generation for Data Analysis . . . • Ground cover map: • multiple Landsat-bands • estimate pixel classes • estimate class parameters • Implementation problems: • which model? • which algorithm? • efficient C/C++ code? • correctness?

Initial model: model landsat as ‘Landsat Clustering’. const nat N as ‘number of pixels’. const nat B as ‘number of bands’. const nat C := 5 as ‘number of classes’ where C << N. double phi(1..N) as ‘class weights’ where 1 = sum(I := 1..C, phi(I)). double mu(1..C), sig(1..C) where 0 < sig(_). int c(1..N) as ‘class assignments’. c(_) ~ discrete(phi). data double x(1..N, 1..B) as ‘pixels’. x(I,_) ~ gauss(mu(c(I)), sig(c(I))). max pr(x | {phi,mu,sig}) for {phi,mu,sig}. Model refinements: sig(_) ~ invgamma(delta/2+1,sig0*delta/2). Model changes: x(I,_) ~ cauchy(mu(c(I)), sig(c(I))). x(I,_) ~ mix(c(I) cases 1 -> gauss(0, error), _ -> cauchy(mu(c(I)),sig(c(I)))). Code Generation for Data Analysis . . . • Ground cover map: • multiple Landsat-bands • estimate pixel classes • estimate class parameters • Implementation problems: • which model? • which algorithm? • efficient C/C++ code? • correctness?

//----------------------------------------------------------------------------//---------------------------------------------------------------------------- // Code file generated by AutoBayes V0.0.1 // (c) 1999-2004 ASE Group, NASA Ames Research Center // Problem: Mixture of Gaussians // Source: examples/cover.ab // Command: ./autobayes // examples/cover.ab // Generated: Wed Mar 10 16:07:54 2004 //---------------------------------------------------------------------------- #include "autobayes.h" //---------------------------------------------------------------------------- // Octave Function: cover //---------------------------------------------------------------------------- DEFUN_DLD(cover,input_args,output_args, "") { octave_value_list retval; if (input_args.length () != 4 || output_args != 3 ){ octave_stdout << "wrong usage"; return retval; } //-- Input declarations ---------------------------------------------------- // Number of classes octave_value arg_n_classes = input_args(0); if (!arg_n_classes.is_real_scalar()){ gripe_wrong_type_arg("n_classes", (const string &)"Scalar expected"); return retval; } double n_classes = (double)(arg_n_classes.double_value()); octave_value arg_x = input_args(1); if (!arg_x.is_real_matrix()){ gripe_wrong_type_arg("x", (const string &)"Matrix expected"); return retval; } Matrix x = (Matrix)(arg_x.matrix_value()); // Iteration tolerance for convergence loop octave_value arg_tolerance1 = input_args(2); if (!arg_tolerance1.is_real_scalar()){ gripe_wrong_type_arg("tolerance1", (const string &)"Scalar expected"); return retval; } double tolerance1 = (double)(arg_tolerance1.double_value()); // maximal number of EM iterations octave_value arg_maxiteration = input_args(3); if (!arg_maxiteration.is_real_scalar()){ gripe_wrong_type_arg("maxiteration", (const string &)"Scalar expected"); return retval; } double maxiteration = (double)(arg_maxiteration.double_value()); //-- Constant declarations ------------------------------------------------- // Number of data dimensions int n_dim = arg_x.rows(); // Number of data points int n_points = arg_x.columns(); // tolerance for appr. 600 data points double tolerance = 0.0003; //-- Output declarations --------------------------------------------------- Matrix mu(n_dim, n_classes); ColumnVector rho(n_classes); Matrix sigma(n_dim, n_classes); //-- Local declarations ---------------------------------------------------- // class assignment vector ColumnVector c(n_points); // Label: label0 // class membership table used in Discrete EM-algorithm Matrix q(n_points, n_classes); // local centers used for center-based initialization Matrix center(n_classes, n_dim); // Random index of data point int pick; // Loop variable int pv68; int pv65; int pv66; int pv42; int pv43; int pv13; int pv24; int pv56; int pv80; // Summation accumulator double pv83; double pv84; double pv85; // Lagrange-multiplier double l; // Common subexpression double pv52; // Memoized common subexpression ColumnVector pv58(n_classes); // Common subexpression double pv60; int pv70; int pv69; int pv71; Matrix muold(n_dim, n_classes); ColumnVector rhoold(n_classes); Matrix sigmaold(n_dim, n_classes); int pv73; int pv74; int pv75; int pv76; int pv77; // convergence loop counter int loopcounter; // sum up the Diffs double pv86; // Summation accumulator double pv92; double pv94; double pv95; double pv99; double pv98; double pv100; int pv26; int pv49; int pv48; int pv50; // Product accumulator double pv96; int pv53; int pv55; int pv89; int pv88; int pv87; int pv90; int pv91; // Check constraints on inputs ab_assert( 0 < n_classes ); ab_assert( 10 * n_classes < n_points ); ab_assert( 0 < n_dim ); ab_assert( 0 < n_points ); // Label: label1 // Label: label2 // Label: label4 // Discrete EM-algorithm // // The model describes a discrete latent (or hidden) variable problem with // the latent variable c and the data variable x. The problem to optimize // the conditional probability pr(x | {mu, rho, sigma}) w.r.t. the variables // mu, rho, and sigma can thus be solved by an application of the (discrete) // EM-algorithm. // The algorithm maintains as central data structure a class membership // table q (see "label0") such that q(pv13, pv61) is the probability that // data point pv13 belongs to class pv61, i.e., // q(pv13, pv61) == pr([c(pv13) == pv61]) // The algorithm consists of an initialization phase for q (see "label2"), // followed by a convergence phase (see "label5"). // // Initialization // The initialization is center-based, i.e., for each class (i.e., value of // the hidden variable c) a center value center is chosen first // (see "label4"). Then, the values for the local distribution are // calculated as distances between the data points and these center values // (see "label6"). // // Random initialization of the centers center with data points; // note that a data point can be picked as center more than once. for( pv65 = 0;pv65 <= n_classes - 1;pv65++ ) { pick = uniform_int_rnd(n_points - 1 - 0); for( pv66 = 0;pv66 <= n_dim - 1;pv66++ ) center(pv65, pv66) = x(pv66, pick); } // Label: label6 for( pv13 = 0;pv13 <= n_points - 1;pv13++ ) for( pv68 = 0;pv68 <= n_classes - 1;pv68++ ) { pv83 = 0; for( pv69 = 0;pv69 <= n_dim - 1;pv69++ ) pv83 += (center(pv68, pv69) - x(pv69, pv13)) * (center(pv68, pv69) - x(pv69, pv13)); pv85 = 0; for( pv70 = 0;pv70 <= n_classes - 1;pv70++ ) { pv84 = 0; for( pv71 = 0;pv71 <= n_dim - 1;pv71++ ) pv84 += (center(pv70, pv71) - x(pv71, pv13)) * (center(pv70, pv71) - x(pv71, pv13)); pv85 += sqrt(pv84); } q(pv13, pv68) = sqrt(pv83) / pv85; } // Label: label5 // EM-loop // // The EM-loop iterates two steps, expectation (or E-Step) (see "label7"), // and maximization (or M-Step) (see "label8"); however, due to the form of // the initialization used here, the are ordered the other way around. The // loop runs until convergence in the values of the variables mu, rho, and // sigma is achieved. loopcounter = 0; // repeat at least once pv86 = tolerance1; while( ((loopcounter < maxiteration) && (pv86 >= tolerance1)) ) { loopcounter = 1 + loopcounter; if ( loopcounter > 1 ) { // assign current values to old values for( pv73 = 0;pv73 <= n_dim - 1;pv73++ ) for( pv74 = 0;pv74 <= n_classes - 1;pv74++ ) muold(pv73, pv74) = mu(pv73, pv74); for( pv75 = 0;pv75 <= n_classes - 1;pv75++ ) rhoold(pv75) = rho(pv75); for( pv76 = 0;pv76 <= n_dim - 1;pv76++ ) for( pv77 = 0;pv77 <= n_classes - 1;pv77++ ) sigmaold(pv76, pv77) = sigma(pv76, pv77); } else ; // Label: label8 label3 // M-Step // // Decomposition I // The problem to optimize the conditional probability pr({c, x} | {mu, // rho, sigma}) w.r.t. the variables mu, rho, and sigma can under the // given dependencies by Bayes rule be decomposed into two independent // subproblems: // max pr(c | rho) for rho // max pr(x | {c, mu, sigma}) for {mu, sigma} // The conditional probability pr(c | rho) is under the dependencies // given in the model equivalent to // prod([pv17 := 0 .. n_points - 1], pr(c(pv17) | rho)) // The probability occuring here is atomic and can thus be replaced by // the respective probability density function given in the model. // Summing out the expected variable c(pv13) yields the log-likelihood // function // sum_domain([pv13 := 0 .. n_points - 1], // [pv18 := 0 .. n_classes - 1], [c(pv13)], q(pv13, pv18), // log(prod([pv17 := 0 .. n_points - 1], rho(c(pv17))))) // which can be simplified to // sum([pv18 := 0 .. n_classes - 1], // log(rho(pv18)) * sum([pv17 := 0 .. n_points - 1], q(pv17, pv18))) // This function is then optimized w.r.t. the goal variable rho. // The expression // sum([pv18 := 0 .. n_classes - 1], // log(rho(pv18)) * sum([pv17 := 0 .. n_points - 1], q(pv17, pv18))) // is maximized w.r.t. the variable rho under the constraint // 1 == sum([pv23 := 0 .. n_classes - 1], rho(pv23)) // using the Lagrange-multiplier l. l = n_points; for( pv24 = 0;pv24 <= n_classes - 1;pv24++ ) { // The summand // l // is constant with respect to the goal variable rho(pv24) and can // thus be ignored for maximization. The function // sum([pv18 := 0 .. n_classes - 1], // log(rho(pv18)) * // sum([pv17 := 0 .. n_points - 1], q(pv17, pv18))) - // l * sum([pv23 := 0 .. n_classes - 1], rho(pv23)) // is then symbolically maximized w.r.t. the goal variable rho(pv24). // The differential // sum([pv17 := 0 .. n_points - 1], q(pv17, pv24)) / rho(pv24) - l // is set to zero; this equation yields the solution // sum([pv26 := 0 .. n_points - 1], q(pv26, pv24)) / l pv92 = 0; for( pv26 = 0;pv26 <= n_points - 1;pv26++ ) pv92 += q(pv26, pv24); rho(pv24) = pv92 / l; } // The conditional probability pr(x | {c, mu, sigma}) is under the // dependencies given in the model equivalent to // prod([pv32 := 0 .. n_dim - 1, pv33 := 0 .. n_points - 1], // pr(x(pv32, pv33) | {c(pv33), mu(pv32, *), sigma(pv32, *)})) // The probability occuring here is atomic and can thus be replaced by // the respective probability density function given in the model. // Summing out the expected variable c(pv13) yields the log-likelihood // function // sum_domain([pv13 := 0 .. n_points - 1], // [pv34 := 0 .. n_classes - 1], [c(pv13)], q(pv13, pv34), // log(prod([pv32 := 0 .. n_dim - 1, // pv33 := 0 .. n_points - 1], // exp((x(pv32, pv33) - mu(pv32, c(pv33))) ** 2 / // -2 / sigma(pv32, c(pv33)) ** 2) / // (sqrt(2 * pi) * sigma(pv32, c(pv33)))))) // which can be simplified to // -((n_dim * n_points * (log(2) + log(pi)) + // sum([pv33 := 0 .. n_points - 1, pv34 := 0 .. n_classes - 1], // q(pv33, pv34) * // sum([pv32 := 0 .. n_dim - 1], // (x(pv32, pv33) - mu(pv32, pv34)) ** 2 / // sigma(pv32, pv34) ** 2))) / 2 + // sum([pv34 := 0 .. n_classes - 1], // sum([pv32 := 0 .. n_dim - 1], log(sigma(pv32, pv34))) * // sum([pv33 := 0 .. n_points - 1], q(pv33, pv34)))) // This function is then optimized w.r.t. the goal variables mu and // sigma. // The summands // n_dim * n_points * log(2) / -2 // n_dim * n_points * log(pi) / -2 // are constant with respect to the goal variables mu and sigma and can // thus be ignored for maximization. // Index decomposition // The function // -(sum([pv33 := 0 .. n_points - 1, pv34 := 0 .. n_classes - 1], // q(pv33, pv34) * // sum([pv32 := 0 .. n_dim - 1], // (x(pv32, pv33) - mu(pv32, pv34)) ** 2 / // sigma(pv32, pv34) ** 2)) / 2 + // sum([pv34 := 0 .. n_classes - 1], // sum([pv32 := 0 .. n_dim - 1], log(sigma(pv32, pv34))) * // sum([pv33 := 0 .. n_points - 1], q(pv33, pv34)))) // can be optimized w.r.t. the variables mu(pv42, pv43) and // sigma(pv42, pv43) element by element (i.e., along the index variables // pv42 and pv43) because there are no dependencies along thats dimensions. for( pv42 = 0;pv42 <= n_dim - 1;pv42++ ) for( pv43 = 0;pv43 <= n_classes - 1;pv43++ ) // The factors // n_classes // n_dim // are non-negative and constant with respect to the goal variables // mu(pv42, pv43) and sigma(pv42, pv43) and can thus be ignored for // maximization. // The function // -(log(sigma(pv42, pv43)) * // sum([pv33 := 0 .. n_points - 1], q(pv33, pv43)) + // sum([pv33 := 0 .. n_points - 1], // (x(pv42, pv33) - mu(pv42, pv43)) ** 2 * q(pv33, pv43)) / // (2 * sigma(pv42, pv43) ** 2)) // is then symbolically maximized w.r.t. the goal variables // mu(pv42, pv43) and sigma(pv42, pv43). The partial differentials // df / d_mu(pvar(42), pvar(43)) == // (sum([pv33 := 0 .. n_points - 1], // q(pv33, pv43) * x(pv42, pv33)) - // mu(pv42, pv43) * // sum([pv33 := 0 .. n_points - 1], q(pv33, pv43))) / // sigma(pv42, pv43) ** 2 // df / d_sigma(pvar(42), pvar(43)) == // sum([pv33 := 0 .. n_points - 1], // (x(pv42, pv33) - mu(pv42, pv43)) ** 2 * q(pv33, pv43)) / // sigma(pv42, pv43) ** 3 - // sum([pv33 := 0 .. n_points - 1], q(pv33, pv43)) / // sigma(pv42, pv43) // are set to zero; these equations yield the solutions // mu(pv42, pv43) == // cond(0 == sum([pv46 := 0 .. n_points - 1], q(pv46, pv43)), // fail(division_by_zero), // sum([pv48 := 0 .. n_points - 1], // q(pv48, pv43) * x(pv42, pv48)) / // sum([pv47 := 0 .. n_points - 1], q(pv47, pv43))) // sigma(pv42, pv43) == // cond(0 == sum([pv49 := 0 .. n_points - 1], q(pv49, pv43)), // fail(division_by_zero), // sqrt(sum([pv50 := 0 .. n_points - 1], // (x(pv42, pv50) - mu(pv42, pv43)) ** 2 * // q(pv50, pv43)) / // sum([pv51 := 0 .. n_points - 1], q(pv51, pv43)))) { // Initialization of common subexpression pv52 = 0; for( pv49 = 0;pv49 <= n_points - 1;pv49++ ) pv52 += q(pv49, pv43); if ( 0 == pv52 ) { ab_error( division_by_zero ); } else { pv94 = 0; for( pv48 = 0;pv48 <= n_points - 1;pv48++ ) pv94 += q(pv48, pv43) * x(pv42, pv48); mu(pv42, pv43) = pv94 / pv52; } if ( 0 == pv52 ) { ab_error( division_by_zero ); } else { pv95 = 0; for( pv50 = 0;pv50 <= n_points - 1;pv50++ ) pv95 += (x(pv42, pv50) - mu(pv42, pv43)) * (x(pv42, pv50) - mu(pv42, pv43)) * q(pv50, pv43); sigma(pv42, pv43) = sqrt(pv95 / pv52); } } // Label: label7 // E-Step // Update the current values of the class membership table q. for( pv13 = 0;pv13 <= n_points - 1;pv13++ ) { // Initialization of common subexpression for( pv56 = 0;pv56 <= n_classes - 1;pv56++ ) { pv96 = 1; for( pv53 = 0;pv53 <= n_dim - 1;pv53++ ) pv96 *= exp((x(pv53, pv13) - mu(pv53, pv56)) * (x(pv53, pv13) - mu(pv53, pv56)) / (double)(-2) / (sigma(pv53, pv56) * sigma(pv53, pv56))) / (sqrt(M_PI * (double)(2)) * sigma(pv53, pv56)); pv58(pv56) = pv96 * rho(pv56); } pv60 = 0; for( pv55 = 0;pv55 <= n_classes - 1;pv55++ ) pv60 += pv58(pv55); for( pv80 = 0;pv80 <= n_classes - 1;pv80++ ) // The denominator pv60 can become zero due to round-off errors. // In that case, each class is considered to be equally likely. if ( pv60 == 0 ) q(pv13, pv80) = (double)(1) / (double)(n_classes); else q(pv13, pv80) = pv58(pv80) / pv60; } if ( loopcounter > 1 ) { pv98 = 0; for( pv87 = 0;pv87 <= n_dim - 1;pv87++ ) for( pv88 = 0;pv88 <= n_classes - 1;pv88++ ) pv98 += abs(mu(pv87, pv88) - muold(pv87, pv88)) / (abs(mu(pv87, pv88)) + abs(muold(pv87, pv88))); pv99 = 0; for( pv89 = 0;pv89 <= n_classes - 1;pv89++ ) pv99 += abs(rho(pv89) - rhoold(pv89)) / (abs(rho(pv89)) + abs(rhoold(pv89))); pv100 = 0; for( pv90 = 0;pv90 <= n_dim - 1;pv90++ ) for( pv91 = 0;pv91 <= n_classes - 1;pv91++ ) pv100 += abs(sigma(pv90, pv91) - sigmaold(pv90, pv91)) / (abs(sigma(pv90, pv91)) + abs(sigmaold(pv90, pv91))); pv86 = pv98 + pv99 + pv100; } else ; } retval.resize(3); retval(0) = mu; retval(1) = rho; retval(2) = sigma; return retval; } //-- End of code --------------------------------------------------------------- Code Generation for Data Analysis Generated program: . . . • Ground cover map: • multiple Landsat-bands • estimate pixel classes • estimate class parameters • Implementation problems: • which model? • which algorithm? • efficient C/C++ code? • correctness?

//----------------------------------------------------------------------------//---------------------------------------------------------------------------- // Code file generated by AutoBayes V0.0.1 // (c) 1999-2004 ASE Group, NASA Ames Research Center // Problem: Mixture of Gaussians // Source: examples/cover.ab // Command: ./autobayes // examples/cover.ab // Generated: Wed Mar 10 16:07:54 2004 //---------------------------------------------------------------------------- #include "autobayes.h" //---------------------------------------------------------------------------- // Octave Function: cover //---------------------------------------------------------------------------- DEFUN_DLD(cover,input_args,output_args, "") { octave_value_list retval; if (input_args.length () != 4 || output_args != 3 ){ octave_stdout << "wrong usage"; return retval; } //-- Input declarations ---------------------------------------------------- // Number of classes octave_value arg_n_classes = input_args(0); if (!arg_n_classes.is_real_scalar()){ gripe_wrong_type_arg("n_classes", (const string &)"Scalar expected"); return retval; } double n_classes = (double)(arg_n_classes.double_value()); octave_value arg_x = input_args(1); if (!arg_x.is_real_matrix()){ gripe_wrong_type_arg("x", (const string &)"Matrix expected"); return retval; } Matrix x = (Matrix)(arg_x.matrix_value()); // Iteration tolerance for convergence loop octave_value arg_tolerance1 = input_args(2); if (!arg_tolerance1.is_real_scalar()){ gripe_wrong_type_arg("tolerance1", (const string &)"Scalar expected"); return retval; } double tolerance1 = (double)(arg_tolerance1.double_value()); // maximal number of EM iterations octave_value arg_maxiteration = input_args(3); if (!arg_maxiteration.is_real_scalar()){ gripe_wrong_type_arg("maxiteration", (const string &)"Scalar expected"); return retval; } double maxiteration = (double)(arg_maxiteration.double_value()); //-- Constant declarations ------------------------------------------------- // Number of data dimensions int n_dim = arg_x.rows(); // Number of data points int n_points = arg_x.columns(); // tolerance for appr. 600 data points double tolerance = 0.0003; //-- Output declarations --------------------------------------------------- Matrix mu(n_dim, n_classes); ColumnVector rho(n_classes); Matrix sigma(n_dim, n_classes); //-- Local declarations ---------------------------------------------------- // class assignment vector ColumnVector c(n_points); // Label: label0 // class membership table used in Discrete EM-algorithm Matrix q(n_points, n_classes); // local centers used for center-based initialization Matrix center(n_classes, n_dim); // Random index of data point int pick; // Loop variable int pv68; int pv65; int pv66; int pv42; int pv43; int pv13; int pv24; int pv56; int pv80; // Summation accumulator double pv83; double pv84; double pv85; // Lagrange-multiplier double l; // Common subexpression double pv52; // Memoized common subexpression ColumnVector pv58(n_classes); // Common subexpression double pv60; int pv70; int pv69; int pv71; Matrix muold(n_dim, n_classes); ColumnVector rhoold(n_classes); Matrix sigmaold(n_dim, n_classes); int pv73; int pv74; int pv75; int pv76; int pv77; // convergence loop counter int loopcounter; // sum up the Diffs double pv86; // Summation accumulator double pv92; double pv94; double pv95; double pv99; double pv98; double pv100; int pv26; int pv49; int pv48; int pv50; // Product accumulator double pv96; int pv53; int pv55; int pv89; int pv88; int pv87; int pv90; int pv91; // Check constraints on inputs ab_assert( 0 < n_classes ); ab_assert( 10 * n_classes < n_points ); ab_assert( 0 < n_dim ); ab_assert( 0 < n_points ); // Label: label1 // Label: label2 // Label: label4 // Discrete EM-algorithm // // The model describes a discrete latent (or hidden) variable problem with // the latent variable c and the data variable x. The problem to optimize // the conditional probability pr(x | {mu, rho, sigma}) w.r.t. the variables // mu, rho, and sigma can thus be solved by an application of the (discrete) // EM-algorithm. // The algorithm maintains as central data structure a class membership // table q (see "label0") such that q(pv13, pv61) is the probability that // data point pv13 belongs to class pv61, i.e., // q(pv13, pv61) == pr([c(pv13) == pv61]) // The algorithm consists of an initialization phase for q (see "label2"), // followed by a convergence phase (see "label5"). // // Initialization // The initialization is center-based, i.e., for each class (i.e., value of // the hidden variable c) a center value center is chosen first // (see "label4"). Then, the values for the local distribution are // calculated as distances between the data points and these center values // (see "label6"). // // Random initialization of the centers center with data points; // note that a data point can be picked as center more than once. for( pv65 = 0;pv65 <= n_classes - 1;pv65++ ) { pick = uniform_int_rnd(n_points - 1 - 0); for( pv66 = 0;pv66 <= n_dim - 1;pv66++ ) center(pv65, pv66) = x(pv66, pick); } // Label: label6 for( pv13 = 0;pv13 <= n_points - 1;pv13++ ) for( pv68 = 0;pv68 <= n_classes - 1;pv68++ ) { pv83 = 0; for( pv69 = 0;pv69 <= n_dim - 1;pv69++ ) pv83 += (center(pv68, pv69) - x(pv69, pv13)) * (center(pv68, pv69) - x(pv69, pv13)); pv85 = 0; for( pv70 = 0;pv70 <= n_classes - 1;pv70++ ) { pv84 = 0; for( pv71 = 0;pv71 <= n_dim - 1;pv71++ ) pv84 += (center(pv70, pv71) - x(pv71, pv13)) * (center(pv70, pv71) - x(pv71, pv13)); pv85 += sqrt(pv84); } q(pv13, pv68) = sqrt(pv83) / pv85; } // Label: label5 // EM-loop // // The EM-loop iterates two steps, expectation (or E-Step) (see "label7"), // and maximization (or M-Step) (see "label8"); however, due to the form of // the initialization used here, the are ordered the other way around. The // loop runs until convergence in the values of the variables mu, rho, and // sigma is achieved. loopcounter = 0; // repeat at least once pv86 = tolerance1; while( ((loopcounter < maxiteration) && (pv86 >= tolerance1)) ) { loopcounter = 1 + loopcounter; if ( loopcounter > 1 ) { // assign current values to old values for( pv73 = 0;pv73 <= n_dim - 1;pv73++ ) for( pv74 = 0;pv74 <= n_classes - 1;pv74++ ) muold(pv73, pv74) = mu(pv73, pv74); for( pv75 = 0;pv75 <= n_classes - 1;pv75++ ) rhoold(pv75) = rho(pv75); for( pv76 = 0;pv76 <= n_dim - 1;pv76++ ) for( pv77 = 0;pv77 <= n_classes - 1;pv77++ ) sigmaold(pv76, pv77) = sigma(pv76, pv77); } else ; // Label: label8 label3 // M-Step // // Decomposition I // The problem to optimize the conditional probability pr({c, x} | {mu, // rho, sigma}) w.r.t. the variables mu, rho, and sigma can under the // given dependencies by Bayes rule be decomposed into two independent // subproblems: // max pr(c | rho) for rho // max pr(x | {c, mu, sigma}) for {mu, sigma} // The conditional probability pr(c | rho) is under the dependencies // given in the model equivalent to // prod([pv17 := 0 .. n_points - 1], pr(c(pv17) | rho)) // The probability occuring here is atomic and can thus be replaced by // the respective probability density function given in the model. // Summing out the expected variable c(pv13) yields the log-likelihood // function // sum_domain([pv13 := 0 .. n_points - 1], // [pv18 := 0 .. n_classes - 1], [c(pv13)], q(pv13, pv18), // log(prod([pv17 := 0 .. n_points - 1], rho(c(pv17))))) // which can be simplified to // sum([pv18 := 0 .. n_classes - 1], // log(rho(pv18)) * sum([pv17 := 0 .. n_points - 1], q(pv17, pv18))) // This function is then optimized w.r.t. the goal variable rho. // The expression // sum([pv18 := 0 .. n_classes - 1], // log(rho(pv18)) * sum([pv17 := 0 .. n_points - 1], q(pv17, pv18))) // is maximized w.r.t. the variable rho under the constraint // 1 == sum([pv23 := 0 .. n_classes - 1], rho(pv23)) // using the Lagrange-multiplier l. l = n_points; for( pv24 = 0;pv24 <= n_classes - 1;pv24++ ) { // The summand // l // is constant with respect to the goal variable rho(pv24) and can // thus be ignored for maximization. The function // sum([pv18 := 0 .. n_classes - 1], // log(rho(pv18)) * // sum([pv17 := 0 .. n_points - 1], q(pv17, pv18))) - // l * sum([pv23 := 0 .. n_classes - 1], rho(pv23)) // is then symbolically maximized w.r.t. the goal variable rho(pv24). // The differential // sum([pv17 := 0 .. n_points - 1], q(pv17, pv24)) / rho(pv24) - l // is set to zero; this equation yields the solution // sum([pv26 := 0 .. n_points - 1], q(pv26, pv24)) / l pv92 = 0; for( pv26 = 0;pv26 <= n_points - 1;pv26++ ) pv92 += q(pv26, pv24); rho(pv24) = pv92 / l; } // The conditional probability pr(x | {c, mu, sigma}) is under the // dependencies given in the model equivalent to // prod([pv32 := 0 .. n_dim - 1, pv33 := 0 .. n_points - 1], // pr(x(pv32, pv33) | {c(pv33), mu(pv32, *), sigma(pv32, *)})) // The probability occuring here is atomic and can thus be replaced by // the respective probability density function given in the model. // Summing out the expected variable c(pv13) yields the log-likelihood // function // sum_domain([pv13 := 0 .. n_points - 1], // [pv34 := 0 .. n_classes - 1], [c(pv13)], q(pv13, pv34), // log(prod([pv32 := 0 .. n_dim - 1, // pv33 := 0 .. n_points - 1], // exp((x(pv32, pv33) - mu(pv32, c(pv33))) ** 2 / // -2 / sigma(pv32, c(pv33)) ** 2) / // (sqrt(2 * pi) * sigma(pv32, c(pv33)))))) // which can be simplified to // -((n_dim * n_points * (log(2) + log(pi)) + // sum([pv33 := 0 .. n_points - 1, pv34 := 0 .. n_classes - 1], // q(pv33, pv34) * // sum([pv32 := 0 .. n_dim - 1], // (x(pv32, pv33) - mu(pv32, pv34)) ** 2 / // sigma(pv32, pv34) ** 2))) / 2 + // sum([pv34 := 0 .. n_classes - 1], // sum([pv32 := 0 .. n_dim - 1], log(sigma(pv32, pv34))) * // sum([pv33 := 0 .. n_points - 1], q(pv33, pv34)))) // This function is then optimized w.r.t. the goal variables mu and // sigma. // The summands // n_dim * n_points * log(2) / -2 // n_dim * n_points * log(pi) / -2 // are constant with respect to the goal variables mu and sigma and can // thus be ignored for maximization. // Index decomposition // The function // -(sum([pv33 := 0 .. n_points - 1, pv34 := 0 .. n_classes - 1], // q(pv33, pv34) * // sum([pv32 := 0 .. n_dim - 1], // (x(pv32, pv33) - mu(pv32, pv34)) ** 2 / // sigma(pv32, pv34) ** 2)) / 2 + // sum([pv34 := 0 .. n_classes - 1], // sum([pv32 := 0 .. n_dim - 1], log(sigma(pv32, pv34))) * // sum([pv33 := 0 .. n_points - 1], q(pv33, pv34)))) // can be optimized w.r.t. the variables mu(pv42, pv43) and // sigma(pv42, pv43) element by element (i.e., along the index variables // pv42 and pv43) because there are no dependencies along thats dimensions. for( pv42 = 0;pv42 <= n_dim - 1;pv42++ ) for( pv43 = 0;pv43 <= n_classes - 1;pv43++ ) // The factors // n_classes // n_dim // are non-negative and constant with respect to the goal variables // mu(pv42, pv43) and sigma(pv42, pv43) and can thus be ignored for // maximization. // The function // -(log(sigma(pv42, pv43)) * // sum([pv33 := 0 .. n_points - 1], q(pv33, pv43)) + // sum([pv33 := 0 .. n_points - 1], // (x(pv42, pv33) - mu(pv42, pv43)) ** 2 * q(pv33, pv43)) / // (2 * sigma(pv42, pv43) ** 2)) // is then symbolically maximized w.r.t. the goal variables // mu(pv42, pv43) and sigma(pv42, pv43). The partial differentials // df / d_mu(pvar(42), pvar(43)) == // (sum([pv33 := 0 .. n_points - 1], // q(pv33, pv43) * x(pv42, pv33)) - // mu(pv42, pv43) * // sum([pv33 := 0 .. n_points - 1], q(pv33, pv43))) / // sigma(pv42, pv43) ** 2 // df / d_sigma(pvar(42), pvar(43)) == // sum([pv33 := 0 .. n_points - 1], // (x(pv42, pv33) - mu(pv42, pv43)) ** 2 * q(pv33, pv43)) / // sigma(pv42, pv43) ** 3 - // sum([pv33 := 0 .. n_points - 1], q(pv33, pv43)) / // sigma(pv42, pv43) // are set to zero; these equations yield the solutions // mu(pv42, pv43) == // cond(0 == sum([pv46 := 0 .. n_points - 1], q(pv46, pv43)), // fail(division_by_zero), // sum([pv48 := 0 .. n_points - 1], // q(pv48, pv43) * x(pv42, pv48)) / // sum([pv47 := 0 .. n_points - 1], q(pv47, pv43))) // sigma(pv42, pv43) == // cond(0 == sum([pv49 := 0 .. n_points - 1], q(pv49, pv43)), // fail(division_by_zero), // sqrt(sum([pv50 := 0 .. n_points - 1], // (x(pv42, pv50) - mu(pv42, pv43)) ** 2 * // q(pv50, pv43)) / // sum([pv51 := 0 .. n_points - 1], q(pv51, pv43)))) { // Initialization of common subexpression pv52 = 0; for( pv49 = 0;pv49 <= n_points - 1;pv49++ ) pv52 += q(pv49, pv43); if ( 0 == pv52 ) { ab_error( division_by_zero ); } else { pv94 = 0; for( pv48 = 0;pv48 <= n_points - 1;pv48++ ) pv94 += q(pv48, pv43) * x(pv42, pv48); mu(pv42, pv43) = pv94 / pv52; } if ( 0 == pv52 ) { ab_error( division_by_zero ); } else { pv95 = 0; for( pv50 = 0;pv50 <= n_points - 1;pv50++ ) pv95 += (x(pv42, pv50) - mu(pv42, pv43)) * (x(pv42, pv50) - mu(pv42, pv43)) * q(pv50, pv43); sigma(pv42, pv43) = sqrt(pv95 / pv52); } } // Label: label7 // E-Step // Update the current values of the class membership table q. for( pv13 = 0;pv13 <= n_points - 1;pv13++ ) { // Initialization of common subexpression for( pv56 = 0;pv56 <= n_classes - 1;pv56++ ) { pv96 = 1; for( pv53 = 0;pv53 <= n_dim - 1;pv53++ ) pv96 *= exp((x(pv53, pv13) - mu(pv53, pv56)) * (x(pv53, pv13) - mu(pv53, pv56)) / (double)(-2) / (sigma(pv53, pv56) * sigma(pv53, pv56))) / (sqrt(M_PI * (double)(2)) * sigma(pv53, pv56)); pv58(pv56) = pv96 * rho(pv56); } pv60 = 0; for( pv55 = 0;pv55 <= n_classes - 1;pv55++ ) pv60 += pv58(pv55); for( pv80 = 0;pv80 <= n_classes - 1;pv80++ ) // The denominator pv60 can become zero due to round-off errors. // In that case, each class is considered to be equally likely. if ( pv60 == 0 ) q(pv13, pv80) = (double)(1) / (double)(n_classes); else q(pv13, pv80) = pv58(pv80) / pv60; } if ( loopcounter > 1 ) { pv98 = 0; for( pv87 = 0;pv87 <= n_dim - 1;pv87++ ) for( pv88 = 0;pv88 <= n_classes - 1;pv88++ ) pv98 += abs(mu(pv87, pv88) - muold(pv87, pv88)) / (abs(mu(pv87, pv88)) + abs(muold(pv87, pv88))); pv99 = 0; for( pv89 = 0;pv89 <= n_classes - 1;pv89++ ) pv99 += abs(rho(pv89) - rhoold(pv89)) / (abs(rho(pv89)) + abs(rhoold(pv89))); pv100 = 0; for( pv90 = 0;pv90 <= n_dim - 1;pv90++ ) for( pv91 = 0;pv91 <= n_classes - 1;pv91++ ) pv100 += abs(sigma(pv90, pv91) - sigmaold(pv90, pv91)) / (abs(sigma(pv90, pv91)) + abs(sigmaold(pv90, pv91))); pv86 = pv98 + pv99 + pv100; } else ; } retval.resize(3); retval(0) = mu; retval(1) = rho; retval(2) = sigma; return retval; } //-- End of code --------------------------------------------------------------- Code Generation for Data Analysis • Generated program: • ~ 1sec. generation time • ≥ 600 lines • ≥ 1:30 leverage • fully documented • deeply nested loops • complex calculations • “correct-by-construction” . . . • Ground cover map: • multiple Landsat-bands • estimate pixel classes • estimate class parameters • Implementation problems: • which model? • which algorithm? • efficient C/C++ code? • correctness?

Transformation Schemas • Big-step transformations • horizontal (model decompositions / transformations) • vertical (domain-specific algorithms) • Implemented as combination of techniques • meta-program (check conditions) • graph rewriting (transform model) • templates (represent code fragments)

[...] // Initialization for(v44=0;v44<=n-1;v44++) for(v45=0;v45<=c-1;v45++ ) q(v44,v45)=0; for(v46=0;v46<=n-1;v46++) q(v46,z(v46))=1; [...] for(v12=0;v12<=n-1;pv12++) for(v13=0;v13<=c-1;pv13++) { pv68 = 0; for(v41=0;v41 <= c-1;pv41++ ) v68+=exp((x(v12)-mu(v41))* (x(v12)-mu(v41))/ (double)(-2)/ [...]); [...] model mog as 'Mixture of Gaussians'. [...] % Class probabilities double rho(1..c). where 1 = sum(I:=1..c, rho(I)). % Class parameters double mu(1..c). double sigma(1..c). where 0 < sigma(_). % Hidden variable nat z(1..n) ~ discrete(rho). % Data data double x(1..n). x(I) ~ gauss(mu(z(I)),sigma(z(I))). % Goal max pr(x|{rho,mu,sigma}) for {rho,mu,sigma}. Code Model Generator Assurance Should you trust a code generator? • Correctness of the generated code depends oncorrectness of the generator • Correctness of the generator is difficult to show • very large • very complicated • very dynamic So what do you do?

Trust me – I’m a doctor Generator Assurance Approaches (I) Correctness-by-construction: Generator is based on logical framework; code is derived by correctness-preserving transformations Techniques: • deductive program synthesis • refinement and transformation systems Advantages: • highest degree of confidence (“proofs as programs”) Disadvantages: • expensive – systems difficult to build and maintain • opaque – correctness argument convoluted (and buried in generator)

Trust me – I’m an engineer Generator Assurance Approaches (II) Generator Qualification: Generator is tested to same level of criticality asgenerated code Advantages: • currently only technique allowed by FAA • state of the practice Disadvantages: • expensive – testing effort very high • expensive – re-qualification required after changes • limited – only partial assurance • opaque – no explicit correctness argument

[...] // Initialization for(v44=0;v44<=n-1;v44++) for(v45=0;v45<=c-1;v45++ ) q(v44,v45)=0; for(v46=0;v46<=n-1;v46++) q(v46,z(v46))=1; [...] for(v12=0;v12<=n-1;pv12++) for(v13=0;v13<=c-1;pv13++) { pv68 = 0; for(v41=0;v41 <= c-1;pv41++ ) v68+=exp((x(v12)-mu(v41))* (x(v12)-mu(v41))/ (double)(-2)/ [...]); [...] model mog as 'Mixture of Gaussians'. [...] % Class probabilities double rho(1..c). where 1 = sum(I:=1..c, rho(I)). % Class parameters double mu(1..c). double sigma(1..c). where 0 < sigma(_). % Hidden variable nat z(1..n) ~ discrete(rho). % Data data double x(1..n). x(I) ~ gauss(mu(z(I)),sigma(z(I))). % Goal max pr(x|{rho,mu,sigma}) for {rho,mu,sigma}. Code Model Generator Assurance Should you trust a code generator? • Correctness of the generated code depends oncorrectness of the generator • Correctness of the generator is difficult to show • very large • very complicated • very dynamic So what??? • Don’t care whether generator is buggy for other peopleas long as it works for me now!  Certifiable Code Generation

Don’t trust me – I’m a computer scientist! Generator Assurance Approaches (III) Certifiable Code Generation: Generator is extended to support independent post-generation verification • certify generated programs, not the generator • product-oriented approach, not process-oriented Advantages • No need to re-certify generator • Minimizes trusted component base [...] // Initialization for(v44=0;v44<=n-1;v44++) for(v45=0;v45<=c-1;v45++ ) q(v44,v45)=0; for(v46=0;v46<=n-1;v46++) q(v46,z(v46))=1; [...] for(v12=0;v12<=n-1;pv12++) for(v13=0;v13<=c-1;pv13++) { pv68 = 0; for(v41=0;v41 <= c-1;pv41++ ) v68+=exp((x(v12)-mu(v41))* (x(v12)-mu(v41))/ (double)(-2)/ [...]); [...] model mog as 'Mixture of Gaussians'. [...] % Class probabilities double rho(1..c). where 1 = sum(I:=1..c, rho(I)). % Class parameters double mu(1..c). double sigma(1..c). where 0 < sigma(_). % Hidden variable nat z(1..n) ~ discrete(rho). % Data data double x(1..n). x(I) ~ gauss(mu(z(I)),sigma(z(I))). % Goal max pr(x|{rho,mu,sigma}) for {rho,mu,sigma}. Proofs Model Code

Generator Assurance Approaches (III) Certifiable Code Generation: Generator is extended to support independent post-generation verification • generate code and annotations in parallel

Generator Assurance Approaches (III) Certifiable Code Generation: Generator is extended to support independent post-generation verification • generate code and annotations in parallel Errm… Can you do this? What about errors in the extension? What about the effort?

Generator Assurance Approaches (III) Certifiable Code Generation: Generator is extended to support independent post-generation verification • generate code and annotations in parallel • use standard program verification techniques • annotations (pre-/postconditions, loop invariants) give hints only • proofs are independently verifiable evidence (certificates) • keeps certification independent from code generation • focus on specific safety properties • language-specific: array bounds, partial operators, … • domain-specific: units, frames, vector norm, matrix symmetry, … • keeps annotation generation tractable • keeps certification tractable

Certification Framework Safety property: formal characterization of intuitively safe programs “All automatic variables shall have been assigned a value before being used” (MISRA 9.1) Formal: • introduce “shadow variables” to record safety information • extend operational semantics by effects on shadow variables:

Certification Framework Safety property: formal characterization of intuitively safe programs “All automatic variables shall have been assigned a value before being used” (MISRA 9.1) Formal: • introduce “shadow variables” to record safety information • extend operational semantics by effects on shadow variables • define semantic safety judgements on expressions/statements:

Certification Framework Safety property: formal characterization of intuitively safe programs “All automatic variables shall have been assigned a value before being used” (MISRA 9.1) Formal: • introduce “shadow variables” to record safety information • extend operational semantics by effects on shadow variables • define semantic safety judgements on expressions/statements • prove safety reduction (i.e., consistency of safety property):  “safe programs don’t go wrong”

Certification Framework Safety policy: proof rules designed to show that safety property holds for program • responsible for • maintenance of shadow variables • construction of safety obligations • extend Hoare-rules by safety predicate and shadow variables:

Certification Framework Safety policy: proof rules designed to show that safety property holds for program • responsible for • maintenance of shadow variables • construction of safety obligations • extend Hoare-rules by safety predicate and shadow variables • define safety predicate corresponding to safety judgements:

Certification Framework Safety policy: proof rules designed to show that safety property holds for program • responsible for • maintenance of shadow variables • construction of safety obligations • extend Hoare-rules by safety predicate and shadow variables • define safety predicate corresponding to safety judgements • prove soundness and completeness:

Certification Architecture Approach: • generate and refine annotations in parallel with code • generate safety obligations from annotated code(i.e., VCG applies safety policy to program) • simplify, prove, & check

Conclusions • Data Management = DBMS + Application Code • Most tailoring is on application side • Tailoring via source code generation from models • Generator assurance

Tailoring Data-Intensive Systems via Source Code Generation

Tailoring Data-Intensive Systems via Source Code Generation

Presentation Transcript

Data Intensive Biomedical Computing Systems

Improving Programmer Productivity via Mining Program Source Code

CPS 216: Data-intensive Computing Systems

Resource Management in Data-Intensive Systems

Source code

Code Generation

CPS216: Data-intensive Computing Systems

Code Generation

Source Code Control Systems

CS216: Data-Intensive Computing Systems

Data -Intensive Computing Systems

EECE 571R: Data-intensive computing systems

Code Generation

CPS 516 : Data-intensive Computing Systems

Database and Data-Intensive Systems

Code Generation

Code Generation

CPS216: Data-intensive Computing Systems

CS216: Data-Intensive Computing Systems