240 likes | 390 Views
The History of Datalog. Origins Failure Resurrection. An Odd Encounter. Several years ago, I met a colleague, Monica Lam, in the hallway at Stanford. “I hear you were involved in the early work on Datalog.”
E N D
The History of Datalog Origins Failure Resurrection
An Odd Encounter • Several years ago, I met a colleague, Monica Lam, in the hallway at Stanford. • “I hear you were involved in the early work on Datalog.” • She had discovered this work and used it in her system for large-scale data-flow analysis.
Odd Encounter – (2) • The application is naturally recursive. • Very large-scale (analyzed code of 800K lines). • They (Monica and her student John Whaley) had an implementation bddbddb that compiled Datalog rules into BDD’s (binary decision diagrams).
Where Did Datalog Come From? • Codd’s tuple and domain calculus (1972). • Gallaire and Minker’s “Logic and Databases” (1978). • Prolog (1976).
Codd’s Logics • TRC. { t | R(r) and S(s) and t.A = r.A and r.B = s.B and t.C = s.C } • Implemented by Stonebraker as QUEL. • DRC. { ac | R(ab) and S(bc) } • Implemented by Zloof as Query-by-Example.
“Logic and Databases” • Viewed queries as the result of an entire logical theory. • Thus allows recursion, negation, theories with multiple minimal models. • Closed/open-world evaluations.
Prolog • A conventional programming language with predicates as function calls. • Bizarre execution rule. • Example: you have to write TC as: path(X,Y) :- arc(X,Y). path(X,Y) :- arc(X,Z), path(Z,Y).
Implementation of Logical Query Languages for Databases • In 1984 I took sabbatical at Hebrew University and wrote a paper with the above title. • It has some crazy stuff that makes me wonder “what was I thinking?” • Much was fixed by others, later. • Published in SIGMOD (no real theorems!).
Implementation – (2) • Key idea: Prolog notation + Horn-clause, unique fixedpoint semantics. • Key idea: It’s about algorithms for query execution, not logical models. • Original thought in that direction was really by Henschen and Naqvi.
Enter “Datalog” • The term “Datalog” to refer to positive Horn clauses without function symbols was first proposed by Dave Maier and David S. (“the other”) Warren. • Appears in their book Programming with Logic (1988), but in common use before that.
Good Implementation Ideas • Seminaive evaluation (Bancilhon and Ramakrishnan, 1986 – also in SIGMOD). • Specialized linear-recursion implementations (many people including Naughton, Ramakrishnan, Sagiv, Vardi,…). • Magic sets (Beeri and Ramakrishnan, 1987 – finally something got into PODS).
Magic Sets • A query-rewriting scheme. • Similar in effect to a number of query-execution ideas such as • Query-Subquery (Rohmer, Lescoeur, and Kerasit, 1986). • Memoing (Dietrich and Warren, 1985).
Negation • With negated subgoals in Datalog • Example: bachelor(X) :- male(X), NOT married(X,Y) you run the risk of multiple minimal models. • Stratified model (Chandra-Harel, 1982; Apt, Blair, Walker, 1985). • Well-founded semantics (Van Gelder, Ross, Schlipf, 1988).
The Death of Datalog • Recursion turned out not to be all that important in the world of the 1980’s. • In the AI community, where logic was taken more seriously than in DB, the emphasis was on expressiveness, not tractability.
The Rebirth • Datalog slept, but nothing could take away its important virtues: • Simplicity and declarativeness. • Tractability. • Simple execution engine. • While “rule-based systems” were long an AI staple, they never got these features of Datalog.
bddbddb • Why did Monica Lam think of Datalog for data-flow analysis? • Classical DFA was for code optimization. • Only inner loops are important, so data never needed to get really large.
bddbddb – (2) • Monica was looking at a different application: software security. • Example: can a string read at one point be passed to a SQL call without first being the argument of a function that checks safety? • Entire program analyzed as a whole. • Example: 800K lines of Apache. • Now it’s a database problem.
Overlog and Dedalus • At about the same time, Joe Hellerstein was experimenting with Datalog, first for prototyping and later for the real implementation. • General direction: protocols for distributed systems.
Overlog and Dedalus – (2) • Two important additions: time and space as first-class concepts. • Example (space): Assume each node has a table of arcs out. • arc(@n, h) means the table at node n contains an arc to node h.
Example – Continued • Each node n computes the set of nodes it can reach by consulting the reach sets for the nodes to which n has arcs. reach(@n, m) :- arc(@n, h), reach(@h, m).
Some Other Datalog Directions • Webdamlog (Abiteboul et al., these proceedings). • Adds creation of rules at remote sites. • PrPl (Lam et al.). • Social networking in Datalog. • SecPAL (Becker et al.). • Microsoft authorization language translated to Datalog.
Other Directions – (2) • LogicBlox (Molham Aref, CEO). • Startup in Atlanta GA. • One of several Datalog-based startups. • Uses Datalog for customized decision-support systems. • Many extensions, including controlled 2nd –order predicates. • Still has a tractable, straightforward execution model.
Conclusions • Too early to tell how important Datalog will be. • Will simplicity and tractability beat expressiveness? • But moving in the right direction(s) now. • From Datalog 2.0 Workshop: needs an open-source standard, like mySQL.