600 likes | 700 Views
ADBIS - DASFAA'2000 Fourth International Symposium on Advances in Databases and Information Systems September 5-8, 2000, Prague, Czech Republic. Tutorial: Object-Oriented Query Languages and Views Part 1: Basic concepts and issues. Lecturer: Kazimierz Subieta
E N D
ADBIS - DASFAA'2000 Fourth International Symposium on Advances in Databases and Information Systems September 5-8, 2000, Prague, Czech Republic Tutorial:Object-Oriented Query Languagesand ViewsPart 1: Basic concepts and issues Lecturer:Kazimierz Subieta Polish-Japanese Institute of Information Technology, Warsaw, Poland Institute of Computer Science Polish Academy of Sciences, Warsaw, Poland subieta@ipipan.waw.pl http://www.ipipan.waw.pl/~subieta
What Is a Query Language? A lot of views... User-friendly: A language for a person who does not want to knows anything about databases, but wants to operate with them. Theoretical achievement: A syntactic variant of some famous mathematical theory, e.g. logic. Used mainly to produce a next paper to a next conference (:-)). Ad hoc interactive database language: A facility for quick retrieval and simple updates through formalized commands (select...from..., update...set..., ...) or through some simple visual interfaces: forms, graphs, menus, etc. A very-high-level programming construct to be embedded into a popular programming language, e.g. SQL embedded into C. A very-high-level programming constructintegrated with a new programming language, e.g. PL/SQL, many 4GLs, and (recently) SQL3.
Properties of Query Languages Abstract, conceptual, data independent: no concepts related to physical level of data (file organizations, indices, moving data between disk and main memory, etc. Declarative (non-procedural): determines what is to be done rather than how. Macroscopic: from the point of the user a query determines simultaneous (parallel) actions on many data. Natural: supporting a natural way of thinking of the user, supporting conceptual modeling, easy to learn and use. Efficient: acceptable response time (performance) through automatic query optimization. Universal: giving the possibility to express every (reasonable) request. Independent of an application domain: supporting all applications of the given DBMS. Lately bound, interpreted: supporting ad hoc queries and various abstractions stored in a database (views, stored procedures, active rules, etc.).
A Query Language in a Database Environment A tool for an end user enabling him/her interactive (ad hoc) querying, generating reports and updating data stored in a database. Very-high-level conceptual language constructs for database programming. Defining integrity constraints preventing illegal operations/states. Defining subschemas and access restrictions. Defining virtual views, materialized views, derived (replicated) data, and procedures stored in a database. Components of scripts in 4GL-s and RAD tools. Defining active rules (triggers) and deductive rules. Determining data to be selected and transmitted in distributed databases; interoperability between heterogeneous/remote databases (ODBC, JDBC, ...). ...... several other applications
Query Optimization A query language is unacceptable without automatic optimization of queries. Typical optimization methods: Rewriting. A query q1 is substituted with a semantically equivalent query q2 promising better performance. E.g. performing selections before joins or removing dead (not used) parts of queries. The methods have no negative side effects. They require, however, regularity and homogeneity of the language definition and deep understanding of formal semantics; Auxiliary data structures and special data organizations: indices, tables of pointers, hash coding, etc; Caching results of queries and then reusing them (materialized views); Simultaneous optimization of many queries; Selecting an optimal query evaluation plan. Optimization of object-oriented QLs is in infancy. There is a lot of wishful thinking and poorly justified hopes concerning the optimization potential of particular theories. This is one of the reasons of slow adoption of object QLs in the commercial world.
Optimizable and Non-optimizable Queries An entire query language Efficient (optimizable) queries Most useful queries In any query language there are non-optimizable queries. The non-optimizable part of a QL should be as small as possible. In real QLs not all useful queries are optimizable. Poorly defined, informal or irregular features of a QL decrease the optimization potential.
OQL and SQL3 on One Slide OQL is a part of the ODMG standard. It is claimed to be a compatible extension of SQL, but actually OQL retains some syntactic patterns of SQL only. Semantically OQL is very different from SQL, because it follows an object model, which is incompatible with the relational model. OQL does not deal with updating and does not define SQL-like facilities such as views, triggers and stored procedures. OQL statements can be embedded into Java, C++ and Smalltalk, with a lot of impedance mismatch. The semantics of OQL is defined poorly and inconsistently, thus probably it is not fully implementable. SQL3 is a new SQL standard developed by ANSI and ISO. In contrast to its predecessors SQL3 is assumed to be a programming language with full computational power. The main data structure is a table, equipped however with a lot of options (thus using the term “relational” makes no sense). SQL3 supports user-defined abstract data types, including methods, object identifiers, subtypes, inheritance and polymorphism. Further facilities include control statements and parameterized types. Together with an extremely rich collection of various features, SQL3 is claimed to be downward compatible with SQL-92 and follows the sweet select...from...where... syntax. The standard is eclectic and extremely huge (more than 1000 pages), thus probably it is not fully implementable.
Is Java an Alternative to Query Languages? There is a lot of discussion around the role of Java in database programming. Java is a very important language. But... Java offers rather classical low-level (object-oriented) programming. The portability of Java bytecode is low-level. There are many details of database interfaces outside the Java bytecode standard. The Java object model is not powerful enough to be a database model. The Java database interfaces (JDBC, SQLJ,...) are not object-oriented. They present SQL interfaces to relational databases, wrapped into the Java syntax. Java does not solve the problem of storing important abstractions within a database (views, database procedures, triggers, etc.) Java itself is not an answer to database problems. It presents much lover level of database programming than programming via query languages. Java + any query language do not avoid the impedance mismatch.
Requirements to Object Query Languages Conceptual simplicity, generality and minimality: clean and precise semantics, a small set of semantic primitives, no redundant constructs. Pragmatic universality: the possibility to formulate any request. Universality of an approach to semantics: no black holes in the semantics, treating every semantic inconsistency or irregularity as a very big problem. Compositionality and orthogonality: no big syntactic constructs, every reasonable combination of constructs should be allowed. Modularity: the possibility to create complex encapsulated reusable units. Homogeneous and consistent approach to all concepts of the underlying object model: complex objects, classes, types, interfaces, ADTs, inheritance, methods, encapsulation, polymorphism, etc. Homogeneous and consistent approach to integration of the query language with programming constructs (updating, etc.) and abstractions (views, methods, stored procedures, parameters of procedures, etc.). High potential for query optimization.
Coupling a QL with a PL Loose coupling (“embedding”): A QL is developed independently of a PL. An additional interface (“glue”) is implemented, enabling the use of QL within the underlying PL. The complexity of the interface presents a problem. The incompatibility between QL and PL is referred to as impedance mismatch. Typical cases: SQL + C, JDBC+Java, OQL + C++, OQL + Java Advantages: the programmers can deal with their favorite PL, the API to a database is independent of PLs. Disadvantages: aesthetically ugly, a lot of limitations, programs are longer than necessary, tricky programming, poor maintainability. Tight coupling (“seamless integration”): Queries are building blocks for programming constructs. No special interface between QL and programming constructs is provided. Typical cases: Oracle PL/SQL, many 4GLs, DBPL, LOQIS, SQL3 Advantages: a consistent homogeneous solution, no impedance mismatch. Disadvantages: implies a new programming language, which is difficult to promote in the current commercial world.
What is Impedance Mismatch? Incompatibility between a PL and a QL to be embedded. It concerns: Syntax: Two different grammars within one programming interface; Type systems: different types, lack of bulk types in PL, no static typing of QL; Semantics and pragmatics: declarative QL vs. procedural PL; Abstraction levels: data independence of QL vs. deep data dependence of PL; Binding phases and mechanisms: late binding of QL vs. early binding of PL; Name spaces and scoping rules: two incompatible name spaces the programmer deals with, stack based scoping rules of PL are not respected by QL; Null values: they are ignored in PL, special mapping tricks are required; Iteration mechanisms: implicit in QL (selection, projection,...), explicit in PL; Persistence: QL - only persistent data, PL - only volatile data; in PL special mechanisms are required to copy persistent data into volatile memory and v/v. Generic programming: reflection in QL (see dynamic SQL), other techniques (e.g. casting, templates) in PL. Looking at the above, claims that ODMG Java + OQL avoid the impedance mismatch are at least dubious.
Object-Orientedness in Databases The commercial world: manifestos, standards and products - no agreement. OODB Manifesto, 3rd Generation DB Manifesto, Third Manifesto, ODMG standard, SQL3 standard (SQL 1999, SQL 2000), persistent Java, XML as a database model, CORBA as a database model, Gemstone, Versant, O2, ObjectStore, Poet, Objectivity/DB, Uni SQL, Oracle 8, Informix Dynamic Server, and others. Useful. But... Eclectic solutions, legacy burden, design monsters, underspecified semantics, non-universal solutions, redundant constructs, many inconsistencies. The academic world: theories and prototypes - no agreement. Nested relational algebras, F-logic, comprehensions, monoid calculus, object algebras, object calculi, structural recursion, functional approaches, and others. Theories neglect vital aspects of object-orientedness, present wishful thinking (e.g. concerning a mapping from OQL), are limited, are conceptually or mathematically invalid (e.g. object algebras). False stereotypes inherited from the relational model (e.g. concerning the role of an algebra in query optimization). The today’s state-of-the-art is PREMATURE (despite thousands of papers). The stability - not sooner than after 5-10 years. Thus in this tutorial we will discuss concepts without referring to a particular standard, product or theory.
Complex Objects NAME Smith ENO E127 JOB designer JOB analyst WORKS_IN • Many data hierarchy levels (no limitations); • Nested repeating attributes (collections); • Large objects (BLOBs) as regular values; • References (pointers) to other objects. Conceptual modeling EMPLOYEE COMPANY IBM PREV_JOB WHEN 1975-77 COMPANY ICL PHOTO PREV_JOB WHEN 1977-90
Classes Bad definitions: A class is a collection of objects (wrong: what about methods, inheritance, ...?); A class is a blueprint for objects (wrong: only creation of objects is regarded). • Correct definition: • A class is a conceptual (imaginary) entity storing invariant properties of objects. • (Invariant properties are factored out from objects to their classes.) • Typical invariant properties include: • Names and types of objects’ attributes (i.e. a type of an object); • Methods that can be applied to an object. • But invariant properties can be of various kind: • A name of an object (ODMG); • Rules for events or exceptions that can occur on objects (CORBA, ODMG); • Links (pointers, references relationships) to other objects (ODMG); • Active rules (triggers) and integrity constraints; • Default values for attributes; • ... • Sometimes a class is a regular run-time object (Self).
Encapsulation, Interfaces Encapsulation and information hiding are basic principles of any engineering, including software engineering. A TV set encapsulates a lot details, which the user does not need to know. De-encapsulation of these details may result in an electric shock of the user. A similar threat concerns a programmer working on non-encapsulated software. Interface - general definition: It consists of everything that the programmer can use or has to know in order to correctly process the object. An interface should not include unnecessary (physical) details of objects’ construction or operation. Interface - particular definition (CORBA, ODMG, Java): An interface a specification of all public properties (attributes and methods) of an object that the programmer can use in a particular context. Interface is a concept different from class (e.g. classes can be sold, interfaces - not); Interface is a concept different from type (e.g. exceptions are not relevant to types). Bad understanding of encapsulation (all attributes are private, only some methods are public) has led to the absurd thesis on contradiction between encapsulation and query languages.
Inheritance Person name date of birth age Employee name date of birth salary age salary net Student name date of birth faculty grades age average grade Employee salary salary net Student faculty grades average grade If two or more classes have common invariants, then they can be factored out to another class. Hence classes are organized in a hierarchy. An object inherits invariants from its class and from all its superclasses. Multi-inheritance – many superclases are allowed.
Methods and Messages A method is a procedure stored within a class. It acts on an environment consisting of: internal environment (attributes) of the currently processed object; private and public properties of the same class and public properties of all its superclasess; base environment, which includes database, volatile variables/objects of the user session and global environment (environment variables, libraries); public properties of currently active program modules. A message is a call of a method. Message passing does not mean parallel asynchronous communication between autonomous agents (this was false association made by OO pioneers). The usual syntax for messages: object . methodName [( parameters )] e.g. (Person where name = “Smith”). age Query languages can introduce other syntax for messages, e.g. e.g. (Person where age > 30) . name
Types A type is an expression, which constraints the content of objects or value. Types determine input/output properties of operators, functions, procedures and methods. Specification of types is obligatory in strongly typed languages. Types formally restrict a context of the use of objects, operators, methods, etc. Strong (static) type checking (strong typing): every use of objects, operators, functions, methods, etc. is checked at compilation time. Typing safety: more than 80% of programmer’s errors are detected at compilation. Dynamic type checking (dynamic typing): types are checked at run-time. Less efficient w.r.t. detecting errors. ODMG OQL is strongly typed, but the typing system is inconsistent (S.Alagic). SQL and SQL3 are dynamically typed. Strong typing decreases the power of a language (generic programming) and presents a difficulty for developers of DBMS. Thus the attitude of the commercial world to strong typing is undetermined (officially approved, practically neglected).
Links (references, pointers) EMPLOYEE EMPLOYEE EMPLOYEE NAME Jones NAME Smith NAME Brown SALARY 2500 SALARY 2000 SALARY 3500 WORKS_IN WORKS_IN WORKS_IN COMPANY BOSS EMPLOYS EMPLOYS NAME Syntex LOC London EMPLOYS Objects can be connected by explicit pointer links. Links support conceptual modeling (see OMT, UML, etc.). Links can be directly used in queries (through path expressions). They much simplify queries and are more efficient than joins. A path expression in SBQL: Name of the Smith’s boss: (EMPLOEE where NAME = “Smith”).WORKS_IN.COMPANY.BOSS.EMPLOYEE.NAME
Class Extents An extent is a named collection of objects being current members of a class. The concept has roots in the relational model, where a declaration of a table (a protoplast of the class concept) is sticked with creation of the table (i.e. extent). In PLs declarations of types/classes are separated from declaration/creation of corresponding variables/objects. An extent is a bit doubtful concept. For example, having a class Person and its subclass Employee the extent for Person and the extent for Employee has a non-empty intersection: some parts of objects Employee are “virtual” parts of the extent for Person. This can be the source of inconsistencies and programmers’ errors. It is also easy to imagine the situation, where a class must have not one but many extents. E.g. the class EmployeePhotoAlbum has an extent for each employee. Probably, the extent concept gives very little for the database designers and requires additional attention of programmers. In my opinion, it should be dropped.
Collections The relational model deals only with one collection - a table (relation). Object models (ODMG) assume more collections, in particular: sets (no duplicates, no order); bags (duplicates are allowed, no order); sequences (duplicates are allowed, the order is informative); arrays - as sequences, no inserting/deleting elements except top, access through order numbers. Collections can be nested, e.g. collection-valued attributes are allowed. Moreover, collection-valued pointer links are allowed too. Collections have a big meaning for conceptual modeling. Nested collections simplify queries (navigations instead of joins). Typical object-oriented programming languages have no explicit collection types; collections must be modelled by some tricks.
The OODBMS Ideals Orthogonal persistence: the same types can be applied to persistent and volatile objects. Ergo: a query language should process persistent and volatile data uniformly. The ideal much reduces the complexity of a QL. Object relativism: each object consists of objects; there are no other concepts that are used for description of objects. Ergo: (atomic) attributes are objects, links are objects, each object can be a component of a higher-level object, etc. The ideal much reduces the complexity of a QL. Total internal identification: each run-time program entity, which can be separately retrieved, bound, updated, inserted, indexed, protected, locked, etc., must possess an unique internal identifier. Internal identifiers can be used as references (l-values) by various language constructs (e.g. updating). The ideal much simplifies semantics of a QL and makes it consistent. Unfortunately, the above ideals are not respected by commercial OODBMS-s and standards.
Syntax, Semantics and Pragmatics of Languages Each language, including query languages, has three aspects: Syntax: describes how to build correct expressions of the language from the alphabet (basic symbols). Semantics: describes what expressions of the language denote. Semantics is the basis for implementation, in particular for query optimization. Semantics is usually syntax-driven: semantic rules are built on top of syntax rules. Pragmatics: describes how to use the language to accomplish practical needs. It deals with mapping concrete problems or tasks onto expressions of the language. Pragmatics is informal, important for teaching, frequently the most difficult for users. (Even the developers of SQL3 have problems how to use their own creature!) Many languages are explained through syntax and pragmatics. Few languages are specified through precise formal semantics. If semantics is underspecified, then each implementation of the language augments the specification on its own way. This is the reason of low portability of languages.
Semantics of Query Languages Semantics of a query is a function which maps a state into a result. For any query language, what we have to define? Query - a syntactic domain of all queries; Result - a domain containing all possible results of queries; State - a domain containing all possible states. General definitions of semantics: sem : Query (State Result) sem : Query (State (Result State)) sem : Query (State State)) For queries with no side effects; For queries with side effects; For imperative queries (e.g. the update clause of SQL). Formal semantics can be different from implementation (see SQL). Stateless approaches to query languages (logic, algebra) have severe limitations.
Four sides of a query language As follows from the previous slide, description of any query language must involve four sides: Description of data stored in a database that are to be queried, i.e. the description of data/object model (formal definition of the set State); Description of syntax (formal definition of the set Query); Description of results returned by queries (formal definition of the set Result); Description of the mapping from queries into results (formal definition of the mapping sem). Usually in practical languages, these definitions are incomplete, inconsistent or even sloppy. If one has to be serious with implementation and query optimization all these definitions must be as clean and precise as possible. Unfortunately, it is not easy to present all sides in detail during 180 minutes of the tutorial.
What is “state”? For real QLs state is more than database state. A state includes: database: all data, objects, classes, methods, etc. in the database; volatile objects/variables/... of the run-time environment of a user session; local objects/variables/... of all currently executed procedures, functions, methods; global environment: environment variables, libraries, files, etc. A state includes temporary internal structures of the query processing machine: Get 10 best-paid employees: select * from Employee as x where count( select * from Employee as y where y.salary > x.salary ) < 10 If the subquery select * from Employee as y where y.salary > x.salary has to be evaluated independently of the context, then the “free” variable x must be stored on an internal structure of the query processing machine. This internal structure (stack) augments the concept of “state”.
The Closure Property It is a property of a query language (or a theoretical framework) saying that the input and output of queries should belong to the same formal domain. For relational QLs, the input consists of tables and the output is a table. For object-oriented QLs the input is a set of objects and the output is a set of objects too. According to its advocates, the closure property is a condition for nesting queries. I disagree. Essentially, the closure property is a false inconsistent stereotype inherited from the relational model. The closure property does not hold even for SQL. E.g. input tables are named and output tables are unnamed; hence there is a big semantic difference between input and output tables. For object-oriented QLs the closure property is a conceptual nonsense. In particular, it leads to subdivision of queries onto “object preserving” and “object generating”, which is a nonsense too. We will formulate QLs semantics in consistent and formally correct terms, without subdividing queries onto “object preserving” and “object generating”.
Results of Queries In ODMG terms, queries return literals. We generalize this concept. The recursive definition below defines the domain Result: Each atomic value Result. Each reference (to an object, attribute, link, method, view, etc.) Result. If v Result, n is a name, then n(v) Result. Such results will be called binders. If v1, v2, v3, ... Result, then row{ v1, v2, v3, ...} Result. In general, the order of elements in the row is essential. This construct generalizes a tuple known from relational systems. If v1, v2, v3, ... Result, then set{ v1, v2, v3, ... }, bag{ v1, v2, v3, ... }, sequence{ v1, v2, v3, ... }, ... Result. There is no other results. In our terms queries never return objects, but can return references to objects, or more precisely, some structures built upon references, values and names.
Example results of queries 4 i33 i40 i47 “Russell” “Jones” “Black” i15 i2 52 lect(i21) i8 44 lect(i26) p(i1) p(i7 ) • Atomic: • 25, "Smith", i11, i18 • Complex: • row{i1, i21} • bag{ row{i1, i21}, row{i7, i26} } • bag{row{ 2, Lecture(i26), Stud( bag{ • row{ n("Russel"), y(i36) }, • row{ n("Black"), y(i30) }})} i1,i2 ,... - references We present bags of rows as rectangular tables (similarly to relational tables), for example:
Tutorial:Object-Oriented Query Languages and ViewsPart 2: The Stack-Based Approach ADBIS - DASFAA'2000 Fourth International Symposium on Advances in Databases and Information Systems September 5-8, 2000, Prague, Czech Republic Lecturer:Kazimierz Subieta Polish-Japanese Institute of Information Technology, Warsaw, Poland Institute of Computer Science Polish Academy of Sciences, Warsaw, Poland subieta@ipipan.waw.pl http://www.ipipan.waw.pl/~subieta
Why the stack-based approach (SBA) to QLs? The motto of SBA (frequently neglected by database researchers): Each, even apparently small semantic problem is a big problem. We would like to achieve: Universality concerning both data structures an QL/PL functionalities; Modularity and compositionality (the hierarchy of conceptual abstractions); Regularity, full orthogonality of the concepts; Minimality of semantic primitives; Clean and precise (formal) semantics; Uniform approach and full integration with procedural capabilities: updating, procedures, views, methods, etc.; Precise treatment of object-oriented concepts (classes, encapsulation, ...); High potential for query optimization.
The Environment Stack (ES) In PLs the environment stack is a basic mechanism to accomplish: Abstraction: the programmer can abstract from internal details of procedures. Semantic independence and program reuse: the meaning and the behaviour of procedural abstractions is independent on the context of its use. Recursion: A procedure (function, method, view) can call other procedures, in particular, can call itself. Encapsulation of local environments is preserved. Consistent binding: Name x is bound to the most local definition or declaration of x. Other definitions or declarations of the name x should be allowed. Parameter passing: The stack makes it possible to store and manage parameters of procedures and to accomplish consistently parameter passing methods. Proper scoping: a program entity should act only on the data environment and name space that the programmer who has programmed the entity was aware of. In SBA the environment stack has a new role: consistent semantic mechanism for definition and implementation of query operators.
Assumptions of the SBA: syntax Unification of PL expressions and queries: 2+2 (x + y) * z (x + (EMP where (NAME = “Smith”)).SAL) * z EMP, NAME, SAL persistent data x, y, z volatile data } 2, “Smith”, 1000,... x, y, z, EMP, NAME, SAL, ... +,- ,*, /, =, >, where, ., ... binary operators sin, sqrt, sum, count, distinct,... unary operators atomic queries q1, q2 are queries, is a binary operator q1 q2 is a query q is a query, is a unary operator (q) is a query Abstract syntax and compositionality: no big syntactic/semantic constructs, e.g. no famous select ... from ... where ... group by ... having ... order by ... Big constructs decrease orthogonality, maintainability, reusability and optimization potential, are more difficult in implementation, are the reason of irregularities.
Assumptions of the SBA: semantics The naming-scoping-binding principle: Each name occurring in a query is bound to run-time database/program entities (persistent objects, volatile objects, attributes, procedures, parameters, views, methods,...) according to the actual scope for the name. • This concerns: • names of persistent objects; • names of objects’ attributes, sub-attributes, ...; • auxiliary names (“variables”) defined within a query; • names of transient objects, programming variables and their attributes; • names of procedures, methods, operators, ...; • names of parameters of procedures, methods,...; • ..... any other name. • Scopes are organized in ES with the “search-from-the-top” rule. • ES is separated from the object store. • Binding of a name can be multi-valued (macroscopic binding). In SBA the domain State consists of: Object Store + Environment Stack.
An Abstract Store Model later The component of a ‘state”. atomic object link object complex object I - a set of internal identifiers N - a set of external names V - a set of atomic values, blobs, compiled procedures, ... < i, n, v > < i1, n, i2 > < i, n, T > T is a set of objects some obvious constraints (uniqueness of identifiers, referential integrities) A set of objects + A set of identifiers (roots) + Store: No record, tuple, array, set, and bag constructors in the model: essentially all of them are collections of objects (“environments”). No uniqueness of external names on any level of data hierarchy: modeling bulk data. Uniform treatment of relational, object-relational and pure object databases. Classes, inheritance and encapsulation require extension.
Tiny Database DEPT[0..*] DNAME LOC[1..*] EMP[0..*] NAME SAL JOB[0..1] age i5 EMP i1 EMP i9 EMP i6 NAME Smith i10 NAME Jones i2 NAME Brown i7 SAL 2000 i11 SAL 1500 i3 SAL 2500 WORKS_IN i8 WORKS_IN i17 i12 WORKS_IN i17 i4 WORKS_IN i13 i13 DEPT i17 DEPT i18 DNAME Sales i14 DNAME Toys i19 LOC Berlin i15 LOC Paris i16 LOC London
Universality of the Store Model Complex hierarchical objects can be defined (no limits of levels); programming variables are treated as objects; Orthogonal persistence: we abstract from the persistence status of objects and variables, i.e. we define in the same way persistent and transient objects; Object relativity: no difference in treatment of objects on any hierarchy level - big advantage for the universality, minimality and simplicity of semantics. Total internal identification: each entity stored in the store model, including attributes, links, BLOBs, methods, views, etc. has a unique internal identifier; Binary relationships (associations) can be defined via link objects; as in the ODMG model we do not deal with ternary and higher order relationships and attributes of relationships (they must be reduced to binary ones); Bulk data: we deal with sets/bags. They are modeled by the same name assigned to many objects on the same hierarchy level; Relational structures: each tuple is understood as an object with subobjects;
What is binding? Binding is substituting a name occurring in a query or a program by a run-time program entity (entities). • For example: • procedure name occurring in a program is substituted by a call of a machine code; • variable name is substituted by an address of a main memory • attribute name is substituted by an offset relatively to the beginning of a structure; • object name is substituted by an object identifier. Binding is early or static, if the substitution is made before the program is executed (i.e. during compilation and linking). Binding is late or dynamic, if the substitution is made during run time. In query languages binding is usually dynamic, because of dynamic database features (inserting new data, removing data, creating/removing views, etc.) Static binding is sometimes used for optimization.
What is binder? Binder is an internal structure to determine bindings. A binder consists of two parts: An external name defined by the application designer or programmer An internal run-time program entity, e.g. an object identifier, a value, a procedure code. This is an abstract view. In implementation binders may be not so explicit. Binding: for each external name occurring in a query a proper binder is found; then the name is substituted by the corresponding internal entity. Binders are written as n(x), where n is an external name, x is an internal entity. For a binder n( i ) name n may be different from the name of the object identified by i. In query languages binders have additional roles. Binders can be nested, i.e. x may consist of binders. In general, a binder will be understood as a query result equipped with a name. General definition of binders: n N, r Result n( r ) is a binder
The Environment Stack It consists of sections. Each section is a set of binders. The stack is growing and shrinking according to program/query nesting. The most local data are at the top. ...... Binders to local entities of currently executed method ..... NAME(i2) SAL(i3) WORKS_IN(i4) Binders to global entities of the user session EMP(i1) EMP(i5) EMP(i9) DEPT(i13) DEPT(i17) Binders to entities of the global environment The section of the currently processed object The section of the “Tiny database” The most global data are at the bottom.
Binding through the environment stack ...... G(“Mary”) X(i221) ..... NAME(i2) SAL(i3) WORKS_IN(i4) ... EMP(i1) EMP(i5) EMP(i9) DEPT(i13) DEPT(i17) ... Binding a name - search from the top: Binding ( G ) = “Mary” Binding ( X ) = i221 Binding( SAL ) = i3 Binding( EMP ) = {i1, i5, i9 } Binding( DEPT ) = {i13, i17} • First the top section is visited, then lower sections are visited. • The search is finished after a binder with the proper name is found. • All binders with the proper name form the result of the search.
Opening a new scope by a query operator NAME(i2) SAL(i3) WORKS_IN(i4) EMP(i1) EMP(i5) EMP(i9) DEPT(i13) DEPT(i17) In PLs opening a new scope (an activation record) at the top of an environment stack is associated with an activation of a block or a procedure. In SBA a new scope at the top of the environment stack is opened to evaluate a query component in the context determined by another component. A context Operator A subquery evaluated in the context EMP where SAL > 1000 The ES state (in one iteration): The new scope opened by where The ES state: EMP(i1) EMP(i5) EMP(i9) DEPT(i13) DEPT(i17) SAL is bound to i3 EMP is bound to {i1, i5, i9}
Function “nested” Given identifieriof a complex object, the function nestedreturns binders to direct sub-objects of the object identified by i. For “Tiny database”: A context A subquery evaluated in the context EMP where SAL > 1000 nested( i1 ) = { NAME( i2 ), SAL( i3 ), WORKS_IN( i4 )} nested( i5 ) = { NAME( i6 ), SAL( i7 ), WORKS_IN( i8 )} yields {i1, i5, i9} Function nested determines a new scope - the environment in which the subquery will be evaluated. The scope is pushed at the top of the environment stack. Function nested is naturally generalized for any r Result. If l is a link, then nested( l) returns the binder to the entity that the link points to. If b is a binder, then nested ( b ) returns b (no change). For rows, nested ( row{v1, v2, ...} ) = nested(v1) nested(v1) ...
The Language SBQL A formalized variant of SQL-like languages, including ODMG OQL and SQL3. It is relevant to relational, object-relational and object-oriented models. Abstract syntax, free of sugar. Syntax: Literals, names, unary or binary operators, parentheses. Orthogonality: Examples queries: 1000 EMP SAL 2+2 SAL > 1000 EMP where (SAL > 1000) ((( EMP where (SAL > 1000)) . WORKS_IN ) . DEPT ) . (DNAME, LOC) Relativity: Each query is evaluated relatively to the state of the environment stack. Thus queries such as SAL > 1000 have semantics independent from the context (providing ES contains a SAL binder). Query operators are subdivided into algebraic and non-algebraic. Algebraic operators do not deal with the environment stack. Non-algebraic operators operate on the environment stack.
SBQL - Algebraic Operators • Numerical and string comparisons, operators and functions: • =, <, +, *, concatenation, sqrt, sin, log, ... • Boolean and, or, and not • Aggregate arithmetic functions sum, max, min, avg • Function count, function for removing duplicates, function exists • Equality of complex query results (shallow, deep) • Dereferencing operator (usually implicit) • Coercion operators (changing representation and types); • Operators for bags (union, intersections, difference, equality, ...) • Operators for sets (union, intersections, difference, equality, containment) • Operators for sequences (concatenation, i-th element, sorting,...) • Cartesian product • ... many other operators NAME = “Smith” and SAL > 1000 Examples of use: 2+2 SAL > 1000 implicit dereferencing
SBQL - Declaration of an Auxiliary Name Let return a bag: q Then returns the bag of binders: qasn Unary algebraic operator n - auxiliary name, q - a query returning a single-column table, e.g. bag{x1, x2, ... xn} Syntax: qasn Semantics: x1 x2 ... xk n(x1 ) n(x2 ) ... n(xk ) Each value returned by q is equipped with the name n. Applications: SQL, OQL correlation variables (“synonyms”): Variables bound by quantifiers: Cursors in “for each” statements: Virtual names in SQL-like views. EMP as e DEPT as d EMP as e ( ... ) for each EMP as x do ...
SBQL - Non-algebraic Operators If is a non-algebraic operator, then in q1 q2 the evaluation order of q1 and q2 is essential. Hence non-algebraic operators do not follow basic properties of algebraic expressions. In contrast to relational/object algebras, our non-algebraic operators are not indexed by (informal) meta-language expressions. No informal treatment of names: each name in a query, including names of attributes, links, etc. precisely follows the same scoping-binding discipline. (ordering) q1where q2 q1 ( q2 ) q1order by q2 Syntax: (transitive closure) q1 . q2 q1 ( q2 ) q1closed by q2 q1join q2 dependent join (plus possibly other operators) Semantics - the uniform homogeneous idea: For each elementr of the collection returned by q1 the query q2 is evaluated with the environment stack augmented by nested( r ). A partial result is a combination of r and the result returned by q2. All partial results are merged into the final result. All these operators are implemented in SBQL.
SBQL: Selection NAME( i2 ) SAL( i3 ) WORKS_IN( i4 ) EMP(i1) EMP(i5) EMP(i9) DEPT(i13) DEPT(i17) NAME( i6 ) SAL( i7 ) WORKS_IN( i8 ) EMP(i1) EMP(i5) EMP(i9) DEPT(i13) DEPT(i17) NAME( i10 ) SAL( i11 ) WORKS_IN( i12 ) EMP(i1) EMP(i5) EMP(i9) DEPT(i13) DEPT(i17) For each row r returned by q1 query q2 is evaluated with the stack ES augmented by nested( r ). q1where q2 The row r belongs to the final result, iffq2returns TRUE for it. The final result EMP where SAL > 1800 i1 i5 i9 i3 1800 i1 i5 TRUE (2500) i7 1800 TRUE ES (2000) EMP(i1) EMP(i5) EMP(i9) DEPT(i13) DEPT(i17) i11 1800 FALSE (1500)
SBQL: Projection, navigation NAME( i2 ) SAL( i3 ) WORKS_IN( i4 ) EMP(i1) EMP(i5) EMP(i9) DEPT(i13) DEPT(i17) NAME( i6 ) SAL( i7 ) WORKS_IN( i8 ) EMP(i1) EMP(i5) EMP(i9) DEPT(i13) DEPT(i17) NAME( i10 ) SAL( i11 ) WORKS_IN( i12 ) EMP(i1) EMP(i5) EMP(i9) DEPT(i13) DEPT(i17) For each row r returned by q1 query q2 is evaluated with the stack ES augmented by nested( r ). q1. q2 The final result is the union of tables returned byq2. EMP . SAL The final result i1 i5 i9 i3 i3 i7 i11 i7 ES EMP(i1) EMP(i5) EMP(i9) DEPT(i13) DEPT(i17) i11
SBQL: Navigational (Dependent) Join NAME( i2 ) SAL( i3 ) WORKS_IN( i4 ) EMP(i1) EMP(i5) EMP(i9) DEPT(i13) DEPT(i17) NAME( i6 ) SAL( i7 ) WORKS_IN( i8 ) EMP(i1) EMP(i5) EMP(i9) DEPT(i13) DEPT(i17) NAME( i10 ) SAL( i11 ) WORKS_IN( i12 ) EMP(i1) EMP(i5) EMP(i9) DEPT(i13) DEPT(i17) For each row r returned by q1 query q2 is evaluated with the stack ES augmented by nested( r ). q1joinq2 A partial result is a concatenation of r with each row returned by q2 . The final result is the union of partial results. EMPjoinSAL The final result i1 i5 i9 i3 i1 i3 i5 i7 i9 i11 i7 ES EMP(i1) EMP(i5) EMP(i9) DEPT(i13) DEPT(i17) i11