1 / 38

CS 519: Big Data Exploration and Analytics

CS 519: Big Data Exploration and Analytics. Relational Query Languages. Announcements . The due data for selecting papers is tomorrow. Your top 3 choices First come, first server Some project ideas will be posted on the course Web site tomorrow. Presentations .

kolina
Download Presentation

CS 519: Big Data Exploration and Analytics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 519: Big Data Exploration and Analytics Relational Query Languages

  2. Announcements • The due data for selecting papers is tomorrow. • Your top 3 choices • First come, first server • Some project ideas will be posted on the course Web site tomorrow.

  3. Presentations • 1 or 2 papers: 45 - 50 minutes presentation. • 20 - 25 minutes discussion • 5 minutes quiz. • Posted articles on how to effectively read scientific papers. • Which parts to read more carefully than others? • Read and analyze the paper critically.

  4. Presentations • The story: • Clear definition of the problem and its applications: Is it important? • What makes this paper different from others • Different version of the problem. Should we care? • Novel solution: Is it sufficiently novel? • Solution • Give a general overview. • Highlight the interesting and important parts.

  5. Presentations • Evaluation, evaluation, evaluation! • Setting: Is it realistic? • Results: Does the paper cover both advantages and disadvantages? • Conclusion and discussion • Strong points (3 or more). How to use them in problems? • Weak points (3 or more) How to address the weak points? • Possible future directions? • Connections with your project/ research?

  6. Presentations • David Paterson’s guide on “How to give a bad presentation” http://www.cs.berkeley.edu/~pattrsn/talks/BadTalk.pdf

  7. Relational Query Languages

  8. Relational model and query languages • Relational model defines data organization • Relational query languages define data retrieval/manipulation operations.

  9. Relational model Relation name Attribute names Book: Title Price Category Year MySQL $102.1 computer 2001 Cell biology $201.69 biology 1954 French cinema $53.99 art 2002 NBA History $63.65 sport 2010 tuples

  10. Relational Model • Attributes • Atomic values • Domain: string, integer, real, • Keys: no duplicate values • Each relation must have keys • A relation does not contain duplicate tuples.

  11. Database Schema vs. Database Instance • Schema of a Relation • Names of the relation and their attributes. • E.g.: Person (Name, Address, SSN) • Domains of / constraints on the attributes. • Schema of the database • Set of relation schemata • E.g.: Person (Name, Address, SSN) Employment(Company, SSN)

  12. Database Schema vs. Database Instance • Schema: S:Book(Title, Price, Category, Year) • Instance: • Values of each attribute A in I: active domain of A, adom(A) Title Price Category Year MySQL $102.1 computer 2001 Cell biology $201.69 biology 1954 French cinema $53.99 art 2002 NBA History $63.65 sport 2010

  13. SQL • A declarative language for querying data stored in relational databases. • Much easier to use than procedural languages. • Say what instead of how SELECT returned attribute(s) FROM table(s) WHERE conditions on the tuples of the table(s)

  14. SQL Example Movie(id, title, year, total-gross) Actor(id, name, b-year) Plays(mid,aid) What movies are made in 1998? SELECT title FROM movie WHERE year = 1998;

  15. Movie(mid, title, year, total-gross) Actor(aid, name, b-year) Plays(mid,aid) SQL Example • Find actors who played in a movie whose total gross is more than $2,000,000. SELECT * FROM Actor, Movie, Plays WHERE Movie.id= Plays.midAND Plays.aid= Actor.id AND total-gross > 2000000;

  16. Formal Relational Query Languages • Formal languages that express queries over relational schemas. • Relational Algebra • Datalog (recursion-free with negation) • Relational calculus • Used to explore the properties of relational model. • Easier to use than SQL in some application domains.

  17. Named versus unnamed perspective • Named perspective Movie(id, title, year, total-gross) Actor(id, name, b-year) Plays(mid,aid) • Unnamed perspective Movie: arity 4 Actor: arity 3 Plays: arity 2

  18. Relational Algebra (RA) • Used by RDBMS to execute queries • Six operators • Selection σ • Projection Π • Join ∞ • Union • Difference – • Renaming ρ (for named perspective)

  19. Movie(mid, title, year, total-gross) Actor(aid, name, b-year) Plays(mid,aid) Relational Algebra • Selection σ • Named perspective σtitle=‘Fargo’ (Movie) • Unnamed perspective σ2=‘Fargo’ (Movie) • Projection Π • Named perspective Πb-year (Actor) • Unnamed perspective Π3 (Actor) • Join ∞ • Named perspective Movie ∞id=mid Plays • Unnamed perspective Movie ∞1=1Plays

  20. Datalog • First created to support recursive queries over relational databases. • Easier to use than relational algebra. • Used extensively in research and industry • Data integration, networking, logic programming, learning, distributed processing, … • We talk about the recursion-free datalog.

  21. Movie(mid, title, year, total-gross) Actor(aid, name, b-year) Plays(mid,aid) Datalog • Each tuple in database is a fact Movie(236878, ‘Godfather I’, 1972, 40000000) Movie(879900, ‘Godfather II’, 1974, 3900000) Actor(090988,’Robert De Niro’, 1943) • Each query is a rule Movies that were produced in 1998 and made more than $2,000. Q1(y):- Movie(x,y,1998,z),z > 2000.

  22. Movie(mid, title, year, total-gross) Actor(aid, name, b-year) Plays(mid,aid) Datalog Example • Actors who played in a movie whose total gross is more than $2,000. Q2(y):- Actor(x,y,z),Plays(t,x),Movie(t,v,w,f), f > 2000. • Actors who played in a movie whose total gross is more than $2,000 and a movie made in 1998. Q3(y):- Actor(x,y,z),Plays(t,x),Movie(t,v,w,f), f > 2000, Plays(g,x), Movie(g,l,1998,h).

  23. Movie(mid, title, year, total-gross) Actor(aid, name, b-year) Plays(mid,aid) Datalog Q2(y):- Actor(x,y,z),Plays(t,x),Movie(t,v,w,f), f > 2000. y: head variable; x, z, t : existential variables • Extensional Database Predicates (EDB) • Movie, Actor, Plays • Intentional Database Predicate (IDB) • Q2 atom atom body head

  24. Movie(mid, title, year, total-gross) Actor(aid, name, b-year) Plays(mid,aid) Datalog programs • A collection of rules: union • Actors who played in a movie with gross of more than $2,000 or a movie made after 1990. Q4(y):- Actor(x,y,z),Plays(t,x),Movie(t,v,w,f), f > 2000. Q4(y):- Actor(x,y,z),Plays(t,x),Movie(t,v,w,f), w > 1990.

  25. Movie(mid, title, year, total-gross) Actor(aid, name, b-year) Plays(mid,aid) Views • Similar to views in SQL • Actors who played in a movie with gross of more than $2000 and in a movie with ‘Robert De Niro’. V(x,y,z):- Actor(x,y,z),Plays(t,x),Movie(t,v,w,f), f > 20000. Q5(y):- V(x,y,z),Plays(t,x),Plays(t,f),Actor(f,’Robert De Niro’,g, h). • Unfolding Q5(y):- Actor(x,y,z),Plays(t,x),Movie(t,v,w,f), f > 20000,Plays(u,x),Plays(u,f),Actor(f,’Robert De Niro’,g, h). Definition of V

  26. Movie(mid, title, year, total-gross) Actor(aid, name, b-year) Plays(mid,aid) Datalog with negation • All actors who did not play in a movie with ‘Robert De Niro’. U(x,y,z):- Actor(x,y,z),Plays(t,x),Plays(t,f), Actor(f,’Robert De Niro’,g). Q6(y):- Actor(x,y,z), not U(x,y,z).

  27. Movie(mid, title, year, total-gross) Actor(aid, name, b-year) Plays(mid,aid) Safe datalog rules • Unsafe rules: V(x,y,z):- Actor(x,y,1998), z > 200. W(x,y,z):- Actor(x,y,z), not Plays(t,x). • A datalog rule is safe if every variable appear in at least one positive predicate

  28. Datalog to SQL • Non-recursive datalog with negation represents the core functionalities of SQL • We can translate each non-recursive datalog program to a core SQL query and vice versa.

  29. Relational calculus • First order logic or predicate calculus • Two approaches: • Domain Relational Calculus (DRC): Unnamed perspective. • Our focus in this lecture. • Tuple Relational Calculus (TRC): Named perspective.

  30. Relational calculus • Each relational predicate P is: • Atom • P ∧ P • P ∨ P • P P • not (P) • Each query Q(x1, …, xn) is a predicate P.

  31. Movie(mid, title, year, total-gross) Actor(aid, name, b-year) Plays(mid,aid) Relational calculus • Actors who played in a movie with total gross of $2,000. • Actors who played only in movies produced in 1990.

  32. Movie(mid, title, year, total-gross) Actor(aid, name, b-year) Plays(mid,aid) Relational calculus • Actors who played in some movies with only one actor.

  33. Movie(mid, title, year, total-gross) Actor(aid, name, b-year) Plays(mid,aid) Domain independent RC • Similar to datalog, one may write unsafe queries in RC, called domain dependent. • RC queries should be domain independent.

  34. Equivalency Theorem • RA, non-recursive datalog with negation and RC express the same set of queries. • Relational queries.

  35. Conjunctive queries (CQ) • One datalog rule. • SELECT-DISTINCT-FROM-WHERE. • Select/project/join (σ, Π, ∞) fragment of RA. • Existential/ conjunctive fragment of RC • There is not any comparison operator (<, ≠, …) in CQ. If used the family is called CQ<, CQ≠, …

  36. Movie(mid, title, year, total-gross) Actor(aid, name, b-year) Plays(mid,aid) CQexamples Actors who played in “LTR”. Q7(y):- Actor(x,y,z),Plays(t,x),Movie(t,’LTR’,w,f). Actors who played in a movie with total gross of $2000. Non-CQ: Actors who played in some movies with only one actor.

  37. Query equivalency and containment • Interesting and long standing problems in database management. • Queries q1 and q2 are equivalent if and only if for every database instance I, q1(I) = q2(I) • Shown as q1 q2 • Query q1is contained in q2if and only if for every database instance I, q1(I) q2(I) • Shown as q1 q2

  38. Containment examples Is q1 q2? q1(x):- R(x,y),R(y,z),R(z,w). q2(x):- R(x,y),R(y,z). q1(x):-R(x,y),R(y,’Joe’). q2(x):-R(x,y),R(y,z). q1(x):- R(x,y),R(y,z),R(z,x). q2(x):- R(x,y),R(y,x).

More Related