150 likes | 261 Views
Readings in Data Management Spring 2008. Computer Science Department Rutgers University. Seminar Information. Web page: http://www.cs.rutgers.edu/~amelie/courses/dbseminar.html Meets Thursday 1-2:30pm in CoRE A. Organization. Weekly presentation on a DB topic (30 minutes)
E N D
Readings in Data ManagementSpring 2008 Computer Science Department Rutgers University
Seminar Information • Web page: http://www.cs.rutgers.edu/~amelie/courses/dbseminar.html • Meets Thursday 1-2:30pm in CoRE A
Organization • Weekly presentation on a DB topic (30 minutes) • We will select 2-3 topics to focus on the course of the semester • For each topic • First week: overview paper (survey, influential work) • Subsequent weeks: more complex papers on the subject • Possibly a few external presentations such as: • Students preparing for DB conference talks or quals • Invited speakers • Discussion on the paper
Topics • First Topic:Probabilistic Databases • We will select next topics from (non exhaustive list): • Question answering • Web Search • Personal Information Spaces • Query Optimization • Data Cleaning • Data Integration • Data Mining • Query Processing Techniques • Adaptive, Automatic, Autonomic Systems • OLAP • Stream Aggregation • Storage, Indexing, and System Architecture • XML Processing • Preference functions • Spatial and High-Dimensional Data • Recovery • Privacy in DBMS • …
What I expect from you • 1-2 presentation over the course of the semester • First-year students will be given “overview” presentation assignments at the beginning of each topic • More Senior students will present more research-focused papers • Number of presentations depends on the number of students in the seminar • Everyone should read the paper in advance and prepare 1-2 questions/discussion topics • Participation in discussion • There are no “stupid” questions! If you did not understand something, chances are others did not either
Presentations • I will select a list of papers to present for each topic • Start with an introductory paper • The papers that go deeper into one or more aspect of the problem • You are welcome to suggest some papers on the topic, as long as it is related (so that we can have more meaningful discussions) • Papers that I have overlooked • Papers on a different aspect of the topic that you would like to focus on
First topic: Probabilistic Databases • Uncertainty/Imprecision in data • Query Semantics • Probabilistic Data Representation Next few slides from Dan Suciu’s tutorial, more at
Databases Today are Deterministic • An item either is in the database or is not • A tuple either is in the query answer or is not • This applies to all variety of data models: • Relational, E/R, NF2, hierarchical, XML, …
What is a Probabilistic Database ? • “An item belongs to the database” is a probabilistic event • “A tuple is an answer to the query” is a probabilistic event • Can be extended to all data models;
Two Types of Probabilistic Data • Database is deterministicQuery answers are probabilistic • Database is probabilisticQuery answers are probabilistic
Long History Probabilistic relational databases have been studied from the late 80’s until today: • Cavallo&Pitarelli:1987 • Barbara,Garcia-Molina, Porter:1992 • Lakshmanan,Leone,Ross&Subrahmanian:1997 • Fuhr&Roellke:1997 • Dalvi&S:2004 • Widom:2005
So, Why Now ? Application pull: • The need to manage imprecisions in data Technology push: • Advances in query processing techniques
Application Pull Need to manage imprecisions in data • Many types: non-matching data values, imprecise queries, inconsistent data, misaligned schemas, etc, etc The quest to manage imprecisions = major driving force in the database community • Ultimate cause for many research areas: data mining, semistructured data, schema matching, nearest neighbor
Technology Push Processing probabilistic data is fundamentally more complex than other data models • Some previous approaches sidestepped complexity There exists a rich collection of powerful, non-trivial techniques and results, some old, some very recent, that could lead to practical management techniques for probabilistic databases.
Suggested Papers to discuss • Nilesh Dalvi, Dan Suciu: Efficient Query Evaluation on Probabilistic Databases. (VLDB 2004). • Minos Garofalakis et al, Probabilistic Data Management for Pervasive Computing: The Data Furnace Project. IEEE Data Eng. Bull. 29(1)(2006) • Omar Benjelloun, Anish Das Sarma, Chris Hayworth, Jennifer Widom: An Introduction to ULDBs and the Trio System. IEEE Data Eng. Bull. 29(1)(2006) • Prithviraj Sen, Amol Deshpande, Representing and Querying Correlated Tuples in Probabilistic Databases (ICDE 2007)