Introduction to Stanford DB Group Research

Introduction to Stanford DB Group Research Li Ruixuan http://cs.hust.edu.cn/rxli/ rxli@public.wh.hb.cn

Contents • Introduction • Past projects • Current projects • Events • References • Links

The Stanford Database Group • “Mainstream” faculty • Hector Garcia-Molina • Jennifer Widom • Jeff Ullman • Gio Wiederhold • “Adjunct” faculty • Chris Manning (natural language processing) • Rajeev Motwani (theory) • Terry Winograd (human-computer interaction) • A.k.a. Stanford InfoLab

Database Group (cont’d) • Approximately 25 Ph.D. students • Varying numbers of M.S. and undergraduate students • Handful of visitors • One senior research associate • One systems administrator, one programmer • Excellent administrative staff • Resident photographer

Research Areas (very coarse) • Digital libraries • Peer-to-peer systems • Data streams • Replication, caching, archiving, broadcast, … • The Web • Ontologies, semantic Web • Data mining • Miscellaneous

Past Projects • LIC: Large-Scale Interoperation and Composition (1999) – mediator (SKC, OntoWeb, CHAIMS, SmiQL, image DB) • SKC: Scalable Knowledge Composition (2000) - semantic heterogeneity • TID: Trusted Image Distribution (2001) - Image Filtering for Secure Distribution of Medical Information • Image Database: Content-based Image Retrieval (2003) • SimQL:Simulation Access Language (2001) - Software modules in manufacturing, acquisition, and planning systems

Past Projects (cont’d) • TSIMMIS: Wrapping and mediation for heterogenous information sources (1998) • Lore: A Database Management System for XML (2000) • WHIPS: WareHouse Information Prototype at Stanford (1998) - Data warehouse creation and maintenance • MIDAS: Mining Data at Stanford (1999) • WSQ: Web-Supported Queries (2000) - Integrating database queries and Web searches

Current Projects • WebBase: Crawling, storage, indexing, and querying of large collections of Web pages. (Molina) • STREAM: A Database Management System for Data Streams (Widom) • Peers: Building primitives for peer-to-peer systems (Molina) • Digital Libraries: Interoperating on-line services for end-user support (TID,WebBase,OntoAgents) (Molina) • TRAPP: Approximate data caching: trading precision for performance (Widom) • CHAIMS: Compiling High-level Access Interfaces for Multi-site Software (1999) (Wiederhold) • OntoAgents: Ontology based Infrastructure for Agents (2002) (Wiederhold)

WebBase: Objectives • Provide a storage infrastructure for Web-like content • Store a sizeable portion of the Web • Enable researchers to easily build indexes of page features across large sets of pages • Distribute Webbase content via multicast channels • Support structureandcontent-based querying over the stored collection

Multicast Module WWW Client Client Page Repository Client I n d e x A P I Crawler Indexing Client Indexing Module Analysis Module W e b B a s e A P I Client Client Query Engine Client Retrieval Indexes Feature Repository WebBase: Architecture

WebBase: Current Status • Efficient “smart”crawler • Parallelism • Freshness & Relevance • Efficient and scalable indexing • Distributed Web-scale content indexes • Indexes over graph structure • Unicast dissemination • Within Stanford • External clients: Columbia, U.Wash, U.C.Berkeley

WebBase: In Progress • WebBase Infrastructure • Multicast dissemination • Complex queries • Other work • PageRank extensions • Clustering and similarity search • Structured data extraction • Hidden Web crawling

Data Streams: Motivation • Traditional DBMS -- data stored in finite, persistent data sets • New applications -- data as multiple, continuous, rapid, time-varying data streams • Network monitoring and traffic engineering • Security applications • Telecom call records • Financial applications • Web logs and click-streams • Sensor networks • Manufacturing processes

Streamed Result Stored Result Register Query DSMS Input streams Archive Scratch Store Stored Relations STREAM: Architecture

STREAM: Challenges • Multiple, continuous, rapid, time-varying streams of data • Queries may be continuous (not just one-time) • Evaluated continuously as stream data arrives • Answer updated over time • Queries may be complex • Beyond element-at-a-time processing • Beyond stream-at-a-time processing

Persistent relations One-time queries Random access Access plan determined by query processor and physical DB design “Unbounded” disk store DBMS versus DSMS • Transient streams (and persistent relations) • Continuous queries • Sequential access • Unpredictable data arrival and characteristics • Bounded main memory

STREAM: Current Status • Data streams and stored relations • Declarative language for registering continuous queries • Flexible queryplans • Designed to cope with high datarates and queryworkloads • Graceful approximation when needed • Careful resource allocation and usage • Relational, centralized (for now)

STREAM: Ongoing Work • Algebra for streams • Semantics for continuous queries • Synopses and algorithmic issues • Memory management issues • Exploiting constraints on streams • Approximation in query processing • Distributed stream processing • System development

STREAM: Related Work • Amazon/Cougar (Cornell) – sensors • Aurora(Brown/MIT) – sensor monitoring, dataflow • Hancock (AT&T) – telecom streams • Niagara (OGI/Wisconsin) – Internet XML databases • OpenCQ (Georgia) – triggers, incr. view maintenance • Stream(Stanford) – general-purpose DSMS • Tapestry (Xerox) – pub/sub content-based filtering • Telegraph (Berkeley) – adaptive engine for sensors • Tribeca (Bellcore) – network monitoring

Peer-To-Peer Systems • Multiple sites (at edge) • Distributed resources • Sites are autonomous (different owners) • Sites are both clients and servers • Sites have equal functionality

P2P Benefits • Pooling available (inexpensive) resources • High availability and fault-tolerance • Self-organization

P2P Challenges • Security & Privacy • Anonymity • Reputation • Accountability • Information Preservation • Information Quality • Trust • Denial of service attacks • Search • Query Expressiveness • Comprehensiveness • Topology • Data Placement • Message Routing • Resource Management • fairness • load balancing

Peers: Stanford Research • New Architectures • Performance Modeling and Optimization • Security and Trust • Distributed ResourceManagement • Applications

User Interfacesand Annotations Payment Institutions Internet Libraries Copyright Services Telnet Query/Data Conversion HTTP Search Agents Z39.50 Commercial Information Brokers & Providers Digital Library Project: Overview

DigLib Projects: DLI1,DLI2 • Resource Discovery • Retrieving Information • Interpreting Information • Managing Information • Sharing Information

DigLib: Resource Discovery • Geographic Views (Tools to assist you in more systematically locating different types of information from a large and diverse number of information sources)

DigLib: Retrieving Information • Information Tiling • PalmPilot Infrastructure (PDA) • Power Browsing (PDA) • Query Translator • SDLIP (Simple Digital Library Interoperability Protocol) • Value Filtering • WebBase

DigLib: Interpreting Information • Murals (Tools to help a user interpret and organize search results) • Web Clustering

DigLib: Managing Information • Archival Repositories • Archiving Movie • InterBib (a tool for maintaining bibliographic information) • Medical Transport Info • PhotoBrowser

DigLib: Sharing Information • Diet ORB (PDA, based on MICO) • Digital Wallets • Mobile Info Delivery • Mobile Security • Multicasting

DLI1 Projects (95-99) • RManage/FIRM • SenseMaker • SCAM • Shopping Models, U-PAI • SONIA • STARTS • WebWriter • AHA • ComMentor • DLITE • Google • GLOSS • FAB • Grassroots • Metadata Architecture

TRAPP: Overview • TRAPP: Tradeoff in Replication Precision and Performance • A.k.a: Approximate Data Caching • Project goal: investigating techniques to permit controlled and explicit relaxation of data precision in exchange for improved performance

TRAPP: Motivation • Transactional consistency too expensive • Even nontransactional propagation of every update still too expensive in many cases • Solution: Approximate Caching • Exploit the fact that many applications do not require exact consistency • Avoid propagating insignificant updates • Trade cache precision for network load

cache [2, 5] [-1, 0.8] refreshes refreshes 0.2 3.9 source source Example: TRAPP Over Numeric Data • Caches store intervals that bound the exact source values • Sources refresh when value leaves interval • Query answers are intervals • Precision constraints specify maximum width

cache Query: X + Y (within 2) Answer: [2.9, 4.7] [2, 5] [-1, 0.8] probe 0.2 3.9 Y X source source Eg(cont’d): Querying in TRAPP For one-time aggregation queries: • Answers computed by combining approximate cached data and exact source data • At query-time: Find low-cost subset of sources to probe so final answer will have adequate precision • Algorithm determined by aggregation function • Some easy, some hard

TRAPP: Approximate Caching Two common scenarios: • Minimize bandwidth usage, precision fixed • TRAPP: caches store bounds as approximations • Queries select combination of cached & source data • Adaptive bound adjustment for good precision level • Bandwidth fixed, maximize precision • Best-Effort Synchronization: caches store stale copies • Refreshing based on priority scheduling • Global priority order via threshold • Adaptive threshold setting for flow control

TRAPP: Status • Past work: focused on an approximate data caching architecture that permits fine-grained control of the precision-performance tradeoff for numerical data in data caching environments. • Current work: applying the above techniques and others to more complex data such as Web pages.

CHAIMS: Overview • CHAIMS: Compiling High-level Access Interfaces for Multi-site Software • Objective: Investigate revolutionary approaches to large-scale software composition. • Approach: Develop and validate a composition-only language, a protocol for large, distributed, heterogeneous and autonomous megamodules, and a supporting system. • Plannedcontributions: • Asynchrony by splitting up CALL-statement. • Hardware and software platform independence. • Potential for multi-site dataflow optimization. • Performance optimization by invocation scheduling.

CHAIMS Megamodules CHAIMS: Overview Megaprogram for composition, written by domain programmer CHAIMS system automates generation of client for distributed system Megamodules, provided byvarious megamodule providers

Megamodule Provider Megaprogrammer wraps non-CHAIMS compliant megamodules writes adds information to information Wrapper Templates Megaprogram (in CHAIMS language) CHAIMS Repository information CHAIMS Compiler b d generates e a CSRT(compiled megaprogram) c MEGA modules Distribution System (CORBA, RMI…) CHAIMS: Architecture

OntoAgents: Objective • OntoAgents goal: establish an agent infrastructure on the WWW or WWW-like networks • Such an agent infrastructure requires an information food chain: every part of the food chain provides information, which enables the existence of the next part.

OntoAgents: Architecture • End User • Ontology Articulation Toolkit • Ontology Construction Tool • Community Portal • Agents • Ontologies • Inference Engine • Metadata Repository • Annotated Webpages • Webpage Annotation Tool

Academic Year Fall Winter Spring 2002/2003 no seminar in the fall quarter Database Seminar (CS545) Genome Databases (CS545G) 2001/2002 Past, Present, and Future of Database Technology Genome Databases Database Seminar to come 2000/2001 Interoperation, Databases and the Semantic Web Image Databases Databases and the Semantic Web 1999/2000 Ontologies, E-Commerce, XML & Metadata n/a Ontologies, E-Commerce, XML & Metadata 1998/1999 Digital Libraries Image Databases Internet and Databases 1997/1998 Data Warehousing Image Databases Internet and Databases 1996/1997 Fall Quarter 96 Image Databases Spring Quarter 97 Events: DB Seminars

Events: Meetings • Stanford Computer Science Forum - Annual Affiliates Meeting, Stanford, May 2003. • SWiM(the Stream Winter Meeting): About 35 researchers in the data streams are came together at Stanford for SWiM, Jan. 2003. • Stream Team: A few data streams research groups held some informal get-togethers, 2002. • Conference Talk: ACM SIGMOD/PODS, VLDB, ICDT, ICDE, ICDCS, CIDR

References: WebBase • Junghoo Cho, Hector Garcia-Molina. "Parallel Crawlers," In Proceedings of the Eleventh World Wide Web Conference, May 2002. • Taher Haveliwala, Aristides Gionis, etc. "Evaluating Strategies for Similarity Search on the Web," Proceedings of the Eleventh International World Wide Web Conference, May 2002. • Taher Haveliwala. "Topic-Sensitive PageRank," Proceedings of the Eleventh International World Wide Web Conference, May 2002.

References: STREAM • R. Motwani, J. Widom, etc. Query Processing, Resource Management, and Approximation in a Data Stream Management System In Proc. of the 2003 Conference on Innovative Data Systems Research (CIDR), January 2003 • A. Arasu, B. Babcock. etc. STREAM: The Stanford Stream Data ManagerIn Proc. of the ACM Intl Conf. on Management of Data (SIGMOD 2003), June 2003 • B. Babcock, S. Babu, etc. Models and Issues in Data Stream SystemsInvited paper in Proc. of the 2002 ACM Symp. on Principles of Database Systems (PODS 2002), June 2002

References: Peers • Neil Daswani, Hector Garcia-Molina and Beverly Yang. Open Problems in Data-Sharing Peer-to-Peer Systems,In ICDT, 2003. • Hector Garcia-Molina. Peer-To-Peer Data Management, Key-notes In ICDE, 2002. • Hrishikesh Deshpande, Mayank Bawa, and Hector Garcia-Molina. Streaming Live Media over a Peer-to-Peer Network.

References: TRAPP • C. Olston and J. Widom. Best-Effort Cache Synchronization with Source Cooperation. ACM SIGMOD 2002 International Conference on Management of Data, Madison, Wisconsin, June 2002, pp. 73 -84. • C. Olston, B. T. Loo and J. Widom. Adaptive Precision Setting for Cached Approximate Values. ACM SIGMOD 2001 International Conference on Management of Data, Santa Barbara , California, May 2001, pp. 355-366.

Useful Links • Database Group: http://www-db.stanford.edu/ • STREAM: http://www-db.stanford.edu/stream/ • Peers: http://www-db.stanford.edu/peers/ • DigLib: http://www-diglib.stanford.edu/ • TRAPP: http://www-db.stanford.edu/trapp/ • WebBase: http://www-diglib.stanford.edu/~testbed/doc2/WebBase/

Introduction to Stanford DB Group Research

Introduction to Stanford DB Group Research

Presentation Transcript

C-Store: An Introduction to Berkeley DB

Introduction to Research

Microsoft Research Stanford University

Introduction to Corpora@Stanford

An introduction to the Dutch National Research Group

Introduction of DB

Introduction to Research

Introduction to Research

Introduction to Research

Introduction to Stanford Email and Calendar

Introduction to DB Security

RP GROUP RESEARCH INSTITUTE Introduction to Planning

Introduction to Research

The Condor DB Group Report

Microsoft Research Stanford University

Introduction to Speech Corpora@Stanford

Stanford Research Institute

Introduction to Research

The Condor DB Group Report

Introduction to Research