490 likes | 618 Views
Introduction to Stanford DB Group Research. Li Ruixuan http://cs.hust.edu.cn/rxli/ rxli@public.wh.hb.cn. Contents. Introduction Past projects Current projects Events References Links. The Stanford Database Group. “ Mainstream ” faculty Hector Garcia-Molina Jennifer Widom
E N D
Introduction to Stanford DB Group Research Li Ruixuan http://cs.hust.edu.cn/rxli/ rxli@public.wh.hb.cn
Contents • Introduction • Past projects • Current projects • Events • References • Links
The Stanford Database Group • “Mainstream” faculty • Hector Garcia-Molina • Jennifer Widom • Jeff Ullman • Gio Wiederhold • “Adjunct” faculty • Chris Manning (natural language processing) • Rajeev Motwani (theory) • Terry Winograd (human-computer interaction) • A.k.a. Stanford InfoLab
Database Group (cont’d) • Approximately 25 Ph.D. students • Varying numbers of M.S. and undergraduate students • Handful of visitors • One senior research associate • One systems administrator, one programmer • Excellent administrative staff • Resident photographer
Research Areas (very coarse) • Digital libraries • Peer-to-peer systems • Data streams • Replication, caching, archiving, broadcast, … • The Web • Ontologies, semantic Web • Data mining • Miscellaneous
Past Projects • LIC: Large-Scale Interoperation and Composition (1999) – mediator (SKC, OntoWeb, CHAIMS, SmiQL, image DB) • SKC: Scalable Knowledge Composition (2000) - semantic heterogeneity • TID: Trusted Image Distribution (2001) - Image Filtering for Secure Distribution of Medical Information • Image Database: Content-based Image Retrieval (2003) • SimQL:Simulation Access Language (2001) - Software modules in manufacturing, acquisition, and planning systems
Past Projects (cont’d) • TSIMMIS: Wrapping and mediation for heterogenous information sources (1998) • Lore: A Database Management System for XML (2000) • WHIPS: WareHouse Information Prototype at Stanford (1998) - Data warehouse creation and maintenance • MIDAS: Mining Data at Stanford (1999) • WSQ: Web-Supported Queries (2000) - Integrating database queries and Web searches
Current Projects • WebBase: Crawling, storage, indexing, and querying of large collections of Web pages. (Molina) • STREAM: A Database Management System for Data Streams (Widom) • Peers: Building primitives for peer-to-peer systems (Molina) • Digital Libraries: Interoperating on-line services for end-user support (TID,WebBase,OntoAgents) (Molina) • TRAPP: Approximate data caching: trading precision for performance (Widom) • CHAIMS: Compiling High-level Access Interfaces for Multi-site Software (1999) (Wiederhold) • OntoAgents: Ontology based Infrastructure for Agents (2002) (Wiederhold)
WebBase: Objectives • Provide a storage infrastructure for Web-like content • Store a sizeable portion of the Web • Enable researchers to easily build indexes of page features across large sets of pages • Distribute Webbase content via multicast channels • Support structureandcontent-based querying over the stored collection
Multicast Module WWW Client Client Page Repository Client I n d e x A P I Crawler Indexing Client Indexing Module Analysis Module W e b B a s e A P I Client Client Query Engine Client Retrieval Indexes Feature Repository WebBase: Architecture
WebBase: Current Status • Efficient “smart”crawler • Parallelism • Freshness & Relevance • Efficient and scalable indexing • Distributed Web-scale content indexes • Indexes over graph structure • Unicast dissemination • Within Stanford • External clients: Columbia, U.Wash, U.C.Berkeley
WebBase: In Progress • WebBase Infrastructure • Multicast dissemination • Complex queries • Other work • PageRank extensions • Clustering and similarity search • Structured data extraction • Hidden Web crawling
Data Streams: Motivation • Traditional DBMS -- data stored in finite, persistent data sets • New applications -- data as multiple, continuous, rapid, time-varying data streams • Network monitoring and traffic engineering • Security applications • Telecom call records • Financial applications • Web logs and click-streams • Sensor networks • Manufacturing processes
Streamed Result Stored Result Register Query DSMS Input streams Archive Scratch Store Stored Relations STREAM: Architecture
STREAM: Challenges • Multiple, continuous, rapid, time-varying streams of data • Queries may be continuous (not just one-time) • Evaluated continuously as stream data arrives • Answer updated over time • Queries may be complex • Beyond element-at-a-time processing • Beyond stream-at-a-time processing
Persistent relations One-time queries Random access Access plan determined by query processor and physical DB design “Unbounded” disk store DBMS versus DSMS • Transient streams (and persistent relations) • Continuous queries • Sequential access • Unpredictable data arrival and characteristics • Bounded main memory
STREAM: Current Status • Data streams and stored relations • Declarative language for registering continuous queries • Flexible queryplans • Designed to cope with high datarates and queryworkloads • Graceful approximation when needed • Careful resource allocation and usage • Relational, centralized (for now)
STREAM: Ongoing Work • Algebra for streams • Semantics for continuous queries • Synopses and algorithmic issues • Memory management issues • Exploiting constraints on streams • Approximation in query processing • Distributed stream processing • System development
STREAM: Related Work • Amazon/Cougar (Cornell) – sensors • Aurora(Brown/MIT) – sensor monitoring, dataflow • Hancock (AT&T) – telecom streams • Niagara (OGI/Wisconsin) – Internet XML databases • OpenCQ (Georgia) – triggers, incr. view maintenance • Stream(Stanford) – general-purpose DSMS • Tapestry (Xerox) – pub/sub content-based filtering • Telegraph (Berkeley) – adaptive engine for sensors • Tribeca (Bellcore) – network monitoring
Peer-To-Peer Systems • Multiple sites (at edge) • Distributed resources • Sites are autonomous (different owners) • Sites are both clients and servers • Sites have equal functionality
P2P Benefits • Pooling available (inexpensive) resources • High availability and fault-tolerance • Self-organization
P2P Challenges • Security & Privacy • Anonymity • Reputation • Accountability • Information Preservation • Information Quality • Trust • Denial of service attacks • Search • Query Expressiveness • Comprehensiveness • Topology • Data Placement • Message Routing • Resource Management • fairness • load balancing
Peers: Stanford Research • New Architectures • Performance Modeling and Optimization • Security and Trust • Distributed ResourceManagement • Applications
User Interfacesand Annotations Payment Institutions Internet Libraries Copyright Services Telnet Query/Data Conversion HTTP Search Agents Z39.50 Commercial Information Brokers & Providers Digital Library Project: Overview
DigLib Projects: DLI1,DLI2 • Resource Discovery • Retrieving Information • Interpreting Information • Managing Information • Sharing Information
DigLib: Resource Discovery • Geographic Views (Tools to assist you in more systematically locating different types of information from a large and diverse number of information sources)
DigLib: Retrieving Information • Information Tiling • PalmPilot Infrastructure (PDA) • Power Browsing (PDA) • Query Translator • SDLIP (Simple Digital Library Interoperability Protocol) • Value Filtering • WebBase
DigLib: Interpreting Information • Murals (Tools to help a user interpret and organize search results) • Web Clustering
DigLib: Managing Information • Archival Repositories • Archiving Movie • InterBib (a tool for maintaining bibliographic information) • Medical Transport Info • PhotoBrowser
DigLib: Sharing Information • Diet ORB (PDA, based on MICO) • Digital Wallets • Mobile Info Delivery • Mobile Security • Multicasting
DLI1 Projects (95-99) • RManage/FIRM • SenseMaker • SCAM • Shopping Models, U-PAI • SONIA • STARTS • WebWriter • AHA • ComMentor • DLITE • Google • GLOSS • FAB • Grassroots • Metadata Architecture
TRAPP: Overview • TRAPP: Tradeoff in Replication Precision and Performance • A.k.a: Approximate Data Caching • Project goal: investigating techniques to permit controlled and explicit relaxation of data precision in exchange for improved performance
TRAPP: Motivation • Transactional consistency too expensive • Even nontransactional propagation of every update still too expensive in many cases • Solution: Approximate Caching • Exploit the fact that many applications do not require exact consistency • Avoid propagating insignificant updates • Trade cache precision for network load
cache [2, 5] [-1, 0.8] refreshes refreshes 0.2 3.9 source source Example: TRAPP Over Numeric Data • Caches store intervals that bound the exact source values • Sources refresh when value leaves interval • Query answers are intervals • Precision constraints specify maximum width
cache Query: X + Y (within 2) Answer: [2.9, 4.7] [2, 5] [-1, 0.8] probe 0.2 3.9 Y X source source Eg(cont’d): Querying in TRAPP For one-time aggregation queries: • Answers computed by combining approximate cached data and exact source data • At query-time: Find low-cost subset of sources to probe so final answer will have adequate precision • Algorithm determined by aggregation function • Some easy, some hard
TRAPP: Approximate Caching Two common scenarios: • Minimize bandwidth usage, precision fixed • TRAPP: caches store bounds as approximations • Queries select combination of cached & source data • Adaptive bound adjustment for good precision level • Bandwidth fixed, maximize precision • Best-Effort Synchronization: caches store stale copies • Refreshing based on priority scheduling • Global priority order via threshold • Adaptive threshold setting for flow control
TRAPP: Status • Past work: focused on an approximate data caching architecture that permits fine-grained control of the precision-performance tradeoff for numerical data in data caching environments. • Current work: applying the above techniques and others to more complex data such as Web pages.
CHAIMS: Overview • CHAIMS: Compiling High-level Access Interfaces for Multi-site Software • Objective: Investigate revolutionary approaches to large-scale software composition. • Approach: Develop and validate a composition-only language, a protocol for large, distributed, heterogeneous and autonomous megamodules, and a supporting system. • Plannedcontributions: • Asynchrony by splitting up CALL-statement. • Hardware and software platform independence. • Potential for multi-site dataflow optimization. • Performance optimization by invocation scheduling.
CHAIMS Megamodules CHAIMS: Overview Megaprogram for composition, written by domain programmer CHAIMS system automates generation of client for distributed system Megamodules, provided byvarious megamodule providers
Megamodule Provider Megaprogrammer wraps non-CHAIMS compliant megamodules writes adds information to information Wrapper Templates Megaprogram (in CHAIMS language) CHAIMS Repository information CHAIMS Compiler b d generates e a CSRT(compiled megaprogram) c MEGA modules Distribution System (CORBA, RMI…) CHAIMS: Architecture
OntoAgents: Objective • OntoAgents goal: establish an agent infrastructure on the WWW or WWW-like networks • Such an agent infrastructure requires an information food chain: every part of the food chain provides information, which enables the existence of the next part.
OntoAgents: Architecture • End User • Ontology Articulation Toolkit • Ontology Construction Tool • Community Portal • Agents • Ontologies • Inference Engine • Metadata Repository • Annotated Webpages • Webpage Annotation Tool
Academic Year Fall Winter Spring 2002/2003 no seminar in the fall quarter Database Seminar (CS545) Genome Databases (CS545G) 2001/2002 Past, Present, and Future of Database Technology Genome Databases Database Seminar to come 2000/2001 Interoperation, Databases and the Semantic Web Image Databases Databases and the Semantic Web 1999/2000 Ontologies, E-Commerce, XML & Metadata n/a Ontologies, E-Commerce, XML & Metadata 1998/1999 Digital Libraries Image Databases Internet and Databases 1997/1998 Data Warehousing Image Databases Internet and Databases 1996/1997 Fall Quarter 96 Image Databases Spring Quarter 97 Events: DB Seminars
Events: Meetings • Stanford Computer Science Forum - Annual Affiliates Meeting, Stanford, May 2003. • SWiM(the Stream Winter Meeting): About 35 researchers in the data streams are came together at Stanford for SWiM, Jan. 2003. • Stream Team: A few data streams research groups held some informal get-togethers, 2002. • Conference Talk: ACM SIGMOD/PODS, VLDB, ICDT, ICDE, ICDCS, CIDR
References: WebBase • Junghoo Cho, Hector Garcia-Molina. "Parallel Crawlers," In Proceedings of the Eleventh World Wide Web Conference, May 2002. • Taher Haveliwala, Aristides Gionis, etc. "Evaluating Strategies for Similarity Search on the Web," Proceedings of the Eleventh International World Wide Web Conference, May 2002. • Taher Haveliwala. "Topic-Sensitive PageRank," Proceedings of the Eleventh International World Wide Web Conference, May 2002.
References: STREAM • R. Motwani, J. Widom, etc. Query Processing, Resource Management, and Approximation in a Data Stream Management System In Proc. of the 2003 Conference on Innovative Data Systems Research (CIDR), January 2003 • A. Arasu, B. Babcock. etc. STREAM: The Stanford Stream Data ManagerIn Proc. of the ACM Intl Conf. on Management of Data (SIGMOD 2003), June 2003 • B. Babcock, S. Babu, etc. Models and Issues in Data Stream SystemsInvited paper in Proc. of the 2002 ACM Symp. on Principles of Database Systems (PODS 2002), June 2002
References: Peers • Neil Daswani, Hector Garcia-Molina and Beverly Yang. Open Problems in Data-Sharing Peer-to-Peer Systems,In ICDT, 2003. • Hector Garcia-Molina. Peer-To-Peer Data Management, Key-notes In ICDE, 2002. • Hrishikesh Deshpande, Mayank Bawa, and Hector Garcia-Molina. Streaming Live Media over a Peer-to-Peer Network.
References: TRAPP • C. Olston and J. Widom. Best-Effort Cache Synchronization with Source Cooperation. ACM SIGMOD 2002 International Conference on Management of Data, Madison, Wisconsin, June 2002, pp. 73 -84. • C. Olston, B. T. Loo and J. Widom. Adaptive Precision Setting for Cached Approximate Values. ACM SIGMOD 2001 International Conference on Management of Data, Santa Barbara , California, May 2001, pp. 355-366.
Useful Links • Database Group: http://www-db.stanford.edu/ • STREAM: http://www-db.stanford.edu/stream/ • Peers: http://www-db.stanford.edu/peers/ • DigLib: http://www-diglib.stanford.edu/ • TRAPP: http://www-db.stanford.edu/trapp/ • WebBase: http://www-diglib.stanford.edu/~testbed/doc2/WebBase/