1 / 49

Introduction to Stanford DB Group Research

Introduction to Stanford DB Group Research. Li Ruixuan http://cs.hust.edu.cn/rxli/ rxli@public.wh.hb.cn. Contents. Introduction Past projects Current projects Events References Links. The Stanford Database Group. “ Mainstream ” faculty Hector Garcia-Molina Jennifer Widom

tevin
Download Presentation

Introduction to Stanford DB Group Research

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Stanford DB Group Research Li Ruixuan http://cs.hust.edu.cn/rxli/ rxli@public.wh.hb.cn

  2. Contents • Introduction • Past projects • Current projects • Events • References • Links

  3. The Stanford Database Group • “Mainstream” faculty • Hector Garcia-Molina • Jennifer Widom • Jeff Ullman • Gio Wiederhold • “Adjunct” faculty • Chris Manning (natural language processing) • Rajeev Motwani (theory) • Terry Winograd (human-computer interaction) • A.k.a. Stanford InfoLab

  4. Database Group (cont’d) • Approximately 25 Ph.D. students • Varying numbers of M.S. and undergraduate students • Handful of visitors • One senior research associate • One systems administrator, one programmer • Excellent administrative staff • Resident photographer

  5. Research Areas (very coarse) • Digital libraries • Peer-to-peer systems • Data streams • Replication, caching, archiving, broadcast, … • The Web • Ontologies, semantic Web • Data mining • Miscellaneous

  6. Past Projects • LIC: Large-Scale Interoperation and Composition (1999) – mediator (SKC, OntoWeb, CHAIMS, SmiQL, image DB) • SKC: Scalable Knowledge Composition (2000) - semantic heterogeneity • TID: Trusted Image Distribution (2001) - Image Filtering for Secure Distribution of Medical Information • Image Database: Content-based Image Retrieval (2003) • SimQL:Simulation Access Language (2001) - Software modules in manufacturing, acquisition, and planning systems

  7. Past Projects (cont’d) • TSIMMIS: Wrapping and mediation for heterogenous information sources (1998) • Lore: A Database Management System for XML (2000) • WHIPS: WareHouse Information Prototype at Stanford (1998) - Data warehouse creation and maintenance • MIDAS: Mining Data at Stanford (1999) • WSQ: Web-Supported Queries (2000) - Integrating database queries and Web searches

  8. Current Projects • WebBase: Crawling, storage, indexing, and querying of large collections of Web pages. (Molina) • STREAM: A Database Management System for Data Streams (Widom) • Peers: Building primitives for peer-to-peer systems (Molina) • Digital Libraries: Interoperating on-line services for end-user support (TID,WebBase,OntoAgents) (Molina) • TRAPP: Approximate data caching: trading precision for performance (Widom) • CHAIMS: Compiling High-level Access Interfaces for Multi-site Software (1999) (Wiederhold) • OntoAgents: Ontology based Infrastructure for Agents (2002) (Wiederhold)

  9. WebBase: Objectives • Provide a storage infrastructure for Web-like content • Store a sizeable portion of the Web • Enable researchers to easily build indexes of page features across large sets of pages • Distribute Webbase content via multicast channels • Support structureandcontent-based querying over the stored collection

  10. Multicast Module WWW Client Client Page Repository Client I n d e x A P I Crawler Indexing Client Indexing Module Analysis Module W e b B a s e A P I Client Client Query Engine Client Retrieval Indexes Feature Repository WebBase: Architecture

  11. WebBase: Current Status • Efficient “smart”crawler • Parallelism • Freshness & Relevance • Efficient and scalable indexing • Distributed Web-scale content indexes • Indexes over graph structure • Unicast dissemination • Within Stanford • External clients: Columbia, U.Wash, U.C.Berkeley

  12. WebBase: In Progress • WebBase Infrastructure • Multicast dissemination • Complex queries • Other work • PageRank extensions • Clustering and similarity search • Structured data extraction • Hidden Web crawling

  13. Data Streams: Motivation • Traditional DBMS -- data stored in finite, persistent data sets • New applications -- data as multiple, continuous, rapid, time-varying data streams • Network monitoring and traffic engineering • Security applications • Telecom call records • Financial applications • Web logs and click-streams • Sensor networks • Manufacturing processes

  14. Streamed Result Stored Result Register Query DSMS Input streams Archive Scratch Store Stored Relations STREAM: Architecture

  15. STREAM: Challenges • Multiple, continuous, rapid, time-varying streams of data • Queries may be continuous (not just one-time) • Evaluated continuously as stream data arrives • Answer updated over time • Queries may be complex • Beyond element-at-a-time processing • Beyond stream-at-a-time processing

  16. Persistent relations One-time queries Random access Access plan determined by query processor and physical DB design “Unbounded” disk store DBMS versus DSMS • Transient streams (and persistent relations) • Continuous queries • Sequential access • Unpredictable data arrival and characteristics • Bounded main memory

  17. STREAM: Current Status • Data streams and stored relations • Declarative language for registering continuous queries • Flexible queryplans • Designed to cope with high datarates and queryworkloads • Graceful approximation when needed • Careful resource allocation and usage • Relational, centralized (for now)

  18. STREAM: Ongoing Work • Algebra for streams • Semantics for continuous queries • Synopses and algorithmic issues • Memory management issues • Exploiting constraints on streams • Approximation in query processing • Distributed stream processing • System development

  19. STREAM: Related Work • Amazon/Cougar (Cornell) – sensors • Aurora(Brown/MIT) – sensor monitoring, dataflow • Hancock (AT&T) – telecom streams • Niagara (OGI/Wisconsin) – Internet XML databases • OpenCQ (Georgia) – triggers, incr. view maintenance • Stream(Stanford) – general-purpose DSMS • Tapestry (Xerox) – pub/sub content-based filtering • Telegraph (Berkeley) – adaptive engine for sensors • Tribeca (Bellcore) – network monitoring

  20. Peer-To-Peer Systems • Multiple sites (at edge) • Distributed resources • Sites are autonomous (different owners) • Sites are both clients and servers • Sites have equal functionality

  21. P2P Benefits • Pooling available (inexpensive) resources • High availability and fault-tolerance • Self-organization

  22. P2P Challenges • Security & Privacy • Anonymity • Reputation • Accountability • Information Preservation • Information Quality • Trust • Denial of service attacks • Search • Query Expressiveness • Comprehensiveness • Topology • Data Placement • Message Routing • Resource Management • fairness • load balancing

  23. Peers: Stanford Research • New Architectures • Performance Modeling and Optimization • Security and Trust • Distributed ResourceManagement • Applications

  24. User Interfacesand Annotations Payment Institutions Internet Libraries Copyright Services Telnet Query/Data Conversion HTTP Search Agents Z39.50 Commercial Information Brokers & Providers Digital Library Project: Overview

  25. DigLib Projects: DLI1,DLI2 • Resource Discovery • Retrieving Information • Interpreting Information • Managing Information • Sharing Information

  26. DigLib: Resource Discovery • Geographic Views (Tools to assist you in more systematically locating different types of information from a large and diverse number of information sources)

  27. DigLib: Retrieving Information • Information Tiling • PalmPilot Infrastructure (PDA) • Power Browsing (PDA) • Query Translator • SDLIP (Simple Digital Library Interoperability Protocol) • Value Filtering • WebBase

  28. DigLib: Interpreting Information • Murals (Tools to help a user interpret and organize search results) • Web Clustering

  29. DigLib: Managing Information • Archival Repositories • Archiving Movie • InterBib (a tool for maintaining bibliographic information) • Medical Transport Info • PhotoBrowser

  30. DigLib: Sharing Information • Diet ORB (PDA, based on MICO) • Digital Wallets • Mobile Info Delivery • Mobile Security • Multicasting

  31. DLI1 Projects (95-99) • RManage/FIRM • SenseMaker • SCAM • Shopping Models, U-PAI • SONIA • STARTS • WebWriter • AHA • ComMentor • DLITE • Google • GLOSS • FAB • Grassroots • Metadata Architecture

  32. TRAPP: Overview • TRAPP: Tradeoff in Replication Precision and Performance • A.k.a: Approximate Data Caching • Project goal: investigating techniques to permit controlled and explicit relaxation of data precision in exchange for improved performance

  33. TRAPP: Motivation • Transactional consistency too expensive • Even nontransactional propagation of every update still too expensive in many cases • Solution: Approximate Caching • Exploit the fact that many applications do not require exact consistency • Avoid propagating insignificant updates • Trade cache precision for network load

  34. cache [2, 5] [-1, 0.8] refreshes refreshes 0.2 3.9 source source Example: TRAPP Over Numeric Data • Caches store intervals that bound the exact source values • Sources refresh when value leaves interval • Query answers are intervals • Precision constraints specify maximum width

  35. cache Query: X + Y (within 2) Answer: [2.9, 4.7] [2, 5] [-1, 0.8] probe 0.2 3.9 Y X source source Eg(cont’d): Querying in TRAPP For one-time aggregation queries: • Answers computed by combining approximate cached data and exact source data • At query-time: Find low-cost subset of sources to probe so final answer will have adequate precision • Algorithm determined by aggregation function • Some easy, some hard

  36. TRAPP: Approximate Caching Two common scenarios: • Minimize bandwidth usage, precision fixed • TRAPP: caches store bounds as approximations • Queries select combination of cached & source data • Adaptive bound adjustment for good precision level • Bandwidth fixed, maximize precision • Best-Effort Synchronization: caches store stale copies • Refreshing based on priority scheduling • Global priority order via threshold • Adaptive threshold setting for flow control

  37. TRAPP: Status • Past work: focused on an approximate data caching architecture that permits fine-grained control of the precision-performance tradeoff for numerical data in data caching environments. • Current work: applying the above techniques and others to more complex data such as Web pages.

  38. CHAIMS: Overview • CHAIMS: Compiling High-level Access Interfaces for Multi-site Software • Objective: Investigate revolutionary approaches to large-scale software composition. • Approach: Develop and validate a composition-only language, a protocol for large, distributed, heterogeneous and autonomous megamodules, and a supporting system. • Plannedcontributions: • Asynchrony by splitting up CALL-statement. • Hardware and software platform independence. • Potential for multi-site dataflow optimization. • Performance optimization by invocation scheduling.

  39. CHAIMS Megamodules CHAIMS: Overview Megaprogram for composition, written by domain programmer CHAIMS system automates generation of client for distributed system Megamodules, provided byvarious megamodule providers

  40. Megamodule Provider Megaprogrammer wraps non-CHAIMS compliant megamodules writes adds information to information Wrapper Templates Megaprogram (in CHAIMS language) CHAIMS Repository information CHAIMS Compiler b d generates e a CSRT(compiled megaprogram) c MEGA modules Distribution System (CORBA, RMI…) CHAIMS: Architecture

  41. OntoAgents: Objective • OntoAgents goal: establish an agent infrastructure on the WWW or WWW-like networks • Such an agent infrastructure requires an information food chain: every part of the food chain provides information, which enables the existence of the next part.

  42. OntoAgents: Architecture • End User • Ontology Articulation Toolkit • Ontology Construction Tool • Community Portal • Agents • Ontologies • Inference Engine • Metadata Repository • Annotated Webpages • Webpage Annotation Tool

  43. Academic Year Fall Winter Spring 2002/2003 no seminar in the fall quarter Database Seminar (CS545) Genome Databases (CS545G) 2001/2002 Past, Present, and Future of Database Technology Genome Databases Database Seminar to come 2000/2001 Interoperation, Databases and the Semantic Web Image Databases Databases and the Semantic Web 1999/2000 Ontologies, E-Commerce, XML & Metadata n/a Ontologies, E-Commerce, XML & Metadata 1998/1999 Digital Libraries Image Databases Internet and Databases 1997/1998 Data Warehousing Image Databases Internet and Databases 1996/1997 Fall Quarter 96 Image Databases Spring Quarter 97 Events: DB Seminars

  44. Events: Meetings • Stanford Computer Science Forum - Annual Affiliates Meeting, Stanford, May 2003. • SWiM(the Stream Winter Meeting): About 35 researchers in the data streams are came together at Stanford for SWiM, Jan. 2003. • Stream Team: A few data streams research groups held some informal get-togethers, 2002. • Conference Talk: ACM SIGMOD/PODS, VLDB, ICDT, ICDE, ICDCS, CIDR

  45. References: WebBase • Junghoo Cho, Hector Garcia-Molina. "Parallel Crawlers,"  In Proceedings of the Eleventh World Wide Web Conference, May 2002. • Taher Haveliwala, Aristides Gionis, etc. "Evaluating Strategies for Similarity Search on the Web," Proceedings of the Eleventh International World Wide Web Conference, May 2002. • Taher Haveliwala. "Topic-Sensitive PageRank," Proceedings of the Eleventh International World Wide Web Conference, May 2002.

  46. References: STREAM • R. Motwani, J. Widom, etc. Query Processing, Resource Management, and Approximation in a Data Stream Management System In Proc. of the 2003 Conference on Innovative Data Systems Research (CIDR), January 2003 • A. Arasu, B. Babcock. etc. STREAM: The Stanford Stream Data ManagerIn Proc. of the ACM Intl Conf. on Management of Data (SIGMOD 2003), June 2003 • B. Babcock, S. Babu, etc. Models and Issues in Data Stream SystemsInvited paper in Proc. of the 2002 ACM Symp. on Principles of Database Systems (PODS 2002), June 2002

  47. References: Peers • Neil Daswani, Hector Garcia-Molina and Beverly Yang. Open Problems in Data-Sharing Peer-to-Peer Systems,In ICDT, 2003. • Hector Garcia-Molina. Peer-To-Peer Data Management, Key-notes In ICDE, 2002. • Hrishikesh Deshpande, Mayank Bawa, and Hector Garcia-Molina. Streaming Live Media over a Peer-to-Peer Network.

  48. References: TRAPP • C. Olston and J. Widom. Best-Effort Cache Synchronization with Source Cooperation. ACM SIGMOD 2002 International Conference on Management of Data, Madison, Wisconsin, June 2002, pp. 73 -84. • C. Olston, B. T. Loo and J. Widom. Adaptive Precision Setting for Cached Approximate Values. ACM SIGMOD 2001 International Conference on Management of Data, Santa Barbara , California, May 2001, pp. 355-366.

  49. Useful Links • Database Group: http://www-db.stanford.edu/ • STREAM: http://www-db.stanford.edu/stream/ • Peers: http://www-db.stanford.edu/peers/ • DigLib: http://www-diglib.stanford.edu/ • TRAPP: http://www-db.stanford.edu/trapp/ • WebBase: http://www-diglib.stanford.edu/~testbed/doc2/WebBase/

More Related