1 / 28

Calder: OGSA-DAI access to data in streams

Calder: OGSA-DAI access to data in streams. Beth Plale Computer Science Dept. Indiana University. Contributors. PhD students: Nithya Vijayakumar Ying Liu Funding agencies: Department of Energy Early Career grant National Science Foundation LEAD project. Problem Statement.

ailish
Download Presentation

Calder: OGSA-DAI access to data in streams

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Calder: OGSA-DAI access to data in streams Beth Plale Computer Science Dept. Indiana University

  2. Contributors • PhD students: • Nithya Vijayakumar • Ying Liu • Funding agencies: • Department of Energy • Early Career grant • National Science Foundation • LEAD project

  3. Problem Statement • Data streams (e.g., from sensor networks, instruments) are growing in pervasivness and importance • Bringing streaming sources to a grid by means of wrapping each sensor and instrument as grid or web service is a naïve solution. • It is the data in the streams that is of value to growing groups of user, not the instruments. • Select few will ‘steer’ instruments • Need fresh solution that provides access to collections of data streams.

  4. Outline • Characterization of data stream systems

  5. Data streams: • Indefinite sequence of events (or messages, tuples) • Often time marked • Generation time, that is, timestamp, and • Logical time • Events continuously generated • pushed or pulled from providers to remote consumers

  6. overlap Types of Data Stream Systems • Stream detection • systems • detect unusual • behavior • Data manipulation • systems • process large • amounts of data size and number of stream “chunks” analyzed • Stream routing • systems • delivery of events timeliness demands on response

  7. manip detect Stream Routing Systems route • Known by various names • Publish/subscribe, selective data dissemination, document filtering, message oriented middleware (MOM) • Decisions made event-by-event • Set of queries (usually very large number) managed over long time duration, arriving event matched against set of queries. • Stream routing projects • Xfilter (UMaryland), Xyleme (INRIA), XPushMachine (UWashington), NaradaBroker (IndianaU), Bayou(XeroxParc), Echo (GeorgiaTech)

  8. Stream Routing Example - Timeliness requirement necessitates focus on efficient matching queries added through user interface Long-standing queries Results multicast to owners of satisfied queries stock quote stream Each arriving event matched against set of queries

  9. manip detect Data Manipulation Systems route • Event streams subject to transformation, filtering, aggregation. • Looser timeliness requirements on results • Long running queries, often periodic (based on assumption of synchronous streams) • Results in generation of new streams • Projects: • Antarctic Monitoring(UNottingham), sensor network query layer (Cornell), dQUOB (IndianaU), STREAM (Stanford), Fjords (Berkeley), NiagraCQ (UWisconsin)

  10. manip detect Stream Detection Systems route • Event-oriented (versus periodic) • Less predictable, asynchronous streams • Intent is to detectanomalous behavior • Timeliness is critical, time markers key to decision making • Result is notification message • Examples: • R-GMA (EU DataGrid), dQUOB (IndianaU), Conquer (GeorgiaTech), Gigascope (AT&T), Fjords (Berkeley)

  11. overlap Claim: streams systems that qualify as Data Stream Resource are only those circled Stream detection systems Data manipulation systems Stream routing systems Justification: • A Data resource has coherence and meaning • Both systems qualify as ‘data resource’ because: • -- distributed global snapshot on stream behavior alone • -- distributed global snapshot has meaning and • coherence

  12. Data description - database description Data access - query/update access Data management - monitor service Data factory - create rowset instance OGSA-DAI OGSI access to data sources Grid data service registry Grid data service registry Grid data Service (GDS) R-DBMS Rowset grid data service Rowset grid data service Rowset Grid data service

  13. Database Access using OGSA-DAI OGSI grid data service registry grid data service factory service create create GDS Grid data service registry Grid data service registry Grid data Service (GDS) handle of GDS Grid service handle of factory R-DBMS query Rowset grid data service Rowset grid data service Rowset Grid data service response (row-by-row)

  14. Database Access using OGSA-DAI OGSI grid data service registry grid data service factory service create create GDS Grid data service registry Grid data service registry Grid data Service (GDS) handle of GDS Grid service handle of factory query Rowset grid data service Rowset grid data service Rowset Grid data service response (row-by-row)

  15. Calder: presents db interface to application and supports multiple stream systems (like ogsa-dai)

  16. Specifying Long-Running Queries: An Example SELECT (caps, acars, nexradII) WHERE REGION(90W, 30N, 62.5KM) START now EXPIRE 1hr RANGE 6 min Select from 3 data types, CAPS radar, ACARS data, and NEXRAD Doppler level II, Filter out events not falling within 80 mile radius around New Orleans Execute beginning immediately and terminating execution in 1 hour. Set sliding window size (RANGE) as window over which joins are carried out

  17. Retrieving results from rowset service: Rowset Request API getTuple(timestamp, ringBufferID) getRangeTuple(startTS, endTS, ringBufferID) getMostRecent(lastRecent, num_events, ringBufferID) getStream(ringbufferID) Results obtained through request to rowset service of form: -- single event based on timestamp, -- range of events bounded by time range, -- most recent n events, or -- stream of events.

  18. Experimental Environment • GDS, rowset server, stream processing server • Dual Dell 2.8 GHz Precision workstation, 2 GB memory, RHEL • Planner • Solaris UltraSPARC 502MHz, 1GB memory, SunOS 5.8 • Computational mesh (1 node) • Xeon Intel 2.8 Ghz, 2GB RAM, RedHat 8.0 • 1 Gbps switched Ethernet network • OGSA-DAI OGSI v4.0 • dQUOB v1.0

  19. Evaluation • Benchmark following steps • GDS setup - ogsa-dai factory call • Query plan time • Plan how query is to be distributed • Query compile - compile into portable script form • Distribute query - deploy script into computational mesh (1 node in test) • Ring buffer setup - allocate space in rowset service, return handle to GDS.

  20. Benchmark results: average taken over 100 different runs

  21. Time interval (a) obtained bygetRangeTuple(startTS, end TS, ringBufferID)drops as # queries at node increases; turnaround increases

  22. Related Research • Common Middleware Instrument Architecture (CIMA), McMullen, Bramley et al. • GATES, • Aggarwal, HPDC04 • Data Cutter, Saltz • Grid Stream Database Manager (GSDM) • Koparanova and Risch, AxGrids 2003 • Stores into O-O DBMS • DQP • Manchester and Newcastle • DB community: Widom, Borealis, NiagaraCQ

  23. Real-Time WRF executed on Grid when environment primed and storms present On-Demand Resource Scheduling Summary • Model of data stream store provides conceptual framework for retrieving data from streams by means of rich andmeaningful queries • GGF DAIS framework of OGSA-DAI is intuitive abstraction for accessing data stream store • It is not about the instruments and sensors … it is about the streams!

  24. Parting Views • Our work brings data streams to the grid in way that user intuitively thinks about accessing data resources • Querying heterogeneous data management tools is a difficult problem. Heterogeneous query languages will exist • I.e., continuous query languages will always be different from database query languages. • Scientists need a lot of coaching to understand why their monolithic “pile of perl” is not a grid service, and further why they should invest in breaking up that “pile of perl”.

  25. http://www.cs.indiana.edu/dde/projects/Calder.html Calder extends dQUOB (http://www.cs.indiana.edu/dde/projects/dquob.html). dQUOB v1.0, which is available for release as open source, includes the stream processing system components of Calder.

More Related