360 likes | 456 Views
The Sibdata Revolution. Nick Roussopoulos DCS & UMIACS & Univ. of Maryland. September 2009. Data Management: Past to Current. Structured Data Structured architectures. Data Management: Huh???. The Landscape.
E N D
The Sibdata Revolution Nick Roussopoulos DCS & UMIACS & Univ. of Maryland September 2009
Data Management: Past to Current • Structured Data • Structured architectures Nick Roussopoulos
Data Management: Huh??? Nick Roussopoulos
The Landscape Bell’s Law:Every decade, a new, lower cost, class of computers emerges, defined by platform, interface, and interconnect • Mainframes 1960s • Minicomputers 1970s • Microcomputers/PCs 1980s • Web-based computing 1990s • Devices (Smart phones, PDAs, wireless sensors, RFID) 2000’s Enabling a new generation of applications that Mandate new data management methods & tools. Nick Roussopoulos
Data Then and Now • The “Data Industrial Revolution”: Data used to be “hand-crafted”, now it’s generated by computers!!! • The Data Integration quagmire: 40 years of continuous successes (sic) and still a long way to the end. • Structure provides crucial understanding for making data usable and leads to discovery/innovation. Nick Roussopoulos
Data Streaming Data Explosion PoS System Barcodes Phones Sensors RFID • Exponential data growth • New challenges: continuous, inter-connected, distributed, physical • Shrinking business cycles • More complex decisions Inventory Transactional Systems Telematics Clickstream Nick Roussopoulos
The Structure Spectrum • Structured data (schema-first) • regular, known, conforming, … • e.g., Relational database • Unstructured data (schema-never) freeform, irregular, • e.g., plain text, images, audio, … • Semi-structured data (schema-later) • Provides structural information, but less constrained. e.g., XML, tagged text/media Nick Roussopoulos
Data Integration • Integration is the ultimate schema-first problem. • Requires complete understanding & disambiguation • Structure (semantics) is both a key enabler and a key impediment here. Nick Roussopoulos
Structured Data: How much • Conventional Wisdom: ~20% of data is structured currently. • Consumer apps, enterprise search, multimedia apps are placing downward pressure on this. Nick Roussopoulos
State of the Art: Integration-in-the-large • Team work, huge & expensive effort, excruciating pain • Extremely long time lag between data generation and availability • Custom-coded implementations that are often unsuccessful • Clearing house of already discovered knowledge (the high overhead is for disambiguating the semantics of the heterogeneous data) Nick Roussopoulos
Future: Integration-in-the-small • End-user, limited in scope, requires training • Continuous as the data sources and equipment evolve • End-user tools are needed • Small cost, enormous opportunity for discovery and innovation Nick Roussopoulos
Sibling Data • Aggregation and naming of disparate data regardless location • Includes actual data, references to external data, queries that generate data, & programs to process data • May include other sibdata • Open vs Closed • Open: continuous accumulation • Closed: fixed snapshot (archival) • Location Independent semantics Nick Roussopoulos
Web search results Nick Roussopoulos
Content vs URL • Content • http://www.michaelmoore.com/ Nick Roussopoulos
Deep-Web Queries SELECT y.title FROM Yahoo_Movies m WHERE m.title like Moore; Nick Roussopoulos
Result vs. Query • Results are associated with the time the query was run • Queries can be captured in sibdata and executed at will; thus the sibdata would be open and captures a different result each time it executes Nick Roussopoulos
Queries to Relational Databases Yahoo_Actors Nick Roussopoulos
Sibdata • Deal with all the data from everywhere & in whatever form they come • Data co-existence no integrated schema, no single warehouse • Expand-as-you-go • Integrate little by little as you need • ETL Data mapping-integrating as you add more data Nick Roussopoulos
Sibdata Properties • Lightweight • Metadata captures the encapsulation, name, and provenance data • Location-independent • Accessible from anywhere • Isolated • Generated with no interference • Durable • Persist until dropped • Secure • Guarantee security defined by the creators and sources • Compose multiple levels of security to its components Nick Roussopoulos
Comparison to Transactions • Transactions • grouping of many actions into an atomic transaction- ACID properties • Substrate: database • Sibdata • Grouping of data into an atomic sibdata – LLADS • Substrate: actions/transactions/data generators Nick Roussopoulos
Sibdata Infrastructure Nick Roussopoulos
Sibdata Servers • Establish a global sibdata ID and name • Creates and maintains metadata with provenance, users, security, etc. • Provides searchable catalog • Provides storage for non-sib compliant data sources • Fault tolerance (replication) Nick Roussopoulos
Sib Protocols • Establish Sibdata protocol • Concurrency-Consistency issues (?) • Sharing of data • Name conventions • Dispute resolution • Distributed Logging • Security Using chits • Group and multi-valued ownership and visibility Nick Roussopoulos
User Interface • Simple OS support • Query Languages • Graphical Languages • ETL tools • Extra functionality • High dimensional indexing • Mining Nick Roussopoulos
Conclusions • Need to build Sib Infrastructure • Refine the sibdata semantics • Refine the security protocols • For data aggregates • User groups • Great opportunities for innovation Nick Roussopoulos
Presentations & Project • 3 X 7 students = 21 presentations ~2 per lecture • Lecture dates • Sep: 15, 22, 29 • Oct: 6, 13, 20, 27 • Nov: 3, 10, 17, 24 • Dec: 1, 8 • Project: Proposal due Sep 29 • Discussion: Every lecture be prepared to give a 2-3 min progress report, papers found, etc. Nick Roussopoulos
Network Data IndependenceHellerstein Berkeley • Physical Data Independence • Decoupling data from layout (not hard coded applications) • Permits reorganization of data w/o affecting the apps • Declarative query languages • Using the schema • Distributed Databases • Transparency hides location from the user who acts as if he is accessing a centralized database • Limited sites- not capable to expand to the mobility of and constant change of the configuration Nick Roussopoulos
table R 1 4 5 6 9 11 3 1 occurrence file Pilars of Data independence • Indexes- offer indirection allowing modification of the underlying structure • Schema based and declarative query languages & optimization Nick Roussopoulos
Sibdata Independence • Encapsulation of dissimilar data • Data can be moved, rearranged, altered • Additional indices on top of Sibdata becomes part of the sibdata • Naming and provenance data are fixed • Do not change to the outside world • Containment information (sibdata encapsulation within other sibdata) is guaranteed Nick Roussopoulos
DHT (Chord) • Data centric distribution • according to content- total data independence • very large number of distributed servers • Configuration changes rapidly (although this may not be really that important) • Fault-tolerance (extra machines) • Limited to single key searches (not range or join queries Nick Roussopoulos
Network Names & Services • Internet Indirection Infrastructure (i3) • Triggers (id,r) where id = global ID and r is an address to forward packets • When a mobile user moves to r’, he modifies his trigger to (id,r’) • It also supports 1-to-n mappings (anycast) • Content Distribution Networks (Akamai) • Replicates heavy data (images, videos) to multiple sites and redirects user accesses to those that are closer (indirection via location independence) Nick Roussopoulos
Relevant DB Technologies • Distributed Aggregation • Monitor networks (collecting stats) • Computing synopses and pass it along • Adaptive execution plans • Feedback to the execution • Commutative tasks to avoid extended delays • Range search over DHT • Trie hashing • Still limited • P2P & Mobile Databases Nick Roussopoulos
Pier: A P2P in situ Query Engine Goals • Massively distributed processing • Scallability • Relaxed consistency (best effort) Architecture • P2P Built on top of DHT • Multicast to all related nodes (lscan) • Pipelining the intermediate results Nick Roussopoulos
Pier Joins • Stored in DHT • Namespace=relation NR, NS • resourceID =Primary Key (PK) • instanceID =tuple # if not a PK • Assume R and S are already DHT hashed using <NR,PKR,1> and <NS,PKS,1> • Symmetric Join building phase • lscan NR and NS eliminate unqualified tuples and not needed attributes • Rehash all above tuples using • namespace NQ • resourceID=R.pkey*S.pkey • Tuples are tagged with relation name • SymmetricJoin Probing phase • Probing in parallel with building (with callbacks) locally • Satisfying tuples are either sent to the Qsite or DHT-ed for the pipelined op • Consumes a lot of bandwidth Nick Roussopoulos
Better Joins • Fetch Matches • Hash only S • lscan R and fetch NS tuples • Rewriting Join using 2-way semijoin • Project R & R on their PK and joining attribute • Do symmetric join on these projections • Rewriting Join using Bloom filters • Create and DHT the Bloom filters • Do lscan and access the Bloom filter to eliminate not joinable tuples Nick Roussopoulos
Conclusions for Pier • P2P bring massive parallelism • Repetitive data comparison over DHT brings along massive waste of bandwidth • Smarter in situ distillation (2-way semijoins, Bloom filters) work better Nick Roussopoulos