580 likes | 589 Views
Explore the use of query languages and scalable systems for detecting complex patterns in databases and data streams. Learn about efficient algorithms for mining large amounts of data and extracting useful patterns.
E N D
UCLA Computer Science Department High-performancePattern Detection and Discoveryfor Databases and Data Streams Barzan Mozafari Adviser: Prof. Carlo Zaniolo Committee Members: Prof. Junghoo Cho, Prof. D. Stott Parker, and Prof. Mark Hansen Winter 2011
Big Picture • Query Languages that allow for the expression of complex patterns • Scalable Systems that support such languages and can handle massive, high-arrival data • Efficient, One-pass Algorithms that can mine large amounts of stored or streaming data and extract useful patterns Query Patterns Matches Data Mining Data
Overview • Introduction • Query Languages for Pattern Detection • Kleene-* Constructs in SQL • Nested Words[SIGMOD’10, VLDB’10] • Optimization [Work in progress] • XSeq [Work in progress] • Conclusion
Complex Event Patterns • Sequences in DBs and CEP over data streams • Academic and industrial interest: • SQL-TS [PODS ‘01] • SASE [2006], SASE+ [2008] • SQL Change proposal, 2007 (by Oracle, IBM and Streambase) • Other industrial and academic languages: • Cayuga & CEL • CEDR • Microsoft CEP & LINQ
Our Contribution: K*SQL • A powerful language for: i. Expressing more complex patterns on relational streams and sequences ii. Querying data with more complex structures, e.g, XML and genomic data • A unifying engine for sequence patterns and XML • New optimization techniques • pattern search over nested words • Efficient query execution backend for other languages • XSeq: An XPath-resembling language to bring Kleene-* to XML applications
Regular Expressions in SQL rfid_readings (Time, SensorType, ensorId, ItemId)
Employees who spend >1 hour in the lab but leave without going to decontamination room Lab Lab Room2 Room12 Room7 Lab Room2 Room7 SELECT badgeID FROM rfid PARTITIONBY badgeID ORDERBY timestamp ASPATTERN Exit
Employees who spend >1 hour in the lab but leave without going to decontamination room L Lab L Lab Room2 Room12 Room7 Lab Room2 Room7 SELECT badgeID FROM rfid PARTITIONBY badgeID ORDERBY timestamp ASPATTERN ( L ) WHEREL.room = ‘Lab’ Exit
Employees who spend >1 hour in the lab but leave without going to decontamination room L Lab L+ L Lab Room2 Room12 Room7 Lab Room2 Room7 SELECT badgeID FROM rfid PARTITIONBY badgeID ORDERBY timestamp ASPATTERN( L+ ) WHERE L.room = ‘Lab’ Exit
Employees who spend >1 hour in the lab but leave without going to decontamination room L Lab L+ L Lab O Room2 O+ O Room12 O Room7 Lab Room2 Room7 SELECT badgeID FROM rfid PARTITIONBY badgeID ORDERBY timestamp ASPATTERN ( L+ O+ ) WHERE L.room = ‘Lab’ ANDO.room != ‘Decontamination’ Exit
Employees who spend >1 hour in the lab but leave without going to decontamination room L Lab L+ L Lab R R Room2 O+ R Room12 R Room7 L+ L Lab R R Room2 O+ R Room7 SELECT badgeID FROM rfid PARTITIONBY badgeID ORDERBY timestamp ASPATTERN ( (R: L+ O+) ) WHERE L.room = ‘Lab’ AND O.room != ‘Decontamination’ Exit
Employees who spend >1 hour in the lab but leave without going to decontamination room L Lab L+ L Lab R R Room2 O+ R Room12 R Room7 L+ R+ L Lab R R Room2 O+ R Room7 SELECT badgeID FROM rfid PARTITIONBY badgeID ORDERBY timestamp ASPATTERN ( (R: L+ O+)+ ) WHERE L.room = ‘Lab’ AND O.room != ‘Decontamination’ Exit
Employees who spend >1 hour in the lab but leave without going to decontamination room L Lab L+ L Lab R R Room2 O+ R Room12 R Room7 L+ R+ L Lab R R Room2 O+ R Room7 SELECT badgeID FROM rfid PARTITIONBY badgeID ORDERBY timestamp ASPATTERN ( (R: L+ O+)+ X) WHERE L.room = ‘Lab’ AND O.room != ‘Decontamination’ ANDX.room = ‘Exit’ X Exit
Employees who spend >1 hour in the lab but leave without going to decontamination room L Lab L+ L Lab R R Room2 O+ R Room12 R Room7 L+ R+ L Lab R R Room2 O+ R Room7 SELECT badgeID FROM rfid PARTITIONBY badgeID ORDERBY timestamp ASPATTERN( (R: L+ O+)+ X) WHERE L.room = ‘Lab’ AND O.room != ‘Decontamination’ AND X.room = ‘Exit’ ANDsum(R.Last(L).timestamp – R.First(L).timestamp) > 3600 X Exit
Strictly More Expressive, through:(i)Nested Kleene-*, (ii) Labels, i.e. Aliases SELECT badgeID FROM rfid PARTITIONBY badgeID ORDERBY timestamp ASPATTERN( (R: L+ O*)+ X) WHERE L.room = ‘Lab’ AND O.room != ‘Decontamination’ AND X.room = ‘Exit’ AND sum(R.Last(L).timestamp – R.First(L).timestamp) > 3600
Strictly More Expressive, through:(i)Nested Kleene-*, (ii) Labels, i.e. Aliases L Lab L+ L Lab R R Room2 O+ R Room12 R Room7 L+ R+ L Lab R R Room2 O+ R Room7 SELECT badgeID, Last(R).Last(L).timestamp – First(R).First(L).timestamp) FROM rfid PARTITIONBY badgeID ORDERBY timestamp ASPATTERN( (R: L+ O*)+ X) WHERE L.room = ‘Lab’ AND O.room != ‘Decontamination’ AND X.room = ‘Exit’ AND sum(R.Last(L).timestamp – R.First(L).timestamp) > 3600 X Exit
K*SQL Checkpoint • A powerful language with a very efficient implementation based on FSA • Subsumes SQL-MR, SASE+, Cayuga, SQL-TS • Many interesting applications • including queries on semistructured documents Very natural question: Can we handle full XML?
Automata and XML • Word Automata (FSA): only linear structure is explicit, cannot model parenthesis languages • Ordered Tree Automata (OTA): only hierarchical structure is explicit, exponentially less succinct for word queries • Pushdown Automata (PDA): Many problems are undecidable; expensive complexity
Advances in the Automata World Nested Words [Alur’06] • Linear sequence + well-nested edges • Positions labeled with symbols in S a2 a1 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 Positions classified as: • Call positions: both linear and hierarchical successors • Return positions: both linear and hierarchical predecessors • Internal positions: otherwise 20
Nested Word Applications U A U A U A G G G G G C C C C C XML Document Program RNA Sequence <conference> <name> CAV 2006 </name> <location> <city> Seattle </city> <hotel> Sheraton </hotel> </location> <sponsor> MSR </sponsor> <sponsor> Cadence </sponsor> </conference> global int x; bool P() { … x = 3; if Q x = 1 ; … } bool Q () { local int y; … x = y; return (x==0); } Primary structure: Linear sequence of nucleotides (A, C, G, U) Secondary structure: Hydrogen bonds between nucleotides
Odious Comparison NWA is exponentially more succinct than Tree Automata No query language has been proposed for NW
XML Sigmod Record:SAX-3 <!ELEMENTSigmodRecord(issue)* > <!ELEMENTissue (volume,number,articles) > <!ELEMENTvolume (#PCDATA)> <!ELEMENTnumber (#PCDATA)> <!ELEMENTarticles (article)* > <!ELEMENTarticle(title,initPage,endPage,authors) > <!ELEMENTtitle (#PCDATA)> <!ELEMENTinitPage (#PCDATA)> <!ELEMENTendPage (#PCDATA)> <!ELEMENTauthors (author)* > <!ELEMENTauthor (#PCDATA)> <!ATTLISTauthorposition CDATA #IMPLIED>
XPath <SigmodRecord> <issue> … <article> <title> Implementation of GEM </title> <initPage> 45 </initPage> … <authors> … <author position="01"> Carlo Zaniolo </author> … </authors> </article> …. Find articles of Carlo Zanioloas the 2nd co-author //article[authors/author [@position = "01" and text()="Carlo Zaniolo"] ]/title/text()
K*SQL Question: Can we query nested words in K*SQL? In particular: can we express traditional XML queries • i.e. those often expressed via XPath/XQuery:
Find articles of Carlo Zaniolo as the 2nd co-author <SigmodRecord> <issue> … <article> <title> Implementation of GEM </title> <initPage> 45 </initPage> … <aut hors> … <author position="01"> Carlo Zaniolo </author> … </authors> </article> …. SELECTTitle.token AS articleName FROM sigmod_record AS PATTERN ( ) WHERE
Find articles of Carlo Zaniolo as the 2nd co-author <SigmodRecord> <issue> … <article> <title> Implementation of GEM </title> <initPage> 45 </initPage> … <authors> … <author position="01"> Carlo Zaniolo </author> … </authors> </article> …. SELECTTitle.token AS articleName FROM sigmod_record AS PATTERN (OpArt ) WHEREOpArt.value = ‘<article>’
Find articles of Carlo Zaniolo as the 2nd co-author <SigmodRecord> <issue> … <article> <title> Implementation of GEM </title> <initPage> 45 </initPage> … <authors> … <author position="01"> Carlo Zaniolo </author> … </authors> </article> …. SELECTTitle.token AS articleName FROM sigmod_record AS PATTERN (OpArt ) WHEREOpArt = open(‘article’)
Find articles of Carlo Zaniolo as the 2nd co-author <SigmodRecord> <issue> … <article> <title> Implementation of GEM </title> <initPage> 45 </initPage> … <authors> … <author position="01"> Carlo Zaniolo </author> … </authors> </article> …. SELECTTitle.token AS articleName FROM sigmod_record AS PATTERN (OpArt OpTitl ) WHERE OpArt = open(‘article’) ANDOpTitl = open(‘title’)
Find articles of Carlo Zaniolo as the 2nd co-author <SigmodRecord> <issue> … <article> <title> Implementation of GEM </title> <initPage> 45 </initPage> … <authors> … <author position="01"> Carlo Zaniolo </author> … </authors> </article> …. SELECTTitle.tokenAS articleName FROM sigmod_record AS PATTERN (OpArt OpTitl Title ) WHERE OpArt = open(‘article’) ANDOpTitl = open(‘title’)
Find articles of Carlo Zaniolo as the 2nd co-author <SigmodRecord> <issue> … <article> <title> Implementation of GEM </title> <initPage> 45 </initPage> … <authors> … <author position="01"> Carlo Zaniolo </author> … </authors> </article> …. SELECTTitle.token AS articleName FROM sigmod_record AS PATTERN (OpArt OpTitl Title ClTitl ) WHERE OpArt = open(‘article’) ANDOpTitl = open(‘title’) ANDClTitl = close(‘title’)
Find articles of Carlo Zaniolo as the 2nd co-author <SigmodRecord> <issue> … <article> <title> Implementation of GEM </title> <initPage> 45 </initPage> … <authors> … <author position="01"> Carlo Zaniolo </author> … </authors> </article> …. SELECTTitle.token AS articleName FROM sigmod_record AS PATTERN (OpArt OpTitl Title ClTitl E* ) WHERE OpArt = open(‘article’) ANDOpTitl = open(‘title’) ANDClTitl = close(‘title’) ANDisElement(E)
Find articles of Carlo Zaniolo as the 2nd co-author <SigmodRecord> <issue> … <article> <title> Implementation of GEM </title> <initPage> 45 </initPage> … <authors> … <author position="01"> Carlo Zaniolo </author> … </authors> </article> …. SELECTTitle.token AS articleName FROM sigmod_record AS PATTERN (OpArt OpTitl Title ClTitl E* OpAuths ) WHERE OpArt = open(‘article’) ANDOpTitl = open(‘title’) ANDClTitl = close(‘title’) AND isElement(E) ANDOpAuths = open(‘authors’)
Find articles of Carlo Zaniolo as the 2nd co-author <SigmodRecord> <issue> … <article> <title> Implementation of GEM </title> <initPage> 45 </initPage> … <authors> … <author position="01"> Carlo Zaniolo </author> … </authors> </article> …. SELECTTitle.token AS articleName FROM sigmod_record AS PATTERN (OpArt OpTitl Title ClTitl E* OpAuths E* ) WHERE OpArt = open(‘article’) ANDOpTitl = open(‘title’) ANDClTitl = close(‘title’) AND isElement(E) ANDOpAuths = open(‘authors’) ANDClArt = close(‘article’)
Find articles of Carlo Zaniolo as the 2nd co-author <SigmodRecord> <issue> … <article> <title> Implementation of GEM </title> <initPage> 45 </initPage> … <authors> … <author position="01"> Carlo Zaniolo </author> … </authors> </article> …. SELECTTitle.token AS articleName FROM sigmod_record AS PATTERN (OpArt OpTitl Title ClTitl E* OpAuths E* OpAu ) WHERE OpArt = open(‘article’) ANDOpTitl = open(‘title’) ANDClTitl = close(‘title’) AND isElement(E) ANDOpAuths = open(‘authors’) ANDOpAu = open(‘author’)
Find articles of Carlo Zaniolo as the 2nd co-author <SigmodRecord> <issue> … <article> <title> Implementation of GEM </title> <initPage> 45 </initPage> … <authors> … <author position="01"> Carlo Zaniolo </author> … </authors> </article> …. SELECTTitle.token AS articleName FROM sigmod_record AS PATTERN (OpArt OpTitl Title ClTitl E* OpAuths E* OpAu Pos ) WHERE OpArt = open(‘article’) ANDOpTitl = open(‘title’) ANDClTitl = close(‘title’) AND isElement(E) ANDOpAuths = open(‘authors’) ANDOpAu = open(‘author’) ANDpos.type = ‘attr’ AND pos.value = ’01’ AND pos.token = ‘position’
Find articles of Carlo Zaniolo as the 2nd co-author <SigmodRecord> <issue> … <article> <title> Implementation of GEM </title> <initPage> 45 </initPage> … <authors> … <author position="01"> Carlo Zaniolo </author> … </authors> </article> …. SELECTTitle.token AS articleName FROM sigmod_record AS PATTERN (OpArt OpTitl Title ClTitl E* OpAuths E* OpAu Pos ) WHERE OpArt = open(‘article’) ANDOpTitl = open(‘title’) ANDClTitl = close(‘title’) AND isElement(E) ANDOpAuths = open(‘authors’) ANDOpAu = open(‘author’) ANDpos = attribute (‘position’, ’01’)
Find articles of Carlo Zaniolo as the 2nd co-author <SigmodRecord> <issue> … <article> <title> Implementation of GEM </title> <initPage> 45 </initPage> … <authors> … <author position="01"> Carlo Zaniolo </author> … </authors> </article> …. SELECTTitle.token AS articleName FROM sigmod_record AS PATTERN (OpArt OpTitl Title ClTitl E* OpAuths E* OpAu Pos Author ) WHERE OpArt = open(‘article’) ANDOpTitl = open(‘title’) ANDClTitl = close(‘title’) AND isElement(E) ANDOpAuths = open(‘authors’) ANDOpAu = open(‘author’) AND pos = attribute(‘position’, ‘01’) ANDauthor.token = `Carlo Zaniolo’
Find articles of Carlo Zaniolo as the 2nd co-author <SigmodRecord> <issue> … <article> <title> Implementation of GEM </title> <initPage> 45 </initPage> … <authors> … <author position="01"> Carlo Zaniolo </author> … </authors> </article> …. SELECTTitle.token AS articleName FROM sigmod_record AS PATTERN (OpArt OpTitl Title ClTitl E* OpAuths E* OpAu Pos Author ClAu ) WHERE OpArt = open(‘article’) ANDOpTitl = open(‘title’) ANDClTitl = close(‘title’) AND isElement(E) ANDOpAuths = open(‘authors’) ANDOpAu = open(‘author’) AND pos = attribute(‘position’, ‘01’) AND author.value = `Carlo Zaniolo’ ANDClAu = close(‘author’)
Find articles of Carlo Zaniolo as the 2nd co-author <SigmodRecord> <issue> … <article> <title> Implementation of GEM </title> <initPage> 45 </initPage> … <authors> … <author position="01"> Carlo Zaniolo </author> … </authors> </article> …. SELECTTitle.token AS articleName FROM sigmod_record AS PATTERN (OpArt OpTitl Title ClTitl E* OpAuths E* OpAu Pos Author ClAu E* ) WHERE OpArt = open(‘article’) ANDOpTitl = open(‘title’) ANDClTitl = close(‘title’) AND isElement(E) ANDOpAuths = open(‘authors’) ANDOpAu = open(‘author’) AND pos = attribute(‘position’, ‘01’) AND author.value = `Carlo Zaniolo’ ANDClAu = close(‘author’)
Find articles of Carlo Zaniolo as the 2nd co-author <SigmodRecord> <issue> … <article> <title> Implementation of GEM </title> <initPage> 45 </initPage> … <authors> … <author position="01"> Carlo Zaniolo </author> … </authors> </article> …. SELECTTitle.token AS articleName FROM sigmod_record AS PATTERN (OpArt OpTitl Title ClTitl E* OpAuths E* OpAu Pos Author ClAu E* ClAuths ClArt) WHERE OpArt = open(‘article’) ANDOpTitl = open(‘title’) ANDClTitl = close(‘title’) AND isElement(E) ANDOpAuths = open(‘authors’) ANDOpAu = open(‘author’) AND pos = attribute(‘position’, ‘01’) AND author.token = `Carlo Zaniolo’ ANDClAu = close(‘author’) ANDClAuths = close(‘authors’) ANDClArt = close(‘article’)
Find articles of Carlo Zaniolo as the 2nd co-author <SigmodRecord> <issue> … <article> <title> Implementation of GEM </title> <initPage> 45 </initPage> … <authors> … <author position="01"> Carlo Zaniolo </author> … </authors> </article> …. SELECTTitle.token AS articleName FROM sigmod_record AS PATTERN (OpArt OpTitl Title ClTitl E* OpAuths E* OpAu Pos Author ClAu E* ClAuths ClArt) WHERE OpArt = open(‘article’) ANDOpTitl = open(‘title’) ANDClTitl = close(‘title’) AND isElement(E) ANDOpAuths = open(‘authors’) ANDOpAu = open(‘author’) AND pos = attribute(‘position’, ‘01’) AND author.token = `Carlo Zaniolo’ ANDClAu = close(‘author’) ANDClAuths = close(‘authors’) ANDClArt = close(‘article’)
Find articles of Carlo Zaniolo as the 2nd co-author <SigmodRecord> <issue> … <article> <title> Implementation of GEM </title> <initPage> 45 </initPage> … <authors> … <author position="01"> Carlo Zaniolo </author> … </authors> </article> …. SELECTTitle.token AS articleName FROM sigmod_record AS PATTERN (OpArt OpTitl Title ClTitl E* OpAuths E* OpAu Pos Author ClAu E* ClAuths ClArt) WHERE OpArt = open(‘article’) ANDOpTitl = open(‘title’) ANDClTitl = close(‘title’) AND isElement(E) ANDOpAuths = open(‘authors’) ANDOpAu = open(‘author’) AND pos = attribute(‘position’, ‘01’) AND author.token = `Carlo Zaniolo’ ANDClAu = close(‘author’) ANDClAuths = close(‘authors’) ANDClArt = close(‘article’)
Sequence Queries over XML: ‘W’-Patterns in Stocks <!ELEMENTStocks(Stock)* > <!ELEMENTStock(symbol, date, price, volume)> <!ELEMENTsymbol (#PCDATA)> <!ELEMENTdate (#PCDATA)> <!ELEMENTprice (#PCDATA)> <!ELEMENTvolume (#PCDATA)>
W-patterns in NASDAQ transactions with volume>1000 <Stock symbol=“YHOO” date=“01-01-2010 23:10:00”> <price> 18.50 </price> <volume> 21 </volume> </Stock> <Stock symbol=“YHOO” date=“01-01-2010 23:16:00”> <price> 18.70 </price> <volume> 11 </volume> </Stock> … SELECT FIRST(Z).FIRST(X).Sym.token FROM Nasdaq PARTITION BY Y.X.Sym.token AS PATTERN (Z: (X: OpSt Sym Date OP Price1 CP OpV Volume ClV ClSt)* (Y: OpSt Sym Date OP Price2 CP OpV Volume ClV ClSt)* )^2 WHERE OpSt = open(‘Stock’) AND ClSt = open(‘Stock’) AND OP = open(‘price’) AND CP = close(‘price’) AND OpV = open(‘volume’) AND ClV = close(‘volume’) AND INT(volume.token) >= 100 AND Z.X.price1.token < Z.PREV(X).price1.token AND Z.Y.price2.token > Z.PREV(Y).price2.token
W-patterns in NASDAQ transactions with volume>1000 <Stock symbol=“YHOO” date=“01-01-2010 23:10:00”> <price> 18.50 </price> <volume> 21 </volume> </Stock> <Stock symbol=“YHOO” date=“01-01-2010 23:16:00”> <price> 18.70 </price> <volume> 11 </volume> </Stock> … SELECT FIRST(Z).FIRST(X).Sym.token FROM Nasdaq PARTITION BY Y.X.Sym.token AS PATTERN (Z: (X: OpSt Sym Date OP Price1 CP OpV Volume ClV ClSt)* (Y: OpSt Sym Date OP Price2 CP OpV Volume ClV ClSt)* )^2 WHERE OpSt = open(‘Stock’) AND ClSt = open(‘Stock’) AND OP = open(‘price’) AND CP = close(‘price’) AND OpV = open(‘volume’) AND ClV = close(‘volume’) AND INT(volume.token) >= 100 AND Z.X.price1.token < Z.PREV(X).price1.token AND Z.Y.price2.token > Z.PREV(Y).price2.token
W-patterns in NASDAQ transactions with volume>1000 Y* Y* X* X* <Stock symbol=“YHOO” date=“01-01-2010 23:10:00”> <price> 18.50 </price> <volume> 21 </volume> </Stock> <Stock symbol=“YHOO” date=“01-01-2010 23:16:00”> <price> 18.70 </price> <volume> 11 </volume> </Stock> … SELECT FIRST(Z).FIRST(X).Sym.token FROM Nasdaq PARTITION BY Y.X.Sym.token AS PATTERN (Z: (X: OpSt Sym Date OP Price1 CP OpV Volume ClV ClSt)* (Y: OpSt Sym Date OP Price2 CP OpV Volume ClV ClSt)* )^2 WHERE OpSt = open(‘Stock’) AND ClSt = open(‘Stock’) AND OP = open(‘price’) AND CP = close(‘price’) AND OpV = open(‘volume’) AND ClV = close(‘volume’) AND INT(volume.token) >= 100 AND Z.X.price1.token < Z.PREV(X).price1.token AND Z.Y.price2.token > Z.PREV(Y).price2.token
Optimization in K*SQL • Compile-Time: • Inferring inter-predicate implications • Query re-writing, e.g. adding more constrainst • Greedy predicate assignment • Run-Time: Avoiding unnecessary backtracks • VPSearch: Extending KMP search algorithm to nested words and visibly pushdown words • Optimizing non-determinisitc queries • i.e. all-match query modes
References • [1] Data mining: Staking a claim on your privacy. Information and Privacy Commissioner, Ontario, Jan. 1998. • [2] Directive on privacy protection. European Union, Oct. 1998. • [3] The end of privacy. The Economist, May 1999. • [4] Daniel J. Abadi, Donald Carney, Ugur C etintemel, Mitch Cherniack, Christian Convey, C. Erwin, Eduardo F. Galvez, M. Hatoun, Anurag Maskey, Alex Rasin, A. Singer, Michael Stonebraker, Nesime Tatbul, Ying Xing, R. Yan, and Stanley B. Zdonik. Aurora: A data stream management system. In SIGMOD Conference, page 666, 2003. • [5] Mads Sig Ager, Olivier Danvy, and Henning Korsholm Rohde. Fast partial evaluation of pattern matching in strings. In PEPM, 2003. • [6] Jagrati Agrawal, Yanlei Diao, Daniel Gyllstrom, and Neil Immerman. Ecient pattern matching over event streams. In SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 147{160, New York, NY, USA, 2008. ACM.