630 likes | 810 Views
Learning Semantic String Transformations from Examples. Rishabh Singh and Sumit Gulwani. FlashFill. Transformations. Syntactic Transformations Concatenation of regular expression based substring “VLDB2012” “VLDB” Semantic Transformations More than just characters
E N D
Learning Semantic String Transformations from Examples Rishabh Singh and SumitGulwani
Transformations • Syntactic Transformations • Concatenation of regular expression based substring • “VLDB2012” “VLDB” • Semantic Transformations • More than just characters • “1/5/2010” “May 1st 2010”
Semantic Transformations • Semantic information as relational tables • 1 January, 2 February • Learn table lookup queries • VLOOKUP macro 2nd most problematic
Outline • Lookup Transformations • Lookup + Syntactic Transformations • Case Studies
Demo Table Lookup Transformations
Learning Framework Input Strings Output String F … F1 Fn 1. Domain-specific Language L 2. Algorithm to learn all Fs from (i,o)
Example - Lookup Select(Name, EmpRecord, (SSN = v1))
Example – Transitive Lookup Select(Price, PriceRec, (ItemId = Select(ItemId, ItemRec, Item = v1))
Learn Query Select(Price, PriceRec, (ItemId = Select(ItemId, ItemRec, Item = v1))
Synthesis Algorithm : • Input: (input state , output string ) • Output: all conforming expressions • Reachability algorithm from input strings
) strings in table rows of visited nodes
…….. Repeat until k steps or fixpoint
Sound and k-complete • t: number of reachable strings • p: number of candidate keys • m: maximum size of a candidate key
Data structure • Maintains tree structure • share common sub-expressions • CNF of Boolean Conditionals • independent column predicates
Synthesize Procedure Synthesize((i1,o1), …, (in,on)) P = GenerateStrt(i1,o1) for j = 2 to n: P’ = GenerateStrt(ij,oj) P = Intersectt(P’, P) return P
Demo Semantic String Transformations
Syntactic String Language [GulwaniPOPL11]
Combined Language Syntactic manipulations over lookup outputs Syntactic manipulations before indexing
Synthesis Algorithm: • Reachability based on syntactic string matches • Boolean conditionals
{ “SSN: 044-58-3429”, “044-58-3429”, “1125”, “Steve Russell” } Set of reachable strings
{ “SSN: 044-58-3429”, “044-58-3429”, “1125”, “Steve Russell” } and in paper
Experiments • 50 benchmark problems • 12 , 38 • ~1020 consistent expressions • Size of data structure: ~2000 • Performance: 96% less than 1 second • Ranking: at most 3 examples (95% 2 examples)
Related Work • Matching strings for table joins • Record Matching [Elmagarmid et. al. 07, Koudas et. al. SIGMOD06] • Schema Matching [Dhamankar et. al. SIGMOD04, Warren & Tompa VLDB06] • Query Synthesis • from representative view [Das Sharma et.al. ICDT10, Tran et.al. SIGMOD09] • Text-editing by example • QuickCode[Gulwani POPL11] • SMARTedit[Lau et.al. ML03], Simulatenous Editing[Miller et.al. USENIX01]
Thanks! Algorithm Designers Software Developers End-Users Large potential
Semantic String Transformations =TEXT(C,”00 00”)+0
Idea 1: Share sub-expressions e Select(C2, T1, C1=v1) Select(C3, T2, C1=e) Select(C2, T3, C1=Select(C2,T2,C1=e)
Youtube Videos French Polish Urdu German Serbian Russian http://bit.ly/flashfill
: string value : set of lookup programs to generate
Related Work • Record Matching • Similarity functions for matching [Elmagarmid et. al. 07, Koudas et. al. SIGMOD06] • Customizable similarity function [Arasu et. al. VLDB09] • Learning Schema Matches • iMAP [Dhamankar et. al. SIGMOD04] concat. of column strings using domain-specific knowledge • [Warren & Tompa VLDB06] concatenation of column substrings, single table
Related Work • Query Synthesis [Das Sharma et.al. ICDT10, Tran et.al. SIGMOD09] • Infer relation from large representative example view • no joins or projections • Text-editing using examples • QuickCode[Gulwani POPL11] string transformations • SMARTedit[Lau et.al. ML03], Simulatenous Editing[Miller et.al. USENIX01] programming by demonstration
General Framework • A Domain-specific Transformation Language L • Expressive and succinct • Efficient Data structures for set of expressions • Version-space algebra • GenerateStr • All sets of expressions from I-O example • Intersect • Intersect two sets of expressions
Example - Lookup Select(Name, EmpRecord, (SSN = v1))
Example – Transitive Lookups Select(Price, PriceRec, (ItemId = Select(ItemId, ItemRec, Item = v1))