Senbazuru : A Prototype Spreadsheet Database Management System

Senbazuru: A Prototype Spreadsheet DatabaseManagement System Zhe Chen Michael Cafarella Jun Chen Daniel PrevoJunfengZhuang University of Michigan Ashleigh Davis

Outline • Introduction • System framework • User Interface • Conclusion

Introduction • Spreadsheets • Critical data management tool • Complete tasks associated with relational systems • Lack explicit relational metadata • Senbazuru • Web crawl using Clue Web09 Corpus

Extracting relational data from spreadsheets allows DBM tools to be applied to spreadsheet data • Recent work: • Perform DB-style operations inside spreadsheet interface • Transform spreadsheet data into relational model • Some extraction systems require explicit rules from user

Technical challenges in relational data extraction: • Extraction • Repair • Senbazuru: • Three functional components: • Search • Extract • Query • Desktop web application • iPad application

System Framework • Search component: • Aids in retrieval of relevant data sets in large corpus • Indexed more than 1,800 spreadsheets collected from U.S. Census Bureau • Indexer • Python to extract text from each cell • Apache Lucene to index the text • Searcher • Uses inverted index and TFDIF-style ranking to sort datasets by relevance • Results comprise potentially useful objects

System Framework • Extract component: • Frame finder • Raw spreadsheet sent • Identifies data frame structures • Conditional random field (CRF) • Passes data frame to next stage • Hierarchy extractor • Recovers attribute hierarchies • Figure out which attributes describe which other attributes

System Framework • Extract component: • Tuple builder • Generates relational tuple for each value in value region • Values annotated with relevant attributes from attribute regions • Relation constructor • Assembles relational tuples into relational tables • Clusters attributes in different tuples into consistent columns, labels columns

System Framework • Query component: • User is allowed to apply relational operators on derived data • Implemented operators: • Select • Join

User Interface • Search • User enters keywords in search box • Returned a list of relevant spreadsheets • Examine top hits or examine other relevant spreadsheets • Extract • After selecting most relevant spreadsheet, user selects extract to transform data into relational table • Data tree • User can repair extraction errors

User Interface • Query • Select • Called filter • Specify filter conditions • Join • Integrate two arbitrary spreadsheet-derived relations • Systems indicates possible joins, indicating join key

Conclusion • Senbazuru searches numerous Web crawl spreadsheets • Automatically transforms spreadsheets into relations • Allows users to fix extraction errors • Supports selection queries on resulting relations • Supports join queries to integrate arbitrary spreadsheets

References • Z. Chen, Cafarella, J. Chen, D. Prevo, J. Zhuang. Senbazuru: A Prototype Spreadsheet Database Management System. In VLDB ’13: Proceedings of the 2013 VLBD Endowment , Vol. 6, No. 12, Riva Del Garda, Trento, Italy. VLDB Endowment. • http://www.eecis.udel.edu/~trnka/CISC889-11S/lectures/praveen-clueweb.pdf

Senbazuru : A Prototype Spreadsheet Database Management System