Advanced Algorithms for Massive Dataset Compression and Indexing

Rossano Venturini Dipartimento di Informatica Università di Pisa • Algorithms and Data Structures • for Massive Datasets • (Acube Lab) Paolo Ferragina Giuseppe Prencipe Marco Cornolti Andrea Farruggia Giovanni Micale Francesco Piccinno Giorgio Audrito

A3 Lab (acube.di.unipi.it) Algorithms and data structures for massive dataset • Data Compression • Compressed Indexing • Web or arbitrary texts • Storage and analysis of massive graphs • Information Retrieval on news, tweet, … Submitted US patents: 3 with Yahoo, 1 with NYU Accepted US patents: 1 with U. Rutgers, 1 with AT&T-Lucent

Social Networks and Social Data • Given an idea, you need the right platform to implement it: • HW + SW (IT Center) • Algorithms (our Lab) • Graph structure + Textual Content • Nodes  users (~ 1 bil) • Edges explicit = friend, follower, retweet, +1, … (~ 10bil) • Edges implicit = similarity, co-occurrence, click, … (» 100 bil)

Data Compression: Theory & Engineering J. ACM ‘05 ACM-SIAM Soda ’09-’14 ACM WSDM ‘10 ESA ’11-’14 Algorithmica ‘12 SIAM J. Computing ‘13 Key issue: • Minimize space occupancy • Maximize decompression speed A new algorithmic concept: Multi-objective design of compressors Two interesting scenarios: - Energy-efficiency issues - Cloud computing Can we fix the space occupancy and minimize the decompression time ? Or, vice versa ?

Compressed Indexing: Theory & Engineering J. ACM ‘05 ACM SIGIR ‘07 J. ACM ‘09 ACM Trans. Algo. ’10 ESA ’13 ACM-SIAM SODA ’13 … and many others Key issue: • Minimize space occupancy • Maximize substring-search throughput Suffix-array compressible «-» Bzip searchable December 2003 • Performance over hundreds of MBs and commodity PC • Count(P) takes 5 microsecs/char, taking about bzip’s space • Locate(P) outputs 100K occ/sec, taking +10% space • This may be 4x faster than IL, within <35% space occupancy

Compressed Indexing: Theory & Engineering No SQL DB The <key,value> problem: • Trie:14x more spacethan input data. • Front-coding & two-levelindexing: • 110% ofinput data • 4 microsecs/char • OurCompressedPermuterm: • < 25% of input data, i.e. closeto bzip2 • 1060 microsecs/char • So, timecloseto FC butone-fourth of itsspace Under Y!-patenting

We know how to “manage” everything…

TF-IDF vector Vector Space model t3 v 2.2 5.1 9.1 1.0 0.1 w a t2 t1 Similarity(v,w) ≈ cos(a) Information Retrieval “Diego Maradona won against Mexico” Dictionary against Diego Maradona Mexico won

Mexico soccer team The soccer player Topic Annotators • “Diego Maradona won against Mexico” Detect mentionsand annotate them with entity/topic extracted from a catalog Wikipedia! we serve about 170k requests/day

Paper at ACM WSDM 2012 Paper at IEEE Software 2012 Details on...http://acube.di.unipi.it/tagme Paper at ECIR 2012

Advanced Algorithms for Massive Dataset Compression and Indexing