140 likes | 397 Views
Succinct Data Structures. Group: Veli Mäkinen Niko Välimäki Jouni Siren. Collaboration: Gonzalo Navarro Paolo Ferragina Giovanni Manzini Johannes Fischer Wolfgang Gerlach. Funding. Academy of Finland: "Theory and Practice of Succinct Data Structures", 2005-2007
E N D
Succinct Data Structures Group: Veli Mäkinen Niko Välimäki Jouni Siren Collaboration: Gonzalo Navarro Paolo Ferragina Giovanni Manzini Johannes Fischer Wolfgang Gerlach ...
Funding • Academy of Finland: • "Theory and Practice of Succinct Data Structures", 2005-2007 • "Self-indexes in permanent storage", 2007-2012 • FDK/ALGODAN http://www.cs.helsinki.fi/group/suds
Motivation: Philosophy, Practice,... • "Maps bigger than the empire" paradox (Apostolico, 2001): • To analyze the data, some data structures are typically required that occupy more space than the initial source of the analysis. • Practical demand in applications: • Suffix tree of Human Genome takes >200GB main memory http://www.cs.helsinki.fi/group/suds
Motivation: ...,Theory • Studying the space demand sometimes reveals fundamental new principles: • "Backwards search" (Ferragina & Manzini, 2000) • "Indexing equals compression" (Grossi, Gupta, Vitter, 2004) http://www.cs.helsinki.fi/group/suds
Yes unzip Does ACACAC occur in File? Does ACACAC occur in File? Does ACACAC occur in File? Yes Motivation: ..., Practice File File.zip zip File File.sfi self-index http://www.cs.helsinki.fi/group/suds
Results / Theory • Entropy-optimal self-index (Ferragina & Manzini & Mäkinen & Navarro, ACM TALG 2007) • Implicit compression boosting (Mäkinen & Navarro, SPIRE 2007): • Entropy-optimal self-index for dynamic set of sequences. http://www.cs.helsinki.fi/group/suds
Results / Practice • Implementation of compressed suffix tree (Välimäki et al., WEA 2007): • Takes 1/10 of normal suffix tree space. • Space-efficient document retrieval (Välimäki & Mäkinen, CPM 2007) http://www.cs.helsinki.fi/group/suds
Implicit compression boosting Mäkinen & Navarro, SPIRE 2007: BWT Wavelet tree ~nHk http://www.cs.helsinki.fi/group/suds
Document retrieval Solution Problem Field Information Retrieval Document Retrieval Inverted Index [Sad07 & VM07] [PST06] practice: space limits theory: time limits [Mut02] Combinatorial Pattern Matching Text Indexing Suffix tree http://www.cs.helsinki.fi/group/suds
Compressed suffix tree • We offer the first implementation of compressed suffix tree • Construction time is O(n log n log |Σ|). • Uses no more than O(n log |Σ|) bits of extra working space. • Each operation is supported in time [O(1),O(log n log |Σ|)]. http://www.cs.helsinki.fi/group/suds
10 times less space! http://www.cs.helsinki.fi/group/suds
30 times slower, but... http://www.cs.helsinki.fi/group/suds
Recommended reading Ferragina & Manzini & Mäkinen & Navarro: Compressed Representations of Sequences and Full-Text Indexes. ACM Transactions on Algorithms, Vol. 3, Issue 2, Article 20, May 2007. Mäkinen & Navarro: Implicit Compression Boosting with Applications to Self-Indexing.To appear in Proc.14th Symposium on String Processing and Information Retrieval (SPIRE 2007), Santiago, Chile, October 29-31, 2007. Navarro & Mäkinen: Compressed Full-Text Indexes. ACM Computing Surveys, Vol. 39, No. 1, Article 2, 2007. Välimäki & Gerlach & Dixit & Mäkinen. Compressed Suffix Tree - A Basis for Genome-scale Sequence Analysis, Bioinformatics, 23:629-630, 2007. Välimäki & Gerlach & Dixit & Mäkinen: Engineering a Compressed Suffix Tree Implementation.In Proc. 6th Workshop on Experimental Algorithms (WEA 2007), Springer-Verlag LNCS 4525, pp. 217-228, Rome, Italy, June 6-8, 2007. http://www.cs.helsinki.fi/group/suds