330 likes | 505 Views
Ranking. Ida Mele. Introduction. The set of software components for the management of large sets of data is made of: MG4J, Fastutil, the DSI Utilities, Sux4J, WebGraph, the LAW software. These software components have been developed by the DSI of the University of Milan. Fastutil.
E N D
Ranking Ida Mele
Introduction • The set of software components for the management of large sets of data is made of: • MG4J, • Fastutil, • the DSI Utilities, • Sux4J, • WebGraph, • the LAW software. • These software components have been developed by the DSI of the University of Milan. Ranking
Fastutil • Fastutil 6 is a free software, developed in Java. • Technical requirement: • Java >= 6 • Useful links: • http://fastutil.di.unimi.it/ • http://fastutil.di.unimi.it/docs/ Ranking
Fastutil • Fastutil extends Java Collections, and it provides: • Type-specific maps, sets, and lists; • Priority queues with a small memory footprint and fast access and insertion; • 64-bit arrays, sets, and lists; • Fast I/O classes for text and binary files. Ranking
Fastutil • Advantages in using Fastutil: • Classes of Fastutil are implemented in order to work on huge collections of data in an efficient way. • Fastutil provides a new set of classes to deal with collections whose size exceeds 231. Ranking
Fastutil • Advantages in using Fastutil: • There are additional features (ex. bidirectional iterators) that are not available in the standard classes. • Classes can be plugged into existing code, because they implement their standard counterpart (ex. Map for Maps). Ranking
Fastutil: Big Arrays • BigArrays: class that provides static methods and objects for working with big arrays. • Big arrays are arrays-of-arrays. For example, a big array of integers has type int[][]. • Methods handle these arrays-of-arrays as if they are monodimensional arrays with 64-bit indices. • The length of a big array is bounded by Long.MAX_VALUE rather than Integer.MAX_VALUE. Ranking
Fastutil: Big Arrays • Given a big array a, a[0], a[1], … a[n] are called segments. Each one has length: SEGMENT_SIZE (the last segment can have a smaller size). • Each index i is associated with a segment and a displacement into the segment. • Methods segment/displacement compute the segment/displacement associated with a given index. • Method index receives the segment and the displacement and returns the corresponding index. • Methods get/set allow to return/set the value of a given element in the big array. Ranking
Fastutil Big Arrays - example • We want to scan the big array a. • First solution: for( int s = 0; s < a.length; s++ ) { final int[] t = a[ s ]; for( int d = 0; d < t.length; d++ ) { //do something with t[ d ] } } Ranking
Fastutil Big Arrays - example • Second solution: for( int s = a.length; s-- != 0; ) { final int[] t = a[ s ]; for( int d = t.length; d-- != 0; ) { //do something with t[ d ] } } Ranking
Fastutil Big Arrays - example • Third solution: for( int s = a.length; s-- != 0; ) { final long[] t = a[ s ]; for( int d = t.length; d-- != 0; ) t[d] = index( s, d ); } We can use the index method, which returns the index associated with a segment and displacement. Ranking
Fastutil: Big data structures • Fastutil provides classes also for other data structures: • BigList: a list with indices. The instances of this class implement the same semantics of traditional List. • HashBigSet: the instances of this class use a hash table to represent a big set. The number of elements in the set is limited only by the amount of core memory. Ranking
Dsiutils • The DSI utilities are a mish mash of classes. • Free software. • Developed in Java. • Useful links: • http://dsiutils.di.unimi.it/ • http://dsiutils.di.unimi.it/docs/ Ranking
Dsiutils: MultipleString • In large-scale text indexing we want to use a mutable string that, once frozen, can be used in the same optimized way of an immutable string. • In Java we have String and StringBuffer, which can be used for immutable and mutable strings respectively. • The solution is MultipleString. • MultipleString does not need synchronization. Ranking
Dsiutils: packages • Some important packages: • it.unimi.dsi.bits contains main classes for manipulating bits. Example: the class BitVectors provides static methods and objects that do useful things with bit vectors. • it.unimi.dsi.compression provides word-based compression/decompression classes. • it.unimi.dsi.util offers implementations of BloomFilters, PrefixMaps, StringMaps, BinaryTries and others. Ranking
WebGraph • WebGraph is a framework for graph compression. • It exploits modern compression techniques to manage very large graphs. • Useful links: • http://webgraph.di.unimi.it/ • http://webgraph.di.unimi.it/docs/ Ranking
WebGraph • WebGraph provides: • ζ-codes, which are suitable for storing web graphs. • Algorithm for compressing the graph that exploit gap compression as well as ζ-codes. The parameters provide different tradeoffs between access speed and compression ratio. • Algorithms to access to compressed graphs without decompression. The lazy techniques delay the decompression until it is necessary. Ranking
WebGraph: classes • Some important classes: • ImmutableGraph is an abstract class representing an immutable graph. • BVGraph allows to store and access web graphs in a compressed form. • ASCIIGraph is used to store the graph in a human-readable ASCII format. Ranking
WebGraph: classes • Some important classes: • ArcLabelledImmutableGraph is an abstract implementation of a graph with labeled arcs. • Transform returns the transformed version of an immutable graph. We can use the transpose method of this class if we want to create the transpose graph. Ranking
LAW • Java software developed by the Laboratory for Web Algorithms. • It is free and contains several implementations of the Pagerank algorithm. • Useful links: • http://law.di.unimi.it/software.php • http://law.di.unimi.it/software/docs/index.html Ranking
LAW: Pagerank • PageRank of the package it.unimi.dis.law.rank is an abstract class that defines methods and attributes for Pagerank algorithm. • Provided features: • we can set the preference vectors; • we can set the damping factor; • we can program stopping criteria; • step-by-step execution; • reusability. Ranking
Exercise • Download the files: • law-1.4.jar and webgraph-3.0.1.jar • example • Text2ASCII.class and PrintRanks.class available at: http://www.dis.uniroma1.it/~mele/teaching_20122013.html • Add law-1.4.jar and webgraph-3.0.1.jar to the directory containing all jar files (ex. lib_mg4j). • Update file set-classpath.sh, and set the classpath: source set-classpath.sh Ranking
Build the graph: step1 • Create the file in the format ASCIIGraph: java Text2ASCII example • Output: • example.graph-txt: the first line contains the number of nodes, ex n. The followingn lines contain the list of out-neighbours of the nodes. In particular, the line i-th contains the successors of the node i, sorted in an increasing order and separated by a space. Ranking
Build the graph: step1 • more example.graph-txt Num of nodes 10 1 8 9 4 7 9 1 3 4 5 6 7 8 9 1 4 5 6 9 1 2 1 1 2 3 4 5 5 9 0 1 3 4 6 0 1 2 3 4 5 6 7 8 9 Node id . . . Lists of successors Ranking
Build the graph: step2 • We can use the main method of the BVGraph class to load and compress an ImmutableGraph. • The compressed graph is described by: basename.graph: the graph file. It contains the successor lists, one for each node. Each list is a sequence of natural number that are coded as sequence of bits in a efficient way. basename.offsets: the offset file. It stores the offset for each node of the graph. basename.properties: the file with properties and statistics. Ranking
Build the graph: step2 • Step 2: Conversion from the ASCIIGraph to the BVGraph: java it.unimi.dsi.webgraph.BVGraph -g ASCIIGraph example example • Output: • example.graph • example.offsets • example.properties Ranking
Build the graph: step2 • more example.properties #BVGraph properties #Wed Nov 21 12:48:44 CET 2012 compratio=1,89 bitsforblocks=22 … version=0 … nodes=10 … arcs=34 … Ranking
Compute Pagerank • To compute the Pagerank we can use the implementations: • PowerMethod • GaussSeidel • Jacobi • The output is made of 2 files: • basename.ranks: binary file with the results of computation. • basename.properties: text files with general info. Ranking
Compute Pagerank: step1 • We use the main method of the class PageRankPowerMethod by issuing the following command: java it.unimi.dsi.law.rank.PageRankPowerMethod example examplePR • Output: • examplePR.ranks • examplePR.properties Ranking
Compute Pagerank: step1 • more examplePR.properties rank.alpha = 0.85 rank.stronglyPreferential = false method.numberOfIterations = 12 method.norm.type = INFTY method.norm.value = 8.396275630317973E-7 graph.nodes = 10 graph.fileName = example Ranking
Compute Pagerank: step2 • The file .ranks is a binary file with the scores of the nodes. • We can print these scores by using the class PrintRanks: java PrintRanks examplePR.ranks > ranks • Output: • ranks. This file has n lines, one for each node. The i-th line contains the score of node number i. Ranking
Compute Pagerank: step2 • more ranks 0.0515659940361598 0.20197850631669495 0.07982657817906964 0.07587785830476211 0.14600457683651308 0.08608501191896127 0.07294688611466064 0.0931194920828582 0.05050241152172527 0.14209268468859523 0 1 2 3 4 5 6 7 8 9 Node id . . . PageRank values Ranking
Homework • Repeat the exercise with the graphs: • WikiIT • WikiPT available at: http://www.dis.uniroma1.it/~mele/teaching_20122013.html • Create a new graph by using synthetic or real data, and repeat the exercise with this new graph. Ranking