1 / 33

Ranking

Ranking. Ida Mele. Introduction. The set of software components for the management of large sets of data is made of: MG4J, Fastutil, the DSI Utilities, Sux4J, WebGraph, the LAW software. These software components have been developed by the DSI of the University of Milan. Fastutil.

Download Presentation

Ranking

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Ranking Ida Mele

  2. Introduction • The set of software components for the management of large sets of data is made of: • MG4J, • Fastutil, • the DSI Utilities, • Sux4J, • WebGraph, • the LAW software. • These software components have been developed by the DSI of the University of Milan. Ranking

  3. Fastutil • Fastutil 6 is a free software, developed in Java. • Technical requirement: • Java >= 6 • Useful links: • http://fastutil.di.unimi.it/ • http://fastutil.di.unimi.it/docs/ Ranking

  4. Fastutil • Fastutil extends Java Collections, and it provides: • Type-specific maps, sets, and lists; • Priority queues with a small memory footprint and fast access and insertion; • 64-bit arrays, sets, and lists; • Fast I/O classes for text and binary files. Ranking

  5. Fastutil • Advantages in using Fastutil: • Classes of Fastutil are implemented in order to work on huge collections of data in an efficient way. • Fastutil provides a new set of classes to deal with collections whose size exceeds 231. Ranking

  6. Fastutil • Advantages in using Fastutil: • There are additional features (ex. bidirectional iterators) that are not available in the standard classes. • Classes can be plugged into existing code, because they implement their standard counterpart (ex. Map for Maps). Ranking

  7. Fastutil: Big Arrays • BigArrays: class that provides static methods and objects for working with big arrays. • Big arrays are arrays-of-arrays. For example, a big array of integers has type int[][]. • Methods handle these arrays-of-arrays as if they are monodimensional arrays with 64-bit indices. • The length of a big array is bounded by Long.MAX_VALUE rather than Integer.MAX_VALUE. Ranking

  8. Fastutil: Big Arrays • Given a big array a, a[0], a[1], … a[n] are called segments. Each one has length: SEGMENT_SIZE (the last segment can have a smaller size). • Each index i is associated with a segment and a displacement into the segment. • Methods segment/displacement compute the segment/displacement associated with a given index. • Method index receives the segment and the displacement and returns the corresponding index. • Methods get/set allow to return/set the value of a given element in the big array. Ranking

  9. Fastutil Big Arrays - example • We want to scan the big array a. • First solution: for( int s = 0; s < a.length; s++ ) { final int[] t = a[ s ]; for( int d = 0; d < t.length; d++ ) { //do something with t[ d ] } } Ranking

  10. Fastutil Big Arrays - example • Second solution: for( int s = a.length; s-- != 0; ) { final int[] t = a[ s ]; for( int d = t.length; d-- != 0; ) { //do something with t[ d ] } } Ranking

  11. Fastutil Big Arrays - example • Third solution: for( int s = a.length; s-- != 0; ) { final long[] t = a[ s ]; for( int d = t.length; d-- != 0; ) t[d] = index( s, d ); } We can use the index method, which returns the index associated with a segment and displacement. Ranking

  12. Fastutil: Big data structures • Fastutil provides classes also for other data structures: • BigList: a list with indices. The instances of this class implement the same semantics of traditional List. • HashBigSet: the instances of this class use a hash table to represent a big set. The number of elements in the set is limited only by the amount of core memory. Ranking

  13. Dsiutils • The DSI utilities are a mish mash of classes. • Free software. • Developed in Java. • Useful links: • http://dsiutils.di.unimi.it/ • http://dsiutils.di.unimi.it/docs/ Ranking

  14. Dsiutils: MultipleString • In large-scale text indexing we want to use a mutable string that, once frozen, can be used in the same optimized way of an immutable string. • In Java we have String and StringBuffer, which can be used for immutable and mutable strings respectively. • The solution is MultipleString. • MultipleString does not need synchronization. Ranking

  15. Dsiutils: packages • Some important packages: • it.unimi.dsi.bits contains main classes for manipulating bits. Example: the class BitVectors provides static methods and objects that do useful things with bit vectors. • it.unimi.dsi.compression provides word-based compression/decompression classes. • it.unimi.dsi.util offers implementations of BloomFilters, PrefixMaps, StringMaps, BinaryTries and others. Ranking

  16. WebGraph • WebGraph is a framework for graph compression. • It exploits modern compression techniques to manage very large graphs. • Useful links: • http://webgraph.di.unimi.it/ • http://webgraph.di.unimi.it/docs/ Ranking

  17. WebGraph • WebGraph provides: • ζ-codes, which are suitable for storing web graphs. • Algorithm for compressing the graph that exploit gap compression as well as ζ-codes. The parameters provide different tradeoffs between access speed and compression ratio. • Algorithms to access to compressed graphs without decompression. The lazy techniques delay the decompression until it is necessary. Ranking

  18. WebGraph: classes • Some important classes: • ImmutableGraph is an abstract class representing an immutable graph. • BVGraph allows to store and access web graphs in a compressed form. • ASCIIGraph is used to store the graph in a human-readable ASCII format. Ranking

  19. WebGraph: classes • Some important classes: • ArcLabelledImmutableGraph is an abstract implementation of a graph with labeled arcs. • Transform returns the transformed version of an immutable graph. We can use the transpose method of this class if we want to create the transpose graph. Ranking

  20. LAW • Java software developed by the Laboratory for Web Algorithms. • It is free and contains several implementations of the Pagerank algorithm. • Useful links: • http://law.di.unimi.it/software.php • http://law.di.unimi.it/software/docs/index.html Ranking

  21. LAW: Pagerank • PageRank of the package it.unimi.dis.law.rank is an abstract class that defines methods and attributes for Pagerank algorithm. • Provided features: • we can set the preference vectors; • we can set the damping factor; • we can program stopping criteria; • step-by-step execution; • reusability. Ranking

  22. Exercise • Download the files: • law-1.4.jar and webgraph-3.0.1.jar • example • Text2ASCII.class and PrintRanks.class available at: http://www.dis.uniroma1.it/~mele/teaching_20122013.html • Add law-1.4.jar and webgraph-3.0.1.jar to the directory containing all jar files (ex. lib_mg4j). • Update file set-classpath.sh, and set the classpath: source set-classpath.sh Ranking

  23. Build the graph: step1 • Create the file in the format ASCIIGraph: java Text2ASCII example • Output: • example.graph-txt: the first line contains the number of nodes, ex n. The followingn lines contain the list of out-neighbours of the nodes. In particular, the line i-th contains the successors of the node i, sorted in an increasing order and separated by a space. Ranking

  24. Build the graph: step1 • more example.graph-txt Num of nodes 10 1 8 9 4 7 9 1 3 4 5 6 7 8 9 1 4 5 6 9 1 2 1 1 2 3 4 5 5 9 0 1 3 4 6 0 1 2 3 4 5 6 7 8 9 Node id . . . Lists of successors Ranking

  25. Build the graph: step2 • We can use the main method of the BVGraph class to load and compress an ImmutableGraph. • The compressed graph is described by: basename.graph: the graph file. It contains the successor lists, one for each node. Each list is a sequence of natural number that are coded as sequence of bits in a efficient way. basename.offsets: the offset file. It stores the offset for each node of the graph. basename.properties: the file with properties and statistics. Ranking

  26. Build the graph: step2 • Step 2: Conversion from the ASCIIGraph to the BVGraph: java it.unimi.dsi.webgraph.BVGraph -g ASCIIGraph example example • Output: • example.graph • example.offsets • example.properties Ranking

  27. Build the graph: step2 • more example.properties #BVGraph properties #Wed Nov 21 12:48:44 CET 2012 compratio=1,89 bitsforblocks=22 … version=0 … nodes=10 … arcs=34 … Ranking

  28. Compute Pagerank • To compute the Pagerank we can use the implementations: • PowerMethod • GaussSeidel • Jacobi • The output is made of 2 files: • basename.ranks: binary file with the results of computation. • basename.properties: text files with general info. Ranking

  29. Compute Pagerank: step1 • We use the main method of the class PageRankPowerMethod by issuing the following command: java it.unimi.dsi.law.rank.PageRankPowerMethod example examplePR • Output: • examplePR.ranks • examplePR.properties Ranking

  30. Compute Pagerank: step1 • more examplePR.properties rank.alpha = 0.85 rank.stronglyPreferential = false method.numberOfIterations = 12 method.norm.type = INFTY method.norm.value = 8.396275630317973E-7 graph.nodes = 10 graph.fileName = example Ranking

  31. Compute Pagerank: step2 • The file .ranks is a binary file with the scores of the nodes. • We can print these scores by using the class PrintRanks: java PrintRanks examplePR.ranks > ranks • Output: • ranks. This file has n lines, one for each node. The i-th line contains the score of node number i. Ranking

  32. Compute Pagerank: step2 • more ranks 0.0515659940361598 0.20197850631669495 0.07982657817906964 0.07587785830476211 0.14600457683651308 0.08608501191896127 0.07294688611466064 0.0931194920828582 0.05050241152172527 0.14209268468859523 0 1 2 3 4 5 6 7 8 9 Node id . . . PageRank values Ranking

  33. Homework • Repeat the exercise with the graphs: • WikiIT • WikiPT available at: http://www.dis.uniroma1.it/~mele/teaching_20122013.html • Create a new graph by using synthetic or real data, and repeat the exercise with this new graph. Ranking

More Related