190 likes | 322 Views
JJE: INEX XML Competition. Bryan Clevenger James Reed Jon McElroy. Introduction. Deal with large size of internet through using better categorization techniques Goal: Optimize search time by grouping pages using clusters Wikipedia is the data source. Problem.
E N D
JJE: INEX XML Competition Bryan Clevenger James Reed Jon McElroy
Introduction • Deal with large size of internet through using better categorization techniques • Goal: Optimize search time by grouping pages using clusters • Wikipedia is the data source
Problem • Take the Wikipedia data and create a clustering algorithm that leads to a the data being clustered. • This creates a reduction in search space for related information.
Solution • If documents contain several similar links then similar data. • Focused on the link data set: • Link data: 39484 2039 4952 1029 39 1920 10233 30197
Overall solution • Determine sub-communities in the graph using Max-Flow/Min-Cut community Discovery • Heuristics used to find relevant seeds
Max Flow – Min Cut • Edge Capacity – similar to edge weight. Represents the “amount” of information that can be pushed along. • Flow – The sum of minimum capacity of all paths from one node to another.
Max Flow – Min Cut (cont.) • The flow between two nodes in the same cluster should be larger than flow between two nodes in separate clusters.
Implementation (Parsing) • Links parsed into a Graph. • Graph: HashMap<Integer, HashMap<Integer,Integer> • Document Id to HashMap of Link Ids to Capacity. • Links structure was created Links[0] = 3244,2645,791 Links[1] = 10293,432,2,1230 ... Links[max] = 1012
Implementation (Initialization of Community Seeds) • Using the Links structure, a percentage of nodes with highest links are chosen as seeds
Implementation (Finding Communities) • Idea, why it didn’t work? • robots
Implementation (Visualization) • Walrus is an interactive 3D visualization tool that works on large directed graphs. • Input and output Parsing. • Grouped clusters by colors.
Results • The INEX links data was composed of 54,000 nodes and 15 million links • Average running time on a DELL Duo Core 2.0 GHz Pentium Laptop to retrieve one cluster was 5.9 hours • Cluster size is between 2-2.5 K
Results • Visual Images of clusters
Conclusion • It worked... kinda. • Looks great! • See pretty pictures.
References [1] Inex 2009 mining track. http://www.inex.otago.ac.nz/tracks/wiki-mine/wiki-mine.asp, October 2009. [2] The standard maximum flow problem. http://www.topcoder.com/tc?module=Static&d1=tutorials&d2=maxFlow, November 2009. [3] Walrus - graph visualization tool. http://www.caida.org/tools/visualization/walrus, December 2009. [4] Mark C. Chu-Carroll. Maximum flow and minimum cut. http://scienceblogs.com/goodmath/2007/08/maximum_flow_ and_minimum_cut_1.php, December 2009. [5] Fordfulkerson algorithm. http://en.wikipedia.org/wiki/FordFulkersos_algorithm, October 2009. [6] Max-flow Min-cut theorem. http://en.wikipedia.org/wiki/Max-flow_ min-cut_theorem, November 2009.
Questions? • O really?