INEX 2009 XML Mining Track

INEX 2009 XML Mining Track James Reed Jonathan McElroy Brian Clevenger

Introduction • INEX is An initiative looking into use of XML retrieval • The clustering task uses Information Retrieval, Data Mining, Machine Learning and XML fields • Goal: To measure how well clustering methods work for retrieving collections from large sets of documents. Also to measure performance specifically for XML IR

Problem • Task: to test the Jardine Hypothesis which states: “documents that cluster together have a similar relevance to a given query.” • If (true) {a small fraction of clusters need to be searched, increasing the throughput of an IR system;}

Data • Wikipedia is the source • 60 Gigabytes with about 2.7 million documents in XML format • Provide Complete and Subsets of the meta-data

Data Files • Tags and trees: • <document ID> <tag ID 1>:<frequency> ... <tag ID n>:<frequency> • <document ID> <tree ID> <tree ID> <length of the String> <depth first traversal> • 14052 0 0 15 1 2 3 -1 4 -1 5 -1 -1 6 7 -1 8 -1 -1 • Links: • <document ID> <linked doc ID> ... < linked doc ID > • Entities: • <document ID> <feature ID 1>:<frequency> ... <feature ID n>:<frequency> • Bag-of-Words (BOW...Wow!): • BOW File: • <document ID> <term ID 1>:<frequency> ... <term ID n>:<frequency> • Term Index File: • 1472,bracelet • 547,depend

Solution: A Two Pronged Approach • First Prong: • Analyze Links to discover maximum flow communities • Using Ford-Fulkerson Algorithm • Second Prong: • Use information from BOW and Entities to develop similarity measures between documents within clusters • Attempt to refine and develop more better clusters

INEX 2009 XML Mining Track