Challenges with XML Challenges with Semi- Structured collections

Challenges with XMLChallenges with Semi-Structured collections Ludovic Denoyer University of Paris 6 Bridging the gap betweenresearchcommunities

Outline • Motivations • XML Mining Challenge • Graph Labelling/WebSpam Challenge • Conclusion and future work

General Idea • The two challenges have been proposed to try to attractresearchersfromdifferentdomains: • Mainly Machine Learning and Information Retrieval • Show to IR researchersthat ML methods are able to solvesome of theirproblems • Show to ML researchersthat IR tasksprovideinterestingcontext for developping new general Machine Learning Algorithms

General Idea • Findgenerictasksthat correspond to: • IR new real-applications • ML new genericproblems • To worktogether…. • To mutualize efforts… • To solvethesetasksfaster… • To compare the approaches…

Open questions in ML

Open questions in IR

Motivations XML Mining Challenge

Motivations WebSpam Challenge XML Mining Challenge

Motivations

Challenges • XML Mining Challenge • « Bridging the gap between Machine Learning and Information Retrieval » • Graph Labelling Challenge • Application to WebSpamdetection

Outline • Motivations • XML Mining Challenge • WebSpam Challenge • Conclusion and future work

XML Mining Challenge • Launched in 2005 • PASCAL (Network of excellence in ML) • DELOS (Network of excellence in Digital Librairies) • Organized as a INEX Track • INEX: Initiative for the Evaluation of XML IR • More than 50 different institutes involved • One eventeachyearat INEX (december) • Biggest INEX Track (after ad-hoc retrieval) • We are currentlylaunching the 4th XML Miningtrack

XML Mining Challenge • ML Goal • Classification of large collections of structures • IR Goal • Classification of semi-structured collections • Usingboth structure and content

Underlyingidea • Using structure and content Information

Collections • Different collections have been used: • 2005 • Artificial collection • Movie collection • 2006 • Scientific articles • Wikipedia XML based collection • 2007 • Wikipedia XML based collection • 96,000 documents in XML • 21 categories

Submittedpapers

Large variety of models • Differentexisting ML Methods have been applied: • Self OrganizingMap • SVM • (Graph) Neural Network • CRF • IncrementalModels • … • Some new models have been developped

Short Typology • SeeReport on the XML Miningtrack – SIGIR Forum

Results - 2007 • Classification

XML Structure Mappingtask • Proposed in 2006 • ML task : Structuredouput classification • Learning to transformtrees • IR application : Dealingwithhetereogenous collections • Learning to transformheterogeneous documents to a mediatedschema

XML Structure Mapping • A generic ML model able to solvethistaskhas a lot of potential applications: • Conversion between file formats • Automatic translation • Natural Languageprocessing • …

Conclusion • Existingstructured input models (kernel,…) have been tested on thistask • New specificmodels have been developped • Difficult to know which model is the best • Need to wait one more year • The challenge has attractedresearchersfromdifferentcommunities • Eachyear, ML researchersare coming to INEX and: • Discover a new domain • Presentadvanced ML models to otherresearchers • The collections are freelyavailable and have been downloaded a hundred times • …some articles start to appear in differentconferences…

WebSpam Challenge • PASCAL « Graph Labelling Challenge » • Organized by: • Ricardo BAEZA-YATES (Yahoo! Research Barcelona) • Carlos CASTILLO (Yahoo! Research Barcelona) • Brian DAVISON (LehighUniversity, USA ) • Ludovic DENOYER (University Paris 6, France) • Patrick GALLINARI (University Paris 6, France) • The Web Spam Challenge 2007 was supported by PASCAL • The Web Spam Challenge 2007 was also supported by the DELIS EU - FET research project

WebSpam Challenge • Three Events: • AirWeb workshop 2007 (WWW’07) • May 2007 • Web-oriented part • GraphLab workshop 2007 – P KDD/ECML • September 2007 • ML-oriented part • AirWeb workshop 2008 (WWW’08 ?)

WebSpam Challenge • IR (Web) Task : • Detection of web spam • Spam = anyattempt to get “an unjustifiably favorable relevance or importance score for some Web pages, considering the page’strue value”

Example of spam

WebSpam Challenge • ML Learning task: • Graph labelling • Classification of inter-dependant variables

Collection • A collection of interconnected Web pages • 77 millions pages • About 11,000 hosts • manuallylabeled as spam or normal (host level) • Blindedevaluation of models

Participants

Participants • Whysuch an increase of ML participants duringGraphLab ?

GraphLab workshop at ECML/PKDD 2007 • Collection has been fullypreprocessed by the organizers • Eachnode corresponds to a vector (in SVMLight format) based on the words distribution in each host/page • The contingenchymatrix has been built • One small collection with 9,000 nodes • One large collection with 400,000 nodes • 10% for train/20% for validation/70% for test • You caneasilyapplyyour « relationnal » models on this corpus withoutknowinganything about textprocessing

Results • Small collection (9,000 nodes)

Results • Large collection (400,000 nodes)

Conclusion on WebSpam • Differentpure ML methodsused « as if » • Semi supervisedmethods • StackedLearning • … • Verynice performances of ML models (equivalent to Web « hand-made » models)

Conclusion on WebSpam • Devlopment of a ML benchmark for graph labelling • WebSpamalso proposes interesting ML challenges thatcouldbeintegrated in the challenge • Learning with a few examples • Large scaleproblems • Adversial Machine Learning • …

Conclusion • The two challenges have proposed benchmarks for IR/Web applications and also for generic ML problems • It is possible to mix researchersfromdifferentcommunities • ML researchersdislike to clean real collections • you have to preprocess the collections • ML researchersdislike large collections • but itismoving…

Future work • XML Miningwill continue thisyear • Seehttp://xmlmining.lip6.fr • The corpus willbepreprocessed ? • WebSpam challenge willalso continue • Seehttp://webspam.lip6.fr • Wewillseeafter WWW’08 if we propose an otherGraphLab workshop (seehttp://graphlab.lip6.fr) • Note that a new larger corpus has been developped in 2008

Thankyou for your attention • (Thankyou to the participants of the different challenges that are in the room)

Challenges with XML Challenges with Semi- Structured collections