1 / 38

Challenges with XML Challenges with Semi- Structured collections

Challenges with XML Challenges with Semi- Structured collections. Ludovic Denoyer University of Paris 6. Bridging the gap between research communities. Outline. Motivations XML Mining Challenge Graph Labelling/ WebSpam Challenge Conclusion and future work. General Idea.

xantha-buck
Download Presentation

Challenges with XML Challenges with Semi- Structured collections

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Challenges with XMLChallenges with Semi-Structured collections Ludovic Denoyer University of Paris 6 Bridging the gap betweenresearchcommunities

  2. Outline • Motivations • XML Mining Challenge • Graph Labelling/WebSpam Challenge • Conclusion and future work

  3. General Idea • The two challenges have been proposed to try to attractresearchersfromdifferentdomains: • Mainly Machine Learning and Information Retrieval • Show to IR researchersthat ML methods are able to solvesome of theirproblems • Show to ML researchersthat IR tasksprovideinterestingcontext for developping new general Machine Learning Algorithms

  4. General Idea • Findgenerictasksthat correspond to: • IR new real-applications • ML new genericproblems • To worktogether…. • To mutualize efforts… • To solvethesetasksfaster… • To compare the approaches…

  5. Open questions in ML

  6. Open questions in IR

  7. Motivations XML Mining Challenge

  8. Motivations WebSpam Challenge XML Mining Challenge

  9. Motivations

  10. Challenges • XML Mining Challenge • « Bridging the gap between Machine Learning and Information Retrieval » • Graph Labelling Challenge • Application to WebSpamdetection

  11. Outline • Motivations • XML Mining Challenge • WebSpam Challenge • Conclusion and future work

  12. XML Mining Challenge • Launched in 2005 • PASCAL (Network of excellence in ML) • DELOS (Network of excellence in Digital Librairies) • Organized as a INEX Track • INEX: Initiative for the Evaluation of XML IR • More than 50 different institutes involved • One eventeachyearat INEX (december) • Biggest INEX Track (after ad-hoc retrieval) • We are currentlylaunching the 4th XML Miningtrack

  13. XML Mining Challenge • ML Goal • Classification of large collections of structures • IR Goal • Classification of semi-structured collections • Usingboth structure and content

  14. Underlyingidea • Using structure and content Information

  15. Collections • Different collections have been used: • 2005 • Artificial collection • Movie collection • 2006 • Scientific articles • Wikipedia XML based collection • 2007 • Wikipedia XML based collection • 96,000 documents in XML • 21 categories

  16. Submittedpapers

  17. Large variety of models • Differentexisting ML Methods have been applied: • Self OrganizingMap • SVM • (Graph) Neural Network • CRF • IncrementalModels • … • Some new models have been developped

  18. Short Typology • SeeReport on the XML Miningtrack – SIGIR Forum

  19. Results - 2007 • Classification

  20. XML Structure Mappingtask • Proposed in 2006 • ML task : Structuredouput classification • Learning to transformtrees • IR application : Dealingwithhetereogenous collections • Learning to transformheterogeneous documents to a mediatedschema

  21. XML Structure Mapping • A generic ML model able to solvethistaskhas a lot of potential applications: • Conversion between file formats • Automatic translation • Natural Languageprocessing • …

  22. Conclusion • Existingstructured input models (kernel,…) have been tested on thistask • New specificmodels have been developped • Difficult to know which model is the best • Need to wait one more year • The challenge has attractedresearchersfromdifferentcommunities • Eachyear, ML researchersare coming to INEX and: • Discover a new domain • Presentadvanced ML models to otherresearchers • The collections are freelyavailable and have been downloaded a hundred times • …some articles start to appear in differentconferences…

  23. WebSpam Challenge • PASCAL « Graph Labelling Challenge » • Organized by: • Ricardo BAEZA-YATES (Yahoo! Research Barcelona) • Carlos CASTILLO (Yahoo! Research Barcelona) • Brian DAVISON (LehighUniversity, USA ) • Ludovic DENOYER (University Paris 6, France) • Patrick GALLINARI (University Paris 6, France) • The Web Spam Challenge 2007 was supported by PASCAL • The Web Spam Challenge 2007 was also supported by the DELIS EU - FET research project

  24. WebSpam Challenge • Three Events: • AirWeb workshop 2007 (WWW’07) • May 2007 • Web-oriented part • GraphLab workshop 2007 – P KDD/ECML • September 2007 • ML-oriented part • AirWeb workshop 2008 (WWW’08 ?)

  25. WebSpam Challenge • IR (Web) Task : • Detection of web spam • Spam = anyattempt to get “an unjustifiably favorable relevance or importance score for some Web pages, considering the page’strue value”

  26. Example of spam

  27. WebSpam Challenge • ML Learning task: • Graph labelling • Classification of inter-dependant variables

  28. Collection • A collection of interconnected Web pages • 77 millions pages • About 11,000 hosts • manuallylabeled as spam or normal (host level) • Blindedevaluation of models

  29. Participants

  30. Participants • Whysuch an increase of ML participants duringGraphLab ?

  31. GraphLab workshop at ECML/PKDD 2007 • Collection has been fullypreprocessed by the organizers • Eachnode corresponds to a vector (in SVMLight format) based on the words distribution in each host/page • The contingenchymatrix has been built • One small collection with 9,000 nodes • One large collection with 400,000 nodes • 10% for train/20% for validation/70% for test • You caneasilyapplyyour « relationnal » models on this corpus withoutknowinganything about textprocessing

  32. Results • Small collection (9,000 nodes)

  33. Results • Large collection (400,000 nodes)

  34. Conclusion on WebSpam • Differentpure ML methodsused « as if » • Semi supervisedmethods • StackedLearning • … • Verynice performances of ML models (equivalent to Web « hand-made » models)

  35. Conclusion on WebSpam • Devlopment of a ML benchmark for graph labelling • WebSpamalso proposes interesting ML challenges thatcouldbeintegrated in the challenge • Learning with a few examples • Large scaleproblems • Adversial Machine Learning • …

  36. Conclusion • The two challenges have proposed benchmarks for IR/Web applications and also for generic ML problems • It is possible to mix researchersfromdifferentcommunities • ML researchersdislike to clean real collections • you have to preprocess the collections • ML researchersdislike large collections • but itismoving…

  37. Future work • XML Miningwill continue thisyear • Seehttp://xmlmining.lip6.fr • The corpus willbepreprocessed ? • WebSpam challenge willalso continue • Seehttp://webspam.lip6.fr • Wewillseeafter WWW’08 if we propose an otherGraphLab workshop (seehttp://graphlab.lip6.fr) • Note that a new larger corpus has been developped in 2008

  38. Thankyou for your attention • (Thankyou to the participants of the different challenges that are in the room)

More Related