120 likes | 239 Views
A Wordification Approach to Relational Data Mining : Early Results. Matic Perovšek, Anže Vavpetič, Nada Lavrač Jožef Stefan Institute, Slovenia. Overview. Introduction Methodology Experimental results Conclusion. Introduction.
E N D
A WordificationApproach to RelationalDataMining: EarlyResults Matic Perovšek, Anže Vavpetič, Nada Lavrač Jožef Stefan Institute, Slovenia
Overview • Introduction • Methodology • Experimental results • Conclusion
Introduction • Relational data mining algorithms aim to induce models and/or relational patterns from multiple tables • Individual-centered relational databases can be transformed to a single-table form – propositionalization
Motivation • Wordificationinspiredbytextminingtechniques • Largenumberofsimple, easy to understandfeatures • Greaterscalability, handlinglargedatasets • Can be used as a preprocessing step to propositional learners, as well as to declarative modeling / constraint solving (De Raedt et al., today’s invited talk)
Methodology • Transformation from relational database to a textual corpus • TF-IDF weightcalculation
Transformation from relational database to a textual corpus • One individual of the initial relational database -> one text document • Features-> the words of this document • Words constructed as a combination:
Transformation from relational database to a textual corpus • For each individual, the words generated for the main table are concatenated with words generated from the secondary (BK) tables
TF-IDF weights • No explicit use ofexistential variables in our features, TF-IDF instead • The weight of a word gives a strong indication of how relevant is the feature for the given individual. • The TF-IDF weights can then be used either for filtering words with low importance or using them directly by a propositional learner.
Experimental results • Slovenian traffic accidents database • IMDB database • Top 250 and bottom 100 movies • Movies, actors, movie genres, directors, director genres • Applied the wordification methodology • Performed association rule learning
Conclusion • Novel propositionalizationtechniquecalledWordification • Greaterscalability • Easy to understandfeatures • Furtherwork: • Test on largerdatabases • Experimentalcomparisonwithotherpropositionalizationtechniques • Combine with propositionalization–like approach to mining heterogeneous information networks (Grčar et al. 2012), applicable to CLP in data preprocessing Grčar, Trdin, Lavrač: A Methodology for Mining Document-Enriched Heterogeneous Information Networks, Computer Journal 2012