A Wordification Approach to Relational Data Mining : Early Results

A WordificationApproach to RelationalDataMining: EarlyResults Matic Perovšek, Anže Vavpetič, Nada Lavrač Jožef Stefan Institute, Slovenia

Overview • Introduction • Methodology • Experimental results • Conclusion

Introduction • Relational data mining algorithms aim to induce models and/or relational patterns from multiple tables • Individual-centered relational databases can be transformed to a single-table form – propositionalization

Motivation • Wordificationinspiredbytextminingtechniques • Largenumberofsimple, easy to understandfeatures • Greaterscalability, handlinglargedatasets • Can be used as a preprocessing step to propositional learners, as well as to declarative modeling / constraint solving (De Raedt et al., today’s invited talk)

Methodology • Transformation from relational database to a textual corpus • TF-IDF weightcalculation

Transformation from relational database to a textual corpus • One individual of the initial relational database -> one text document • Features-> the words of this document • Words constructed as a combination:

Transformation from relational database to a textual corpus • For each individual, the words generated for the main table are concatenated with words generated from the secondary (BK) tables

Example

TF-IDF weights • No explicit use ofexistential variables in our features, TF-IDF instead • The weight of a word gives a strong indication of how relevant is the feature for the given individual. • The TF-IDF weights can then be used either for filtering words with low importance or using them directly by a propositional learner.

Experimental results • Slovenian traffic accidents database • IMDB database • Top 250 and bottom 100 movies • Movies, actors, movie genres, directors, director genres • Applied the wordification methodology • Performed association rule learning

Experimental results

Conclusion • Novel propositionalizationtechniquecalledWordification • Greaterscalability • Easy to understandfeatures • Furtherwork: • Test on largerdatabases • Experimentalcomparisonwithotherpropositionalizationtechniques • Combine with propositionalization–like approach to mining heterogeneous information networks (Grčar et al. 2012), applicable to CLP in data preprocessing Grčar, Trdin, Lavrač: A Methodology for Mining Document-Enriched Heterogeneous Information Networks, Computer Journal 2012

A Wordification Approach to Relational Data Mining : Early Results