160 likes | 265 Views
An Empirical Investigation of Learning from the Semantic Web. Pete Edwards Gunnar AAstrand Grimnes Alun Preece Computing Science Department University of Aberdeen {pedwards, ggrimnes ,apreece}@csd.abdn.ac.uk Semantic Web Mining Workshop @ ECML 2002. Motivation. The Semantic Web should:
E N D
An Empirical Investigation of Learning from the Semantic Web Pete Edwards Gunnar AAstrand Grimnes Alun Preece Computing Science Department University of Aberdeen {pedwards,ggrimnes,apreece}@csd.abdn.ac.uk Semantic Web Mining Workshop @ ECML 2002 An Empirical Investigation of Learning from the Semantic Web
Motivation • The Semantic Web should: • Facilitate learning from the Web. • Facilitate reuse of learning outcomes. Hypothesis : Learning from Semantically Marked-up data should outperform learning from plain text. An Empirical Investigation of Learning from the Semantic Web
Methods • Compare performance of learning from plain text and from semantic Meta-data. • Using traditional ML algorithms as baseline approach. • Naïve Bayes • K-Nearest Neighbour • Explore application of more knowledge intensive approaches, such as ILP. An Empirical Investigation of Learning from the Semantic Web
Datasets • Semantic Web still in its infancy, so available datasets are limited. • Need dataset with instances represented in plain-text and in some semantic markup-language. • Forced to use artificial data-sets. • No ontological support. An Empirical Investigation of Learning from the Semantic Web
ITTalkshttp://ittalks.org • ITTalks is a real Semantic Web application. • Information about seminars at Universities in the US. • Plain HTML and DAML+OIL versions of each talk has slightly different content, but largely overlapping. • No classification of data, so we did personal preference labelling. An Empirical Investigation of Learning from the Semantic Web
ITTalks example <rdf:RDF> <rdf:Description about="http://www.ittalks.org/jsp/Controller.jsp?action=ViewTalk&as=HTML&talkid=20010620141011"> <Talk rdf:parseType="Resource"> <Title>PROBABILISTIC OPTIMIZATION TECHNIQUES FOR MULTICAST KEY MANAGEMENT … </Title> <Abstract>Multicast is a key technology to support large group communications over the Internet… </Abstract> <BeginTime> <time:Year>2001</time:Year> <time:Month>06</time:Month><time:Day>20</time:Day> ... </BeginTime> ... <Audience>General Public</Audience> <DomainName>umbc</DomainName> <Location rdf:parseType="Resource"> <Institution>UMBC</Institution> </Location> <Speaker rdf:parseType="Resource"> <Name>Ali Selcuk</Name> <Organization>UMBC</Organization> <Email>aselcu1@csee.umbc.edu</Email> </Speaker> </Talk> </rdf:Description> </rdf:RDF> An Empirical Investigation of Learning from the Semantic Web
ResearchIndexhttp://citeseer.nj.nec.com • ResearchIndex is scientific literature digital library. • Articles from 17 different subject areas within Computing Science. • Full text of article and BibTeX provided. • BibTex converted to RDF. • Full text is typically 6000 words. • BibTex is typically 10 RDF Statements. An Empirical Investigation of Learning from the Semantic Web
BibTeX RDF mapping @inproceedings{ davies94agentk, author = "W. H. E. Davies and P. Edwards", title = "Agent-K: An Integration of AOP and KQML", booktitle = "Proceedings of the CIKM'94 Workshop on Intelligent Agents", address = "Gaithersburg, MD, USA", editor = "T. Finin and Y. Labrou", year = "1994", url = "citeseer.nj.nec.com/15298.html" } <inproceedings rdf:about="davies94agentk"> <author>W. H. E. Davies and P.Edwards</author> <title>Agent-K: An Integration of AOP and KQML</title> <booktitle>Proceedings of the CIKM'94 Workshop on Intelligent Agents</booktitle> <address>Gaithersburg, MD, USA</address> <editor>T. Finin and Y. Labrou</editor> <year>1994</year> <url>citeseer.nj.nec.com/15298.html</url> </inproceedings> An Empirical Investigation of Learning from the Semantic Web
Knowledge Sparse LearningRepresentation • For each algorithm we use 3 instance representations: 1. Conventional plain text 2. Meta-data as plain-text 3. Meta-data tags to feature mapping An Empirical Investigation of Learning from the Semantic Web
Method 3 Meta-data tags to feature mapping Meta-data instance: <xml> <rdf> <talk id='mlsemweb1'> <title>An Empirical Investigation of Learning from the Semantic Web</title> <speaker> <name>Gunnar AAstrand Grimnes</name> <url>http://www.csd.abdn.ac.uk/~ggrimnes</url> </speaker> ... Feature tags: talk, title, speaker, name, url ... Instance representation: {}, {empirical, investigation, learning, semantic, web}, {}, {gunnar, aastrand, grimnes}, {csd, abdn, ggrimnes} ... An Empirical Investigation of Learning from the Semantic Web
Knowledge Sparse LearningResults ITTalks ResearchIndex • ITTalks: • Meta 2 performs poorly, caused by redundant features. • Text & Meta 1 are very similar, as those instances in this dataset are almost identical. • ResearchIndex: • KNN performs better for the full text instances, as it is better at dealing with large numbers of features. An Empirical Investigation of Learning from the Semantic Web
Knowledge Intensive LearningRepresentation • Ignore the plain-text representations. • RDF maps to 1st order logic Prolog representation. • Using the ILP algorithm Progol4.4 to learn Prolog rules for class descriptions. • Solve binary classification problems. An Empirical Investigation of Learning from the Semantic Web
RDF Prolog mapping url( davies94agentk, 'citeseer.nj.nec.com/15298.html' ). editor( davies94agentk, 'T. Finin' ). editor( davies94agentk, 'Y. Labrou' ). titleword( davies94agentk, 'agent' ). titleword( davies94agentk, 'integration' ). titleword( davies94agentk, 'aop' ). titleword( davies94agentk, 'kqml' ). author( davies94agentk, 'W. Davies' ). author( davies94agentk, 'P. Edwards' ). address( davies94agentk, 'Gaithersburg, MD,USA'). year( davies94agentk, '1994' ). type( davies94agentk, ‘#inproceedings' ). booktitleword( davies94agentk, 'proceedings' ). booktitleword( davies94agentk, 'cikm94' ). booktitleword( davies94agentk, 'workshop' ). booktitleword( davies94agentk, 'intelligent' ). booktitleword( davies94agentk, 'agents' ). <inproceedings rdf:about="davies94agentk"> <author>W. H. E. Davies and P.Edwards</author> <title>Agent-K: An Integration of AOP and KQML</title> <booktitle>Proceedings of the CIKM'94 Workshop on Intelligent Agents</booktitle> <address>Gaithersburg, MD, USA</address> <editor>T. Finin and Y. Labrou</editor> <year>1994</year> <url>citeseer.nj.nec.com/15298.html</url> </inproceedings> An Empirical Investigation of Learning from the Semantic Web
Knowledge Intensive LearningResults Agents experiment (155 clauses): inClass(A) :- author(A,'A. Rao'). inClass(A) :- author(A,'D. Lambrinos'). inClass(A) :- titleword(A,agent), titleword(A,mobile). inClass(A) :- type(A,'http://www.csd.abdn.ac.uk/òggrimnes/exp/#misc'), textword(A,agent), titleword(A,agent). inClass(A) :- year(A,1999), titleword(A,agents). inClass(A) :- titleword(A,bdi). Machine Learning (259 clauses): inClass(A) :- publisher(A,'Morgan Kaufmann'), booktitleword(A,learning). inClass(A) :- titleword(A,based), titleword(A,case). Theory (279 clauses): inClass(A) :- volume(A,18). An Empirical Investigation of Learning from the Semantic Web
Future workLearning Personal Profiles Gunnar’s profile. Based on 200 manually rated articles from the ResearchIndex dataset. inClass(A) :- titleword(A,image). inClass(A) :- type(A,'http://www.csd.abdn.ac.uk/~ggrimnes/exp/#misc'), textword(A,learning). inClass(A) :- booktitleword(A,mining). inClass(A) :- author(A,'N. Jennings'). inClass(A) :- titleword(A,indexing). inClass(A) :- pages(A,143). An Empirical Investigation of Learning from the Semantic Web
Conclusion • In terms of accuracy learning from the Semantic Web was not superior. • Learning from RDF requires less resources. • Datasets have no ontological support. • Learning outcomes from the Semantic Web can be real, reusable knowledge. An Empirical Investigation of Learning from the Semantic Web