130 likes | 264 Views
Text mining tool for ontology engineering based on use of product taxonomy and web directory. Jan Nemrava and Vojtech Sv atek Department of Information and Knowledge Engineering VSE Praha. Current state. IE and Ontology learning are frequently discussed issues in the field of Semantic Web.
E N D
Text mining tool for ontology engineering based on use of product taxonomy and web directory Jan Nemrava and Vojtech Svatek Department of Information and Knowledge Engineering VSE Praha
Current state • IE and Ontology learning are frequently discussed issues in the field of Semantic Web. • Semi-automatic and automatic methods ontology-based extraction of informationneeded • Web is great source for unstructured text DATESO 2005
Task is … • Collect specific words – verbs in our case – that usually occur together with particular product category as support for ontology designers. • Small and specialized ontologies concerning one product category and describing its frequent relations in common text. • Make use of fulltext search engines and DMOZ directory for retrieving information • And UNSPSC (United Nations Standard Products and Services Code) product catalogue DATESO 2005
Web directory are rarely valid taxonomies. • It is easy to see that subheadings are often not specializations of headings • Some of them are even not concepts (names of entities) but properties that implicitly restrict the extension of a preceding concept in the hierarchy. Consider for example .../Industries/Construction and Maintenance/Materials and Supplies/ /Masonry_and_Stone/Natural Stone/International Sources/Mexico. DATESO 2005
Proposal of method … • Obtain so called „indicator verbs” that characterize particular term (product category in our case) in UNSPSC. • Particular terms will be then generalized and may mine verbs that are indicative for the upper level of these terms. • join UNSPSC taxonomy and it’s list of products with content of company websites to gain valuable information about verbs that usually occur in one sentence with some product category from the taxonomy. • Use hand classified web directories containing relevant web sites. DATESO 2005
Task sequence decomposition • Manually select UNSPSC product and corresponding product category from DMOZ Business branch • Search in directory headings names • Search in web site description • Use fulltext • 1) Input: URL of DMOZ directory containing companies that manufacture desired product. • Output: List of URL of companies. • 2) Input:URL of company website • Output: List of web pages containing the target term. • 3) Input: Web page containing the term • Output: File with extracted sentences containing the term • 4) Input:Sentence with term. • Output: Tagged sentences • 5) Input: Verbs • Output: lemmatized, grouped and saved verbs DATESO 2005
Experiment • Handling equipment branch / UNSPSC product with corresponding DMOZ category • Goal is find verbs: • common for most products. • characterizing one branch of products • specific for small group of products, or even only one product • 7 product categories, 303 verbs collected that occurred 7300 times at web sites. DATESO 2005
Experiment DATESO 2005
Experiments • some verbs are obvious to be entirely neutral and do not characterize the products at all. (be, have, provide and use) • Some are connected with manufacturing(design, require, offer, make, contact, manufacture, develop, supply) • activities describing manipulating with material. (handle, lift, install and move) DATESO 2005
Experiments DATESO 2005
normalization • Fij = fij * (Vtj / V) • Croft’s normalization moderates the effect of high-frequency verbs • cf = K + (1 - K) * fij / mij • TF/IDF • wij = fij * log2(N / n) DATESO 2005
Problem remaining … • Automate assigning UNSPSC category to DMOZ category • Some UNSPSC have no appropriate category leading in no or little web sites. • Some categories are less informative DATESO 2005
Thank you! DATESO 2005