210 likes | 292 Views
Automatic term categorization by extracting knowledge from theWeb. Leonardo Rigutini, Ernesto Di Iorio, Marco Ernandes e Marco Maggini Dipartimento di Ingengeria dell’Informazione Università degli studi di Siena {rigutini,diiorio,ernandes,maggini}@dii.unisi.it. Text mining.
E N D
Automatic term categorizationby extracting knowledge from theWeb Leonardo Rigutini, Ernesto Di Iorio, Marco Ernandes e Marco Maggini Dipartimento di Ingengeria dell’Informazione Università degli studi di Siena {rigutini,diiorio,ernandes,maggini}@dii.unisi.it
Text mining • To provide semantic information to entities extracted by the text documents • Uses of thesauri, gazetteers and domain specific lexicons • Problems in maintenance of these resources: • Large amount of human effort in tracking changes and in adding new lexical entities
Term Categorization • Key task in text mining research area • A lexicon or a more articulated structure (ontology) can be automatically populated by associating each unknown lexical entity to one o more semantic categories • The goal of term categorization is to label lexical entities using a set of semantic themes (disciplines, domains)
Lexicons • Domains-specific lexicons have been used in several tasks • Word-sense disambiguation: the semantic categories of the terms surrounding the target word help to disambiguate it • Query-expansion: adding semantic information to the query make it more specific and focused and it increases the precision of the answer
Lexicons • Cross-lingual text categorization: • Ontologies are used to replace particualr entities with their semantic category, thus reducing the temporal and geographic dependency of the content of documents. Entities like proper names or factory brands depend by the country and by the time to which the document has been produced. Replacing them with their semantic category (politician or singer, computer or shoes factory) improve the categorization of text documents.
Automatic term categorization • Several attempts to face the problem of automatic expansion of ontologies and thesauri have been proposed in literature • F. Cebah1 proposed two possible approach to the problem: • Exogenous, where the sense of a term is inferred by the context in which it appears • Endogenous, in which the sense of a term relies only on statistical information extracted by the sequence of characthers constituting the entity 1 F.Cerbah,”Exogenous and endogenous approaches to semantic categorization of unknown technical terms”,in Proceedings of the 18th International Conference on Computational Linguistic (COLING)
Automatic term categorization • Sebastiani et al. proposed an exogenous approach that faced the problem as the dual of the text categorization2 • the Reuter corpus provided the knowledge base to classify terms • they tried to replicate the Wordnet Domains ontology selecting only the terms appearing in the Reuter corpus • Their approach showed low F1 values: • high precision but a very small recall (~ 0.4 % ) 2 H. Avancini,.A. Lavelli, B.Magnini, F.Sebastiani, R. Zanoli,”Expanding domain-specific lexicons by term categorization”, in Proceedings of the 2003 ACM symposium on Applied computing (SAC03)
The proposed system • We propose a system to automatically categorize entities that exploit the Web to build an enriched representation of the entity, the Entity Context Lexicon (ECL) • the ECL is a list of all the words appearing in the context of the entity • for each word, some statistical information are stored:term frequency, snippet frequency etc… • basically, an ECL is a “bag-of-words” representation of the words appearing in the context of th entity • The idea is that “entities of the same semantic category should appear in similar contexts”
System description • The system for term classification is composed by two modules: • the training module is used to train the classifier from a set of labeled examples • the entity classification module is applied to predict the appropriate categoryfor a given input entity • Both modules exploit the Web to build the ECL representation of the entities • They are composed by some sub-modules: • two ECL generators, which build the ECLs • the classifier, which is trained to classify the unknown ECL
The ECL generator • We choose to use the Web as knowledge base to build the ECLs • The snippets returned by the search engines submitting the entity as query report the contexts in which the terms in the query appear • The ECL of an entity e is simply the set of the context terms extracted from the snippets
The ECL generator • Given an entity e: • it is submitted as query to a search engine • the related top-scored S snippets are collected • the terms in the snippets are used to build the ECL: • for each word the term frequency and the snippet frequency are stored • In order to avoid inclusion of not significant terms, a stop-words list or feature selection technique can be used
The classifier • Each entity e is characterized by the corresponding ECLe: • thus a set of labeled ECLs can be used to train an automatic classifier • then the trained classifier can be used to label the unlabeld ECLs • The most common classifier models can be used: • SVM, Naive Bayes, Complement Naive Bayes and profile based (es. Rocchio).
The CCL classifier • Following the idea that similar entities appears in similar contexts, we exploited a new type of profile-based classifier: • a profile for each class is built by merging the training ECLs associated to that class • for each term in the profile is evaluated a weight using a weighting function W • the obtained lexicon is called Class Context Lexicon (CCL) • a similarity function is used to measure the similarity of an unlabeled ECL with each CCL • When an unlabeled ECL is passed to the classifier: • it is assigned to the class reporting the highest similarity score
The CCL classifer • Weighting functions: • tf • tf-idf • snippet-frequency inverse class frequency (sficf), which provides high scores to a word if it is much frequent in a class and low frequent in the remaining classes • Similarity functions: • Euclidean similarity • Cosine similarity • Gravity similarity =
Experimental results • We selected 8 categories: • soccer, music, location, computer, politics, food, philosophy, medicine • For each of them we collected predefined gazetteers from the web and sampled 200 entities for each class. We performed tests varying the size of the learning set LM where M indicates the number of learning entities per class • We used Google as seach engine and we set the number of snippets selected for building the ECL to 10 (S=10)
Experimental results • We tested all the classifiers listed previously: • SVM, NB, CNB and CCL and we used the F1 values to measure the performances of the system • Firstly, we tested the CCL classifier by combining the weighting functions and the similarity functions listed previously • We selected the CCL configuration reporting the best performances and then we compared it with the SVM, NB and CNB classifiers
Performances using the CCL classifiers • We selected the CCL-sficf-gravity configuration as the CCL classifier reporting the best performances
Overall perfromances • The CNB classifier showed the best performances even if the CCL model results are comparable
Conclusions • We propose a system for Web based term categorization oriented to automatic thesaurus construction • The idea is that “terms from a same semantic category should appear in very similar contexts”, i.e. that contain approximately the same words. • the system builds an Entity Context Lexicon (ECL) for each entity using the Web as knowledge base • this enriched representaion is used to train an automatic classifier • We tested all the most common classifier models (SVM, Naive Bayes and Complement Naive Bayes) • Moreover, we propose a profile-based classifier called CCL that builds the class profiles by merging the learning ECLs
Conclusions • The experimental results shows that the CNB classifier reports the best performances • However, the CCL classifier results are very promising and comparable with the CNB ones • Additional tests have been planned considering a multi-label classification task and to verify the robustness of the system in “out-of-topic” cases