150 likes | 251 Views
Learning ontologies from the web for microtext processing. Boris A.Galitsky, Gábor Dobrocsi, and Josep Lluis de la Rosa . University of Girona Spain. Why ontologies are needed for microtext.
E N D
Learning ontologies from the web for microtext processing Boris A.Galitsky, Gábor Dobrocsi, and Josep Lluis de la Rosa. University of Girona Spain .
Why ontologies are needed for microtext What can be a scalable way to automatically build a taxonomies of entities to improve search relevance? Taxonomy construction starts from the seed entities and mines the web for new entities associated with them. To form these new entities, machine learning of syntactic parse trees (syntactic generalization) is applied It form commonalities between various search results for existing entities on the web. Taxonomy and syntactic generalization is applied to relevance improvement in search and text similarity assessment in commercial setting; evaluation results show substantial contribution of both sources. Entities need to make sense together Automated customer service rep. Q: Can you reactivate my card which I am trying to use in Nepal? A: We value you as a customer… We will cancel your card… New card will be mailed to your California address … A child with severe form of autism Q: Can you give your candy to my daughter who is hungry now and is about to cry? A: No, my mom told me not to feed babies. Its wrapper is nice and blue. I need to wash my hands before I eat it … … Human and auto agents having difficulties processing texts if required ontologies are missing
Knowing how entities are connected would improve search results Condition “active paddling” is ignored or misinterpreted, although Google knows that it is a valid combination (‘paddling’ can be ‘active’) • In the above example “white water rafting in Oregon with active paddling with kids” active is meaningless without paddling. • So if the system can’t find answers with ‘active paddling’, try finding with ‘paddling’, but do not try finding with ‘active’ but without ‘paddling’.
Difficulty in building taxonomies Building, tuning and managing taxonomies and ontologies is rather costly since a lot of manual operations are required. A number of studies proposed automated building of taxonomies based on linguistic resources and/or statistical machine learning, (Kerschberg et al 2003, Liu &Birnbaum 2008, Kozareva et al 2009). However, most of these approaches have not found practical applications due to: • insufficient accuracy of resultant search, • limited expressiveness of representations of queries of real users, • high cost associated with manual construction of linguistic resources and their limited adjustability. We propose automated taxonomy building mechanism • It is based on initial set of key entities (a seed) for given vertical knowledge domain. • This seed is then automatically extended by mining of web documents which include a meaning of a current taxonomy node. • This node is further extended by entities which are the results of inductive learning of commonalities between these documents. • These commonalities are extracted using an operation of syntactic generalization, which finds the common parts of syntactic parse trees of a set of documents, obtained for the current taxonomy node. Therefore automated or semi-automated approach is required for practical apps
Using default logic to handle ambiguity in microtext Providing multiple answers as a result of default reasoning Building extensions of default theory for each meaning
A simplified step 1 of ontology learning Currently available: tax – deduct 1) Get search results for currently available expressions 2) Select attributes based on their linguistic occurrence (shown in yellow) 3) Find common attributes (commonalities between search results, shown in red, like ‘overlook’). 4) Extend the taxonomy path by adding newly acquired attribute Tax-deduct-overlook
Step 2 of ontology learning (more details) Currently available taxonomy path: tax – deduct - overlook 1) Get search results 2) Select attributes based on their linguistic occurrence (modifiers of entities from the current taxonomy path) 3) Find common expressions between search results as syntactic generalization, like ‘PRP-mortgage’ 4) Extend the taxonomy path by adding newly acquired attribute Tax-deduct-overlook- mortgage, Tax-deduct- overlook – no_itemize …
Step 3 of ontology learning Currently available taxonomy path: tax – deduct – overlook-mortgage 1) Get search results 2) Perform syntactic generalization, finding common maximal parse sub-trees excluding the current taxonomy path 3) If nothing in common any more, this is the taxonomy leave (stop growing the current path). Possible learning results (taxonomy fragment)
If a keyword is in a query, and in the closest taxonomy path, it HAS TO BE in the answer Query: can I deduct tax on mortgage escrow account: Closest taxonomy path: tax – deduct – overlook-mortgage- escrow_account Then keywords/multiwords have to be in the answer: {deduct ,tax , mortgage escrow_account } Wrong answers
Improving the precision of text similarity: articles, blogs, tweets, images and videos We verify if an image belongs here, based on its caption Using syntactic generalization to access relevance
Generalizing two sentences and its application Improvement of search relevance by checking syntactic similarity between query and sentences in search hits. Syntactic similarity is measured via generalization. Such syntactic similarity is important when a search query contains keywords which form a phrase , domain-specific expression, or an idiom, such as “shot to shot time”, “high number of shots in a short amount of time”. Based on syntactic similarity, search results can be re-sorted based on the obtained similarity score Based on generalization, we can distinguish meaningful (informative) and meaningless (uninformative) opinions, having collected respective datasets Not very meaningful sentence to be shown, even if matches the search query Meaningful sentence to be shown as search result
Generalizing sentences & phrases Generalizing phrases Deriving a meaning by generalization noun phrase [ [JJ-* NN-zoom NN-* ], [JJ-digital NN-camera ]] About ZOOM and DIGITAL CAMERA verb phrase [ [VBP-* ADJP-* NN-zoom NN-camera ], [VB-* NN-zoom IN-* NN-camera ]] To do something with ZOOM –…- CAMERA prepositional phrase [ [IN-* NN-camera ], [IN-for NN-* ]] With/for/to/in CAMERA, FOR something VP [VB-use DT-the JJ-digital NN-zoom IN-of DT-this NN-camera IN-for VBG-filming NNS-insects ] + VP [VB-get JJ-short NN-focus NN-zoom NN-lens IN-for JJ-digital NN-camera ] = [VB-* JJ-* NN-zoom NN-* IN-for NNS-* ] score = score(NN) + score(PREP) + 3*score(<POS*>) Meaning: “Do-something with some-kind-of ZOOM something FOR something-else” Generalization: from words to phrases to sentences to paragraphs • Obtain parse trees. Group by sub-trees for each phrase type • Extend list of phrases by paraphrasing (semantically equivalent expressions) • For every phrase type • For each pair of tree lists, perform pair-wise generalization • For a pair of trees, perform alignment • For a pair of words (nodes), generalize them • Remove more general trees (if less general exist) from the resultant list Syntactic generalization helps with microtext when ontology use is limited
Learning similarity between syntactic trees Generalization algorithm Obtain parsing tree for each sentence. For each word (tree node) we have lemma, part of speech and form of word information, as well as an arc to the other node. Split sentences into sub-trees which are phrases for each type: verb, noun, prepositional and others; these sub-trees are overlapping. The sub-trees are coded so that information about occurrence in the full tree is retained. All sub-trees are grouped by phrase types. Extending the list of phrases by adding equivalence transformations Generalize each pair of sub-trees for both sentences for each phrase type. For each pair of sub-trees yield the alignment, and then generalize each node for this alignment. For the obtained set of trees (generalization results), calculate the score. For each pair of sub-trees for phrases, select the set of generalizations with highest score (least general). Form the sets of generalizations for each phrase types whose elements are sets of generalizations for this type. Filtering the list of generalization results: for the list of generalization for each phrase type, exclude more general elements from lists of generalization for given pair of phrases. * Generalization of semantic role expressions
Evaluation Classification of short texts Hybrid approach improves text similarity/relevance assessment Ordering of search results based on generalization, taxonomy, and conventional search engine
Related work Conclusions • Mapping to First Order Logic representations with a general prover and without using acquired rich knowledge sources • Semantic entailment [de Salvo Braz et al 2005] • Semantic Role Labeling, for each verb in a sentence, the goal is to identify all constituents that fill a semantic role, and to determine their roles, such as Agent, Patient or Instrument [Punyakanok et al 2005]. • Generic semantic inference framework that operates directly on syntactic trees. New trees are inferred by applying entailment rules, which provide a unified representation for varying types of inferences [Bar-Haim et al 2005] • Generic paraphrase-based approach for a specific case such as relation extraction to obtain a generic configuration for relations between objects from text [Romano et al 2006] Ontologies are more sensitive way to match keywords in micro-text (compared to bag-of-words and TF*IDF) Since microtext includes abbreviations and acronyms, and we don’t ‘know’ all mappings, semantic analysis should be tolerant to omits of some entities and still understand “what this text fragment is about”. Since we are unable to filter out noise “statistically” like most NLP environments do, we have to rely on ontologies. Syntactic generalization takes bag-of-words and pattern-matching classes of approaches to the next level allowing to treat unknown words systematically as long as their part of speech information is available from context.