1 / 20

Pable J. Garces, Jose A. Olivas, and Francisco P. Romero

Concept-Matching IR Systems Versus Word-Matching IR Systems: Considering Fuzzy Interrelations for Indexing Web Pages. Pable J. Garces, Jose A. Olivas, and Francisco P. Romero Journal of the American Society for Information Science and Technology,2006 Presented by Yi-ling Lin. Agenda.

Download Presentation

Pable J. Garces, Jose A. Olivas, and Francisco P. Romero

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Concept-Matching IR Systems Versus Word-Matching IR Systems: Considering Fuzzy Interrelations for Indexing Web Pages Pable J. Garces, Jose A. Olivas, and Francisco P. Romero Journal of the American Society for Information Science and Technology,2006 Presented by Yi-ling Lin

  2. Agenda • Introduction • The FIS-CRM Model • Concept-Matching IR System • Example of Comparison of a Word-Based Search and a Concept-Based Search • Conclusion

  3. Introduction • Web retrieval systems aim to help user to retrieve what they really need. • What about the quality of the results? • Do they really match the user query? • Several meanings, eg. “stack” • General words, eg. “fish” • Language problem • Other complex problem: eg. “high frequency ” and CPU

  4. Introduction • Considering only lexicographical aspects when trying to match Web pages and queries, ignoring semantic aspects • Soft computing techniques • Definition of models for representing documents • Latent semantic indexing • Fuzzy logic to define models for representing documents • The key of the system is to use the Fuzzy Interrelations and Synonymy-Based Concept Representation Model (FIS-CRM) to extract and represent the concepts contained in both the Web pages and the user query.

  5. The FIS-CRM Model • A fuzzy model • An extension of the vector space model (VSM) • weight vectors • similarity measurements • Two types of interrelations among terms • the synonymy interrelation (fuzzy synonymy dictionary ) • the generality interrelation (fuzzy ontologies) • The fundamental basis • To share the occurrences of a contained “concept” among the fuzzy synonyms that represent the same concept, and to give a weight to the words that represent a more general concept than the contained word does.

  6. The FIS-CRM Model- Fuzzy Synonymy Interrelation • Jaccard coefficient with a term sense disambiguation method • assumes that the set of synonyms of every sense of each word is available (stored in a synonymy dictionary). • the synonymy interrelation is defined between two term-sense pairs. This means of measuring the synonymy degree leads to a symmetrical interrelation The pair (term A, sense S1 ) and the pair (term B, sense S2 )

  7. The FIS-CRM Model- Fuzzy Synonymy Interrelation (cont’) • The problem of transitive property • two strong words, which are synonymous, with a different set of synonyms => the synonymy degree between them will be less than 1. • The extended dictionary • stores entries for each pair of recognized term-sense synonyms • The aim of this calculation is to disambiguate the correct sense in which both terms are synonymous. the term-sense “Ai” is a synonym of only one of the meanings of “B,” the MAX function provides the synonymy degree of the correct sense of “B” that makes these words synonyms.

  8. The FIS-CRM Model- Fuzzy Generality Interrelation • Eg. Narrower_than/broader_than • Widyantoro and Yen (2001) deals with the idea that two words are related if they often co-occur in different documents. • A fuzzy measurement of the broadness of term “A” in respect to the other term “B” (and implicitly of the narrowness of “B” in relation to “A”). • These interrelations are managed as a single interrelation called generalization

  9. The FIS-CRM Model-Implementing the Extended Dictionary and the Fuzzy Ontologies • The generality degrees are stored in thematic ontologies. A thematic ontology is obtained for each of the thematic collections. <term1, term2, GD(term1,term2)> • The extended dictionary stores the set of synonyms of each term-sense and the synonymy degree between each term-sense and its synonyms. <term1, N, term2, SD(term1, term2> • If a term-sense has m synonyms, it will have m entries in the dictionary. N is the number that identifies a sense of term1

  10. The FIS-CRM Model- The Concept of “Concept” • Fuzzy concepts can be managed in terms of semantic areas. • Every word has a semantic area. The semantic area of a weak word is the union of the semantic areas of each of its senses. • If a single word (t1) is related to other more general word (t2) by means of generality interrelation, the semantic area (SA1) of t1 will be included in the semantic area (SA2) of t2. • SA1 is Included in SA2 with a membership degree equal to the generality degree between both terms.

  11. The FIS-CRM Model- The Concept of “Concept” (cont’) • A concept obtained from the occurrences of various synonyms as a fuzzy set • Membership degree • each of the words that form the concept to the concept itself. m words co-occur in a document or in a query wi is the weight of the term t i in the document

  12. The FIS-CRM Model- The Construction of FIS-CRM Vectors • When obtaining the vector weights, the goal is to share the number of occurrences of each concept among the words of the set of synonyms whose semantic area is more representative of the semantic area of that concept. • Construction steps: • Representing them by their base weight vectors (based on the occurrences of the contained words) • Weight readjusting (obtaining FIS-CRM vectors based on concept occurrences)

  13. Concept-Matching IR System-Offline subsystem • The offline subsystem is in charge of the following functions • Web crawler process: crawling all the accessible Web pages in the Internet and the construction of the IR index • Representing and indexing of Web pages using FIS-CRM (FIG 3)

  14. Concept-Matching IR System-Online Subsystem • The online subsystem undertakes the following functions: • Entering and preprocessing the user query • Matching process • Representing the retrieved snippets by using the FIS-CRM model • Grouping the retrieved snippets according to the concepts contained

  15. Concept-Matching IR System -Online Subsystem (cont’) • Three components of the online subsystem: query input component, search engine, and clustering component.

  16. Concept-Matching IR System-Online Subsystem (cont’) • Query input component: to build the query vector. • If the query contains weak terms and the user does not select any ontology, the system will offer the user the set of synonyms of every sense of every weak term in the query. • If we consider that each term in the query has a membership degree to each set of synonyms to which it belongs, we can obtain a fuzzy measure of the compatibility of each of the vectors with the query. the compatibility degree of a vector v to a query Q of n terms

  17. Concept-Matching IR System-Online Subsystem (cont’) • Search engine: to retrieve the links of the web pages that match the user query. => The snippets of the web pages whose matching value is higher than zero and satisfies the logical expression defined in the query. • Clustering component : to organize the retrieved snippets hierarchically into groups. => A set of overlapped groups of snippets that are conceptually related. The resulting groups are also labeled with the words that represent the concepts inside the groups of documents.

  18. Example of Comparison of a Word-Based Search and a Concept-Based Search

  19. Example of Comparison of a Word-Based Search and a Concept-Based Search (cont’) • The set of synonyms of the different senses of the word stack could be the following three (referred to in the three documents, respectively): • If the user disambiguates the sense of stack in the query by selecting a “programming” ontology.

  20. Conclusion • FIS-CRM uses standard term-based vectors to represent the concepts contained in queries and documents, and the proposed search system implements a standard term-based matching method. • The concept of “concept” is managed from a fuzzy point of view, which allows us to manage concept occurrences by means of term occurrences, without using a concept index or concept repository. • Any search engine could manage FIS-CRM vectors without the need to adapt its matching mechanism and retaining its efficiency. • Future approaches will consider that the possibility of defining labeled fuzzy sets related to some user specified attributes, providing an approximation to this problem.

More Related