Knowledge-Based Information Retrieval: Concepts and Techniques

CS533 Information Retrieval Dr. Michal Cutler Lecture #10 February 24, 1999

This class Review of models covered so far Knowledge in Artificial Intelligence Knowledge based information retrieval

Models covered so far Pure Boolean Fuzzy Boolean models Vector space model Latent semantic indexing Probabilistic models

Knowledge and Artificial Intelligence • Early AI systems were based on general problem solving and search techniques • Had limited usefulness because of their lack of knowledge about the domain of the problem they deal with • Knowledge based systems provided better solution

Knowledge Representation • Knowledge representation and reasoning with knowledge became a major research area in AI • Many techniques such as semantic nets, frames, and production rules were used for representing knowledge

Knowledge Representation • It became evident that successful natural language understanding requires the representation of a large amount of “common sense” knowledge • Building knowledge based systems is time consuming and requires maintenance when the knowledge is dynamic

Knowledge Based IR • Knowledge based information retrieval attempts to identify the occurrence of high level concepts in a document • We will discuss a specific knowledge based information retrieval system • Lecture is based on on-line information on the definition and usage of topic sets

RUBRIC/TOPIC • Technique similar to frame based methods in natural language understanding • Concepts and their relationship represent the knowledge needed for retrieval • Evidential reasoning provide the link between a document and its concepts

What Are Topics? • A topic groups information related to a concept or a subject area. • Topics enable the encapsulation of knowledge • When you add topics to a Verity search application, users can select those topics when they define their queries.

Topic Organization • Topics organize search criteria in a format similar to an outline. • Operators and modifiers join related groups of search criteria. • Topics may be independent units, or units with relationships to other topics in a hierarchical structure.

Weight Assignments • By assigning weights to search criteria, the most relevant documents receive the highest scores. • A weight is between 0.01 and 1.00. • A weight of 1.00 indicates that the search criteria is of great interest. • Document score is determined by the accumulation of evidence of search criteria and weights.

Knowledge Bases • The knowledge base for a Topic application can consist of topic sets. • Knowledge bases offer the ability to find information without having to compose sophisticated queries. • Typically, the administrator of Topic sets up the knowledge base and maintains its contents.

Topic Sets • The subject of a topic is identified by its name. • In the example below, the topic is performing-arts. • This topic is composed of its name, performing-arts, and its evidence topics, ballet, drama, dance, opera, symphony, and mime.

WORD ballet WORD drama WORD dance WORD opera WORD symphony NOT-WORD mime Performing-arts ACCRUE The performing arts topic

Operators and modifiers • Operators represent logic to be applied to evidence topics. • This logic qualifies the kinds of needed documents • Modifiers apply further logicto evidence topics.

Adding a topic • A modifier can specify that documents containing an evidence topic not be included in the result. • More general and less general topics may be added • The topic film is added to the performing-art structure to form the top-level topic, art.

Topic hierarchy • Sophisticated topics are composed of top-level topics, subtopics, and evidence topics. • A topic set consists of several top-level topics. • Note that subtopics and evidence topics can be used by multiple top-level topics.

WORD ballet WORD drama WORD dance WORD opera WORD symphony NOT-WORD mime Performing-arts ACCRUE Art ACCRUE WORD film OR motion-pictures WORD movie art-films OR ... Film ACCRUE The performing arts topic

literature ACCRUE philosophy ACCRUE language ACCRUE history ACCRUE art ACCRUE liberal-art ACCRUE Performing-arts ACCRUE film ACCRUE visual-arts ACCRUE video OR The liberal-arts topic

Document Scoring • If an evidence topic is present (absent), its score is 1.00 (0.00). • If an evidence topic is weighted, • the scores of the evidence topics are multiplied by the weights, • then the resulting products are combined as specified by the operator of the parent topic.

Document Scoring • If this parent topic is, in turn, the child of another topic which is being searched, • its score is multiplied by its assigned weight, and • the resulting product is combined with the products of its siblings in a manner specified by the operator assigned to the parent topic.

Document Scoring This process continues until the parent topic is reached. For ACCRUE the result is the highest product of each child's weight and score, with a little added to the score for each child which is present in the document.

Document Scoring For AND(OR)the lowest (highest) product of each child's weight and score is taken. If a child uses a proximity operator (PHRASE, SENTENCE, or PARAGRAPH), or a relational operator, the child receives a score of 1.00 if the topic is present, and a score of 0.00 if the topic is not present.

0.5 boeings-comps OR 0.5 boeing-label AND 0.5 boeing-people ACCRUE WORD Boeing WORD Company BOINGCO OR The Boeing topic

WORD Boeing WORD computer WORD services 0.5 boeings-comp-service PHRASE 0.5 boeing-aerspace SENTECE 0.5 boeing-defense PARAGRAPH WORD Boeing WORD aerospace WORD electronics 0.5 boeings- comps OR WORD Boeing WORD defense The Boeing topic

0.8 paul-binder PHRASE 0.5 arthur- hitsman PHRASE 0.3 ted-johnson PHRASE 0.5 boeings- people ACCRUE WORD Ted WORD Johnson WORD Paul WORD Binder WORD Arthur WORD Hitsman The Boeing topic

Evidence topics boeing, computer, and services appear in phrase Evidence topics boeing and defense appear in paragraph Evidence topics boeing and company appear in document Evidence topics ted and johnson appear in the same phrase Example document

Final scores

The scores • boeing-comps, which uses the OR operator, has a score of 0.50. • boeing-people, which uses the ACCRUE operator, has a score of 0.30. • BOEINGCO, which uses OR, compares the products of each child's weight and score, and takes the highest product • The document is scored as 0.50.

The Query Language • Evidence Operators • Proximity Operators • Relational Operators • Concept Operators • Boolean Operators • Score Operators • Modifiers

Evidence Operators

Proximity Operators

Relational Operators • Relational operators search document fields (such as AUTHOR) • Perform filtering function by selecting documents that contain specified field values. • Documents retrieved using relational operators are not relevance-ranked

Relational Operators Also string comparison operators

Concept Operators Used with scores

Boolean Operators

Score Operators

Modifiers

Overcomes the need for vocabulary overlap between query and document By using phrases, and adjacency operators can deal with polysemy Enables the specification of very precise queries Top retrieved documents have high precision Advantages

Defining and fine tuning a topic set requires substantial work The knowledge base must be well maintained Disadvantages

Knowledge-Based Information Retrieval: Concepts and Techniques

Knowledge-Based Information Retrieval: Concepts and Techniques

Presentation Transcript

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval