CS533 Information Retrieval

CS533 Information Retrieval Dr. Michal Cutler Lecture #16 March 30, 2000

Webor • Web-based search tool for Organization Retrieval • Based on the vector space model

Webor • Consists of two components • Indexing engine • Search engine

Language • The indexing engine is a C++ program • The search engine is a Common Gate Interface program written in C++

Indexing engine • Robot-based • Builds 25 inverted index files • Builds a webpage index with: • doc id, title, URL, • number unique terms, • number “parents”, • parent doc ids

Search engine • Reads inverted lists for each indexed query term, and • computes similarity • Returns a list of web pages • Ranks web pages based on Cosine similarity

Webor • How to make use of the structure information in webpage collections to improve retrieval.

HTML tags • In current implementation many tags are ignored. Webor checks only whether a term appears in the: • title or • headers or • is emphasized (underscore, italics or bold) or • inside a list item • All other appearances “plain”

Main Idea • Intuitively, terms in title, header, emphasized, or lists are important • Storing occurrence information in the index and • assigning importance values to their appearance may improve retrieval

Anchor descriptions • Authors include in an Anchor tag, a description of the document, in addition to its URL. • Descriptions are the perception of these authors about the contents of the document

Anchor descriptions • May provide good synonymous and related terms • Enable to retrieve documents that could not be retrieved otherwise. • Using an optimal importance value for anchor terms may improve the ranking

The six Classes • Plain Text, • Title, • H1-H2, • H3-H6, • Strong, and • Anchor. • New version uses different 6 classes

The anchor class • Includes terms in the anchor tag of hyperlinks to the document • This means that the document is augmented by descriptions that occur in other documents

The tags and the classes (V1)

Modified set of tags (V2)

Indexing engine Read a list of URL seeds and store in HP index Read a list of seed domains while unread URLs do Fetch a home page, save title and doc ID in HP index

Indexing engine while more “child links” in HP do extract anchor terms for “child” HP Extract URL and if new store in HP index

Indexing engine add “parent” to “child” HP in the HP index Build keyword index

Build keyword index Extract strings between HTML tags and assign to one of 6 classes while not empty string do Extract a token Transform token to a keyword and update index file

Class assignment (V1) • Assign keywords within <TITLE> and </TITLE> to Title class • Then, assign keywords within <H1> and </H1> or <H2> and </H2> to H1-H2 class • Then assign to H3-H6, and then to Strong class • Remaining terms - Plain class

A token in Webor • Sequence of non blank characters • Truncate “‘s” • Discard one character token

What is a token? • If last character is not alphabetic discard character • “numbers,” to “numbers” • Discard token with non-alphabetic characters • Discard token if a capital letter occurs in middle of non-capitalized word

Keywords in Webor • Stem token with Porter’s stemmer • Make token lower case • Add to index if not stop word

The keyword index (V1) • Binary search trees • Reside in main memory • At end copied to disk • Only about 1000 web pages can be indexed • Eventually, all index files and HP files must be merged

Calculating Cosine similarity • The idf of a term is only known at the end of the merge • To compute the “length” of the document vector the weight tf*idf of each document term must be used • Now the “length” formula can be computed and saved in the HP index

Limitations • Cannot search for phrases • Numbers are discarded • EngiNet, OS/2 are discarded • All terms lower case

Problems with indexing engine in V1 • Efficiency (V1) • Names (unless all capital) are stemmed (V1) • Hyphens (V1) • Typos • URLs (V1)

Efficiency • Binary search trees can be skewed • Writing 25 files can be slow because of seek times • V2 uses B-trees. The B tree is partially stored in main memory

Typing errors • “engineeringwith”- is an index term • “engineer-ing” becomes engineer- • “engeneer” is stemmed to “engen”

URLs • Duplicates (some) home pages with non identical URLs (fixed in V2) • Down servers not indexed (fixed in V2)

The files • HP (homepage) index • record for each webpage • 25 keyword index files • A, B,…,Y • Each file • a sequence of keyword records

The web page records • url address, webpage id, • webpage title, • number unique keywords, • anchor terms used by “parents”, • total number of “parents”, • list of “parent” IDs

Keyword record • The keyword (and ID) • df - no. documents with keyword • An inverted list

The inverted list • A sequence of (ID, TFV) where • TFV is a vector of size 6 that contains the term frequencies for the 6 classes

The search engine Read the HP index file Read query string while string not empty do

The search engine get token convert to keyword if in index get keyword record Compute similarity Display in nonincreasing similarity

The weight of a term • Webor uses the 6 Class Importance Values computed by our experiments (CIV) • The weight of a term is computed by

Normal • In this case results are identical to ignoring HTML tags and the parent anchors

Creating a test bed • Web pages: A snap shot of the Binghamton University site in Dec. 1996 (about 4,600 pages; after removing duplicates, about 3,000 pages). • Queries: 20 queries were created. • For each query, (manually) identify the documents relevant to the query.

web-based retrievalconcert and music neural network intramural sports master thesis in geology cognitive science prerequisite of algorithm campus dining handicap student helpcareer development promotion guidelinenon-matriculated admissions grievance committee student associations laboratory in electrical engineering research centers anthropology chairmanengineering program computer workshop papers in philosophy computer and cognitive system The 20 queries

A Genetic Algorithm for finding the optimal CIV. • The initial population has 30 CIVs. • 25 are randomly generated (range [1, 15]) • 5 are “good” CIVs from manual screening. • Each new generation of CIVs is produced by executing: crossover, mutation, and reproduction.

The Genetic Algorithm • Crossover • done for each consecutive pair CIVs, with probability 0.75. • a single random cut for each selected pair Example: old pair new pair (1, 4, 2, 1, 2, 1) (2, 3, 2, 1, 2, 1) (2, 3, 1, 2, 5, 1) (1, 4, 1, 2, 5, 1) cut

The Genetic Algorithm • Mutation • performed on each CIV with probability 0.1. • When mutation is performed, each CIV component is either decreased or increased by one with equal probability, subject to range conditions of each component. Example: If a component is already 15, then it cannot be increased.

The Genetic Algorithm • The fitness function • A CIV has an initial fitness of • 0 when the 11-point average precision is less than 0.22. • (11-point average precision - 0.22), otherwise. • The final fitness is its initial fitness divided by the sum of the initial fitnesses of all the CIVs in the current generation. • each fitness is between 0 and 1 • the sum of all fitness values is 1

The Genetic Algorithm • Reproduction • Wheel of fortune scheme to select the parent population. • The scheme selects fit CIVs with high probability and unfit CIVs with low probability. • The same CIV may be selected more than once.

The Genetic Algorithm • Termination • The algorithm terminates after 25 generations and the best CIV obtained is reported as the optimal CIV. • The 11-point average precision by the optimal CIV is reported as the performance of the CIV.

Experimental Results Classes:title, header, list, strong, anchor, plain Queries Opt.CIV Normal New Improvement 1st 10 281881 0.182 0.254 39.6% 2nd 10 271881 0.172 0.255 48.3% all 251881 0.177 0.254 43.5%

Conclusions • Anchor and strong are most important • Header is also important • Title is only slightly more important than list and plain

Future work • Webor has the potential to substantially improve the retrieval effectiveness. • Results are too preliminary. Need to • Expand the set of queries in the test bed • Use other Web page collections

CS533 Information Retrieval