1 / 49

CS533 Information Retrieval

CS533 Information Retrieval. Dr. Michal Cutler Lecture #16 March 30, 2000. Webor. Web-based search tool for Organization Retrieval Based on the vector space model. Webor. Consists of two components Indexing engine Search engine. Language. The indexing engine is a C++ program

kreese
Download Presentation

CS533 Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS533 Information Retrieval Dr. Michal Cutler Lecture #16 March 30, 2000

  2. Webor • Web-based search tool for Organization Retrieval • Based on the vector space model

  3. Webor • Consists of two components • Indexing engine • Search engine

  4. Language • The indexing engine is a C++ program • The search engine is a Common Gate Interface program written in C++

  5. Indexing engine • Robot-based • Builds 25 inverted index files • Builds a webpage index with: • doc id, title, URL, • number unique terms, • number “parents”, • parent doc ids

  6. Search engine • Reads inverted lists for each indexed query term, and • computes similarity • Returns a list of web pages • Ranks web pages based on Cosine similarity

  7. Webor • How to make use of the structure information in webpage collections to improve retrieval.

  8. HTML tags • In current implementation many tags are ignored. Webor checks only whether a term appears in the: • title or • headers or • is emphasized (underscore, italics or bold) or • inside a list item • All other appearances “plain”

  9. Main Idea • Intuitively, terms in title, header, emphasized, or lists are important • Storing occurrence information in the index and • assigning importance values to their appearance may improve retrieval

  10. Anchor descriptions • Authors include in an Anchor tag, a description of the document, in addition to its URL. • Descriptions are the perception of these authors about the contents of the document

  11. Anchor descriptions • May provide good synonymous and related terms • Enable to retrieve documents that could not be retrieved otherwise. • Using an optimal importance value for anchor terms may improve the ranking

  12. The six Classes • Plain Text, • Title, • H1-H2, • H3-H6, • Strong, and • Anchor. • New version uses different 6 classes

  13. The anchor class • Includes terms in the anchor tag of hyperlinks to the document • This means that the document is augmented by descriptions that occur in other documents

  14. The tags and the classes (V1)

  15. Modified set of tags (V2)

  16. Indexing engine Read a list of URL seeds and store in HP index Read a list of seed domains while unread URLs do Fetch a home page, save title and doc ID in HP index

  17. Indexing engine while more “child links” in HP do extract anchor terms for “child” HP Extract URL and if new store in HP index

  18. Indexing engine add “parent” to “child” HP in the HP index Build keyword index

  19. Build keyword index Extract strings between HTML tags and assign to one of 6 classes while not empty string do Extract a token Transform token to a keyword and update index file

  20. Class assignment (V1) • Assign keywords within <TITLE> and </TITLE> to Title class • Then, assign keywords within <H1> and </H1> or <H2> and </H2> to H1-H2 class • Then assign to H3-H6, and then to Strong class • Remaining terms - Plain class

  21. A token in Webor • Sequence of non blank characters • Truncate “‘s” • Discard one character token

  22. What is a token? • If last character is not alphabetic discard character • “numbers,” to “numbers” • Discard token with non-alphabetic characters • Discard token if a capital letter occurs in middle of non-capitalized word

  23. Keywords in Webor • Stem token with Porter’s stemmer • Make token lower case • Add to index if not stop word

  24. The keyword index (V1) • Binary search trees • Reside in main memory • At end copied to disk • Only about 1000 web pages can be indexed • Eventually, all index files and HP files must be merged

  25. Calculating Cosine similarity • The idf of a term is only known at the end of the merge • To compute the “length” of the document vector the weight tf*idf of each document term must be used • Now the “length” formula can be computed and saved in the HP index

  26. Limitations • Cannot search for phrases • Numbers are discarded • EngiNet, OS/2 are discarded • All terms lower case

  27. Problems with indexing engine in V1 • Efficiency (V1) • Names (unless all capital) are stemmed (V1) • Hyphens (V1) • Typos • URLs (V1)

  28. Efficiency • Binary search trees can be skewed • Writing 25 files can be slow because of seek times • V2 uses B-trees. The B tree is partially stored in main memory

  29. Typing errors • “engineeringwith”- is an index term • “engineer-ing” becomes engineer- • “engeneer” is stemmed to “engen”

  30. URLs • Duplicates (some) home pages with non identical URLs (fixed in V2) • Down servers not indexed (fixed in V2)

  31. The files • HP (homepage) index • record for each webpage • 25 keyword index files • A, B,…,Y • Each file • a sequence of keyword records

  32. The web page records • url address, webpage id, • webpage title, • number unique keywords, • anchor terms used by “parents”, • total number of “parents”, • list of “parent” IDs

  33. Keyword record • The keyword (and ID) • df - no. documents with keyword • An inverted list

  34. The inverted list • A sequence of (ID, TFV) where • TFV is a vector of size 6 that contains the term frequencies for the 6 classes

  35. The search engine Read the HP index file Read query string while string not empty do

  36. The search engine get token convert to keyword if in index get keyword record Compute similarity Display in nonincreasing similarity

  37. The weight of a term • Webor uses the 6 Class Importance Values computed by our experiments (CIV) • The weight of a term is computed by

  38. Normal • In this case results are identical to ignoring HTML tags and the parent anchors

  39. Creating a test bed • Web pages: A snap shot of the Binghamton University site in Dec. 1996 (about 4,600 pages; after removing duplicates, about 3,000 pages). • Queries: 20 queries were created. • For each query, (manually) identify the documents relevant to the query.

  40. web-based retrievalconcert and music neural network intramural sports master thesis in geology cognitive science prerequisite of algorithm campus dining handicap student helpcareer development promotion guidelinenon-matriculated admissions grievance committee student associations laboratory in electrical engineering research centers anthropology chairmanengineering program computer workshop papers in philosophy computer and cognitive system The 20 queries

  41. A Genetic Algorithm for finding the optimal CIV. • The initial population has 30 CIVs. • 25 are randomly generated (range [1, 15]) • 5 are “good” CIVs from manual screening. • Each new generation of CIVs is produced by executing: crossover, mutation, and reproduction.

  42. The Genetic Algorithm • Crossover • done for each consecutive pair CIVs, with probability 0.75. • a single random cut for each selected pair Example: old pair new pair (1, 4, 2, 1, 2, 1) (2, 3, 2, 1, 2, 1) (2, 3, 1, 2, 5, 1) (1, 4, 1, 2, 5, 1) cut

  43. The Genetic Algorithm • Mutation • performed on each CIV with probability 0.1. • When mutation is performed, each CIV component is either decreased or increased by one with equal probability, subject to range conditions of each component. Example: If a component is already 15, then it cannot be increased.

  44. The Genetic Algorithm • The fitness function • A CIV has an initial fitness of • 0 when the 11-point average precision is less than 0.22. • (11-point average precision - 0.22), otherwise. • The final fitness is its initial fitness divided by the sum of the initial fitnesses of all the CIVs in the current generation. • each fitness is between 0 and 1 • the sum of all fitness values is 1

  45. The Genetic Algorithm • Reproduction • Wheel of fortune scheme to select the parent population. • The scheme selects fit CIVs with high probability and unfit CIVs with low probability. • The same CIV may be selected more than once.

  46. The Genetic Algorithm • Termination • The algorithm terminates after 25 generations and the best CIV obtained is reported as the optimal CIV. • The 11-point average precision by the optimal CIV is reported as the performance of the CIV.

  47. Experimental Results Classes:title, header, list, strong, anchor, plain Queries Opt.CIV Normal New Improvement 1st 10 281881 0.182 0.254 39.6% 2nd 10 271881 0.172 0.255 48.3% all 251881 0.177 0.254 43.5%

  48. Conclusions • Anchor and strong are most important • Header is also important • Title is only slightly more important than list and plain

  49. Future work • Webor has the potential to substantially improve the retrieval effectiveness. • Results are too preliminary. Need to • Expand the set of queries in the test bed • Use other Web page collections

More Related