1 / 24

A Search Engine That Learns

This presentation explores the use of genetic algorithms to optimize webpage ranking via search engine adjustments and web page suggestions. Preliminary results, future work on Levenshtein Distance fitness function, and optimization suggestions are discussed.

johnathanj
Download Presentation

A Search Engine That Learns

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Search Engine That Learns Jeff Elser – jelser@cs.montana.edu John Paxton – paxton@cs.montana.edu Montana State University - Bozeman

  2. Presentation Outline • Problem • Background Information • Approach • Preliminary Results • Future Work • Summary • Questions

  3. I. Problem • RightNow software use • Spidering and searching • Website optimization • Page by page is tedious and time consuming • Dual ownership should allow perfect optimization • Solutions • Search engine adjustments • Suggesting specific web page changes

  4. II. Background – Search Engine • Spidering • Indexing • Weighting factors

  5. Goldberg’s Simple GA Mutation Crossover Elitism Non-overlapping populations Several fitness functions Individual 1 Fitness = 2 Individual 2 Fitness = 4 II. Background – Genetic Algorithms

  6. III. Approach • Architecture • Training data • Testing controls (website source) • GA specifics • Fitness functions

  7. A. Architecture

  8. B. Training Data • Website source • 20000 newsgroup articles from UCI Knowledge Discovery in Databases Archive • Hand formatted HTML • Chosen for word count and structure

  9. C. Testing Controls • Webmaster provides training data • List of important keywords • Associated ranked pages • Tedious, but trivial compared to optimizing all pages

  10. D. GA Specifics • Random initial population • Population size 1000 • Used GAlib’s built in random number generator • Genome • 16 real numbers corresponding to the 16 weighting factors • Range 0.0 – 1000.0

  11. D. GA Specifics • GA executes for 10000 generations • Elitism is turned on • Mutation probability = 0.01 • Crossover probability = 0.6

  12. D. Fitness Function 1 • ∑D • D = |(actual ranking) – (desired ranking)| • +1 to avoid division by 0

  13. D. Fitness Function 2 • +100 penalty for pages that don’t appear • -10 reward for pages with a perfect fit

  14. IV. Preliminary Results • 12 tests using fitness function #2 • 1 realistic set of desired rankings • 11 random sets • 4 tests obtained perfect rankings • 4 improved rankings, but did not achieve optimal • 4 tests showed no improvement

  15. IV. Preliminary Results Htdig default weights Fitness Function #2

  16. IV. Preliminary Results Fitness Function #2 Htdig default weights

  17. V. Future Work – Fitness Function 3Levenshtein Distance • D = string 1; A = string 2 • Construct a mxn Matrix (M) where m = |D|+1 and n = |A|+1 • M[0,i] = i and M[j,0] = j • For each remaining cell: D[i] == A[j] then cost = 0 D[i] != A[j] then cost = 1 M[i,j] = MIN {a, b, c} where a = M[i-1,j] + 1 b = M[i,j-1] + 1 c = M[i-1,j-1] + cost • Distance = M[m,n] F A R M 0 1 2 3 4 F 1 0 1 2 3 R 2 1 1 1 2 O 3 2 2 2 2 M 4 3 3 3 2

  18. V. Future Work – Fitness Function 3Levenshtein Distance • Reduce the url comparison to string comparison • Experiment further using LD as a fitness function • Sigmoid weighting function to increase the importance of the front of the string ↓

  19. V. Future Work • Create more extensive test sets • dare.com, studentaid.ed.gov, fafsa.ed.gov, americorps.org

  20. V. Future Work

  21. V. Future Work

  22. V. Future Work • For pages that still do not rank properly, create optimization suggestions • Use custom meta tags to properly rank outliers • Use implicit user feedback to find the desired rankings

  23. VI. Summary • Proof of concept • Testing on real world websites will strengthen results and open other areas of study.

  24. VII. Questions • Thanks for attending • Any questions?

More Related