240 likes | 253 Views
This presentation explores the use of genetic algorithms to optimize webpage ranking via search engine adjustments and web page suggestions. Preliminary results, future work on Levenshtein Distance fitness function, and optimization suggestions are discussed.
E N D
A Search Engine That Learns Jeff Elser – jelser@cs.montana.edu John Paxton – paxton@cs.montana.edu Montana State University - Bozeman
Presentation Outline • Problem • Background Information • Approach • Preliminary Results • Future Work • Summary • Questions
I. Problem • RightNow software use • Spidering and searching • Website optimization • Page by page is tedious and time consuming • Dual ownership should allow perfect optimization • Solutions • Search engine adjustments • Suggesting specific web page changes
II. Background – Search Engine • Spidering • Indexing • Weighting factors
Goldberg’s Simple GA Mutation Crossover Elitism Non-overlapping populations Several fitness functions Individual 1 Fitness = 2 Individual 2 Fitness = 4 II. Background – Genetic Algorithms
III. Approach • Architecture • Training data • Testing controls (website source) • GA specifics • Fitness functions
B. Training Data • Website source • 20000 newsgroup articles from UCI Knowledge Discovery in Databases Archive • Hand formatted HTML • Chosen for word count and structure
C. Testing Controls • Webmaster provides training data • List of important keywords • Associated ranked pages • Tedious, but trivial compared to optimizing all pages
D. GA Specifics • Random initial population • Population size 1000 • Used GAlib’s built in random number generator • Genome • 16 real numbers corresponding to the 16 weighting factors • Range 0.0 – 1000.0
D. GA Specifics • GA executes for 10000 generations • Elitism is turned on • Mutation probability = 0.01 • Crossover probability = 0.6
D. Fitness Function 1 • ∑D • D = |(actual ranking) – (desired ranking)| • +1 to avoid division by 0
D. Fitness Function 2 • +100 penalty for pages that don’t appear • -10 reward for pages with a perfect fit
IV. Preliminary Results • 12 tests using fitness function #2 • 1 realistic set of desired rankings • 11 random sets • 4 tests obtained perfect rankings • 4 improved rankings, but did not achieve optimal • 4 tests showed no improvement
IV. Preliminary Results Htdig default weights Fitness Function #2
IV. Preliminary Results Fitness Function #2 Htdig default weights
V. Future Work – Fitness Function 3Levenshtein Distance • D = string 1; A = string 2 • Construct a mxn Matrix (M) where m = |D|+1 and n = |A|+1 • M[0,i] = i and M[j,0] = j • For each remaining cell: D[i] == A[j] then cost = 0 D[i] != A[j] then cost = 1 M[i,j] = MIN {a, b, c} where a = M[i-1,j] + 1 b = M[i,j-1] + 1 c = M[i-1,j-1] + cost • Distance = M[m,n] F A R M 0 1 2 3 4 F 1 0 1 2 3 R 2 1 1 1 2 O 3 2 2 2 2 M 4 3 3 3 2
V. Future Work – Fitness Function 3Levenshtein Distance • Reduce the url comparison to string comparison • Experiment further using LD as a fitness function • Sigmoid weighting function to increase the importance of the front of the string ↓
V. Future Work • Create more extensive test sets • dare.com, studentaid.ed.gov, fafsa.ed.gov, americorps.org
V. Future Work • For pages that still do not rank properly, create optimization suggestions • Use custom meta tags to properly rank outliers • Use implicit user feedback to find the desired rankings
VI. Summary • Proof of concept • Testing on real world websites will strengthen results and open other areas of study.
VII. Questions • Thanks for attending • Any questions?