240 likes | 318 Views
A Search Engine That Learns. Jeff Elser – jelser@cs.montana.edu John Paxton – paxton@cs.montana.edu Montana State University - Bozeman. Presentation Outline. Problem Background Information Approach Preliminary Results Future Work Summary Questions. I. Problem. RightNow software use
E N D
A Search Engine That Learns Jeff Elser – jelser@cs.montana.edu John Paxton – paxton@cs.montana.edu Montana State University - Bozeman
Presentation Outline • Problem • Background Information • Approach • Preliminary Results • Future Work • Summary • Questions
I. Problem • RightNow software use • Spidering and searching • Website optimization • Page by page is tedious and time consuming • Dual ownership should allow perfect optimization • Solutions • Search engine adjustments • Suggesting specific web page changes
II. Background – Search Engine • Spidering • Indexing • Weighting factors
Goldberg’s Simple GA Mutation Crossover Elitism Non-overlapping populations Several fitness functions Individual 1 Fitness = 2 Individual 2 Fitness = 4 II. Background – Genetic Algorithms
III. Approach • Architecture • Training data • Testing controls (website source) • GA specifics • Fitness functions
B. Training Data • Website source • 20000 newsgroup articles from UCI Knowledge Discovery in Databases Archive • Hand formatted HTML • Chosen for word count and structure
C. Testing Controls • Webmaster provides training data • List of important keywords • Associated ranked pages • Tedious, but trivial compared to optimizing all pages
D. GA Specifics • Random initial population • Population size 1000 • Used GAlib’s built in random number generator • Genome • 16 real numbers corresponding to the 16 weighting factors • Range 0.0 – 1000.0
D. GA Specifics • GA executes for 10000 generations • Elitism is turned on • Mutation probability = 0.01 • Crossover probability = 0.6
D. Fitness Function 1 • ∑D • D = |(actual ranking) – (desired ranking)| • +1 to avoid division by 0
D. Fitness Function 2 • +100 penalty for pages that don’t appear • -10 reward for pages with a perfect fit
IV. Preliminary Results • 12 tests using fitness function #2 • 1 realistic set of desired rankings • 11 random sets • 4 tests obtained perfect rankings • 4 improved rankings, but did not achieve optimal • 4 tests showed no improvement
IV. Preliminary Results Htdig default weights Fitness Function #2
IV. Preliminary Results Fitness Function #2 Htdig default weights
V. Future Work – Fitness Function 3Levenshtein Distance • D = string 1; A = string 2 • Construct a mxn Matrix (M) where m = |D|+1 and n = |A|+1 • M[0,i] = i and M[j,0] = j • For each remaining cell: D[i] == A[j] then cost = 0 D[i] != A[j] then cost = 1 M[i,j] = MIN {a, b, c} where a = M[i-1,j] + 1 b = M[i,j-1] + 1 c = M[i-1,j-1] + cost • Distance = M[m,n] F A R M 0 1 2 3 4 F 1 0 1 2 3 R 2 1 1 1 2 O 3 2 2 2 2 M 4 3 3 3 2
V. Future Work – Fitness Function 3Levenshtein Distance • Reduce the url comparison to string comparison • Experiment further using LD as a fitness function • Sigmoid weighting function to increase the importance of the front of the string ↓
V. Future Work • Create more extensive test sets • dare.com, studentaid.ed.gov, fafsa.ed.gov, americorps.org
V. Future Work • For pages that still do not rank properly, create optimization suggestions • Use custom meta tags to properly rank outliers • Use implicit user feedback to find the desired rankings
VI. Summary • Proof of concept • Testing on real world websites will strengthen results and open other areas of study.
VII. Questions • Thanks for attending • Any questions?