A Search Engine That Learns

A Search Engine That Learns Jeff Elser – jelser@cs.montana.edu John Paxton – paxton@cs.montana.edu Montana State University - Bozeman

Presentation Outline • Problem • Background Information • Approach • Preliminary Results • Future Work • Summary • Questions

I. Problem • RightNow software use • Spidering and searching • Website optimization • Page by page is tedious and time consuming • Dual ownership should allow perfect optimization • Solutions • Search engine adjustments • Suggesting specific web page changes

II. Background – Search Engine • Spidering • Indexing • Weighting factors

Goldberg’s Simple GA Mutation Crossover Elitism Non-overlapping populations Several fitness functions Individual 1 Fitness = 2 Individual 2 Fitness = 4 II. Background – Genetic Algorithms

III. Approach • Architecture • Training data • Testing controls (website source) • GA specifics • Fitness functions

A. Architecture

B. Training Data • Website source • 20000 newsgroup articles from UCI Knowledge Discovery in Databases Archive • Hand formatted HTML • Chosen for word count and structure

C. Testing Controls • Webmaster provides training data • List of important keywords • Associated ranked pages • Tedious, but trivial compared to optimizing all pages

D. GA Specifics • Random initial population • Population size 1000 • Used GAlib’s built in random number generator • Genome • 16 real numbers corresponding to the 16 weighting factors • Range 0.0 – 1000.0

D. GA Specifics • GA executes for 10000 generations • Elitism is turned on • Mutation probability = 0.01 • Crossover probability = 0.6

D. Fitness Function 1 • ∑D • D = |(actual ranking) – (desired ranking)| • +1 to avoid division by 0

D. Fitness Function 2 • +100 penalty for pages that don’t appear • -10 reward for pages with a perfect fit

IV. Preliminary Results • 12 tests using fitness function #2 • 1 realistic set of desired rankings • 11 random sets • 4 tests obtained perfect rankings • 4 improved rankings, but did not achieve optimal • 4 tests showed no improvement

IV. Preliminary Results Htdig default weights Fitness Function #2

IV. Preliminary Results Fitness Function #2 Htdig default weights

V. Future Work – Fitness Function 3Levenshtein Distance • D = string 1; A = string 2 • Construct a mxn Matrix (M) where m = |D|+1 and n = |A|+1 • M[0,i] = i and M[j,0] = j • For each remaining cell: D[i] == A[j] then cost = 0 D[i] != A[j] then cost = 1 M[i,j] = MIN {a, b, c} where a = M[i-1,j] + 1 b = M[i,j-1] + 1 c = M[i-1,j-1] + cost • Distance = M[m,n] F A R M 0 1 2 3 4 F 1 0 1 2 3 R 2 1 1 1 2 O 3 2 2 2 2 M 4 3 3 3 2

V. Future Work – Fitness Function 3Levenshtein Distance • Reduce the url comparison to string comparison • Experiment further using LD as a fitness function • Sigmoid weighting function to increase the importance of the front of the string ↓

V. Future Work • Create more extensive test sets • dare.com, studentaid.ed.gov, fafsa.ed.gov, americorps.org

V. Future Work

V. Future Work • For pages that still do not rank properly, create optimization suggestions • Use custom meta tags to properly rank outliers • Use implicit user feedback to find the desired rankings

VI. Summary • Proof of concept • Testing on real world websites will strengthen results and open other areas of study.

VII. Questions • Thanks for attending • Any questions?

A Search Engine That Learns

A Search Engine That Learns

Presentation Transcript

Choosing a Search Engine

Choosing a Search Engine

A Meeting Browser that Learns

Search Engine

Frompo a Search Engine

Search Engine

Search Engine

Search Engine

Constructing a School That Learns

The Idea of a School That Learns

SEARCH ENGINE

Search Engine

Search Engine

Search Engine

Search engine

Search Engine

search engine

SEARCH ENGINE

That Is A Search Engine Optimization Material Writer