Using your Users’ Taxonomy: Improving Search with Analytics

Using your Users’ Taxonomy:Improving Search with Analytics John Ferrara Information Architect, Vanguard

Our story – Chapter 1 • In fall of 2007, we were transitioning to a new search engine technology • Information architects participated in product selection and visioning • IA was less involved once implementation started It’s beautiful! Ahh! Ooo!

Chapter 2: First signs of a problem • Project manager noticed some searches didn’t work that well • Asked for help evaluating the quality of results • I tried a few searches and agreed that they seemed to be underperforming

Chapter 3: Bringing it to the developers Where’s the proof? Information Architect & Dev Team Meeting • Told the development team that results seemed off, but they were skeptical. You can’t tell for sure. Search seems to have a few problems… Nah.

Stage 1 – Blind fury Information Architect & Dev Team Meeting DO NOT QUESTION THE INFORMATION ARCHITECT!!

Stage 2 – Getting over yourself Wait, they have a point…

Unsound method for evaluation • The tested searches came from our formal taxonomy • Users might not describe things the same way • Users might not be interested in the same things • All anectodal, no metrics • The handful of searches I tried didn’t do well • Thousands of different searches are submitted each day • Provided no basis for comparison • By what standard do we measure good or bad performance? • How will we know when it’s good enough?

Chapter 4: Recognizing an opportunity • We have: • The most popular searches (our users’ own taxonomy) • The legacy search engine in production • The new search engine running in dev • Excel • All we need is a method

Developed 2 testing methods • Relevancy: How reliably the search engine returns the best matches first. • Quick & easy • Limited insight • Precision: The proportion of relevant and irrelevant results clustered at the top of the list. • Longer & more difficult • Robust insight Both use the users’ taxonomy

Relevancy test, step 1 • Go to the most common queries report • Skip any phrase where: • There’s more than one best target • There is no relevant content on the site • You’re not sure what the user is trying to find • Keep the rest • Try to get enough that the results will be quantitatively significant

For example… Skip any where: Example: There’s more than one best target “Registrar” could either refer to the University registrar or the Law School registrar, which have different pages. Neither one is more plausible than the other. There is no relevant content on the site “Football” has a single clear best target, but it’s hosted on a separate site that’s not indexed in the search engine. This is a problem, but it’s not the fault of the engine. You’re not sure what the user is trying to find “Parking” is a very common search, but it’s vague. It could refer to student parking, event parking, parking registration, visitor parking, or parking tickets

Apparent intention (awfully important) “campus map” • Your judgment of the user’s intention impacts results. • Actual intention: • What the user really had in mind • Can’t get this from search logs • Apparent intention: • How a reasonable person would interpret a search phrase • Search should be held to the human standard, but cannot be expected to do any better • When in doubt, skip it (there’s no shortage of search phrases). You only want to keep phrases where you’re very confident of the user’s intended meaning

Relevancy test, step 2 • Put the narrowed list of search phrases into a spreadsheet

Relevancy test, step 2 • Put the narrowed list of search phrases into a spreadsheet • Add the title of the best target

Relevancy test, step 2 • Put the narrowed list of search phrases into a spreadsheet • Add the title of the best target • Add the URL of the best target

Relevancy test, step 3 • Search for the users’ phrases

Relevancy test, step 3 • Here’s the best target:

Relevancy test, step 3 • Here’s where it is in the search results: #1

Relevancy test, step 3 • Not all phrases may work that well

Relevancy test, step 3 • Here’s the best target:

Relevancy test, step 3 • And here are the top results:

Relevancy test, step 3 • Here’s where the best target was: #17

Relevancy test, step 3 • Record each target’s distance from the top of the list

Relevancy test, step 4 • Go to the results tab • Mean: Average distance from the top • Median: Less sensitive to outliers, but not useful once at least half are ranked #1 • Count - Below 1st: How often is the best target something other than 1st? • Count – Below 5th: How often is the best target outside the critical area? • Count – Below 10th: How often is the best target beyond the first page? For all numbers, the lower the better

Shortcomings of relevancy testing • Has to skip some phrasings • Looking for the “best target” ignores the quality of other results • Tells a narrow story of search performance Precision testing closes these gaps.

What is precision? • In other words, how many of the results that the search engine returns are of good quality? • Users don’t look at all of the results, so we limit the test to the top few. Number of relevant results Total number of results Precision =

Precision test, step 1 • Again, work from the user’s taxonomy • This time we don’t eliminate any phrasings

Precision test, step 1 • Transpose the phrases directly to the spreadsheet

Precision test, step 2 • Search for the users’ phrases

Evaluate relevance on a scale • R – Relevant. Based on the information the user provided, the page's ranking is completely relevant.

Evaluate relevance on a scale • R - Relevant:

Evaluate relevance on a scale • R – Relevant. Based on the information the user provided, the page's ranking is completely relevant. • N – Near. The page is not a perfect match, but it’s clearly reasonable for it to be ranked highly.

Evaluate relevance on a scale • N - Near:

Evaluate relevance on a scale • R – Relevant. Based on the information the user provided, the page's ranking is completely relevant. • N – Near. The page is not a perfect match, but it’s clearly reasonable for it to be ranked highly. • M - Misplaced: You can see why the search engine returned it, but it should not be ranked highly.

Evaluate relevance on a scale • M - Misplaced:

Evaluate relevance on a scale • R – Relevant: Based on the information the user provided, the page's ranking is completely relevant. • N – Near: The page is not a perfect match, but it’s clearly reasonable for it to be ranked highly. • M - Misplaced: You can see why the search engine returned it, but it should not be ranked highly. • I – Irrelevant: The result has no apparent relationship to the user’s search.

Evaluate relevance on a scale • I - Irrelevant:

Use a mnemonic R – Relevant N – Near M – Misplaced I – Irrelevant R – Ralph N – Nader M – Makes I – Igloos Ralph Nader image by Don LaVange Igloo image by NOAA

Precision test, step 3 • Record the ratings of the top 5 results from each search in the spreadsheet

Calculating precision • Precision depends upon what you count as permissable • Our method specifies three parallel standards: • Strict – Only counts completely relevant results • Loose – Counts relevant and near results • Permissive – Counts relevant, near, and misplaced results

Precision test, step 4 • Go to the results tab For these numbers, the higher the better Ralph Nader image by Don LaVange Igloo image by NOAA

Presenting performance metrics

Chapter 5: Bringing back the data Now I see the problem. Information Architect & Dev Team Meeting • The case for change was more compelling because people could see the data and trust it. We need to fix this. Ah!

Actions for remediation

Tracking improvement

Relevancy testing: Quick & easy Provides actionable metrics Has to skip some phrasings Focused on a “best target”, ignores the quality of other results Tells a narrow story of search performance Precision testing: Longer & more difficult Provides actionable metrics Doesn’t skip any phrasings Factors in results that are close enough, making it more realistic Tells a robust story of search performance Evaluating the evaluations

Questions? I’m all ears!

Using your Users’ Taxonomy: Improving Search with Analytics