A few months back, I was asked to evaluate my company’s current solution solution against another search engine to try to determine if it would be worthwhile to implement a new solution. I’ve done package / tool evaluations in the past but I felt that there was something a bit different about this in that I needed to somehow integrate a fairly standard requirements-based evaluation with a measure of quality of the search results themselves, which are not easily expressed as concrete requirements.
So I set about the task and asked the SearchCop for suggestions about how to do an evaluation of the search results in a meaningful and supportable way. I received several useful results, including some suggestions from Avi Rappaport, about a methodology to go about identifying a good representation of search terms to use in an evaluation.
With my own experiences and those of the SearchCoP in hand, I came up with a process that I thought I would share here.
Two Components to the Evaluation
I split the assessment into two distinct parts. The first was a traditional “requirements-based” assessment which allowed me to reflect support for a number of functional or architectural needs I could identify. Some examples of such requirements were:
- The ability to support multiple file systems;
- The ability to control the web crawler (independent of robots.txt or robots tags embedded in pages)
- The power and flexibility of the administration interface, etc.
The second part of the assessment was to measure the quality of the search results.
I’ll provide more details below for each part of the assessment, but the key thing for this assessment was the have a (somewhat) quantitative way to measure the overall picture of the effectiveness and power of the search engines. It might be possible to even quantitatively combine the measure of these two components, though I did not do so in this case.
For the first part, I used a simplified quality functional deployment matrix – I identified the various requirements to consider and assigned them a weight (level of importance); based on some previous experiences, I forced the weights to be either 10 (very important -probably “mandatory” in a semantic sense), a 5 (desirable but not absolutely necessary) or a 1 (nice to have) – this provides a better spread in the final outcome, I believe.
Then I reviewed the search engines against those requirements and assigned each search engine a “score” which, again, was measured as a 10 (met out of the box), a 5 (met with some level of configuration), a 1 (met with some customization – i.e., probably some type of scripting or similar, but not configuration through an admin UI) and a 0 (does not meet and can not meet).
The overall “score” for an engine was then measured as the sum of the product of the score and weight for each requirement.
This simplistic approach can have the effect of giving too much weight to certain areas of requirements in total. Because each requirement is given a weight, if there are areas of requirements that have a lot of detail in your particular case, you can give that area too much overall weight simply because of the amount of detail. In other words, if you have a total of, say, 50 requirements and 30 of them are in one area (say you have specified 30 different file formats you need to support – each as a different requirement), then a significant percentage of your overall score will be contingent on that area. In some cases, that is OK but in many, it is not.
In order to work around this, I took the following approach:
- Grouped requirements into a set of categories;
- The categories should reflect natural cohesiveness of the requirements but should also be defined in a way that each category is roughly equal in importance to other categories;
- Compute the total possible score for each category (which in my case was 10 * (total-weight-of-requirements-in-category)
- Compute the relative score of that category for a search engine by summing the product of that engine’s score and the weight of the requirements for that category; the relative score is that engine’s score divided by the total possible score for that category.
- Now sum all of the relative scores for each category and (to get a number between 0 and 100) multiply by 100
This approach gives you a score for each engine between 0 and 100 and also gives each category a roughly equal effect on the total score.
If you are looking for some insights on categories of requirements you might want to include in your evaluation, I provide some of my thoughts in a subsequent post.
Search Results Quality
To measure the quality of search results, I took Avi’s insights from the SearchCoP and identified a set of specific searches that I wanted to measure. I identified the candidate searches by looking at the log files for the existing search solution on the site and pulling out a few searches that fell into each category Avi identified. The categories included:
- Simple queries
- Complex queries
- Common queries
- Spelling, typing and vocabulary errors
- Force matching edge-case issues, including:
- Many matches
- Few matches
- No matches
Going into this, I assumed I did not necessarily know the “right” targets for these searches, so I enlisted some volunteers among a group of knowledgeable employees (content managers on the web site) who could complete a survey I put together. The survey included a section where the participant had to execute each search against each search engine (the survey provided a link to do the search – so the participants did not have to actually go to a search screen somewhere and enter the terms and search – this was important to keep it somewhat simpler). The participants were then asked to score the quality of the results for each search engine (on a scale of 1-5).
The survey also included some other questions about presentation of results, performance, etc. (even though we did not customize search result templates or tweak anything in the searches, we wanted to get a general sense of usability) and also included a section where users could define and rate their own searches.
The results from the survey were then analyzed to get an overall measure of quality of results across this candidate set of searches for each search engine – basically doing some aggregation of the different searches into average scores or similar.
Outcome of the Assessment
With the engines we were looking at, the results were that one was better on the administration / architectural requirements and the
other was better on the search results – which makes for an interesting decision, I think.
The key takeaway for me from this process is that it is at least quantitative – one can argue over the set of requirements to include, or the weight of any particular requirement or the score of an engine on a particular requirement. However, the discussion can be held at that level instead of a more qualitative level (AKA “gut feel”).
Additionally, for search engines, taking a two-part approach ensures that each of these very important factors are included and reflected in the final outcome.
Issues with this Approach
In the case of my own execution of this approach, I know there are some issues (the general methodology is sound, I believe). Including (in no particular order):
- I defined the set of requirements (ideally, I would have liked to have input from others but I’ve basically been a one-man show and I don’t think others would have had a lot of input or time to provide that input).
- I defined the weights for requirements (see above).
- I assigned the score for the requirements (again, see above).
- I did not have hands-on with each engine under consideration and had to lean a lot on documentation, demos and discussions with vendors.
- All summed up – I think the exact scores could be in question but, given me as the only resource it worked reasonably well.
As for the survey / search results evaluation:
- I would have liked a larger population of participants, including people who did not know the site
- I would have liked a larger population of queries to be included, but I felt the number already was pretty large (about 40 pre-defined ones and ability for 10 more user-defined)
- I did not mask which engine produced which results. As Walter Underwood mentions (he referenced this post from the SearchCoP thread), that can cause some significant issues with reliability of measures.