In my last few posts, I have commented on the lack of standard measures to use for enterprise search (leading to challenges of comparing various solutions to others among other things) and suggested some criteria for what standard measures to use.
In this post, I am going to propose a few basic measures that I think meet the criteria and that any enterprise search solution should be able to provide. The labels are not critical for these, but the meaning of them is, I think, very important.
First, and most important, is a search. A search is a single action in which a user retrieves a set of results from the search engine. Different user experiences may “count” these events differently.
When a user starts the process (in my experience, typically with a search term typed into a box on a web page somewhere), that is a single search.
If that user navigates to a second page of results, that is another search. Navigating to a third page counts as yet another search, etc.
Applying a filter (if the user interface supports such) counts as yet another search.
Re-sorting results counts as yet another search.
In a browser-based experience, even a user simply doing a page refresh counts as another search (though I will also say that in this case, if the interface uses some kind of caching of results, this might not actually truly retrieve a new set of results from the search engine, so this one could be a bit “squishy”).
In a user experience with an infinite scroll, the act of a user scrolling to the bottom of one ‘chunk’ of results and thus triggering the interface to retrieve the next ‘chunk’ also counts as yet another search (this is effectively equivalent to paging through result except it doesn’t require any action by the user).
The second basic measure is the click. A click is counted any time a user clicks on any results in the experience.
Depending on the implementation, differentiating the type of thing a user clicks on (an organic result or a ‘best bet’, etc.) can be useful – but I don’t consider that differentiation critical at the high level.
One thing to note here that I know is a gap – there are some scenarios where a user does not need to click on anything in the search results. The user might meet their information need simply by seeing the search results.
This could be because they just wanted to know if anything was returned at all. It could be because the information they need is visible right on the results screen (the classic example of this would be a search experience that shows people profiles and the display shows some pertinent piece of information like a phone number). In a sophisticated search experience that offers “answers” to question, the answer might be displayed right on the results screen. I have been puzzled about how to measure this scenario for a while. Other than some mechanism on the interface that allows a user to take some action to acknowledge that they achieved there need (“Was this answer useful?”), I’m not sure what that is. Very interested if others have solved this puzzle.
A third important metric is the search session. This is closely related to the search metric, but I do think that it is important to differentiate.
A search session is a series of actions a user takes that, together, constitute an attempt to satisfy a specific information need.
This definition, though, is really not deterministically measurable. There is no meaningful way (unless you can read the user’s mind) to know when they are “done”.
One possibility is to equate a search session to a visit – I find a good definition for this on Wikipedia in the Web analytics article:
A visit or session is defined as a series of page requests or, in the case of tags, image requests from the same uniquely identified client.
In the current solution I am working with, however, we have defined a search session to be a series of actions taken in sequence where the user does not change their search term. The user might navigate through a series of pages of results, reorder them, apply multiple filters, click on one or more results, etc., but, none of these count as another search session.
The rationale for this is that, based on anecdotal discussions with users, users tend to think of an effort using a single search term as a notional “search”. If the user fails with that term, they try another, but that is a different “search”.
Obviously, this is not truly accurate in all situations – if we could meaningfully detect (at scale, meaning across all of our activity) when changing the search term is really a restatement of the same information need vs. a completely different information need, we could do something more accurate, but we are not there, yet.
The last basic measure I propose is the first click.
A first click is counted the first time a user clicks on a result within a search session. If a user clicks on multiple things within a search session, they are all still counted as clicks, but not as first clicks.
If the user starts a new search session (which, in the current solution I work with, means they have changed their search term), then, if they click on some result, that is another first click.
That is the set of basic measures that I think could be useful to establish as a standard.
Next steps – I hope to engage with others working in this domain to refine these and tighten them up (especially a search session). I hope to make some contacts through the Enterprise Search Engine Professionals group on LinkedIn and perhaps other communities for this. If you are interested, please let me know!
In my next post, I will be sharing definitions of some important metrics derived from the basic measures above that I use and provide some examples of each.
In my last post, I wondered about the lack of meaningful standards for evaluating enterprise search implementations.
I did get some excellent comments on the post and also some very useful commentary from a LinkedIn discussion about this topic – I would recommend you read through that discussion. Udo Kruschwitz and Charlie Hull both provided links to some very good resources.
In this post, I thought I would describe what I think to be some important attributes of any standard measures that could be adopted. Here I will be addressing the specific actions to measure – in a subsequent post I will write about how these can be used to actually evaluate a solution.
Measurable
To state the obvious, we need to have metrics that are measurable and objective. Ideally, metrics that directly reflect user interaction with the search solution.
Measures that depend on subjective evaluation or get feedback from users through means other than their direct use of the tool can be very useful but introduce problems in terms of interpretation differences and sustainability.
For example, a feedback function built into the interface (“Are these results useful?” or even a more specific, “Is this specific result useful for you here?”) can provide excellent insight but are used so little that the data is not useful overall.
Surveys of users inevitably fall into the problem of faulty or biased memory – in my experience, users have such a negative perception of enterprise search that individual negative experiences will overwhelm positive experiences with the search when you ask them to recall and assess their experience a day or week after their usage.
Common / Useful to compare implementations
Another important consideration is that a standard for evaluating enterprise search should include aspects of search that are common across the broad variety of solutions you might see.
In addition, they should lend themselves to comparing different solutions in a useful way.
Some implementations might be web-based (in my experience, this is by far the most common way to make enterprise search available). Some might be based on a desktop application or mobile app. Some implementations might depend only on users enterprise search terms to start a search session; some might only support searching based on search terms (no filtering or refining at all). Some implementations might provide a “search as you type” (showing results immediately based on part of what the user has entered). Many variations to consider here.
I would want to have measures that allow me to compare one solution to another – “Is this one better than that one?” “Are there specific user needs where this solution is better than that one?”
Likely to be insightful
Another obvious aspect is that we want to include measures that are likely to be useful.
Useful in what way, though?
My first thought is that it must measure if the solution is useful for the users – does it meet the users’ needs? (With search, I would simplify this to “does it provide the information the user needs efficiently?” but there are likely a lot of other ways to define “useful” even within a search experience.
Operationalizable
I would want all measures I use to be consistently available (no need to “take a measurement” at a given time) and also to not depend on someone actively having to “take a measurement”.
As mentioned above, measures that directly reflect what happens in the user experience are what I would be looking for. In this case, I would add in that the measures should be taken directly from the user experience – data captured into a search log file somewhere or captured via some other means.
This provides a data set that can be reviewed and used at basically any time and which (other than maintaining the system capturing the measurements) don’t require any effort to capture and maintain – the users use the search solution and their activities are captured.
Usable for overall and when broken down by dimensions
Finally, I would expect that measures would support analysis at broad scales and also should support the ability to drill in to details and use the same measures?
Examples of “broad scale” applicability: How good is this search solution overall? How good is my search solution in comparison to the overall industry average? How good are search solutions supporting the needs of users in the XYZ industry? How good are search solutions at supporting “known item” searching in comparison with “exploratory searching”?
Examples of drilling in: Within my user base, how successful are my users by department? How useful is the search solution in different topic areas of content? How good are results for individual, specific search criteria?
Others?
I’m sure I am missing a lot of potential criteria here – What would you add? Remove? Edit?