Lee Romero

On Content, Collaboration and Findability
February 28th, 2021

Standard Measures for Enterprise Search – A proposal for a universal KPI

Having introduced some basic, standard, definitions in my previous post, in this one I am going to propose some standard measures derived from those that enable comparisons across solutions. These also are extremely useful for individual solutions where you, as an enterprise search manager, might want to have tools at hand to proactively improve your users’ experience.

A quick recap of what I defined before:

  • Search: A single action a user takes that retrieves a set of results. Initiating a searching effort, applying a sort to result, pagination, applying filters would typically all increment this metric.
  • Click: A user clicking on a result presented to them.
  • Search Session: A sequence of actions (clicks or searches) that are taken in order without changing the search term (more generally, the criteria of the search).
  • First Click: The first click within a search session.

Lost Clicks

The first derived measure is one I call “lost clicks”. This measures the raw number of search sessions that resulted in no click:

    \[\mbox{lost} \mbox{ clicks} = (\mbox{search sessions} - \mbox{first clicks})\]

This is a useful measure that tells you how many times, in total, users initiated a session but found nothing of interest to click on.

You can also think of this as an indicator that measures the number of total failed search sessions.

One more point I’ll make on this is that, because it is a raw number (not a ratio or percentage), it is not useful as a key performance indicator (KPI).

Abandonment rate

Now, finally, to my proposal for a standard measure of the quality of a search solution – a measure that, I think, can be usefully applied to all enterprise search solutions, can be used to drive improvement within a solution, and can be used to compare across such solutions.

That measure is “abandonment rate”, which I define as the percent of sessions that are ‘failed sessions’:

    \[\mbox{abandonment  rate} = {\mbox{lost clicks} \over \mbox{search sessions}}\]

which, after a bit of simplifying, I normally write as:

    \[\mbox{abandonment  rate} = 1 - ({\mbox{first clicks} \over \mbox{search sessions}})\]

This measure has some important advantages over a simpler click-rate model (e.g., [success rate] = [click] / [search]). For one thing, it avoids some simple problems that can be caused by a few anomalous users; for a second, it avoids the ‘trap’ of assuming a click is a success.

Anomalous usage patterns

There are two anomalous patterns I see every once in a while:

  1. A single dedicated user (or a small number of such users) might page through dozens or hundreds of pages of results (I actually have seen this before!) – generating a LOT of search actions – and yet click on nothing or just a result or two.
    • If every other user found something interesting to click on and did so on the first page of results, the click rate is still artificially lowered by these “extra” searches.
  2. Inversely, users who are in a ‘research mode’ of usage (not a known item search) will click on a lot of results (I have also seen instances where a single user clicks on 100s of results all in the same search session).
    • Even if no other user found anything interesting to click on, the click rate is still artificially raised by these “extra” clicks.

By using only the first click and also the search session as the denominator, these scenarios don’t come into play (note that because I am recommending still capturing the simpler ‘search’ and the simpler ‘click’ metrics, you can still do some interesting analyses with these!).

Bad Success and Good Abandonment

The second advantage I mentioned above is more of a philosophical one – the success rate measure as defined builds in more strongly that you are measuring user success. This is a strong statement.

By focusing on abandonment, I find it a more honest view – your metrics don’t build in an assumption that a click is likely a success but, instead, that a failure to find something of interest to click on is more clearly an indication of likely failure.

What do I mean?

When I consider the ideas of “success” and “failure” in a search solution, I always have to remind myself of the good and bad sides of both – what do I mean by that??

  • Good success – Good success is a click on a result that was actually useful and what the user needs to do their job. This, ultimately, is what you want to get to – however, because there is no way for a search solution to (at scale) know if any given result is “good” or “useful”, this is impossible to really measure.
  • Bad abandonment – This is the flip side – this is how I think of the experience where a user has a search session where they find nothing useful at all. Again, this is the clear definition of failure.

However, there are other possibilities to consider!

  • Bad success – This is when a user finds something that appears to be useful or what they need and they click on it, but it turns out to be something entirely different and not useful at all.
    • A classic example of bad success I have seen is in regard to my firm’s branding library (named ‘Brand Space’). For whatever reason, many intranet managers like to create image libraries in their sites and name them ‘Brand Space’ (I think this is because they think of this image library as their own instance of ‘Brand Space’). They then leave that image library exposed in search (we train them not to do so, but sometimes they don’t listen) and if an end user initiates a search session looking for Brand Space, they find the image library in results, click on it, and are likely disappointed (I imagine such a user thinking, “What is this useless web page?”)
    • A different way to think of this is in regard to the perspective of someone who is responsible for a particular type of content (let’s say benefits information for your company) – they may think they know what users *should* access when they search in particular ways and clicking on anything else is an instance of ‘bad success’. I get this but, as the manager of the search solution, I am not in the position of defining what users *should* click on – I cannot read their minds to understand intent.
  • Good abandonment – This is when a user finds the information they need right on the search results screen. Technically, such a session would count as ‘abandoned’ even though the user got what they needed.
    • This is exactly the scenario I mentioned in the definition of a ‘click’ in my last post where I would like to define how to measure this but have never been able to figure out a way to do so.

Getting back to my description of how measuring and tracking abandonment rate is better then a success rate – my assumption has been that good abandonment and bad success will always exist for your users, however, good abandonment is likely a much smaller percentage of sessions than bad success and, more importantly, it is much easier to “improve” your search by increasing bad success then decreasing good abandonment.

Conclusion

There is my proposal for a measure to be used to assess search solutions for the quality of the user experience – abandonment rate.

It is not perfect and it is still just an indicator but I have found it incredibly useful to actually drive action for improvement. I’ll share more on this in my next post.

February 7th, 2021

Standard Measures for enterprise search

In my last few posts, I have commented on the lack of standard measures to use for enterprise search (leading to challenges of comparing various solutions to others among other things) and suggested some criteria for what standard measures to use.

In this post, I am going to propose a few basic measures that I think meet the criteria and that any enterprise search solution should be able to provide. The labels are not critical for these, but the meaning of them is, I think, very important.

Search

First, and most important, is a search. A search is a single action in which a user retrieves a set of results from the search engine. Different user experiences may “count” these events differently.

When a user starts the process (in my experience, typically with a search term typed into a box on a web page somewhere), that is a single search.

If that user navigates to a second page of results, that is another search. Navigating to a third page counts as yet another search, etc.

Applying a filter (if the user interface supports such) counts as yet another search.

Re-sorting results counts as yet another search.

In a browser-based experience, even a user simply doing a page refresh counts as another search (though I will also say that in this case, if the interface uses some kind of caching of results, this might not actually truly retrieve a new set of results from the search engine, so this one could be a bit “squishy”).

In a user experience with an infinite scroll, the act of a user scrolling to the bottom of one ‘chunk’ of results and thus triggering the interface to retrieve the next ‘chunk’ also counts as yet another search (this is effectively equivalent to paging through result except it doesn’t require any action by the user).

Click

The second basic measure is the click. A click is counted any time a user clicks on any results in the experience.

Depending on the implementation, differentiating the type of thing a user clicks on (an organic result or a ‘best bet’, etc.) can be useful – but I don’t consider that differentiation critical at the high level.

One thing to note here that I know is a gap – there are some scenarios where a user does not need to click on anything in the search results. The user might meet their information need simply by seeing the search results.

This could be because they just wanted to know if anything was returned at all. It could be because the information they need is visible right on the results screen (the classic example of this would be a search experience that shows people profiles and the display shows some pertinent piece of information like a phone number). In a sophisticated search experience that offers “answers” to question, the answer might be displayed right on the results screen. I have been puzzled about how to measure this scenario for a while. Other than some mechanism on the interface that allows a user to take some action to acknowledge that they achieved there need (“Was this answer useful?”), I’m not sure what that is. Very interested if others have solved this puzzle.

Search Session

A third important metric is the search session. This is closely related to the search metric, but I do think that it is important to differentiate.

A search session is a series of actions a user takes that, together, constitute an attempt to satisfy a specific information need.

This definition, though, is really not deterministically measurable. There is no meaningful way (unless you can read the user’s mind) to know when they are “done”.

One possibility is to equate a search session to a visit – I find a good definition for this on Wikipedia in the Web analytics article:

A visit or session is defined as a series of page requests or, in the case of tags, image requests from the same uniquely identified client.

In the current solution I am working with, however, we have defined a search session to be a series of actions taken in sequence where the user does not change their search term. The user might navigate through a series of pages of results, reorder them, apply multiple filters, click on one or more results, etc., but, none of these count as another search session.

The rationale for this is that, based on anecdotal discussions with users, users tend to think of an effort using a single search term as a notional “search”. If the user fails with that term, they try another, but that is a different “search”.

Obviously, this is not truly accurate in all situations – if we could meaningfully detect (at scale, meaning across all of our activity) when changing the search term is really a restatement of the same information need vs. a completely different information need, we could do something more accurate, but we are not there, yet.

First Click

The last basic measure I propose is the first click.

A first click is counted the first time a user clicks on a result within a search session. If a user clicks on multiple things within a search session, they are all still counted as clicks, but not as first clicks.

If the user starts a new search session (which, in the current solution I work with, means they have changed their search term), then, if they click on some result, that is another first click.

Conclusion and what’s next

That is the set of basic measures that I think could be useful to establish as a standard.

Next steps – I hope to engage with others working in this domain to refine these and tighten them up (especially a search session). I hope to make some contacts through the Enterprise Search Engine Professionals group on LinkedIn and perhaps other communities for this. If you are interested, please let me know!

In my next post, I will be sharing definitions of some important metrics derived from the basic measures above that I use and provide some examples of each.

January 31st, 2021

Criteria for Standard Measures of Enterprise Search

In my last post, I wondered about the lack of meaningful standards for evaluating enterprise search implementations.

I did get some excellent comments on the post and also some very useful commentary from a LinkedIn discussion about this topic – I would recommend you read through that discussion. Udo Kruschwitz and Charlie Hull both provided links to some very good resources.

In this post, I thought I would describe what I think to be some important attributes of any standard measures that could be adopted. Here I will be addressing the specific actions to measure – in a subsequent post I will write about how these can be used to actually evaluate a solution.

Measurable

To state the obvious, we need to have metrics that are measurable and objective. Ideally, metrics that directly reflect user interaction with the search solution.

Measures that depend on subjective evaluation or get feedback from users through means other than their direct use of the tool can be very useful but introduce problems in terms of interpretation differences and sustainability.

For example, a feedback function built into the interface (“Are these results useful?” or even a more specific, “Is this specific result useful for you here?”) can provide excellent insight but are used so little that the data is not useful overall.

Surveys of users inevitably fall into the problem of faulty or biased memory – in my experience, users have such a negative perception of enterprise search that individual negative experiences will overwhelm positive experiences with the search when you ask them to recall and assess their experience a day or week after their usage.

Common / Useful to compare implementations

Another important consideration is that a standard for evaluating enterprise search should include aspects of search that are common across the broad variety of solutions you might see.

In addition, they should lend themselves to comparing different solutions in a useful way.

Some implementations might be web-based (in my experience, this is by far the most common way to make enterprise search available). Some might be based on a desktop application or mobile app. Some implementations might depend only on users enterprise search terms to start a search session; some might only support searching based on search terms (no filtering or refining at all). Some implementations might provide a “search as you type” (showing results immediately based on part of what the user has entered). Many variations to consider here.

I would want to have measures that allow me to compare one solution to another – “Is this one better than that one?” “Are there specific user needs where this solution is better than that one?”

Likely to be insightful

Another obvious aspect is that we want to include measures that are likely to be useful.

Useful in what way, though?

My first thought is that it must measure if the solution is useful for the users – does it meet the users’ needs? (With search, I would simplify this to “does it provide the information the user needs efficiently?” but there are likely a lot of other ways to define “useful” even within a search experience.

Operationalizable

I would want all measures I use to be consistently available (no need to “take a measurement” at a given time) and also to not depend on someone actively having to “take a measurement”.

As mentioned above, measures that directly reflect what happens in the user experience are what I would be looking for. In this case, I would add in that the measures should be taken directly from the user experience – data captured into a search log file somewhere or captured via some other means.

This provides a data set that can be reviewed and used at basically any time and which (other than maintaining the system capturing the measurements) don’t require any effort to capture and maintain – the users use the search solution and their activities are captured.

Usable for overall and when broken down by dimensions

Finally, I would expect that measures would support analysis at broad scales and also should support the ability to drill in to details and use the same measures?

Examples of “broad scale” applicability: How good is this search solution overall? How good is my search solution in comparison to the overall industry average? How good are search solutions supporting the needs of users in the XYZ industry? How good are search solutions at supporting “known item” searching in comparison with “exploratory searching”?

Examples of drilling in: Within my user base, how successful are my users by department? How useful is the search solution in different topic areas of content? How good are results for individual, specific search criteria?

Others?

I’m sure I am missing a lot of potential criteria here – What would you add? Remove? Edit?

January 18th, 2021

Evaluating enterprise search – standards?

Over the past several years of working very closely with the enterprise search solution at Deloitte, I have tried to look “outside” as best as I can in order to understand what others in the industry are doing to evaluate their solutions in order to understand where ours ‘fits’.

I’ve attended a number of conferences and webcasts and read papers (many, I’ll admit, that are highlighted by Martin White on Twitter. I can’t recommend a follow of Martin enough!)

One thing I have never found is any common way to evaluate or talk about enterprise search solutions. I have seen several people (including Martin) comment on the relatively little research on enterprise search (as opposed to internet search, which has a lot of research behind it), and I am sure a significant reason for that is that there is no common way to evaluate the solutions.

If we could compare in a systematic way, we could start to understand how to do things like:

  • Identify common use cases that are visible in user behavior (via metrics)
  • Compare how ‘good’ different solutions are at meeting the core need (an employee needs to access some resource to do their job)
  • Compare different industries approaches to information seeking (again, as identified by user behavior via metrics) – for example, do users search differently in industrial companies vs. professional services companies vs. research companies?

Why do we not have a common set of definitions?

One possibility is certainly that I have still not read up enough on the topic – perhaps there is a common set of definitions – if so, feel free to share.

Another possibility is that this is a result of dependency on the metrics that are implemented within the search solutions enterprises are using. I have found that these are useful but they don’t come with a lot of detail or clarity of definition. And, more specifically, they don’t seem common across products. That said, I have relatively limited exposure to multiple search solutions – Again, I would be interested in insights from those who have (perhaps any consultants working in this space?)

And, one more possible driver behind a lack of commonality is the proprietary nature of most implementations. I try to speak externally as frequently as I can, but I am always hesitant (and have been coached) to not be too detailed on the implementation.

I do plan to put up a small series here, though, with some of the more elemental components of our metrics implementation for comparison with anyone who cares to share.

More soon!

November 21st, 2020

Back on board

After ignoring my blog here for several years, I finally am back on board – corrected some system errors that were happening and plan to soon be writing again.

Since my last article here (9 years ago!!), I have continued to be busy working for Deloitte. I also have changed roles a few times – the last several years being the business owner of our enterprise search and also leading a virtual team we call the “Search Optimization Center” here.

I’ll be writing about that work and other things soon!

October 10th, 2011

Language change over time in your search log

This is a second post in a series I have planned about the language found throughout your search log – all the way into the “long tail” and how it might or might not be feasible to understand it all.

My previous post, “80-20: The lie in your search log?“, highlighted how the slope of “short head” of your search terms may not be as steep as anecdotes would say.  That is, there can be a lot less commonality within a particular time range among even the most common terms in your search log than you might expect.

After writing that post, I began to wonder about the overall re-use of terms over periods of time.

In other words:

Even while commonality of re-using terms within a month is relatively low, how much commonality do we see in our users’ language (i.e., search terms) from month to month?

To answer this, I needed to take the entire set of terms for a month and compare them with the entire set from the next month and determine the overlap and then compare the second month’s set of terms to a third month’s, and so on.  Logically not a hard problem but quite a challenge in practice due to the volume of data I was manipulating (large only in the face of the tools I have to manipulate it).

So I pulled together every single term used over a period of about 18 months and broke them into the set used for each of those months and performed the comparison.

Before getting into the details, a few details to share for context about the search solution I’m writing about here:

  • The average number of searches performed each month was almost 123,000.
  • The average number of distinct terms during this period was just under 53,000.
  • This results in an average of about 2.3 search for each distinct term

My expectation was that comparing the entire set of terms from one month to the next would show a relatively high percentage of overlap.  What I found was not what I expected.

If you look at the unique terms and their overlap, surprisingly, the average overlap between months was a shockingly low 13.2%.  In other words, over 86% of the terms in any given month were not used at all in the

Month to Month Re-Use of Search Terms

previous month.

If you look at the total searches performed and the percent of searches performed with terms from the prior month, this goes up to an average of 36.2% – reflecting that the terms that are re-used in a subsequent month among the most common terms overall.

Month to Month Re-Use of Search Terms

As you can see, the amount of commonality from month-to-month among the terms used is very low.

What can you draw from this observation?

In a brief discussion about this with noted search analytics expert Lou Rosenfeld, his reaction was that this represented a significant amount of change in the information needs of the users of the system – significant enough to be surprising.

Another conclusion I draw from this is that it provides another reason why it is very hard to meaningfully improve search across the language of your users.  Based on my previous post on the flatness of the curve of term use within a month, we know that it we need to look at a pretty significant percentage of distinct terms each month to account for a decent percentage of all searches – 12% of distinct terms to account for only 50% of searches.  In our search solution, that 12% doesn’t seem that large until you realize it is still represents about 6,000 distinct terms.

Coupling that with the observation from the analysis here means that even if you review those terms for a given month, you will likely need to review a significant percentage of brand new terms the next month, and so on.  Not an easy task.

Having established just how challenging this can be, my next few posts will provide some ideas for grappling with the challenges.

In the meantime, if you have any insight on similar statistics from your solution (or statistics about the shape of the search log curve I previously wrote above), please feel free to share here, on the SearchCoP on Yahoo! groups or on the Enterprise Search Engine Professionals group on LinkedIn – I would very much like to compare numbers to see if we can identify meaningful generalizations from different solution.

September 23rd, 2011

The Findability Gap by Lou Rosenfeld

Lou Rosenfeld has just published a great presentation I would highly recommend for anything working in the search space:  The Findability Gap.

It provides a great picture of the overall landscape of the problem (it’s not just search, after all!).

I especially liked slide 4 – a very telling illustration of the challenge we face in intelligently making information available to our users.

Re: Slide 24 – As I’ve written about before, I would say that the 80/20 rule is more than just “not quite accurate”.  But that’s mincing words.

Overall, a highly recommended read.

June 14th, 2011

KMers.org Chat on the Importance of Search in your KM Solution

Last week, I moderated a discussion for the weekly KMers.org Twitter chat about “The Importance of Search in your KM Solution”.

My intent was to try to get an understanding about how important search is relative to other components of a KM search (connecting people, collecting and managing content, etc.).

It was a good discussion with about a dozen or so people taking part (that I could tell).

You can read through the transcript of the session here.   Let me know what you think on the topic!

During the discussion, a great question came up about measuring the success of your search solution (thanks to Ed Dale) which I thought deserved its own discussion, so I have submitted a suggestion for a new topic for an upcoming KMers.org chat.

Please visit the suggestion here and vote for it!

November 16th, 2010

Taxonomy Boot Camp 2010

Yesterday, I delivered my presentation at Taxonomy Boot Camp 2010 on “Enterprise Taxonomy: Six Components of a vision”.  You can find the presentation on my site here and also on the Taxonomy Boot Camp site here (the latter requires a login you will need to get from the conference).

Some of the most interesting topics for me this week have been about semantic (web) technologies and also some details on the implementation of taxonomy in SharePoint 2010.  Good stuff.

In addition, I’ve had the opportunity to meet and re-meet many people who work in the taxonomy space and also in search, so it’s been a very revitalizing experience.

I also (finally) picked up a copy of the Accidental Taxonomist by Heather Hedden.  I am really looking forward to reading it.

November 13th, 2010

80-20: The lie in your search log?

Recently, I have been trying to better understand the language in use by our users in the search solution we use, and in order to do that, I have been trying to determine what tools and techniques one might use to do that. This is the first post in a planned series about this effort.

I have many goals in pursuing this.  The primary goal has been to be able to identify trends from the whole set of language in use by users (and not just the short head).  This goals supports the underlying business desire of identifying content gaps or (more generally) where the variety of content available in certain categories does not match with the variety expected by users (i.e., how do we know when we need to target the creation and publication of specific content?)

Many approaches to this do focus on the short head – typically the top N terms, where N might be 50 or 100 or even 500 (some number that’s manageable).  I am interested in identifying ways to understand the language through the whole long tail as well.

As I have dug into this, I realized an important aspect of this problem is to understand how much commonality there is to the language in use by users and also how much the language in use by users changes over time – and this question leads directly to the topic at hand here.

Search Term Usage

Chart 1

There is an anecdote I have heard many times about the short head of your search log that “80 percent of your searches are accounted for by the top 20% most commonly-used terms“.  I now question this and wonder what others have seen.

I have worked closely with several different search solutions in my career and the three I have worked most closely with (and have most detailed insight on) do not come even close to the above assertion.  Chart 1 shows the usage curve for one of these.  The X axis is the percent of distinct terms (ordered by use) and the Y axis shows the percent of all searches accounted for by all terms up to X.

From this chart, you can see that it takes approximately 55% of distinct terms to account for 80% of all searches – that is a lot of terms!

This curve shows the usage for one month – I wondered about how similar this would be for other months and found (for this particular search solution) that the curves for every month were basically the exact same!

Wondering if this was an anomaly, I looked at a second search solution I have close access to to wonder if it might show signs of the “80/20” rule.  Chart 2 adds the curve for this second solution (it’s the blue curve – the higher of the two).

Chart 2

Chart 2

In this case, you will find that the curve is “higher” – it reaches 80% of searches at about 37% of distinct terms.  However, it is still pretty far from the “80/20” rule!

After looking at this data in more detail, I have realized why I have always been troubled at the idea of paying close attention to only the so-called “short head” – doing so leaves out an incredible amount of data!

In trying to understand the details of why, even though neither is close to adhering to the “80/20” rule, the usage curves are so different, I realize that there are some important distinctions between the two search solutions:

  1. The first solution is from a knowledge repository – a place where users primarily go in order to do research; the second is for a firm intranet – much more focused on news and HR type of information.
  2. The first solution provides “search as you type” functionality (showing a drop-down of actual search results as the user types), while the second provides auto-complete (showing a drop-down of possible terms to use).  The auto-complete may be encouraging users to adopt more commonality.

I’m not sure how (or really if) these factor into the shape of these curves.

In understanding this a bit better, I hypothesize two things:  1) the shape of this curve is stable over time for any given search solution, and 2) the shape of this curve tells you something important about how you can manage your search solution.  I am planning to dig more to answer hypothesis #1.

Questions for you:

  • Have you looked at term usage in your search solution?
  • Can you share your own usage charts like the above for your search solution and describe some important aspects of your solution?  Insight on more solutions might help answer my hypothesis #2.
  • Any ideas on what the shape of the curve might tell you?

I will be writing more on these search term usage curves in my next post as I dig more into the time-stability of these curves.