Lee Romero

On Content, Collaboration and Findability
November 13th, 2010

80-20: The lie in your search log?

Recently, I have been trying to better understand the language in use by our users in the search solution we use, and in order to do that, I have been trying to determine what tools and techniques one might use to do that. This is the first post in a planned series about this effort.

I have many goals in pursuing this.  The primary goal has been to be able to identify trends from the whole set of language in use by users (and not just the short head).  This goals supports the underlying business desire of identifying content gaps or (more generally) where the variety of content available in certain categories does not match with the variety expected by users (i.e., how do we know when we need to target the creation and publication of specific content?)

Many approaches to this do focus on the short head – typically the top N terms, where N might be 50 or 100 or even 500 (some number that’s manageable).  I am interested in identifying ways to understand the language through the whole long tail as well.

As I have dug into this, I realized an important aspect of this problem is to understand how much commonality there is to the language in use by users and also how much the language in use by users changes over time – and this question leads directly to the topic at hand here.

Search Term Usage

Chart 1

There is an anecdote I have heard many times about the short head of your search log that “80 percent of your searches are accounted for by the top 20% most commonly-used terms“.  I now question this and wonder what others have seen.

I have worked closely with several different search solutions in my career and the three I have worked most closely with (and have most detailed insight on) do not come even close to the above assertion.  Chart 1 shows the usage curve for one of these.  The X axis is the percent of distinct terms (ordered by use) and the Y axis shows the percent of all searches accounted for by all terms up to X.

From this chart, you can see that it takes approximately 55% of distinct terms to account for 80% of all searches – that is a lot of terms!

This curve shows the usage for one month – I wondered about how similar this would be for other months and found (for this particular search solution) that the curves for every month were basically the exact same!

Wondering if this was an anomaly, I looked at a second search solution I have close access to to wonder if it might show signs of the “80/20″ rule.  Chart 2 adds the curve for this second solution (it’s the blue curve – the higher of the two).

Chart 2

Chart 2

In this case, you will find that the curve is “higher” – it reaches 80% of searches at about 37% of distinct terms.  However, it is still pretty far from the “80/20″ rule!

After looking at this data in more detail, I have realized why I have always been troubled at the idea of paying close attention to only the so-called “short head” – doing so leaves out an incredible amount of data!

In trying to understand the details of why, even though neither is close to adhering to the “80/20″ rule, the usage curves are so different, I realize that there are some important distinctions between the two search solutions:

  1. The first solution is from a knowledge repository – a place where users primarily go in order to do research; the second is for a firm intranet – much more focused on news and HR type of information.
  2. The first solution provides “search as you type” functionality (showing a drop-down of actual search results as the user types), while the second provides auto-complete (showing a drop-down of possible terms to use).  The auto-complete may be encouraging users to adopt more commonality.

I’m not sure how (or really if) these factor into the shape of these curves.

In understanding this a bit better, I hypothesize two things:  1) the shape of this curve is stable over time for any given search solution, and 2) the shape of this curve tells you something important about how you can manage your search solution.  I am planning to dig more to answer hypothesis #1.

Questions for you:

  • Have you looked at term usage in your search solution?
  • Can you share your own usage charts like the above for your search solution and describe some important aspects of your solution?  Insight on more solutions might help answer my hypothesis #2.
  • Any ideas on what the shape of the curve might tell you?

I will be writing more on these search term usage curves in my next post as I dig more into the time-stability of these curves.

8 Responses to “80-20: The lie in your search log?”

  1. I’m not a search expert, but the curves you show are interesting from another perspective too. They seem to have a constant slope after the initial “short head.” Normally, these curves get continually shallower as they reach their terminus. I would expect the last 1% of searches to account for far fewer results than the 51st percent. But here, it looks like a constant slope. Maybe it’s a small-screen effect. If it is real, then you are operating in a different regime. While the short head is going to be important for a large chunk of search, the last 5% is also going to be important.

    p.s. What does the term “language” mean in this context? It appears to be a term of art that the uninitiated (me) may not understand.

  2. Hi Jack – You are exactly right – another interesting phenomenon of looking at term usage with these curves is that when the data reaches the point where there is one use of each search term, the curve becomes completely linear. From the point at which that happens until you reach the ‘end’ of the curve is a constant slope.

    I have another post in the works that looks more closely at this – the question being, at what percentage of distinct terms does your search usage become completely “one use terms” and is that percentage stable for a given search solution and (if it is stable), does that percentage tell you something useful?

    I don’t mean anything particular technical about the use of the word “language” – other than meaning “the set of terms and words used by users of your search solution”. I’m not thinking of something like, say, English versus French.

  3. [...] As part of the some of the early work in faceted taxonomies I did, I spent some time at MIT working on a research project that compared results when we queried a system that was based on a search engine technology alone,  and when we queried one where the query could be enhanced by adding taxonomy terms. For this experiment, we had the advantage of using a system, that was the brainchild of Wendi Pohs,  in which we had 2 search engines  using the same technology processing similar documents that were made available to a user interface which had a simple search box like Google. One engine processed news feeds .  These feeds were added quickly with no intervention—directly loaded into a search engine.   What our research found was that search engines without a taxonomy, left unattended, flatlined. The recall never improved over 75-80%.   Lee Romero, who is a keen observer of search, has recently done an excellent blog post observing thi… [...]

  4. My search terms per month follow simialr curves, it takes 50% or higher to acheive 80% of the queries. Very different from external ecommerce search.

  5. I’m wondering if you think a report like this is something every search analytics tool should have?

    I’m asking because we have a Search Analytics service (see http://sematext.com/search-analytics/index.html ) but don’t have such a report. And your point 2) above (”the shape of this curve tells you something important about how you can manage your search solution”) sounds intriguing, although I also don’t yet know what that “something important” is.

  6. Lee Romero Says:
    August 2nd, 2011 at 6:52 am

    Otis – I do think this kind of report is useful, but currently I’m still not entirely sure what one might do with the insight.

    I have a few data points myself (i.e., I understand what this curve looks like for a couple of search solutions) but I have not yet been able to determine if there is really an underlying interpretation of the shape.

    If there were a means to understand what this curve looks like for a good-sized set of solutions and then be able to compare traits of those search solutions, I think it could be more insightful.

    I also wish I knew what the “something important” is – I’ve tried engendering interest in “comparing notes” across solutions via the SearchCoP community on Yahoo! groups without any luck so far.

  7. [...] Slide 24 – As I’ve written about before, I would say that the 80/20 rule is more than just “not quite accurate”.  But [...]

  8. [...] previous post, “80-20: The lie in your search log?“, highlighted how the slope of “short head” of your search terms may not be as [...]

Leave a Reply