Yesterday, I delivered my presentation at Taxonomy Boot Camp 2010 on “Enterprise Taxonomy: Six Components of a vision”. You can find the presentation on my site here and also on the Taxonomy Boot Camp site here (the latter requires a login you will need to get from the conference).
Some of the most interesting topics for me this week have been about semantic (web) technologies and also some details on the implementation of taxonomy in SharePoint 2010. Good stuff.
In addition, I’ve had the opportunity to meet and re-meet many people who work in the taxonomy space and also in search, so it’s been a very revitalizing experience.
Recently, I have been trying to better understand the language in use by our users in the search solution we use, and in order to do that, I have been trying to determine what tools and techniques one might use to do that. This is the first post in a planned series about this effort.
I have many goals in pursuing this. The primary goal has been to be able to identify trends from the whole set of language in use by users (and not just the short head). This goals supports the underlying business desire of identifying content gaps or (more generally) where the variety of content available in certain categories does not match with the variety expected by users (i.e., how do we know when we need to target the creation and publication of specific content?)
Many approaches to this do focus on the short head – typically the top N terms, where N might be 50 or 100 or even 500 (some number that’s manageable). I am interested in identifying ways to understand the language through the whole long tail as well.
As I have dug into this, I realized an important aspect of this problem is to understand how much commonality there is to the language in use by users and also how much the language in use by users changes over time – and this question leads directly to the topic at hand here.
There is an anecdote I have heard many times about the short head of your search log that “80 percent of your searches are accounted for by the top 20% most commonly-used terms“. I now question this and wonder what others have seen.
I have worked closely with several different search solutions in my career and the three I have worked most closely with (and have most detailed insight on) do not come even close to the above assertion. Chart 1 shows the usage curve for one of these. The X axis is the percent of distinct terms (ordered by use) and the Y axis shows the percent of all searches accounted for by all terms up to X.
From this chart, you can see that it takes approximately 55% of distinct terms to account for 80% of all searches – that is a lot of terms!
This curve shows the usage for one month – I wondered about how similar this would be for other months and found (for this particular search solution) that the curves for every month were basically the exact same!
Wondering if this was an anomaly, I looked at a second search solution I have close access to to wonder if it might show signs of the “80/20″ rule. Chart 2 adds the curve for this second solution (it’s the blue curve – the higher of the two).
In this case, you will find that the curve is “higher” – it reaches 80% of searches at about 37% of distinct terms. However, it is still pretty far from the “80/20″ rule!
After looking at this data in more detail, I have realized why I have always been troubled at the idea of paying close attention to only the so-called “short head” – doing so leaves out an incredible amount of data!
In trying to understand the details of why, even though neither is close to adhering to the “80/20″ rule, the usage curves are so different, I realize that there are some important distinctions between the two search solutions:
I’m not sure how (or really if) these factor into the shape of these curves.
In understanding this a bit better, I hypothesize two things: 1) the shape of this curve is stable over time for any given search solution, and 2) the shape of this curve tells you something important about how you can manage your search solution. I am planning to dig more to answer hypothesis #1.
Questions for you:
I will be writing more on these search term usage curves in my next post as I dig more into the time-stability of these curves.