## Language change over time in your search log

This is a second post in a series I have planned about the language found throughout your search log – all the way into the “long tail” and how it might or might not be feasible to understand it all.

My previous post, “80-20: The lie in your search log?“, highlighted how the slope of “short head” of your search terms may not be as steep as anecdotes would say. That is, there can be a lot less commonality within a particular time range among even the most common terms in your search log than you might expect.

After writing that post, I began to wonder about the overall re-use of terms over periods of time.

In other words:

Even while commonality of re-using terms within a month is relatively low, how much commonality do we see in our users’ language (i.e., search terms) from month to month?

To answer this, I needed to take the entire set of terms for a month and compare them with the entire set from the next month and determine the overlap and then compare the second month’s set of terms to a third month’s, and so on. Logically not a hard problem but quite a challenge in practice due to the volume of data I was manipulating (large only in the face of the tools I have to manipulate it).

So I pulled together every single term used over a period of about 18 months and broke them into the set used for each of those months and performed the comparison.

Before getting into the details, a few details to share for context about the search solution I’m writing about here:

- The average number of searches performed each month was almost 123,000.
- The average number of distinct terms during this period was just under 53,000.
- This results in an average of about 2.3 search for each distinct term

My expectation was that comparing the entire set of terms from one month to the next would show a relatively high percentage of overlap. What I found was not what I expected.

If you look at the unique terms and their overlap, surprisingly, the average overlap between months was a shockingly low 13.2%. In other words, **over 86% of the terms in any given month were not used at all in the**

** **

**previous month**.

If you look at the total searches performed and the percent of searches performed with terms from the prior month, this goes up to an average of 36.2% – reflecting that the terms that are re-used in a subsequent month among the most common terms overall.

As you can see, the amount of **commonality from month-to-month among the terms used is very low**.

What can you draw from this observation?

In a brief discussion about this with noted search analytics expert Lou Rosenfeld, **his reaction was that this represented a significant amount of change in the information needs of the users of the system** – significant enough to be surprising.

Another conclusion I draw from this is that it provides **another reason why it is very hard to meaningfully improve search across the language of your users**. Based on my previous post on the flatness of the curve of term use within a month, we know that it we need to look at a pretty significant percentage of distinct terms each month to account for a decent percentage of all searches – 12% of distinct terms to account for only 50% of searches. In our search solution, that 12% doesn’t seem that large until you realize it is still represents about 6,000 distinct terms.

Coupling that with the observation from the analysis here means that **even if you review those terms for a given month, you will likely need to review a significant percentage of brand new terms the next month**, and so on. Not an easy task.

Having established just how challenging this can be, my next few posts will provide some ideas for grappling with the challenges.

In the meantime, if you have any insight on similar statistics from your solution (or statistics about the shape of the search log curve I previously wrote above), please feel free to share here, on the SearchCoP on Yahoo! groups or on the Enterprise Search Engine Professionals group on LinkedIn – I would very much like to compare numbers to see if we can identify meaningful generalizations from different solution.