## Language change over time in your search log

This is a second post in a series I have planned about the language found throughout your search log – all the way into the “long tail” and how it might or might not be feasible to understand it all.

My previous post, “80-20: The lie in your search log?“, highlighted how the slope of “short head” of your search terms may not be as steep as anecdotes would say. That is, there can be a lot less commonality within a particular time range among even the most common terms in your search log than you might expect.

After writing that post, I began to wonder about the overall re-use of terms over periods of time.

In other words:

Even while commonality of re-using terms within a month is relatively low, how much commonality do we see in our users’ language (i.e., search terms) from month to month?

To answer this, I needed to take the entire set of terms for a month and compare them with the entire set from the next month and determine the overlap and then compare the second month’s set of terms to a third month’s, and so on. Logically not a hard problem but quite a challenge in practice due to the volume of data I was manipulating (large only in the face of the tools I have to manipulate it).

So I pulled together every single term used over a period of about 18 months and broke them into the set used for each of those months and performed the comparison.

Before getting into the details, a few details to share for context about the search solution I’m writing about here:

- The average number of searches performed each month was almost 123,000.
- The average number of distinct terms during this period was just under 53,000.
- This results in an average of about 2.3 search for each distinct term

My expectation was that comparing the entire set of terms from one month to the next would show a relatively high percentage of overlap. What I found was not what I expected.

If you look at the unique terms and their overlap, surprisingly, the average overlap between months was a shockingly low 13.2%. In other words, **over 86% of the terms in any given month were not used at all in the**

** **

**previous month**.

If you look at the total searches performed and the percent of searches performed with terms from the prior month, this goes up to an average of 36.2% – reflecting that the terms that are re-used in a subsequent month among the most common terms overall.

As you can see, the amount of **commonality from month-to-month among the terms used is very low**.

What can you draw from this observation?

In a brief discussion about this with noted search analytics expert Lou Rosenfeld, **his reaction was that this represented a significant amount of change in the information needs of the users of the system** – significant enough to be surprising.

Another conclusion I draw from this is that it provides **another reason why it is very hard to meaningfully improve search across the language of your users**. Based on my previous post on the flatness of the curve of term use within a month, we know that it we need to look at a pretty significant percentage of distinct terms each month to account for a decent percentage of all searches – 12% of distinct terms to account for only 50% of searches. In our search solution, that 12% doesn’t seem that large until you realize it is still represents about 6,000 distinct terms.

Coupling that with the observation from the analysis here means that **even if you review those terms for a given month, you will likely need to review a significant percentage of brand new terms the next month**, and so on. Not an easy task.

Having established just how challenging this can be, my next few posts will provide some ideas for grappling with the challenges.

In the meantime, if you have any insight on similar statistics from your solution (or statistics about the shape of the search log curve I previously wrote above), please feel free to share here, on the SearchCoP on Yahoo! groups or on the Enterprise Search Engine Professionals group on LinkedIn – I would very much like to compare numbers to see if we can identify meaningful generalizations from different solution.

October 11th, 2011 at 1:00 pm

Lee – I wonder how different your results would be if you were to normalize queries, say using http://code.google.com/p/google-refine/ . Or maybe you already did that?

October 11th, 2011 at 1:23 pm

Otis – I did not do that with this data set but I did apply quite a few normalizations to the data in my prior post (the 80-20 … post).

What I found was that it did not make that huge of a difference. I was pretty aggressive with the normalization – removing stop words, stemming, ignoring word order, case, many kinds of punctuation, etc. – and I found that it “raised” that curve by about 1% (on the Y axis, i.e., count of searches) along the X axis (i.e., how deep into the list of terms you go) from about 1% on the X axis until about 20% – after that, the lines just grew closer together.

It’s hard to compare the two because of the difference in what the curves mean, but I would guess it might push overlap from about 13% to maybe the upper teens? I’ll give it a try for a few months and report back here, though.

Thanks!

October 11th, 2011 at 2:38 pm

The specific domain of your data/site and the use case stories of your user base are going to be the primary factor in this sort of metric

If you run a news site, or an ecommerce portal selling new products, then numbers like these shouldn’t surprise anyone.

If you host an encyclopedia, or a database of medical ailments+symptoms, or a parts catalog for antique repair, or a site on stamp collecting, etc… the results are all going to be massively different.

October 12th, 2011 at 7:15 am

Thanks, Hoss – I intuitively agree with your points but I have found no data or reports actually supporting that assertion. That’s part of why I’ve been posting in this on this topic – I think there is insight to be gained to actually prove these assumed truths are actually true (or to find out that they’re not true or to be able to more precisely identify what are the factors that influence these metrics).

Are you aware of any journal articles (or even blog posts like mine) that support your points? I’d be very interested!