Lee Romero

On Content, Collaboration and Findability

Archive for January, 2009

What is a Search Analyst?

Tuesday, January 27th, 2009

Having written about what I consider to be the principles of enterprise search, about people search in the enterprise, about search analytics and several other topics related to search in some detail, I thought I would share some insights on a role I have called search analyst – the person(s) who are responsible for the care and feeding of an enterprise search solution. The purpose of this post is to share some thoughts and experiences and help others who might be facing a problem similar to what my team faced several years back – we had a search solution in place that no one was maintaining and we needed to figure out what to do to improve it.

Regarding the name of the role – when this role first came into being in my company, I did not know what to call the role, exactly, but we started using the term search analyst because it related to the domain (search) and reflected the fact that the role was detailed (analytical) but was not a technical job like a developer. Subsequently, I’ve heard the term used by others so it seems to be fairly common terminology now – it’s possible that by now I’ve muddled the timeline enough in my head that I had heard the term prior to using it but just don’t recall that!

What does a Search Analyst do?

What does a search analyst do for you? The short answer is that a search analyst is the point person for improving the quality of results in your search solution. The longer answer is that a search analyst needs to:

  • Review data related to your search solution and understand its implications
  • Formulate new questions to continually improve upon the insights gained from the data
  • Formulate action plans from insights gained from monitoring that data in order to continually improve your search solution – this requires that the search analyst understand your search solution at a deep enough level of understand to be able to translate analytic insights into specific changes or actions
  • Follow through on those action plans and work with others as necessary to effect the necessary changes

Measuring Success as a Search Analyst

In order to define success for a search analyst, you need to set some specific objectives for the search analyst(s). Ultimately, given the job description, they translate to measuring how the search analyst has been successful in improving search, but here are some specific suggestions about how you might measure that:

  • Execute a regular survey of users of your search (perhaps annually?) – this can be a very direct way of measuring increased quality, though ensuring you get good coverage of the target audience (and reflect appropriate demographics) may be a challenge. We have used this and results do reflect increases in satisfaction.
  • Provide ability to rate search results – a more direct way than a survey to measure satisfaction with search, though implementing it and integrating it with the search experience in a way that invites users to provide feedback can be a challenge.
  • Measure overall increase in search usage – No need to directly work with users of your search but also begs the question about whether increasing search usage is really a measure of quality.
  • Measure increase in search usage relative to visits to your site (assuming your search solution is integrated with your intranet, for example) – I mentioned this in post on advanced metrics as a metric to monitor. I think this can be more useful than just measuring increases in usage, however, it might also reflect changes (good or bad) in navigation as much as changes in search.
  • Measure overall coverage of search (total potential targets) – How much content does your search solution make available as potential search results? By itself, increases in this do not equate to an improvement in search but if combined with other metrics that more directly measure quality of results, increases in coverage do translate to a user being more likely to get what they need from search. In other words, if you can assure users that they can gain direct access to more potential results in search while also ensuring that the quality of results returned is at least as good as before, that’s a good thing. On the other hand, if adding in new content pollutes the experience with many less-relevant search results, you are not doing anyone any favors by including them.
  • Measure number of specific enhancements / changes made to improve the quality of results – especially for highly sought content. Assuming you track the specific changes made, a measure of effectiveness could be to track how many changes a search analyst has made over a given time period. Did the search analyst effect 5 changes in a month? 50? Again, the number itself doesn’t directly reflect improvements (some of those changes could have been deleterious to search quality) but it can be an indicator of value.

Time Commitment for a Search Analyst

Another common question I’ve received is what percentage of time should a search analyst expect to spend on this type of work? Some organizations may have large enough search needs to warrant multiple full-time people on this task but we are not such an organization and I suspect many other organizations will be in the same situation. So you might have someone who splits their time among several roles and this is just one of them.

I don’t have a full answer to the question because, ultimately, it will depend on the value your organization does place on search. My experience has been that in an organization of approximately 5-6,000 users (employees) covering a total corpus of about a million items spread across several dozen sites / applications / repositories, spending about .25 FTE on search analyst tasks seems to provide for steady improvements and progress.

Spending less than that (down to about .1 FTE), I’ve found, results in a “steady state” – no real improvements but at least the solution does not seem to degrade. Obviously, spending more than that could result in better improvements but I find that dependence on others (content owners, application owners, etc.) can be a limiting factor in effectiveness – full organizational support for the efforts of the search analyst (giving the search analyst a voice in prioritization of work) can help alleviate that. (A search analyst with a software development background may find this less of an issue as, depending on your organization, you may find yourself less tied to development resources than you would otherwise be, though this also likely raises your own FTE commitment.)

The above description is worded as if your organization has a single person focused on search analyst responsibilities. It might also be useful to spread the responsibility among multiple people. One reason would be if your enterprise’s search solution is large enough to warrant a team of people instead of a single person. A second would be that it can be useful to have different search analysts focused (perhaps part time still for each of them) on different content areas. In this second situation, you will want to be careful about how “territorial” search analysts are, especially in the face of significant new content sources (you want to ensure that someone takes on whatever responsibility there might be for that content in regards to ensuring good findability).

What Skills does a Search Analyst Need

So far I’ve provided a description of the role of a search analyst, suggestions for objectives you can assign to a search analyst and at least an idea of the time commitment you might expect to have an effective search analyst. But, if you were looking to staff such a position, what kinds of skills should you look for? Here are my thoughts:

  • First, I would expect that a search analyst is a capable business analyst. I would expect that anyone who I would consider a capable search analyst would be able to also work with business users to elicit, structure and document requirements in general. I would also expect a search analyst to be able to understand and document business processes in general. Some other insights on a business analyst’s skills can be found here and here.
  • I would also expect that a search analyst should be naturally curious and knows how to ask the right questions. Especially with regard to the exploratory nature of dealing with a lot of analytical data (as seen in my recent posts about search analytics).
  • A search analyst must be very capable of analyzing data sets. Specifically, I would expect a search analyst to be very proficient in using spreadsheets to view large data collections – filtering, sorting, formulae, pivot tables, etc. – in order to understand the data they’re looking at. Depending on your search solution, I would also expect a search analyst to be proficient with building SQL queries; ideally they would use reports built in a reporting system (and so not have to directly manipulate data using SQL) but I find that the ad hoc / exploratory nature of looking at data makes that hard.
  • I would expect a search analyst to have an understanding of taxonomy in general and, specifically, understands your organization’s taxonomy and its management processes. This is important because the taxonomy needs to be an input into their analysis of search data and also (as highlighted in the potential actions taken from insights from search analytics), many insights can be gained from a search analyst that can influence your taxonomy.
  • I would also look for a search analyst to understand information architecture and how it influences navigation on your organization’s web sites. As with the taxonomy, I find that the search analyst will often discover insights that can influence your navigation.
  • I would expect a search analyst to have some understanding in basic web technologies. Most especially HTML and the use of meta tags within it. Also, XML is important (perhaps moreso, depending on your search engine). Some understanding of JavaScript (at least in so far as how / if your engine deals with it) can be useful.
  • I would expect that a search analyst should be able to quickly learn details of computer systems – specifically, how to manage and administer your search solution. I would not be hung up on whether your search analyst already knows the specific engine you might be using but that can obviously be useful.
  • This is not a skill, but another important piece of knowledge your search analyst should have is a good understanding of your major content sources and content types. In general, what kinds of things should be expected to be found in what places? What formats? What kinds of processes are behind their maintenance?
  • This is also not a skill per se, but it is important for your search analyst to be connected to content managers and application teams. The connection might be relatively tight (working in a group with them) or loose (association via a community of practitioners in your organization). The reasons for this suggestion include:
    • The ability to easily have two way communication with content managers enables your search analyst to provide continuous education to content managers about the importance of their impact on findability (education about good content tagging, how content will show in search, etc.) and also enables content managers to reach out to a search analyst when they are trying to proactively improve search of their content (something which does not seem to be as likely as I’d like to see within an enterprise setting!).
    • The ability to communicate with development teams can help in similar ways: The search analyst can use that as a way to continually reinforce the need for developers to consider findability when applications are deployed. Also, connectivity with development teams can provide insights to the search analyst so that they can proactively inject themselves in the testing of the applications (or hopefully even in the requirements definition process!) to ensure findability is actually considered.
  • Given that last recommendation, it is also important that a search analyst be able to communicate effectively and also be comfortable in teaching others (formally or informally). I find that education of others about findability is a constant need for a search analyst.

If your search needs warrant more than one person focused on improving your enterprise search solution, as much overlap in the above as feasible is good, though you may have team members specializing in some skills while others focus on other areas.

Organizational location of search analyst

Another important issue to address is where in your overall organization should the search analyst responsibility rest? I don’t have a good answer for this question and am interested in others’ opinions. My own experiences:

  • Originally, we have this responsibility falling on the heads of our search engine engineers. Despite their best efforts, this was destined to not be effective because their focus was primarily on the engine and they didn’t have enough background in things like the content sources, applications or repositories to include, connectivity to content managers or application developers. They primarily just ensured that the engine was running and would make changes reactively when someone contacted them about an issue.
  • We moved this responsibility into our knowledge management group – I was a trigger for this move as I could see that no one else in the organization was going to “step up”.
  • Due to subsequent organizational changes, this responsibility then fell into the IT group.
  • At this point, I would suggest that the best fit in our organization was within the KM group.
    • A search analyst is not a technical resource (developer or system admin, for example) though the job is very similar to business analysts that your IT group might have on staff.
    • The real issue I have found with having this responsibility fall into the IT organization is that within many organizations, IT is an organization that is responsive to the business and not an organization that drives business processes or decisions. Much of what the search analyst needs to accomplish will result in IT driving its own priorities, which can present challenges – the voice of the search analyst is not listened to within IT because it’s not coming “from the business”.
    • Also, it can be a challenge for an IT group to position a search analyst within it in order to support success. The internal organization of IT groups will vary so widely I can’t make any specific suggestions here, but I do believe that if your search analyst is located within your IT group, a search analyst could be closely aligned to a group focused on either architecture or business intelligence and be successful.
  • If your organization is structured to have a specific group with primary responsibility for your web properties (internal or external), that group would also be a potential candidate for positioning this responsibility. If that group primarily focuses externally, you would likely find that a search analyst really plays more of an SEO role than being able to focus on your enterprise search solution.

Enough about my own insights – What does anyone else have to share about how you perceive this role?   Where does it fit in your organization?  What are your objectives for this role?

Search Analytics – Search Results Usage

Monday, January 26th, 2009

In my previous two posts, I’ve written about some basic search analytics and then some more advanced analysis you can also apply. In this post, I’ll write about the types of analysis you can and should be doing on data captured about the usage of search results from your search solution. This is largely a topic that could be in the “advanced” analytics topic but for our search solution, it is not built into the search solution and has been implemented only in the last year through some custom work, so it feels different enough (to me) and also has enough details within it that I decided to break it out.


When I first started working on our search solution and dug into the reports and data we had available about search behavior, I found we had things like:

  • Top searches per reporting period
  • Top indexes used and the top templates used
  • Searches per hour (or day) for the reporting period (primarily useful to know how much hardware your solution needs)
  • Breakdowns of searches by “type”: “successful” searches, “not found” searches, “error” searches, “redirected” searches, etc.
  • A breakdown of which page of results a user (allegedly) found the desired item

and much more. However, I was frustrated by this because it did not give me a very complete picture. We could see the searches people were using – at least the top searches – but we could not get any indication of “success” or what people found useful in search, even. The closest we got from the reports was the last item listed above, which in a typical report might look something like:

Search Results Pages

  • 95% of hits found on results page 1
  • 4% of hits found on results page 2
  • 1% of hits found on results page 3
  • 0% of hits found on results page 4
  • Users performed searches up to results page 21

However, all this really reflects is the percentage of each page number visited by a searcher – so 95% of users never go beyond page 1 and the engine assumes that means they found what they wanted there. That’s a very bad assumption, obviously.

A Solution to Capture Search Results Usage

I wanted to be able to understand what people were actually clicking on (if anything) when they performed a search! I ended up solving this with a very simple solution (simple once I thought of it). I believe this emulates what Google (and probably many other search engines) do. I built a simple servlet that takes a number of parameters, including a URL (encoded) and the various pieces of data about a search result target and stores an event in a database from those parameters and then forwards the user to the desired URL. Then the search results page was updated to provide the URL for that servlet in the search results instead of the direct URL to the target. That’s been in place for a while now and the data is extremely useful!

By way of explanation, the following are the data elements being captured for each “click” on a search result:

  • URL of the target
  • search criteria used for the search
  • Location of the result (which page of results, which result number)
  • The relevance of the result
  • The index that contained the result and whether it was in the ‘best bets’ section
  • The date / time of the click

This data provides for a lot of insight on behavior. You can guess what someone might be looking for based on understanding the searches they are performing but you can come a lot closer to understanding what they’re really looking for by understanding what they actually accessed. Of course, it’s important to remember that this does not really necessarily equate to the user finding what they are looking for, but may only indicate which result looks most attractive to them, so there is still some uncertainty in understand this.

While I ended up having to do some custom development to achieve this, some search engines will capture this type of data, so you might have access to all of this without any special effort on your part!

Also – I assume that it would be possible to capture a lot of this using a standard web analytics tool as well – I had several discussions with our web analytics vendor about this but had some resource constraints that kept it from getting implemented and also it seemed it would depend in part on the target of the click being instrumented in the right way (having JavaScript in it to capture the event). So any page that did not have that (say a web application whose template could not be modified) or any document (something like a PDF, etc) would likely not be captured correctly.

Understanding Search Usage

Given the type of data described above, here are some of the questions and actions you can take as a search analyst:

  • You know the most common searches being performed (reported by your search engine) – what are the most common searches for search result clicks?
    • If you do not end up with basically the same list, that would indicate a problem, for sure!
    • Action: Understanding any significant differences, though, would be very useful – perhaps there is key content missing in your search (so users don’t have anything useful to click on).
  • For common searches (really, for whatever subset you want to examine but I’m assuming you have a limited amount of time so I would generally recommend focusing on the most common searches), what are the most commonly clicked on results (by URL)?
    • Do these match your expectations? Are there URLs you would expect to see but don’t?
    • Action: As mentioned in the basic analytics article, you can identify items that perhaps are not showing properly in search that should and work on getting them included (or improved if your content is having an identity issue).
  • Independent of the search terms used, what are the most commonly accessed URLs from search?
    • For each of the most commonly used URLs, what keywords do users use to find them?
    • Does the most common URL clicked on change over time? Seasonally? As mentioned in the basic analytics article, you can use this insight to more proactively help users through updates to your navigation.
    • Action: Items that are common targets from search might present navigation challenges for your users. Investigate that.
    • Action: Items that are common targets but which have a very broad spectrum of keywords that lead a user to it might indicate a landing page that could be split out into more refined targets. That being said, it is very possible that users prefer the common landing page and following the navigation from there instead of diving deeper into the site directly from search. Some usability testing would be appropriate for this type of change.
  • A very important metricWhat is the percentage of “fall outs” (my own term – is there a common one)? Meaning, what percentage of searches that are performed do not result in the user selecting any result? For me, this static provides one of the best pieces of insight you can automatically gather on the quality of results.
    • More specifically, measure the percentage fall out for specific searches and monitor that. Focus on the most common searches or searches that show up as common over longer durations of time.
    • Action: Searches that have high fall out would definitely indicate poor-performing searches and you should work to identify the content that should be showing and why it doesn’t. Is the content missing? Does it show poorly?
  • What percentage of results come from best bets?
    • Looking at this both as an overall average and also for individual searches or URLs can be useful to track over time.
    • Action: At the high level (overall average) a move down in this percentage over time would indicate that the Best Bets are likely not being maintained.
      • Look for items that are commonly clicked on that are not coming from Best Bets and consider if they should be added!
      • Are the keywords associated with the best bets items kept up to date?
    • Action: Review the best bets and confirm if there are items that should be added. Also, does your search results UI present the best bets in an obvious way?
  • What is the percentage of search results usage that comes from each page of results (how many people really click on an item on page 2, page 3, etc.)?
    • Are there search terms or search targets that show up most commonly not on page 1 of the results?
    • Action: If there are searches were the percentage of results clicked is higher on pages after page 1, you should review what is showing up on the first page. It would seem that the desired target is not showing up on the first page (at least at a higher rate than for other searches).
    • Action: If there are URLs where the percentage of times they are clicked on in pages beyond the first page of results is higher than for other URLs, look at those URLs – why are they not showing up higher in the results?
  • Depending on the structure of the URLs in use within your content, it might also be possible to do some aggregation across URLs to provide insight on search results usage across larger pieces of your site. For example, if you use paths in your URLs you could do aggregation on this data on patterns of the URLs – How many search results are to an item whose URL looks like “http://site.domain.com/path1/path2”.
    • Assuming you can do this with your data, you can then analyze common keywords used to access a whole area instead of focusing on specific URLs
    • If your site is dynamic (using query strings) it might be possible to do some aggregation based on the patterns in the query strings of the URLs instead to achieve the same results.
    • This type of analysis can actually be very useful to find cases where a user is “getting close” to a desired item but they’re not getting the most desirable target because the most desirable target does not show up well in search. (So a user might make their way to the benefits area but might not be directly accessing the particular PDF describing a particular benefit.)
      • Action: You can then identify items for improvement.
    • All of the above detailed questions about URLs can be asked about aggregations of URLs, so keep that in mind.

You can also combine data from this source with data from your web analytics solution to do some additional analysis. If you capture the search usage data in your web analytics tool (as I mention above should be possible), doing this type of analysis should be much easier, too!

  • For URLs commonly clicked on from search results, what percentage of their access is through search?
    • Action: If a page has a high percentage of its access via search, this identifies a navigation issue to address.
    • One case I have not yet worked out is a page that is very commonly accessed from search results (high compared to other results) but for which those accesses represent a low percentage of use of that page – do you care? What action (if any) might be driven from this? It seems like from the perspective of search, it’s important but there does not seem to be a navigational issue (users are getting to the target OK for the most part). Any thoughts?
  • Turning around the above, for commonly accessed pages (as reported by your web analytics tool), what percentage of their access comes via search? In my experience, it’s likely that the percentage via search would be low if the pages themselves are highly used already, but this is good to validate for those pages.
    • Action: As above, a high percentage of accesses via search would seem to indicate a navigation issue.
  • You can also use your web analytics package to get a sense of the “fall outs” mentioned above at a high level of detail – using the path functionality of your web analytics package, what percentage of accesses to your search results page have a “next page” where the user leaves the site? What percentage leads to a page that is known to not be a relevant target (in our data, I see a large percentage of users return to the home page, for example – it is possible the user clicked on a result that is the home page, but it seems unlikely).
    • However, you will likely not have any insight about what the searches were that led to this and not know what the variance is across different searches.

Summing Up

Here’s a wrap (for now) on the types of actionable metrics you might consider for your search program. I’ve covered some basic metrics that just about any search engine should be able to support; then some more complex metrics (requiring combining data from other sources or some kind of processing on the data used for the basic metrics) and in this post, I’ve covered some data and analysis that provides a more comprehensive picture of the overall flow of a user through your search solution.

There are a lot more interesting questions I’ve come up with in the time I’ve had access to the data described above and also with the data that I discussed in my previous two posts, but many of them seem a bit academic and I have not been able to identify possible actions to take based on the insights from them.

Please share your thoughts or, if you would, point me to any other resources you might know of in this area!

Search Analytics – Advanced Metrics

Friday, January 23rd, 2009

In my last post, I provided a description of some basic metrics you might want to look into using for your search solution (assuming you’re not already). In this post, I’ll describe a few more metrics that may take a bit more effort to pull together (depending on your search engine).

Combining Search Analytics and Web Analytics

First up – there is quite a lot of insight to be gained from combining your search analytics data with your web analytics data. It is even possible to capture almost all of your search analytics in your web analytics solution which makes this combination easier, though that can take work. For your external site, it’s also very likely that your web analytics solution will provide insight on the searches that lead people to your site.

A first useful piece of analysis you can perform is to review your top N searches, perform the same searches yourself and review the resulting top target’s usage as reported in your web analytics tool.

  • Are the top targets the most used content for that topic?
  • Assuming you can manipulate relevancy at an individual target level, you might bump up the relevancy for items that are commonly used but which show below other items in the search results (or you might at least review the titles and tags for the more-commonly-used items and see if they can be improved).
  • Are there targets you would expect to see for those top searches that your web analytics tool reports as highly utilized but which don’t even show in the search results for the searches? Perhaps you have a coverage issue and those targets are not even being indexed.
  • It might be possible to integrate data from your web analytics solution reflecting usage directly into your search to provide a boost in relevance for items in search that reflects usage.
  • [Update 26 Jan 2009] One item I forgot to include here originally is to use your web analytics tool to track the page someone is on when they perform a search (assuming you provide persistently available access to your search tool – say in a persistently available search box on your site). Knowing this can help tune your navigational experience. Pages that commonly lead users to use search would seem like pages that do not provide good access to the information users expect and they fall back to using search. (Of course, it might be that leading the user to search is part of the point of the page so keep that in mind.)
  • [Update 26 Jan 2009] Another metric to monitor – measure the ratio of searches performed each reporting period (week) to the number of visits for that same time period.  This will give you a sense of how much the search is used (in relation to navigation).  I find that the absolute number is not as useful as tracking this over time and that monitoring changes in this value can give you indicators of general issues with navigation (if the ratio goes up) or search (if the ratio goes down).  Does anyone know of any benchmarks in this area? I do not but am interested in understand if there’s a generally-accepted range for this that is judged “acceptable”.  In the case of our solution, when I first started tracking this, it was just under .2 and has seen a pretty steady increase over the years to a pretty steady value of about 0.33 now.

A second step would be to review your web analytics report for the most highly used content on your site. For the most highly utilized targets, determine what are the obvious searches that should expose those targets and then try those searches out and see where the highly used targets fall in the results.

  • Do they show as good results? If not, ensure that the targets are actually included in your search and review the content, titles and tags. You might need to also tweak synonyms to ensure good coverage.
  • You should also review the most highly used content as reported by your web analytics tool against your “best bets” (if you use that). Is the most popularly accessed content show up in best bets?

Another fruitful area to explore is to consider what people actually use from search results after they’ve done a search (do they click on the first item, second? what is the most common target for a given keyword? Etc.). I’ll post about this separately.

I’m sure there are other areas that could be explored here – please share if you have some ideas.

Categorizing your searches

When I first got involved in supporting a search solution, I spent some time understanding the reports I got from my search engine. We had our engine configured to provide reports on a weekly basis and the reports provided the top 100 searches for the week. All very interesting and as we started out, we tried to understand (given limited time to invest) how best to use the insight from just these 100 searches each week.

  • Should we review the results from each of those 100 searches and try to make sure they looked good? That seemed like a very time intensive process.
  • Should we define a cut off (say the top 20)? Should we define a cutoff in terms of usage (any search that was performed more than N times)?
  • What if one of these top searches was repeated? How often should we re-review those?
  • How to recognize when a new search has appeared that’s worth paying attention to?

We quickly realized that there was no really good, sustainable answer and this was compounded by the fact that the engine reported two searches as different searches if there was *any* difference between two searches (even something as simple as case difference, even though the engine itself does not consider case when doing a search – go figure).

In order to see the forest for the trees, we decided what would be desirable is to categorize the searches – associate individual searches with a larger grouping that allows us to focus at a higher level. The question was how best to do this?

Soon after trying to work out how to do this, I attended Enterprise Search Summit West 2007 and attended a session titled “Taxonomize Your Search Logs” by Marilyn Chartrand from Kaiser Permanente. She spoke about exactly this topic, and, more specifically, the value of doing this as a way to understand search behavior better, to be able to talk to stakeholders in ways that make more sense to them, and more.

Marilyn’s approach was to have a database (she showed it to me and I think it was actually in a taxonomy tool but I don’t recall the details – sorry!) where she maintained a mapping from individual search terms to the taxonomy values.

After that, I’ve started working on the same type of structure and have made good headway. Further, I’ve also managed to have a way to capture every single search (not just the top N) into a SQL database so that it’s possible to view the “long tail” and categorize that as well. I still don’t have a good automated solution to anything like auto-categorizing the terms but the level of re-use from one reporting period to the next is high enough that dumping in a new period’s data requires categorization of only part of the new data. [Updated 26 Jan 2009 to add the following] Part of the challenge is that you will likely want to apply many of the same textual conversions to your database of captured searches that are applied by your search engine – synonyms, stemming, lemmatization, etc. These conversions can help simplify the categorization of the captured searches.

Anyway – the types of questions this enables you to answer and why it can be useful include:

  • What are the most-used categories of content for your search users?
    • How does this correlate with usage (as reported in your web analytics solution) for that same category?
    • If they don’t correlate well, you may have a navigational issue to address (perhaps raising the prominence of a category that’s overly visible in navigation or lowering it).
    • Review the freshness of content in those categories and work with content owners to ensure that content is kept up to date. I’ve found it very useful to be able to talk with content owners in terms like “Did you know that searches for your content constitute 20% of all searches?” If nothing else, it helps them understand the value of their content and why they should care about how well it shows up in search results! Motivate them to keep it up to date!
  • Assuming you categorize your searches based on your taxonomy, this can also feed back into your taxonomy management process as well! Perhaps you can identify taxonomic terms that should be retired or collapsed or split using insights from predominance of use in search.
  • Within the categorization of search terms, can you correlate the words used to identify what are the most common “secondary” words in the searches. An example – GroupWise is a product made and sold by my employer. It is also a common search target. So a lot of searches will include the word groupwise in them (I use that as a way to pseudo-automatically categorizes searches with a category – by the presence of a single keyword). Most of those searches, though, include other words. What are the most common words (other than groupwise) among searches that are assigned to the GroupWise category?
    • This insight can help you tune your navigation – common secondary words represent content that a user should have access to when they are looking at a main page (assuming one exists) for that particular category. If the most common secondary word for GroupWise were documentation, say, providing direct access to product documentation would be appropriate.
    • You can also use that insight to feed back into your taxonomy (specifically, you might be able to find ways to identify new sub-terms in your taxonomy).

Analytics on the search terms / words

Another useful type of analysis you can perform on search data is to look at simple metrics of the searches. Louis Rosenfeld identified several of these – I’m including those here and a few additional thoughts.

  • How many words, on average, are in a search? What is the standard deviation? This insight can help you understand how complex the searches your users are performing. I don’t know what a benchmark is, but I find in our search solution, it averages just over 2 words / search. This indicates to me that the average search is very simple, so expectations are high on the search engine’s ability to take those 2 words and provide a good result.
    • You can also monitor this over time and try to understand if it changes much and, if so, analyze what has changed.
  • While not directly actionable, another good view of this data is to build a chart of the # of searches performed for each count of words. The chart below shows this for a long period of use on our engine. You can see that searches with more than 10 words are vanishingly small. After the jump from 1 word to 2 words, it’s almost a steady decline, though there are some anomalies in the data where certain longer lengths jump up from the previous count (for example, 25 word searches are more than twice as common as 24 word searches). The absolute numbers of these is very small, though, so I don’t think it indicates much about those particular lengths.
Chart of Searches per Word Count

Chart of Searches per Word Count

  • You can also look at the absolute length of the search terms (effectively, the number of characters). This is useful to review against your search UI (primarily, the ever-present search box you have on your site, right?). Your search box should be large enough to ensure that a high percentage (90+%) of searches will be visible in the box without scrolling.
    • I did this analysis and found that our search UI did exactly that.
    • I also generated a chart like the one above where the X axis was the length of the search and found some obvious anomalies in our search – you can see them in the chart below.
    • I tried to understand the unexpected spike in searches of length 3 and 4 compared to the more regular curve and found that it was caused by a high level of usage of (corporate-specific) acronyms in our search! This insight led me to realize that we needed to expand our synonyms in search to provide more coverage for those acronyms, which were commonly the acronyms for internal application names.
Chart of Search Length to number of searches

Chart of Search Length to number of searches

Network Analysis of Search Words

Another interesting view of your search data is hinted at by the discussion above of “secondary” search words – words that are used in conjunction with other words. I have not yet managed to complete this view (lack of time and, frankly, the volume of data is a bit daunting with the tools I’ve tried).

The idea is to parse your searches into their constituent words and then build a network between the words, where the each word is a node and the links between the words represent the strength of the connection between the words – where “strength” is the number of times those two words appear in the same searches.

Having this available as a visual tool to explore words in search seems like it would be valuable as a way to understand their relationships and could give good insight on the overall information needs of your searchers.

The cost (in myown time if nothing else) of taking the data and manipulating it into a format that could then be exposed in this, however, has been high enough to keep me from doing it without some more concrete ideas for what actionable steps I could take from the insight gained. I’m just not confident enough to think that this would expose anything much more than “the most common words used tend to be used together most commonly”.

Closing thoughts

I’m missing a lot of interesting additional types of analyses above – feel free to share your thoughts and ideas.

In my next post, I’ll explore in some more detail the insights to be gained from analyzing what people are using in search results (not just what people are searching for).

Search Analytics – Basic Metrics

Tuesday, January 20th, 2009

In my first few posts (about a year ago now), I covered what I call the three principles of enterprise search – coverage, identity, and relevance. I have posted on enterprise search topics a few times in the meantime and wanted to return to the topic with some thoughts to share on search analytics and provide some ideas for actionable metrics related to search.

I’m planning 3 posts in this series – this first one will cover some of what I think of as the “basic” metrics, a second post on some more advanced ideas and a third post focusing more on metrics related to the usage of search results (instead of just the searching behavior itself).

Before getting into the details, I also wanted to say that I’ve found a lot of inspiration from the writings and speaking of Louis Rosenfeld and also Avi Rappoport and strongly recommend you look into their writings. A specific webinar to share with you, provided by Louis, is “Site Search Analytics for a Better User Experience“, which Louis presented in a Search CoP webcast last spring. Good stuff!

Now onto some basic metrics I’ve found useful. Most of these are pretty obvious, but I guess it’s good to start at the start.

  • Total searches for a given time period – This is the most basic measure – how much is search even used? This can be useful to help you understand if people are using the search more or less over time.
    • In terms of actionable steps, if you pay attention to this metric over time, it can tell you, at a high level, whether users are finding navigation to be useful or not. Increasing search usage can point to the need to improve navigation – so perhaps might indicate the need for a better navigational taxonomy, so look at whether highly-sought content has clear navigation and labeling.
  • Total distinct search terms for a given time period – Of all of the searches you are measuring with the first metric, how many are unique combinations of search criteria (note: criteria may include both user-entered keywords and also something like categories or taxonomy values selected from pick lists if your search supports that)? If you take the ratio of total searches to distinct searches, you can determine the average number of times any one search term is used.
    • In terms of taking action on this, there is not much new to this metric compared to total searches, but the value I find is that it seems to be a bit more stable from period to period.
    • Monitoring the ratio over time is interesting (in my experience, ours tends to run about 1.87 searches / distinct search and variations seem small over time). Not sure what a benchmark should be. Anyone? Understanding and comparing to benchmarks probably would provide some more distinct tasks.
  • Total distinct words for a given time period and average words per search – take the previous metric and pull apart individual search terms (or user-selected taxonomic values) and get down to the individual words.
    • This view of the data helps you understand the variety of words in use throughout search. Often, I find that understanding the most common individual words is more useful than the top searches.
    • In terms of action, again, not much new here other than comparing to the total searches to find ways to understanding search usage.
    • I’m also interested in whatever benchmarks anyone else knows of in this area – again, I think comparing to benchmarks could be very useful. Just to share from my end, here are what I see (looking at these values week by week over a fairly long period):
      • Average words per search: 2.02. Maximum (of weekly averages) was 2.16 and minimum (of weekly averages) was 1.84. So pretty stable. So, on average, most searches use two words.
      • Average uses of each word (during any given week): 4.95. Maximum (of weekly averages) was 5.69 and minimum (of weekly averages) was 2.93. So a much wider variance than we see in words per search.
  • (The most obvious?) Top N searches for a given time period – I typically look at weekly data and, for this metric, I most commonly look at the top 100 searches and focus on about the top 20. Actions to take:
    • Ensure that common searches return decent results. If it does not show good results, what’s causing it to show up as a common search (it would seem that users are unlikely to find what they need)? If it does show what appear to be good results, does this expose specific issues with navigation (as opposed to the general issues observable from the metrics listed above)?
    • If a search shows up that hasn’t been in the top of the list, does that represent something new in your users’ work that they need access to? Perhaps a some type of seasonal (annual or maybe monthly) change?
  • Trending of all of the above – More useful than any of the above metrics as single snapshots for a given time period (which is what it seems like many engines will provide out of the box) is the ability to view trends over longer periods. Not just the ability to view the above metrics over longer periods but the ability to see what the metrics were, say, last week and compare those to the week before, and the week before that, etc.
    • I’ve mentioned a few of these, but comparing how the trend is changing of how many searches are performed each week (or month or quarter) is much more useful than just knowing that data point during any given time period.
    • One of the challenges I’ve had with any of the “Top N” type metrics (searches, words, etc.) is the ability to easily compare and contrast the top searches week to week – being able to compare in an easily-comprehended manner what searches have been popular each week (or month) over, say, a few month (quarter) period helps you know if any particular common search is likely a single spike (and likely not worth spending time on improving results for) or an indication of a real trend (and thus very worthwhile to act on). I have ended up doing a good bit of manual work with data to get this insight – anyone know of tools that make it easier?
  • Top Searches over time – another type of metric I’ve spent time trying to tweak is to understand what makes a “top search over an extended period of time”. This is similar to understanding and reviewing trends over time but with a twist.
    • Let’s say that you gather weekly reports and you have access to the data week by week over a longer period of time (let’s say a year).
    • The question is – over a longer time period, what are the searches you should pay attention to and actively work to improve? What is a “top search”?
    • A first answer is to simply count the total searches over that year and whichever searches were most commonly used are the ones to pay attention to.
    • What I’ve found is that using that definition can lead to anomalous situations like a search that is very popular for one week (but otherwise perhaps doesn’t appear at all) could appear to be a “top search” simply because it was so popular that one week.
      • To address this, what I do is to impose a minimum threshold on the # of reporting periods (weeks in my case) that a search needs to be a top search in order for it to be considered a top search for the longer time period. The ratio I use is normally 25% – so a term needs to be a top search for 25% of the weeks being considered to be considered at all. Within that subset of popular searches, you can then count the total searches.
      • Alternately, if you can, massage your data to include the total searches (over the longer time period) and total reporting periods in which the search occurs as two distinct columns and you can sort / filter the data as you wish.
      • The important thing is to recognize that if you’re looking to actively work on improving specific searches, you need to focus your (limited, I’m sure!) time on those searches that warrant your time, not find yourself spending time on a search that only appears as a popular search in one reporting period.
    • On the other hand, a search that might not be a top N search any given week could, if you look at usage over time, be stable enough in its use that over the course of a longer period it would be a top search.
      • This is the inverse of the first issue. In this case, the key issue is that you will need access over longer periods of time to all of the search terms for each reporting period – not just the top searches. Depending on your engine, this data may or may not be available.
  • Another important dimension you should pay attention to when interpreting behavior is seasonality. You should compare your data to the same period a year ago (or quarter ago or maybe month ago, depending on your situation) to see if there are terms that are popular only at particular times.
    • An example on our intranet is that each year you can see the week before and of the “Take your Kids to Work” program, searches on ‘kids to work’ goes through the roof and then disappears again for another year. Also, at the end of each year, you see searches on “holidays” go way up (users looking for information on what dates are company holidays and also about holiday policy).
    • This insight can help you anticipate information needs that are cyclical, which could mean ensuring that new content for the new cycle (say we had a new site for the Kids to Work program each year, though I’m not sure if we do) shows well for searches that users will use to find it.
    • It also helps you understand what might be useful temporary navigation to provide to users for this type of situation. Having a link from your intranet home page to your holiday policies might not be useful all of the time but if you know that people are looking for that in late November and December, placing a link to the policies for that period can help your users find the information they need.
  • Another area of metrics you need to be attention to are not found searches and error searches.
    • What percentage of searches result in not found searches for your reporting periods? How is that changing? If it’s going up, you seem to have a problem. If it’s stable, is it higher than it should be?
    • What are the searches that users are most commonly doing that are resulting in no results being found? Focus on those and work to ensure whether it’s a content issue (not having the right content) or perhaps a tagging issue (the users are not using expected words to find the content).
    • The action you take will depend on the percentage of not found results and also on the value of losing users on those not found.
      • On an e-commerce site, each potential customer you lose because they couldn’t find what they were looking for represents hard dollars lost.
      • On an intranet, it is harder to directly tie a cost to the not found search but if your percentage is high, you need to address it (improving coverage or tagging or whatever is necessary).
      • A relatively low “not found” percentage might not indicate a good situation – it might also simply reflect very large corpus of content being included in which just about any words a user might use will get some kind of result even if it’s not a useful result. More about that in my next post.
        • I’m not sure what a benchmark is for high or low percentage of not found, exactly. Does anyone know of any resource that might provide that?
        • On our intranet search, this metric has been very stable at around 7-8% over a fairly extended time period. That is not high enough to warrant general concern, though I do look for whether there are any common searches in this and there actually does not seem to be – individual “not found” results are almost always related to obvious misspellings and our engine provides spelling correction suggestions so it’s likely that when a user gets this, they click on the (automatically provided) link to see results with the corrected spelling and they (likely) no longer get the “no results” result.
      • Customizing your search results page for not found searches can be useful and provide alternate searches (based on the user’s search criteria) is very useful though it might be a very challenging effort.
    • What types of things might trigger an “error search” will depend on your engine (some engines may be very good at handling errors and controlling resources so as to effectively never return an error unless the engine is totally offline (in which case, it’s not too likely you’ll capture metrics on searches). Also, whether these are reported on in a way that you can act on will depend on your engine. If so, I think of these as very similar to “not found” searches. You should understand their percentage (and whether it’s going up, down or is stable), what are the keywords that trigger errors, etc. Modify your engine configuration, content or results display as possible to deal with this.
      • An example: With the engine we use, the engine tries to ensure that single searches do not cause performance issues so if a search would return too many results (what is considered “too many” is configurable but it is ultimately limited), it triggers an “error” result being returned to the user. I was able to find the searches that trigger this response and ensure that (hand-picked) items show up in the search results page for any common search that triggers an error.

That’s all of the topics I have for “basic metrics”. Next up, some ideas (along with actions to take from them) on more complex search metrics. Hopefully, you find my recommendations for specific actions you can take on each metric useful (as they do tend to make the posts longer, I realize!).

Enterprise Taxonomy – A Business Process for Managing A Taxonomy

Thursday, January 15th, 2009

Now that I’ve posted quite a bit on the technical side of an enterprise taxonomy, I thought I’d share a bit on the business process side of how we have managed our taxonomy.

I spoke about this topic at the  2007 Taxonomy Boot Camp. (As an aside, I tried to find if the presentation I used is available on the site but I couldn’t find it – if someone knows of an online archive, please let me know and I can provide a link from here.) The session I delivered was titled, “The Process and Politics of Implementing a Corporate Taxonomy” and focused on the overall process we have implemented.

What follows is an overview of the larger process we used to establish the taxonomy and a description of the smaller process used to maintain it and I’ll close with some of my own thoughts on what it is that triggers changes in a taxonomy.

Getting Started

When we first started trying to formalize a taxonomy, one of the first steps we took was to do an organizational mapping to identify participants in the process. We focused on the following:

  • Groups that had significant investments in web content publication
  • Groups that had significant interest and investment in knowledge capture and sharing
  • Groups that have influence on the corporate culture

We felt that this organizational mapping was important because it would help increase buy-in to the taxonomy from those who have most vested interest in it and also (with help from that last group) would help increase larger scale adoption of the language. Once we felt that we had identified the groups that met these criteria, we engaged with the executives for the groups to help us identify one or more people who could be included in our Taxonomy Review Board.

The rest of the “getting started” process included content audits and analyses to identify terminology used to describe the content, definition of the structure of the taxonomy we wanted to use, organization of the terminology into this structure and then working with the Taxonomy Review Board to confirm the end result as a first version of the (evolving) taxonomy.

We also layed out the objectives we had for the overall process – which you can find in my post on the vision we have developed for our taxonomy. The really pertinent items we wanted to ensure were: We wanted to ensure that the taxonomy was actively managed and we wanted to ensure that the management process was transparent.

The People

Now that the taxonomy had been established, we needed to identify the people and process we would use for maintaining and enhancing the taxonomy.

The people who are involved include:

  • The taxonomy manager – a single person responsible for responding to requests for changes, proactively identifying proposed changes within the taxonomy and handling the “administrative” side of the process. If it’s helpful, I’ve found that this responsibility probably takes about 10% of a single person’s time (though that obviously reflects the size of our organization and volume of content, etc., and can vary at different times) This is my role within the process.
  • A core team – a group of about 3 people (one of which is the taxonomy manager) who do a first-level check of change requests to make sure that requests that are obviously (at least in the minds of the core team) not worth moving farther in a review process are not further considered. Time commitment for this group is probably on the scale of about a few hours a month.
  • The above-mentioned Taxonomy Review Board (TRB)- A cross-organizational group that reviews proposed changes and aligns with them or propose counter-proposals. This group currently has about 15-20 members. Time commitment for this group is minimal – normally, the proposals for change have been considered and detailed enough by the time this group sees them that their involvement is to receive emails with change proposals and either align (so no reply necessary) or write a counter proposal.

This organization has helped to keep the taxonomy managed, while also keeping overall enterprise expense to manage it fairly small.

The Process

Now, I am, at heart, a software engineer. Why is this pertinent? Early on in my career, I came to appreciate the need and value for change control (or, as I prefer to think of it change management or change visibility – I’ve always thought “control” seemed a bit stronger than you could really achieve) and that has seeped into our process.

At its heart, our process is similar to a software development team’s change control board (CCB) process:

  1. All changes, upon identification, are captured in the same bug-tracking system used for our engineering and IT systems (an implementation of Bugzilla). Just like with software, all changes are treated as either enhancements (extending beyond what we have now) or defects (a problem or mistake that was not anticipated) and so they follow the same lifecycle (I generically use the word “bug” below to mean the specific documented request in our tracking system for a change, regardless of whether the change is an enhancement or defect).
  2. Once a change is documented as a bug (I’ll write a bit more below about the sources of changes to the taxonomy), it is assigned to the taxonomy manager for resolution.
  3. The taxonomy manager then needs to do a few things:
    1. Ensure that the bug contains all of the necessary details and any obvious questions are answered. An example of this would be the specific guidelines we have for one of our classifications – I shared these with the TaxoCoP and Patrick Lambe blogged about them as well. In this case, the taxonomy manager is on the hook to ensure a request adheres to these guidelines.
    2. Describing the impact of the change on the rest of the taxonomy (if any).
  4. The change is then reviewed by the core team – this review is typically virtual via email exchange but can be a meeting convened by the taxonomy manager.
    1. If the core team aligns with the change (perhaps after some continued evolution of it), it moves forward for a review by the full Taxonomy Review Board.
    2. If the core team rejects the change, it is canceled. The taxonomy manager communicates that back to the requester (if the trigger for the change was a particular person or group).
  5. When a change is put to the Taxonomy Review Board, which is a virtual team (and which is geographically very distributed), it is communicated by email to the TRB.
    1. At this point in the process, we want to ensure efficiency in process so we do not use a “request for comment” type of approach.
    2. Instead, the change is detailed for the TRB and the TRB members are given two options: 1) align with the change as stated or 2) provide a counter proposal. This helps keep focused and helps to avoid potentially lengthy discussions on the change at this point.
    3. To further accelerate decision making and reduce time on the part of the TRB members, each request is also positioned as a time-boxed proposal: You have until <this date> to provide a counter proposal or else you are assumed to align to the change. In other words, no reply from a member equates to alignment.
    4. Another implication of this is that by the point a change reaches the TRB it is almost inevitably going to go ahead in some form (perhaps changed by counter-proposals from the TRB). It will not be canceled. That seems possible but so far has not happened in practice.
  6. Upon achieving alignment within the TRB on the final proposal, the change is executed in the taxonomy and the request closed. The change is communicated back to the requester to close the loop and (especially for significant changes) the change may also be communicated to our larger content manager community.

Issues with the Process and Framework

While it has worked effectively we still face a number of issues with this process. These include:

  • The need to keep on top of organizational changes – specifically, with regard to membership in the Taxonomy Review Board. A member’s role within the enterprise can change to the point where they may not be in the best position to represent a group of interests. In addition, with some organizational changes we’ve seen, it can result in an “unbalanced” TRB.
  • Which brings us to the second issue – organizational coverage. Currently, we have a TRB that overly represents our marketing organization and is missing representation from some groups that should be represented.
  • Lastly, support of this process from within our IT organization is a concern. I see this in a couple of different ways:
    • Organizationally, the taxonomy manager falls within IT but the responsibility to continue managing the taxonomy is not perceived as a priority (and there’s a question as to whether it should even organizationally be within IT);
    • In terms of adoption, it has been a challenge to educate the IT organization about the value and use of the taxonomy. An example would be integration with a business intelligence solution to ensure consistency in language and, more specifically, to be able to effectively integrate insights about content (which does use the taxonomy) with more transactional-based “data”.

Identifying the Need for Change

What triggers a change in the taxonomy?

As I (re-)gather my thoughts on this topic, one lingering question came back to me about the overall process. The question is external to the process (which takes the approach of “a change comes from somewhere and we’re not going to worry about where it comes from but once it’s been identified, we’ll wedge it into this process”) but I am interested in understanding what other taxonomists might actively do in maintaining a taxonomy. In other words, how much change do you experience that comes from others compared to your own recommendations or insights?

Here’s a list of triggers that have resulted in changes in the taxonomy:

  • We provide content publishers with a mechanism to request a change to one facet (“Item Type”) at the point where they are submitting a piece of content. I consider this to be a purely tactical, reactive change and, given the above process, suffers from the problem that a content publisher cannot sit at their computer waiting for the business process to complete before they submit their content. So even if a new value is adopted, they will need to publish their content with a temporary value and remember to come back and change it after the fact.
  • I have engaged with content owners several times who were planning to publish a set of content and worked proactively with them to understand their content and ensure that the taxonomy provides good coverage. It’s lucky (though perhaps it shouldn’t be!) when this can happen and I manage to ensure the taxonomy changes are in place before they need to publish content.
  • When a new repository is being migrated or merged into a system using the taxonomy, there will likely be a number of changes in the taxonomy, including adoption of whole new classifications and introduction of new values. Also, this almost inevitably require a good mapping from local system values to the taxonomy values where there is (near) overlap.
  • Most proactively on my part, I have also used analytics from a number of sources to help refine the taxonomy, including:
    • Reviewing search query logs to understand the language being used by people looking for content
    • Reviewing the “free text” fields (e.g., title, description, etc.) within content management systems to look for terms that are commonly used that might warrant explicit use in a constrained classification.
    • Reviewing the volume of content when split along various dimensions of the existing taxonomy – looking for opportunities to merge (values are under-utilized), split (values are over-utilized) or perhaps retired (values are not utilized)
  • Adoption of new terminology by groups responsible for that part of the taxonomy. A common example is the terminology used to describe our various solution offerings – these will, at times, be changed unilaterally by our marketing organization and we then need understand how that translates to the existing taxonomy and to content tagged with that taxonomy.
  • Lastly, given that another part of the vision of the taxonomy is to use systems of record where possible, a number of changes are triggered outside of the taxonomy and simply synchronized in from the source system. This approach assumes (true in all cases as far as I am aware) that the source systems provide their own management process on values and these changes do not require any review through the above taxonomy management process.

Enterprise Taxonomy – An XML schema for Publishing a Taxonomy

Wednesday, January 14th, 2009

In my continuing dive into the structure of our taxonomy, which, hopefully might be of use or interest to you to understand and possibly adopt to your own needs, so far, I’ve provided an outline of the application solution and then a high level outline of the data model we’re using.

One of the important features of our solution is that our taxonomy system provides the ability for other systems to consume the taxonomy via an XML document. I’ll explore that a bit here.

Accessing the XML

Access to the XML document for the taxonomy is through a very simple means: a standard HTTP GET. The query string in the request can specify various parameters on the URL – effectively, a very simple web service. The types of parameters supported include:

  • Identifying which classification is desired (default is to return all)
  • Specifying the statuses of values to include (default will return all)
  • Specifying the language to include (default returns English)
  • Specifying the level of detail of interest (default returns the briefest format)

With regard to the language – one of the business rules followed in our web sites is that you provide content in the user’s selected language when available and return English when the user’s language is not available (English should always be available). This rule is pushed down into this interface at the level of each value. So a consuming application might request the set of German values for the taxonomy and get all of the classification details in German and, say, 99% of the values in German but if there are values that are not translated, those are returned in English. This approach keeps the taxonomy consistent with our general rules (though if taxonomy values are used directly in a user interface, it does present a possibly confusing same-page mix of non-English and English).

Document structure

The returned XML document looks like the following. I’m not using any formal XML schema syntax – instead showing the elements and how they relate to each other with a brief description of th elements that I don’t think are self-explanatory.

  • taxonomy
    • classification – has an attribute id (the ID of the classification)
      • name – has an attribute lang (the language code describing the language of the name element)
      • description – has an attribute lang (the language code describing the language of the description element)
      • status
      • createDate
      • updateDate
      • sourceSystem
      • comments
      • hasValues (a Y/N indicating if a consuming application should expect to find values in the values element)
      • constrained (a Y/N indicating if a consuming application should enforce the rule that values for this classification must come from the list of values provided)
      • multiValued (a Y/N indicating if a consuming application should allow multiple values be assigned for any given content piece)
      • dataType
      • changeHistory – an element with a sequence of elements, one for each auditable event in this item’s life history
      • aliases – has attribute count (the number of alias elements included)
        • alias – a structured element providing details on an alias
      • levels – has an attribute count (the number of levels included)
        • level – a structured element providing details on the level (omitted here)
      • values – has an attribute count (the number of values included)
        • value – has an atribute id (the ID of the value in the taxonomy system)
          • name – has an attribute lang (the language code describing the language of the name element)
          • description – has an attribute lang (the language code describing the language of the description element)
          • status
          • createDate
          • updateDate
          • sourceSystemId
          • levelRef – attribute id (identifies the specific level [in the levels element above] with which this value is associated)
          • aliases – attribute count (the number of aliases for this value)
            • alias – a structured element providing details on an alias
          • changeHistory – Same as for classification
          • values – recursive structure reflecting hierarchy within a classification’s set of values
            • value (etc.)

And that’s the schema. Looks complicated, but it’s really pretty simple, I think. The advantage of this has been that consuming applications do not need to directly access the database containing this (which would be pretty simple in principle) and so can be insulated from changes in the underlying structure of the database as we need to make them.

Providing access via an HTTP get keeps the technical cost minimal for consuming applications (they need to be able to read from an HTTP socket and then parse XML, both pretty standard functions in modern languages / libraries).

One last comment – in regard to the level of detail parameter mentioned above – the “brief” level includes the names , descriptions and statuses only of the classifications, levels and values.  The “detailed” includes all details except the changeHistory elements.  The “complete” level includes all of the above.  The “complete” format is probably not very useful for consumers as most will not care about the life history of elements (though that is of interest and value within the taxonomy).

Relationship to other Schemas

Just to connect the dots – I know of other XML schemas that we could conceivably have used to publish this document.  With help from the Taxonomy community of practice, I found the following while researching for a schema to use (I especially want to say thanks to Leonard Will, Mike Taylor, Marcel van Mackelenbergh and Bob Bater for their insights):

At the time we were designing (defining) a schema to use, we knew we wanted to keep it as simple as possible and (right or wrong) as close to the underlying model as we could, which made sense within our business environment. It wasn’t clear at the time which of the above might provide the most likely path forward (in terms of standard adoption) so we “rolled our own”. And, another factor was that the schemas seemed far more general than our needs warranted; for example, the broader-than / narrower-than type relations were implicit in our structure and specifying those explicitly seemed confusing. (To be honest, all of which could be interpreted as “we weren’t educated enough to understand the options and took the simpler-at-the-time approach of rolling our own”.)

I am still not as familiar as I would like to be with the above, so I still would not be able to say which would be most appropriate, but the SKOS schema, now in draft from the W3C seems like a potential solution that would fit our needs and could eventually become a broader standard.  Does anyone have any insights as to where this is moving?

Enterprise Taxonomy – The Structure in Detail

Tuesday, January 13th, 2009

In my previous post, I started describing the structure of the taxonomy we are using in some detail; originally, the following was part of my last post but it got a bit too long so I’ve split it. In this post, I’ll explore the structure in yet more detail – getting closer to a data model.

If you are going through a similar process that we’ve been through and you want to organize your taxonomy in a database, this might provide you with enough detail to get moving.

One note on terminology – much of what we have used is not what I would consider “standard” among taxonomist but was derived during a period when we had numerous systems we were trying to pull together, each of which used one of many different terms – categories, attributes, metadata, fields, tags, etc. I was charged at this point (which was before we started digging into the details of defining an enterprise taxonomy) with trying to define some terms that we could all use so that we could at least understand each other. A taxonomy for taxonomies, I guess.


The primary construct in the taxonomy is called a “Classification”. A better term for this I now know would be “Facet” as that’s what they are. The intent is that a Classification is a specific set of values (perhaps explicitly defined or perhaps defined by a set of guidelines or business rules) with which pieces of content can be associated (they can be tagged with values from the classification).

In our schema, a Classification itself has a number of elements:

  • Name – The preferred name for the Classification. Typically used as the label for fields on, for example, data entry forms of various sorts.
  • Definition – A concise definition of the Classification. Forcing the explicit definition of this helps reduce fuzzy thinking and gets people to clearly differentiate when a new Classification is needed versus using an existing one. This can be displayed in other systems that allow users to associate classification values with content as a kind of “mini-help”.
  • Life History (create date, modification date, audit trail) – We maintain the create date (actually, date added to the taxonomy) and a modification date so we know what happened when to the Classification. More detail is provided below on the audit trail.
  • Source System – Each classification might be sourced from another system. An example is a product listing – these are not maintained in the taxonomy but in their own systems and the taxonomy simply uses that list. Another example (where we do not have automation) is language (where we reference ISO standards as the master even though the values are still manually maintained in our taxonomy database).
  • Comments – A text field to hold comments for use within the taxonomy. Notes about issues, etc. Not intended for end users as the Definition is.
  • Data Type – The type of values expected for this Classification. Most commonly, just Strings, but we do define (for example) Creation Date and Expiration Date as classifications with data type of Date.
  • Value Indicators – The taxonomy provides indicators to help other systems know what to do with the Classification – Should assignment be constrained to just the values provided by the taxonomy? Should other systems allow content pieces to be associated with multiple values of a classification?
  • Synonyms – We provide for the Classification itself to have synonyms (these are synonyms for the Name of the classification). This can be used when (despite best attempts to the contrary) people want to continue to use different terms for the same classificatoin. An example might be that one system (and its user group) might want to refer to a “Region” whereas another might use the term “Market” or “Area”.
  • Status – We provide a status indicator on pretty much everything within the taxonomy (Classifications, individual values, etc). The usage is consistent and breaks down into:
    • “Active” – the value can be assigned to new/modified content; should be displayed in any type of search UI (say as a pick list) if appropriate; and should be displayed if a user views the taxonomic tagging of an item.
    • “Inactive” – the value should not be able to be assigned to new content or be newly assigned to existing content; it should be displayed in search UIs (if appropriate) and should be displayed if a user views the taxonomic tagging of an item. Basically, it was valid at one point and still has value on content already tagged with it but we do not use it any more.
    • “Deleted” – We don’t delete values physically, but mark them “Deleted”. The value can not be assigned when creating or editing content, it should not be displayed in any search UI and it should not be displayed if a user views the taxonomic tagging of a piece of content. Basically, the value is no longer in the taxonomy (though some systems may still have the value associated with content in some ways).
    • “Proposed” – The first status for most items. The value would only be in the Taxonomy system itself and would not propagate to other systems. Indicates that it’s being considered for adding but has not yet been approved.
  • A set of Classification Levels – Some classifications have an internal structure, described below in the “Level” section.
  • Localizations of Classification – There may be non-English translations of the name and description of a classification in the taxonomy database (see below for more about multiple languages).
  • A set of Classification Values – Most classification have a set of explicit values that can be associated with a piece of content. The values might be a flat list or might be hierarchical. The taxonomy database supports both. Currently, we do not support any type of many-to-many relationship or relationships across Classifications – just a simple one-to-many within a Classification which is a value / sub-value relationship (some Classifications provide more explicit constraints on the intended meaning of the relationship). Also, we do not have a construct that allows for an explicit (in the taxonomy database) meaning for any given relationship (specifically, narrower-than, broader-than, etc.) It’s implicit in the structure of the values.

Given the definition of a Classification as above, the terminology we use is that the taxonomy is, itself, the set of all Classifications we have defined and which can be used to tag content.  As with Classification itself, this is not, I think, consistent with standard using (the hierarchical structure within any one Classification would be considered a taxonomy) but adopting this definition at least got us organizationally out of the confusion of how we have a taxonomy when all of the values are not in a single, strict hierarchy.


A Value is a single (usually textual, though might be dates or numbers) term which can be associated with a piece of content. Values are grouped into Classifications. A value association to a piece of content is what connects that piece of content to the taxonomy.

Like a Classification, a Value has a structure, which is only used when the Classification provides explicit values:

  • ID – the unique identifier within the taxonomy that identifies the value. Most systems using the taxonomy will store this ID as the associate (and not the associated value). This allows for the Value to have its textual representation changed without having to revisit any content (say a product name changes or a country’s name changes)
  • Structure details – What classification this value is associated with and which value in this Classification (if any) is the parent of this value. Also, some values have a designated “Level” (see below for more on that).
  • Value – the textual representation of this value. The string users will see and interpret as the “value”.
  • Definition – the definition of this value. As with the classifications, forcing this to be clearly defined provides a good “buffer” against people requesting values to be added that are duplicative or not generally useful. I’m surprised by how often asking a requestor for a clear definition (and how it’s different from another value that seems similar) stops them in their tracks.
  • Life History – same as the Classifications
  • Source System ID – For Classifications whose values come from another system, we maintain the source system’s ID so we can associate it back to the source system for updates. This can also be used by systems that pull from the taxonomy and also might happen (for other business reasons) to pull data from the same source systems and allows those systems to cross between the two sets of values.
  • Status – Same as for Classifications
  • Synonyms – Same as for Classifications but applied to the individual values. Synonyms for values are much more common than synonyms for classifications. Systems using the synonyms can potentially do many different things with synonyms (displaying them while a content manager is associating values with content, supporting search on them, etc.)
  • Localization of Value and Definition – Non-English translations of the value and definition. See below for more details.


Within a single Classification, we have adopted a mechanism we refer to as a “Level” in order to have a structure within the Classification when it’s meaningful to have different Values grouped into semantically different sets. I think of this as the means by which we support a structure of Classifications.

A good example is Geography. We have a single classification for Geography which contains all necessary values for tagging content for geographic relevance (or irrelevance in some cases). However, each Value within that Classification might represent a different type of Geography. Some values are regions of the world (“North America” or “EMEA”); some values are Countries (“France” or “Japan”); and some might be areas within a country of use (“Midwest United States”).

A Level is a hierarchy of terms within a Classification and any given Value can be assigned to a Level.

The value of this is that systems using the taxonomy can provide user interfaces that group similar values (a nested, tree-style interface, say) while we do not need to have multiple Classifications with relationships across the Classifications to support this.

Multiple Languages

In order to support multiple languages on our web sites, we have provided a means to localize the entire taxonomy. Because localized content is a critical component of our customer-facing site, we provide a structure so that all text that can be used outside of the taxonomy (primarily things like the names and definitions of Classifications, the name and definition for Values, Level names, and even synonyms of each of these) can be localized.

Systems that pull from the taxonomy can then use the available localized terms in their displays (falling back to English if a particular term is not available in a specific language). This could be used in field labels on forms or navigation labels in a browsing interface, menu items, etc.

Audit Events

As I mentioned in my post on a vision for an enterprise taxonomy, the taxonomy should provide transparency and allow interested users to examine the history of changes within the taxonomy. This is accomplished by maintaining a history of audit events which can be associated with any of the entities within the taxonomy (classifications, values, levels, etc). Each event is pretty simple:

  • Event type – the type of event that occurred (addition of a new entity, modification of an entity, etc.)
  • Event description – a longer (description) field describing the event. For bugs added / modified manually (as opposed to changes via feed from another system) this comment will almost always include a reference to the bug (in our bug database) that describes the change more fully.
  • Date / time of event – When the event occurred
  • User who triggered the event – Who triggered the event
  • Associated entity (the value, classification, level, etc. that changed) – what was changed.

With the above, when a user views the taxonomy, they can see the full lifecycle of any given entity in the taxonomy.

The processes that pull taxonomy values from source systems also populate events, so we are gathering these for automated and manually maintained values.

All together, this helps provide interested users with some confidence in what’s changing and why it’s changing. In addition, provides the ability (not exercised) to measure “turbulence” in the taxonomy – amount of change over time, etc.

Up next, I’ll describe the XML schema we use for publishing from the taxonomy.

Enterprise Taxonomy – The Structure

Monday, January 12th, 2009

(Editor’s note – I started this several weeks ago and managed to get myself busy with a lot of other things in the meantime and am finally getting back to it now. Apologies for the lengthy pause in the discussion.)

In my last post, I described the vision we developed for our taxonomy and provided a little bit of insight on how it’s managed. I thought some might find it interesting to understand the structure within the taxonomy at a deeper level.

When we initiated our taxonomy effort, we started (as I think most do) by collecting a lot of the language used throughout our enterprise in a big spreadsheet. We went through the language and organized it into a variety of facets and for many of those facets, we organized the values into a hierarchy. We managed the taxonomy in a spreadsheet for a while with some success but there were problems (of course):

  1. It was not possible to actually do any meaningful integration from a spreadsheet into any systems (to use the taxonomy);
  2. It was always a challenge to ensure people had access to the most recent view of the taxonomy;
  3. It was hard to really to meaningfully integrate the taxonomy with source systems that provide many of our labels in the taxonomy (to pull in values from those source systems).

Given this challenge and a developer resource and some good insights about what the taxonomy needed to do, we have created a relatively simple application that has enabled the taxonomy to be much more visible and also much more directly integrated with other systems. Note: It’s very likely that a commercial product would provide what we’ve done and a lot more, but when we set out on this it was not feasible to spend “hard” money on this, so we spent “soft” money in the form of a developer’s time. Perhaps not the best strategy but it’s been successful for our needs so far.

Given the above challenges we had with the “spreadsheet approach”, my primary interest was to solve the problems of access, display and integration and I was not interested in a system that provided a UI for maintaining the taxonomy (that was also supported by the fact that I’ve strived to have most of the taxonomy sourced from business systems and that the management of the other values has primarily been a one-person job and that person was familiar with databases and could update directly).

So, the taxonomy system comprises the following components:

  1. A SQL database (built in MySQL to be specific);
  2. A web application that provides a view of what is in the database – basically a mirror of the database structure which is described below;
  3. A set of processes that run on schedules to pull data from source systems into the taxonomy;
  4. An XML output following a formal(ish) specification to allow other systems to pull values from the taxonomy.

In my next post (possibly later today, even), I’ll provide more details on the structure – closer to a data model for the bits and pieces that comprise the entire taxonomy.