Having written about what I consider to be the principles of enterprise search, about people search in the enterprise, about search analytics and several other topics related to search in some detail, I thought I would share some insights on a role I have called search analyst – the person(s) who are responsible for the care and feeding of an enterprise search solution. The purpose of this post is to share some thoughts and experiences and help others who might be facing a problem similar to what my team faced several years back – we had a search solution in place that no one was maintaining and we needed to figure out what to do to improve it.
Regarding the name of the role – when this role first came into being in my company, I did not know what to call the role, exactly, but we started using the term search analyst because it related to the domain (search) and reflected the fact that the role was detailed (analytical) but was not a technical job like a developer. Subsequently, I’ve heard the term used by others so it seems to be fairly common terminology now – it’s possible that by now I’ve muddled the timeline enough in my head that I had heard the term prior to using it but just don’t recall that!
What does a search analyst do for you? The short answer is that a search analyst is the point person for improving the quality of results in your search solution. The longer answer is that a search analyst needs to:
In order to define success for a search analyst, you need to set some specific objectives for the search analyst(s). Ultimately, given the job description, they translate to measuring how the search analyst has been successful in improving search, but here are some specific suggestions about how you might measure that:
Another common question I’ve received is what percentage of time should a search analyst expect to spend on this type of work? Some organizations may have large enough search needs to warrant multiple full-time people on this task but we are not such an organization and I suspect many other organizations will be in the same situation. So you might have someone who splits their time among several roles and this is just one of them.
I don’t have a full answer to the question because, ultimately, it will depend on the value your organization does place on search. My experience has been that in an organization of approximately 5-6,000 users (employees) covering a total corpus of about a million items spread across several dozen sites / applications / repositories, spending about .25 FTE on search analyst tasks seems to provide for steady improvements and progress.
Spending less than that (down to about .1 FTE), I’ve found, results in a “steady state” – no real improvements but at least the solution does not seem to degrade. Obviously, spending more than that could result in better improvements but I find that dependence on others (content owners, application owners, etc.) can be a limiting factor in effectiveness – full organizational support for the efforts of the search analyst (giving the search analyst a voice in prioritization of work) can help alleviate that. (A search analyst with a software development background may find this less of an issue as, depending on your organization, you may find yourself less tied to development resources than you would otherwise be, though this also likely raises your own FTE commitment.)
The above description is worded as if your organization has a single person focused on search analyst responsibilities. It might also be useful to spread the responsibility among multiple people. One reason would be if your enterprise’s search solution is large enough to warrant a team of people instead of a single person. A second would be that it can be useful to have different search analysts focused (perhaps part time still for each of them) on different content areas. In this second situation, you will want to be careful about how “territorial” search analysts are, especially in the face of significant new content sources (you want to ensure that someone takes on whatever responsibility there might be for that content in regards to ensuring good findability).
So far I’ve provided a description of the role of a search analyst, suggestions for objectives you can assign to a search analyst and at least an idea of the time commitment you might expect to have an effective search analyst. But, if you were looking to staff such a position, what kinds of skills should you look for? Here are my thoughts:
If your search needs warrant more than one person focused on improving your enterprise search solution, as much overlap in the above as feasible is good, though you may have team members specializing in some skills while others focus on other areas.
Another important issue to address is where in your overall organization should the search analyst responsibility rest? I don’t have a good answer for this question and am interested in others’ opinions. My own experiences:
Enough about my own insights – What does anyone else have to share about how you perceive this role? Where does it fit in your organization? What are your objectives for this role?
In my previous two posts, I’ve written about some basic search analytics and then some more advanced analysis you can also apply. In this post, I’ll write about the types of analysis you can and should be doing on data captured about the usage of search results from your search solution. This is largely a topic that could be in the “advanced” analytics topic but for our search solution, it is not built into the search solution and has been implemented only in the last year through some custom work, so it feels different enough (to me) and also has enough details within it that I decided to break it out.
When I first started working on our search solution and dug into the reports and data we had available about search behavior, I found we had things like:
and much more. However, I was frustrated by this because it did not give me a very complete picture. We could see the searches people were using – at least the top searches – but we could not get any indication of “success” or what people found useful in search, even. The closest we got from the reports was the last item listed above, which in a typical report might look something like:
Search Results Pages
However, all this really reflects is the percentage of each page number visited by a searcher – so 95% of users never go beyond page 1 and the engine assumes that means they found what they wanted there. That’s a very bad assumption, obviously.
I wanted to be able to understand what people were actually clicking on (if anything) when they performed a search! I ended up solving this with a very simple solution (simple once I thought of it). I believe this emulates what Google (and probably many other search engines) do. I built a simple servlet that takes a number of parameters, including a URL (encoded) and the various pieces of data about a search result target and stores an event in a database from those parameters and then forwards the user to the desired URL. Then the search results page was updated to provide the URL for that servlet in the search results instead of the direct URL to the target. That’s been in place for a while now and the data is extremely useful!
By way of explanation, the following are the data elements being captured for each “click” on a search result:
This data provides for a lot of insight on behavior. You can guess what someone might be looking for based on understanding the searches they are performing but you can come a lot closer to understanding what they’re really looking for by understanding what they actually accessed. Of course, it’s important to remember that this does not really necessarily equate to the user finding what they are looking for, but may only indicate which result looks most attractive to them, so there is still some uncertainty in understand this.
While I ended up having to do some custom development to achieve this, some search engines will capture this type of data, so you might have access to all of this without any special effort on your part!
Also – I assume that it would be possible to capture a lot of this using a standard web analytics tool as well – I had several discussions with our web analytics vendor about this but had some resource constraints that kept it from getting implemented and also it seemed it would depend in part on the target of the click being instrumented in the right way (having JavaScript in it to capture the event). So any page that did not have that (say a web application whose template could not be modified) or any document (something like a PDF, etc) would likely not be captured correctly.
Given the type of data described above, here are some of the questions and actions you can take as a search analyst:
You can also combine data from this source with data from your web analytics solution to do some additional analysis. If you capture the search usage data in your web analytics tool (as I mention above should be possible), doing this type of analysis should be much easier, too!
Here’s a wrap (for now) on the types of actionable metrics you might consider for your search program. I’ve covered some basic metrics that just about any search engine should be able to support; then some more complex metrics (requiring combining data from other sources or some kind of processing on the data used for the basic metrics) and in this post, I’ve covered some data and analysis that provides a more comprehensive picture of the overall flow of a user through your search solution.
There are a lot more interesting questions I’ve come up with in the time I’ve had access to the data described above and also with the data that I discussed in my previous two posts, but many of them seem a bit academic and I have not been able to identify possible actions to take based on the insights from them.
Please share your thoughts or, if you would, point me to any other resources you might know of in this area!
In my last post, I provided a description of some basic metrics you might want to look into using for your search solution (assuming you’re not already). In this post, I’ll describe a few more metrics that may take a bit more effort to pull together (depending on your search engine).
First up – there is quite a lot of insight to be gained from combining your search analytics data with your web analytics data. It is even possible to capture almost all of your search analytics in your web analytics solution which makes this combination easier, though that can take work. For your external site, it’s also very likely that your web analytics solution will provide insight on the searches that lead people to your site.
A first useful piece of analysis you can perform is to review your top N searches, perform the same searches yourself and review the resulting top target’s usage as reported in your web analytics tool.
A second step would be to review your web analytics report for the most highly used content on your site. For the most highly utilized targets, determine what are the obvious searches that should expose those targets and then try those searches out and see where the highly used targets fall in the results.
Another fruitful area to explore is to consider what people actually use from search results after they’ve done a search (do they click on the first item, second? what is the most common target for a given keyword? Etc.). I’ll post about this separately.
I’m sure there are other areas that could be explored here – please share if you have some ideas.
When I first got involved in supporting a search solution, I spent some time understanding the reports I got from my search engine. We had our engine configured to provide reports on a weekly basis and the reports provided the top 100 searches for the week. All very interesting and as we started out, we tried to understand (given limited time to invest) how best to use the insight from just these 100 searches each week.
We quickly realized that there was no really good, sustainable answer and this was compounded by the fact that the engine reported two searches as different searches if there was *any* difference between two searches (even something as simple as case difference, even though the engine itself does not consider case when doing a search – go figure).
In order to see the forest for the trees, we decided what would be desirable is to categorize the searches – associate individual searches with a larger grouping that allows us to focus at a higher level. The question was how best to do this?
Soon after trying to work out how to do this, I attended Enterprise Search Summit West 2007 and attended a session titled “Taxonomize Your Search Logs” by Marilyn Chartrand from Kaiser Permanente. She spoke about exactly this topic, and, more specifically, the value of doing this as a way to understand search behavior better, to be able to talk to stakeholders in ways that make more sense to them, and more.
Marilyn’s approach was to have a database (she showed it to me and I think it was actually in a taxonomy tool but I don’t recall the details – sorry!) where she maintained a mapping from individual search terms to the taxonomy values.
After that, I’ve started working on the same type of structure and have made good headway. Further, I’ve also managed to have a way to capture every single search (not just the top N) into a SQL database so that it’s possible to view the “long tail” and categorize that as well. I still don’t have a good automated solution to anything like auto-categorizing the terms but the level of re-use from one reporting period to the next is high enough that dumping in a new period’s data requires categorization of only part of the new data. [Updated 26 Jan 2009 to add the following] Part of the challenge is that you will likely want to apply many of the same textual conversions to your database of captured searches that are applied by your search engine – synonyms, stemming, lemmatization, etc. These conversions can help simplify the categorization of the captured searches.
Anyway – the types of questions this enables you to answer and why it can be useful include:
Another useful type of analysis you can perform on search data is to look at simple metrics of the searches. Louis Rosenfeld identified several of these – I’m including those here and a few additional thoughts.
Chart of Searches per Word Count
Chart of Search Length to number of searches
Another interesting view of your search data is hinted at by the discussion above of “secondary” search words – words that are used in conjunction with other words. I have not yet managed to complete this view (lack of time and, frankly, the volume of data is a bit daunting with the tools I’ve tried).
The idea is to parse your searches into their constituent words and then build a network between the words, where the each word is a node and the links between the words represent the strength of the connection between the words – where “strength” is the number of times those two words appear in the same searches.
Having this available as a visual tool to explore words in search seems like it would be valuable as a way to understand their relationships and could give good insight on the overall information needs of your searchers.
The cost (in myown time if nothing else) of taking the data and manipulating it into a format that could then be exposed in this, however, has been high enough to keep me from doing it without some more concrete ideas for what actionable steps I could take from the insight gained. I’m just not confident enough to think that this would expose anything much more than “the most common words used tend to be used together most commonly”.
I’m missing a lot of interesting additional types of analyses above – feel free to share your thoughts and ideas.
In my next post, I’ll explore in some more detail the insights to be gained from analyzing what people are using in search results (not just what people are searching for).
In my first few posts (about a year ago now), I covered what I call the three principles of enterprise search – coverage, identity, and relevance. I have posted on enterprise search topics a few times in the meantime and wanted to return to the topic with some thoughts to share on search analytics and provide some ideas for actionable metrics related to search.
I’m planning 3 posts in this series – this first one will cover some of what I think of as the “basic” metrics, a second post on some more advanced ideas and a third post focusing more on metrics related to the usage of search results (instead of just the searching behavior itself).
Before getting into the details, I also wanted to say that I’ve found a lot of inspiration from the writings and speaking of Louis Rosenfeld and also Avi Rappoport and strongly recommend you look into their writings. A specific webinar to share with you, provided by Louis, is “Site Search Analytics for a Better User Experience“, which Louis presented in a Search CoP webcast last spring. Good stuff!
Now onto some basic metrics I’ve found useful. Most of these are pretty obvious, but I guess it’s good to start at the start.
That’s all of the topics I have for “basic metrics”. Next up, some ideas (along with actions to take from them) on more complex search metrics. Hopefully, you find my recommendations for specific actions you can take on each metric useful (as they do tend to make the posts longer, I realize!).
Now that I’ve posted quite a bit on the technical side of an enterprise taxonomy, I thought I’d share a bit on the business process side of how we have managed our taxonomy.
I spoke about this topic at the 2007 Taxonomy Boot Camp. (As an aside, I tried to find if the presentation I used is available on the site but I couldn’t find it – if someone knows of an online archive, please let me know and I can provide a link from here.) The session I delivered was titled, “The Process and Politics of Implementing a Corporate Taxonomy” and focused on the overall process we have implemented.
What follows is an overview of the larger process we used to establish the taxonomy and a description of the smaller process used to maintain it and I’ll close with some of my own thoughts on what it is that triggers changes in a taxonomy.
When we first started trying to formalize a taxonomy, one of the first steps we took was to do an organizational mapping to identify participants in the process. We focused on the following:
We felt that this organizational mapping was important because it would help increase buy-in to the taxonomy from those who have most vested interest in it and also (with help from that last group) would help increase larger scale adoption of the language. Once we felt that we had identified the groups that met these criteria, we engaged with the executives for the groups to help us identify one or more people who could be included in our Taxonomy Review Board.
The rest of the “getting started” process included content audits and analyses to identify terminology used to describe the content, definition of the structure of the taxonomy we wanted to use, organization of the terminology into this structure and then working with the Taxonomy Review Board to confirm the end result as a first version of the (evolving) taxonomy.
We also layed out the objectives we had for the overall process – which you can find in my post on the vision we have developed for our taxonomy. The really pertinent items we wanted to ensure were: We wanted to ensure that the taxonomy was actively managed and we wanted to ensure that the management process was transparent.
Now that the taxonomy had been established, we needed to identify the people and process we would use for maintaining and enhancing the taxonomy.
The people who are involved include:
This organization has helped to keep the taxonomy managed, while also keeping overall enterprise expense to manage it fairly small.
Now, I am, at heart, a software engineer. Why is this pertinent? Early on in my career, I came to appreciate the need and value for change control (or, as I prefer to think of it change management or change visibility – I’ve always thought “control” seemed a bit stronger than you could really achieve) and that has seeped into our process.
At its heart, our process is similar to a software development team’s change control board (CCB) process:
While it has worked effectively we still face a number of issues with this process. These include:
What triggers a change in the taxonomy?
As I (re-)gather my thoughts on this topic, one lingering question came back to me about the overall process. The question is external to the process (which takes the approach of “a change comes from somewhere and we’re not going to worry about where it comes from but once it’s been identified, we’ll wedge it into this process”) but I am interested in understanding what other taxonomists might actively do in maintaining a taxonomy. In other words, how much change do you experience that comes from others compared to your own recommendations or insights?
Here’s a list of triggers that have resulted in changes in the taxonomy:
In my continuing dive into the structure of our taxonomy, which, hopefully might be of use or interest to you to understand and possibly adopt to your own needs, so far, I’ve provided an outline of the application solution and then a high level outline of the data model we’re using.
One of the important features of our solution is that our taxonomy system provides the ability for other systems to consume the taxonomy via an XML document. I’ll explore that a bit here.
Access to the XML document for the taxonomy is through a very simple means: a standard HTTP GET. The query string in the request can specify various parameters on the URL – effectively, a very simple web service. The types of parameters supported include:
With regard to the language – one of the business rules followed in our web sites is that you provide content in the user’s selected language when available and return English when the user’s language is not available (English should always be available). This rule is pushed down into this interface at the level of each value. So a consuming application might request the set of German values for the taxonomy and get all of the classification details in German and, say, 99% of the values in German but if there are values that are not translated, those are returned in English. This approach keeps the taxonomy consistent with our general rules (though if taxonomy values are used directly in a user interface, it does present a possibly confusing same-page mix of non-English and English).
The returned XML document looks like the following. I’m not using any formal XML schema syntax – instead showing the elements and how they relate to each other with a brief description of th elements that I don’t think are self-explanatory.
And that’s the schema. Looks complicated, but it’s really pretty simple, I think. The advantage of this has been that consuming applications do not need to directly access the database containing this (which would be pretty simple in principle) and so can be insulated from changes in the underlying structure of the database as we need to make them.
Providing access via an HTTP get keeps the technical cost minimal for consuming applications (they need to be able to read from an HTTP socket and then parse XML, both pretty standard functions in modern languages / libraries).
One last comment – in regard to the level of detail parameter mentioned above – the “brief” level includes the names , descriptions and statuses only of the classifications, levels and values. The “detailed” includes all details except the changeHistory elements. The “complete” level includes all of the above. The “complete” format is probably not very useful for consumers as most will not care about the life history of elements (though that is of interest and value within the taxonomy).
Just to connect the dots – I know of other XML schemas that we could conceivably have used to publish this document. With help from the Taxonomy community of practice, I found the following while researching for a schema to use (I especially want to say thanks to Leonard Will, Mike Taylor, Marcel van Mackelenbergh and Bob Bater for their insights):
At the time we were designing (defining) a schema to use, we knew we wanted to keep it as simple as possible and (right or wrong) as close to the underlying model as we could, which made sense within our business environment. It wasn’t clear at the time which of the above might provide the most likely path forward (in terms of standard adoption) so we “rolled our own”. And, another factor was that the schemas seemed far more general than our needs warranted; for example, the broader-than / narrower-than type relations were implicit in our structure and specifying those explicitly seemed confusing. (To be honest, all of which could be interpreted as “we weren’t educated enough to understand the options and took the simpler-at-the-time approach of rolling our own”.)
I am still not as familiar as I would like to be with the above, so I still would not be able to say which would be most appropriate, but the SKOS schema, now in draft from the W3C seems like a potential solution that would fit our needs and could eventually become a broader standard. Does anyone have any insights as to where this is moving?
In my previous post, I started describing the structure of the taxonomy we are using in some detail; originally, the following was part of my last post but it got a bit too long so I’ve split it. In this post, I’ll explore the structure in yet more detail – getting closer to a data model.
If you are going through a similar process that we’ve been through and you want to organize your taxonomy in a database, this might provide you with enough detail to get moving.
One note on terminology – much of what we have used is not what I would consider “standard” among taxonomist but was derived during a period when we had numerous systems we were trying to pull together, each of which used one of many different terms – categories, attributes, metadata, fields, tags, etc. I was charged at this point (which was before we started digging into the details of defining an enterprise taxonomy) with trying to define some terms that we could all use so that we could at least understand each other. A taxonomy for taxonomies, I guess.
The primary construct in the taxonomy is called a “Classification”. A better term for this I now know would be “Facet” as that’s what they are. The intent is that a Classification is a specific set of values (perhaps explicitly defined or perhaps defined by a set of guidelines or business rules) with which pieces of content can be associated (they can be tagged with values from the classification).
In our schema, a Classification itself has a number of elements:
Given the definition of a Classification as above, the terminology we use is that the taxonomy is, itself, the set of all Classifications we have defined and which can be used to tag content. As with Classification itself, this is not, I think, consistent with standard using (the hierarchical structure within any one Classification would be considered a taxonomy) but adopting this definition at least got us organizationally out of the confusion of how we have a taxonomy when all of the values are not in a single, strict hierarchy.
A Value is a single (usually textual, though might be dates or numbers) term which can be associated with a piece of content. Values are grouped into Classifications. A value association to a piece of content is what connects that piece of content to the taxonomy.
Like a Classification, a Value has a structure, which is only used when the Classification provides explicit values:
Within a single Classification, we have adopted a mechanism we refer to as a “Level” in order to have a structure within the Classification when it’s meaningful to have different Values grouped into semantically different sets. I think of this as the means by which we support a structure of Classifications.
A good example is Geography. We have a single classification for Geography which contains all necessary values for tagging content for geographic relevance (or irrelevance in some cases). However, each Value within that Classification might represent a different type of Geography. Some values are regions of the world (“North America” or “EMEA”); some values are Countries (“France” or “Japan”); and some might be areas within a country of use (“Midwest United States”).
A Level is a hierarchy of terms within a Classification and any given Value can be assigned to a Level.
The value of this is that systems using the taxonomy can provide user interfaces that group similar values (a nested, tree-style interface, say) while we do not need to have multiple Classifications with relationships across the Classifications to support this.
In order to support multiple languages on our web sites, we have provided a means to localize the entire taxonomy. Because localized content is a critical component of our customer-facing site, we provide a structure so that all text that can be used outside of the taxonomy (primarily things like the names and definitions of Classifications, the name and definition for Values, Level names, and even synonyms of each of these) can be localized.
Systems that pull from the taxonomy can then use the available localized terms in their displays (falling back to English if a particular term is not available in a specific language). This could be used in field labels on forms or navigation labels in a browsing interface, menu items, etc.
As I mentioned in my post on a vision for an enterprise taxonomy, the taxonomy should provide transparency and allow interested users to examine the history of changes within the taxonomy. This is accomplished by maintaining a history of audit events which can be associated with any of the entities within the taxonomy (classifications, values, levels, etc). Each event is pretty simple:
With the above, when a user views the taxonomy, they can see the full lifecycle of any given entity in the taxonomy.
The processes that pull taxonomy values from source systems also populate events, so we are gathering these for automated and manually maintained values.
All together, this helps provide interested users with some confidence in what’s changing and why it’s changing. In addition, provides the ability (not exercised) to measure “turbulence” in the taxonomy – amount of change over time, etc.
Up next, I’ll describe the XML schema we use for publishing from the taxonomy.
(Editor’s note – I started this several weeks ago and managed to get myself busy with a lot of other things in the meantime and am finally getting back to it now. Apologies for the lengthy pause in the discussion.)
In my last post, I described the vision we developed for our taxonomy and provided a little bit of insight on how it’s managed. I thought some might find it interesting to understand the structure within the taxonomy at a deeper level.
When we initiated our taxonomy effort, we started (as I think most do) by collecting a lot of the language used throughout our enterprise in a big spreadsheet. We went through the language and organized it into a variety of facets and for many of those facets, we organized the values into a hierarchy. We managed the taxonomy in a spreadsheet for a while with some success but there were problems (of course):
Given this challenge and a developer resource and some good insights about what the taxonomy needed to do, we have created a relatively simple application that has enabled the taxonomy to be much more visible and also much more directly integrated with other systems. Note: It’s very likely that a commercial product would provide what we’ve done and a lot more, but when we set out on this it was not feasible to spend “hard” money on this, so we spent “soft” money in the form of a developer’s time. Perhaps not the best strategy but it’s been successful for our needs so far.
Given the above challenges we had with the “spreadsheet approach”, my primary interest was to solve the problems of access, display and integration and I was not interested in a system that provided a UI for maintaining the taxonomy (that was also supported by the fact that I’ve strived to have most of the taxonomy sourced from business systems and that the management of the other values has primarily been a one-person job and that person was familiar with databases and could update directly).
So, the taxonomy system comprises the following components:
In my next post (possibly later today, even), I’ll provide more details on the structure – closer to a data model for the bits and pieces that comprise the entire taxonomy.