Having written about what I consider to be the principles of enterprise search, about people search in the enterprise, about search analytics and several other topics related to search in some detail, I thought I would share some insights on a role I have called search analyst – the person(s) who are responsible for the care and feeding of an enterprise search solution. The purpose of this post is to share some thoughts and experiences and help others who might be facing a problem similar to what my team faced several years back – we had a search solution in place that no one was maintaining and we needed to figure out what to do to improve it.
Regarding the name of the role – when this role first came into being in my company, I did not know what to call the role, exactly, but we started using the term search analyst because it related to the domain (search) and reflected the fact that the role was detailed (analytical) but was not a technical job like a developer. Subsequently, I’ve heard the term used by others so it seems to be fairly common terminology now – it’s possible that by now I’ve muddled the timeline enough in my head that I had heard the term prior to using it but just don’t recall that!
What does a search analyst do for you? The short answer is that a search analyst is the point person for improving the quality of results in your search solution. The longer answer is that a search analyst needs to:
In order to define success for a search analyst, you need to set some specific objectives for the search analyst(s). Ultimately, given the job description, they translate to measuring how the search analyst has been successful in improving search, but here are some specific suggestions about how you might measure that:
Another common question I’ve received is what percentage of time should a search analyst expect to spend on this type of work? Some organizations may have large enough search needs to warrant multiple full-time people on this task but we are not such an organization and I suspect many other organizations will be in the same situation. So you might have someone who splits their time among several roles and this is just one of them.
I don’t have a full answer to the question because, ultimately, it will depend on the value your organization does place on search. My experience has been that in an organization of approximately 5-6,000 users (employees) covering a total corpus of about a million items spread across several dozen sites / applications / repositories, spending about .25 FTE on search analyst tasks seems to provide for steady improvements and progress.
Spending less than that (down to about .1 FTE), I’ve found, results in a “steady state” – no real improvements but at least the solution does not seem to degrade. Obviously, spending more than that could result in better improvements but I find that dependence on others (content owners, application owners, etc.) can be a limiting factor in effectiveness – full organizational support for the efforts of the search analyst (giving the search analyst a voice in prioritization of work) can help alleviate that. (A search analyst with a software development background may find this less of an issue as, depending on your organization, you may find yourself less tied to development resources than you would otherwise be, though this also likely raises your own FTE commitment.)
The above description is worded as if your organization has a single person focused on search analyst responsibilities. It might also be useful to spread the responsibility among multiple people. One reason would be if your enterprise’s search solution is large enough to warrant a team of people instead of a single person. A second would be that it can be useful to have different search analysts focused (perhaps part time still for each of them) on different content areas. In this second situation, you will want to be careful about how “territorial” search analysts are, especially in the face of significant new content sources (you want to ensure that someone takes on whatever responsibility there might be for that content in regards to ensuring good findability).
So far I’ve provided a description of the role of a search analyst, suggestions for objectives you can assign to a search analyst and at least an idea of the time commitment you might expect to have an effective search analyst. But, if you were looking to staff such a position, what kinds of skills should you look for? Here are my thoughts:
If your search needs warrant more than one person focused on improving your enterprise search solution, as much overlap in the above as feasible is good, though you may have team members specializing in some skills while others focus on other areas.
Another important issue to address is where in your overall organization should the search analyst responsibility rest? I don’t have a good answer for this question and am interested in others’ opinions. My own experiences:
Enough about my own insights – What does anyone else have to share about how you perceive this role? Where does it fit in your organization? What are your objectives for this role?
In my previous two posts, I’ve written about some basic search analytics and then some more advanced analysis you can also apply. In this post, I’ll write about the types of analysis you can and should be doing on data captured about the usage of search results from your search solution. This is largely a topic that could be in the “advanced” analytics topic but for our search solution, it is not built into the search solution and has been implemented only in the last year through some custom work, so it feels different enough (to me) and also has enough details within it that I decided to break it out.
When I first started working on our search solution and dug into the reports and data we had available about search behavior, I found we had things like:
and much more. However, I was frustrated by this because it did not give me a very complete picture. We could see the searches people were using – at least the top searches – but we could not get any indication of “success” or what people found useful in search, even. The closest we got from the reports was the last item listed above, which in a typical report might look something like:
Search Results Pages
However, all this really reflects is the percentage of each page number visited by a searcher – so 95% of users never go beyond page 1 and the engine assumes that means they found what they wanted there. That’s a very bad assumption, obviously.
I wanted to be able to understand what people were actually clicking on (if anything) when they performed a search! I ended up solving this with a very simple solution (simple once I thought of it). I believe this emulates what Google (and probably many other search engines) do. I built a simple servlet that takes a number of parameters, including a URL (encoded) and the various pieces of data about a search result target and stores an event in a database from those parameters and then forwards the user to the desired URL. Then the search results page was updated to provide the URL for that servlet in the search results instead of the direct URL to the target. That’s been in place for a while now and the data is extremely useful!
By way of explanation, the following are the data elements being captured for each “click” on a search result:
This data provides for a lot of insight on behavior. You can guess what someone might be looking for based on understanding the searches they are performing but you can come a lot closer to understanding what they’re really looking for by understanding what they actually accessed. Of course, it’s important to remember that this does not really necessarily equate to the user finding what they are looking for, but may only indicate which result looks most attractive to them, so there is still some uncertainty in understand this.
While I ended up having to do some custom development to achieve this, some search engines will capture this type of data, so you might have access to all of this without any special effort on your part!
Also – I assume that it would be possible to capture a lot of this using a standard web analytics tool as well – I had several discussions with our web analytics vendor about this but had some resource constraints that kept it from getting implemented and also it seemed it would depend in part on the target of the click being instrumented in the right way (having JavaScript in it to capture the event). So any page that did not have that (say a web application whose template could not be modified) or any document (something like a PDF, etc) would likely not be captured correctly.
Given the type of data described above, here are some of the questions and actions you can take as a search analyst:
You can also combine data from this source with data from your web analytics solution to do some additional analysis. If you capture the search usage data in your web analytics tool (as I mention above should be possible), doing this type of analysis should be much easier, too!
Here’s a wrap (for now) on the types of actionable metrics you might consider for your search program. I’ve covered some basic metrics that just about any search engine should be able to support; then some more complex metrics (requiring combining data from other sources or some kind of processing on the data used for the basic metrics) and in this post, I’ve covered some data and analysis that provides a more comprehensive picture of the overall flow of a user through your search solution.
There are a lot more interesting questions I’ve come up with in the time I’ve had access to the data described above and also with the data that I discussed in my previous two posts, but many of them seem a bit academic and I have not been able to identify possible actions to take based on the insights from them.
Please share your thoughts or, if you would, point me to any other resources you might know of in this area!
In my last post, I provided a description of some basic metrics you might want to look into using for your search solution (assuming you’re not already). In this post, I’ll describe a few more metrics that may take a bit more effort to pull together (depending on your search engine).
First up – there is quite a lot of insight to be gained from combining your search analytics data with your web analytics data. It is even possible to capture almost all of your search analytics in your web analytics solution which makes this combination easier, though that can take work. For your external site, it’s also very likely that your web analytics solution will provide insight on the searches that lead people to your site.
A first useful piece of analysis you can perform is to review your top N searches, perform the same searches yourself and review the resulting top target’s usage as reported in your web analytics tool.
A second step would be to review your web analytics report for the most highly used content on your site. For the most highly utilized targets, determine what are the obvious searches that should expose those targets and then try those searches out and see where the highly used targets fall in the results.
Another fruitful area to explore is to consider what people actually use from search results after they’ve done a search (do they click on the first item, second? what is the most common target for a given keyword? Etc.). I’ll post about this separately.
I’m sure there are other areas that could be explored here – please share if you have some ideas.
When I first got involved in supporting a search solution, I spent some time understanding the reports I got from my search engine. We had our engine configured to provide reports on a weekly basis and the reports provided the top 100 searches for the week. All very interesting and as we started out, we tried to understand (given limited time to invest) how best to use the insight from just these 100 searches each week.
We quickly realized that there was no really good, sustainable answer and this was compounded by the fact that the engine reported two searches as different searches if there was *any* difference between two searches (even something as simple as case difference, even though the engine itself does not consider case when doing a search – go figure).
In order to see the forest for the trees, we decided what would be desirable is to categorize the searches – associate individual searches with a larger grouping that allows us to focus at a higher level. The question was how best to do this?
Soon after trying to work out how to do this, I attended Enterprise Search Summit West 2007 and attended a session titled “Taxonomize Your Search Logs” by Marilyn Chartrand from Kaiser Permanente. She spoke about exactly this topic, and, more specifically, the value of doing this as a way to understand search behavior better, to be able to talk to stakeholders in ways that make more sense to them, and more.
Marilyn’s approach was to have a database (she showed it to me and I think it was actually in a taxonomy tool but I don’t recall the details – sorry!) where she maintained a mapping from individual search terms to the taxonomy values.
After that, I’ve started working on the same type of structure and have made good headway. Further, I’ve also managed to have a way to capture every single search (not just the top N) into a SQL database so that it’s possible to view the “long tail” and categorize that as well. I still don’t have a good automated solution to anything like auto-categorizing the terms but the level of re-use from one reporting period to the next is high enough that dumping in a new period’s data requires categorization of only part of the new data. [Updated 26 Jan 2009 to add the following] Part of the challenge is that you will likely want to apply many of the same textual conversions to your database of captured searches that are applied by your search engine – synonyms, stemming, lemmatization, etc. These conversions can help simplify the categorization of the captured searches.
Anyway – the types of questions this enables you to answer and why it can be useful include:
Another useful type of analysis you can perform on search data is to look at simple metrics of the searches. Louis Rosenfeld identified several of these – I’m including those here and a few additional thoughts.
Chart of Searches per Word Count
Chart of Search Length to number of searches
Another interesting view of your search data is hinted at by the discussion above of “secondary” search words – words that are used in conjunction with other words. I have not yet managed to complete this view (lack of time and, frankly, the volume of data is a bit daunting with the tools I’ve tried).
The idea is to parse your searches into their constituent words and then build a network between the words, where the each word is a node and the links between the words represent the strength of the connection between the words – where “strength” is the number of times those two words appear in the same searches.
Having this available as a visual tool to explore words in search seems like it would be valuable as a way to understand their relationships and could give good insight on the overall information needs of your searchers.
The cost (in myown time if nothing else) of taking the data and manipulating it into a format that could then be exposed in this, however, has been high enough to keep me from doing it without some more concrete ideas for what actionable steps I could take from the insight gained. I’m just not confident enough to think that this would expose anything much more than “the most common words used tend to be used together most commonly”.
I’m missing a lot of interesting additional types of analyses above – feel free to share your thoughts and ideas.
In my next post, I’ll explore in some more detail the insights to be gained from analyzing what people are using in search results (not just what people are searching for).
In my first few posts (about a year ago now), I covered what I call the three principles of enterprise search – coverage, identity, and relevance. I have posted on enterprise search topics a few times in the meantime and wanted to return to the topic with some thoughts to share on search analytics and provide some ideas for actionable metrics related to search.
I’m planning 3 posts in this series – this first one will cover some of what I think of as the “basic” metrics, a second post on some more advanced ideas and a third post focusing more on metrics related to the usage of search results (instead of just the searching behavior itself).
Before getting into the details, I also wanted to say that I’ve found a lot of inspiration from the writings and speaking of Louis Rosenfeld and also Avi Rappoport and strongly recommend you look into their writings. A specific webinar to share with you, provided by Louis, is “Site Search Analytics for a Better User Experience“, which Louis presented in a Search CoP webcast last spring. Good stuff!
Now onto some basic metrics I’ve found useful. Most of these are pretty obvious, but I guess it’s good to start at the start.
That’s all of the topics I have for “basic metrics”. Next up, some ideas (along with actions to take from them) on more complex search metrics. Hopefully, you find my recommendations for specific actions you can take on each metric useful (as they do tend to make the posts longer, I realize!).
In an exchange in comments on Stephen Arnold’s blog, Stephen states the line that is the title of this post:
“the future is search enabled applications, not enterprise search”
I’m somewhat familiar with Stephen (I’ve seen him speak at a couple of conferences and also have followed his writing on his blog for some time), but I had actually not seen this declaration in the past (though Stephen says he’s accused of saying it too much).
In any event – I find this an interesting claim and I think I would agree with the sentiment but I also think that it depends on how you look at it. As I wrote previously in trying to lay out what I thought enterprise search is, I think that the key aspects of an enterprise search are that it’s available to all members of the enterprise and that it covers all relevant content.
Down in the details, if access to the enterprise search is through embedding that it in numerous locations or one location, I do not believe it matters. In fact, as I wrote previously, embedding access through multiple points is probably ideal – let workers access it within the environment in which they work, regardless of what tool(s) they normally use to do their job.
On the other hand, if the expectation is that you can embed search in single applications and expect that search only within that application is sufficient, I do not think that is now or will in the future be sufficient. The information needs for any organization are diverse enough that no one application can realistically handle all of them – email, document management, CRM, support knowledge bases, intranets, policies, etc.
Thoughts?
In my last post, I described the goals I have tried to achieve with my proof of concept people search function. Here I will describe the design and implementation of this proof of concept.
Given the goals above, here’s the general outline of the design for this solution:
Initially the web application directly queried the various systems used as sources when generating a profile for a worker. That is not scalable and also limits the amount of processing you can do, so I designed a simple SQL database to contain the data for this (implemented in MySQL). This database is essentially a data mart of worker data. The primary tables are:
With the implementation of this database, I also implemented a synchronization tool that updates the data in the tables from the source systems for the various types of activities.
By automatically pulling data from these source systems (which workers use in their regular day-to-day work), you remove the need for the workers to maintain data.
Now, how should the profile page for a worker be presented?
Initially, I put together a design that did two things: 1) provided a typical employee directory style layout of my administrative details and 2) provided a list of all of the activities for a worker, grouped by activity source. In other words, you would see a list of all of the Wiki articles edited by the worker, a list of mailing list memberships, a list of community memberships, project team memberships, task assignments, etc. Each activity source’s list would be separately displayed (in a simple bulleted list). (Before this would go into production, I always have assumed I would ask for some design help from our electronic marketing group to give it a more professional look, but I thought the bulleted list worked perfectly well functionally.)
This proved simple and effective and also enabled the profile page to provide direct links to those activities that are addressable via a link (for example, the profile page could link directly to a Wiki article I’ve edited from my profile page, it could link to each discussion post, etc.)
However, this approach suffered from at least two problems: 1) it lacked an immediately obvious visual presentation of a worker’s attributes, and 2) it exposed every detailed activity of a worker to anyone who viewed the profile (I found when I demoed this to people, some had the immediate reaction of, “Wow – anyone can see all of these details? I’m not sure I like that!” – a reaction that surprised me given that any of the details are generally visible to anyone who wants to look, but go figure).
After looking for alternatives, I found that the keywords for a worker (when combined with their weights) provided good input for a tag cloud – which is what I ended up using as the default presentation of a worker’s keywords (visible to everyone). This helps to highlight what someone is “about”, presents a generally attractive visualization of the data, and, if the default view of a worker displays this tag cloud (and the worker’s administrative data) and does not show all of the details, it alleviates the concern mentioned above.
I have found the implementation of the tag cloud to be the trigger that pulls people into this tool – it helps satisfy my goal #5 because, for most people who have looked at this, it provides immediate validation when they see words they expect to see in their own tag cloud.
Here’s a shot of what part of my profile page looks like (partially obscured):
I wanted to keep the initial proof of concept simple in order to try to test different ways of using the data from the activity sources. With that in mind, here are some details on how I’ve done this so far:
Some additional functions I have layered on top of the basic profile / search mechanism that I believe will make this a valuable solution:
The proof of concept has been very interesting to work through and has presented me with some (subjective) proof of the value of this approach, as simple as it is. That being said, there are some issues and additional areas I hope are explored in the future:
I have previously described what I termed the various generations of solutions to the common challenge of workers finding connecting with or finding co-workers within an enterprise. My most recent post described the fourth generation solution – which enables users to search and connect using much more than simple administrative terms (name, email, address, etc.) for the search.
Over my next couple of posts, I will provide a write-up of a proof of concept implementation I’ve assembled that meets a lot of the need for this with what I believe to be relatively minimal investment.
The follow represent the goals I’ve set for myself in this proof of concept:
Also, I wanted to say that part of the inspiration for this proof of concept came from a session I attended at Enterprise Search Summit 2007 as presented by Trent Parkhill. In his session, he described a mechanism where submissions to a company’s repository would be tagged with the names of participants in the project that produced the document as a deliverable. Then, when users were searching for content, there was a secondary search that produced a list of people associated with the terms and / or documents found by the user’s search. I’ve kind of turned that around and treated the people as being tagged by the keywords of the items they produce.
In my next post, I will describe the overall design of my proof of concept.
I just read through Kas Thomas’ post In search of a standard search syntax, and have to agree this would be useful for users of search engines.
However, I would go even further and suggest that the search industry (enterprise search as well as internet search engines) would also benefit if it were to define and adopt a standard response syntax for results (at least a response syntax that could be provided as an option). Obviously, for most users a straightforward HTML presentation is desirable as when they interact with an engine through their browser, they want to be able to view the results in their browser.
However, an ability to request results from an arbitrary engine in a standard format would be a great step forward – it would vastly simplify aggregation of results for federated search and more generally it could present the ability to programmatically interact with multiple engines for a variety of other purposes.
I know of one attempt that seems to drive to this – OpenSearch (which is associated with A9 – Amazon’s search engine) – a set of elements that can be used as extensions to an RSS format. Are there others? How widely known (and adopted?) is OpenSearch as a format?
Or, in other words, “How do you apply the application standards to improve findability to applications built by third-party providers who do not follow your standards?”
I’ve previously written about the standards I’ve put together for (web-based) applications that help ensure good findability for content / data within that application. These standards are generally relatively easy to apply to custom applications (though it can still be challenging to get involved with the design and development of those applications at the right time to keep the time investment minimal, as I’ve also previously written about).
However, it can be particularly challenging to apply these standards to third-party applications – For example, your CRM application, your learning management system, or your HR system, etc. Applying the existing standards could take a couple of different forms:
The rest of this post will discuss a solution for option #3 above – how you can implement a different solution. Note that some search engines will provide pre-built functionality to enable search within many of the more common third party solutions – those are great and useful, but what I will present here is a solution that can be implemented independent of the search engine (as long as the search engine has a crawler-based indexing function) and which is relatively minimal in investment.
So, you have a third party application and, for whatever reason, it does not adhere to your application standards for findability. Perhaps it fails the coverage principle and it’s not possible to adequate find the useful content without getting many, many useless items; or perhaps it’s the identity principle and, while you can find all of the desirable targets, they have redundant titles; or it might even be that the application fails the relevance principle and you can index the high value targets and they show up with good names in results but they do not show up as relevant for keywords which you would expect. Likely, it’s a combination of all three of these issues.
The core idea in this solution is that you will need a helper application that creates what I call “shadow pages” of the high value targets you want to include in your enterprise search.
Note: I adopted the use of the term “shadow page” based on some informal discussions with co-workers on this topic – I am aware that others use this term in similar ways (though I don’t think it means the exact same thing) and also am aware that some search engines address what they call shadow domains and discourage their inclusion in their search results. If there is a preferred term for the idea described here – please let me know!
What is a shadow page? For my purposes here, I define a shadow page as:
To make this solution work, there are a couple of minimal assumptions of the application. A caveat: I recognize that, while I consider these as relatively simple assumptions, it is very likely that some applications will still not be able to meet these and so not be able to be exposed via your enterprise search with this type of solution.
Given the description of a shadow page and the assumptions about what is necessary to support it, it is probably obvious how they are used and how they are constructed, but here’s a description:
First – you would use the query that gives you a list of targets (item #2 from the assumptions) from your source application to generate an index page which you can give your indexer as a starting point. This index page would have one link on it for each desirable target’s shadow page. This index page would also have “robots” <meta> tags of “noindex,follow” to ensure that the index page itself is not included as a potential target.
Second – The shadow page for each target (which the crawler reaches thanks to the index page) is dynamically built from the query of the application given the identity of the desirable search target (item #3 from the assumptions). The business rules defining how the desirable target should behave in search help define the necessary query, but the query would need to contain at minimum some of the following data: the name of the target, a description or summary of the target, some keywords that describe the target, a value which will help define the true URL of the actual target (per assumption #1, there must be a way to directly address each target).
The shadow page would be built something like the following:
The overall effect of this is that the search engine will index the shadow page, which has been constructed to ensure good adherence to the principles of enterprise search, and to a searcher, it will behave like a good search target but when the user clicks on it from a search result, the user ends up looking at the actual desired target. The only clue the user might have is that the URL of the target in the search results is not what they end up looking at in their browser’s address bar.
The following provides a simple example of the source (in HTML – sorry for those who might not be able to read it) for a shadow page (the parts that change from page to page are in bold):
<html> <head> <TITLE>title of target</TITLE> <meta name="robots" content="index, nofollow"> <meta name="keywords" content="keywords for target"> <meta name="description" content="description of target"> <script type="text/javascript"> document.location.href="URL of actual target"; </script> </head>
<body> <div style="display:none;"> <h1>title of target</h1> description of target and keywords of target </div> </body> </html>
A few things that are immediately obvious advantages of this approach:
There are also a number of issues that I need to highlight with this approach – unfortunately, it’s not perfect!
There you have it – a solution to the exposure of your high value targets from your enterprise applications that is independent of your search engine and can provide you (the search administrator) with a good level of control over how content appears to your search engine, while ensuring that what is included highly adheres to my principles of enterprise search.
I’ve previously written about the three principles of enterprise search and also about the specific business process challenges I’ve run into again and again with web applications in terms of findability.
Here, I will provide some insights on the specific standards I’ve established to improve findability, primarily within web applications.
As you might expect, these standards map closely to the three principles of enterprise search and so that’s how I will discuss them.
When an application is being specified, the application team must ensure that they discuss the following question with business users – What are the business objects within this application and which of those should be visible through enterprise search?
The first question is pretty standard and likely forms the basis for any kind of UML or entity relationship diagram that would be part of a design process for the application. The second part is often not asked but it forms the basis for what will eventually be the specific targets that will show in search results through the enterprise search.
Given the identification of which objects should be visible in search results, you can then easily start to plan out how they might show up, how the search engine will encounter them, whether the application might best provide a dynamic index page of links to the entities or support a standard crawl or perhaps even a direct index of the database(s) behind the application.
Basically, the standard here is that the application must provide a means to ensure that a search engine can find all of the objects that need to be visible and also to ensure that the search engine does not include things that it should not.
Some specific things that are included here:
With the standard for Coverage defined, we can be comfortable with knowing that the right things are going to show in search and the wrong things will not show up. How useful will they be as search results, though? If a searcher sees an item in a results list, will they be able to know that it’s what they’re looking for? So we need to ensure that the application addresses the identity principle.
The standard here is that the pages (ASP pages, JSP files, etc) that comprise the desirable targets for search must be designed to address the identity principle – specifically:
Relevance
Now we know that the search includes what it should and we also know that when those items show in search, they will be identifiable for what they are. How do we ensure that the items show up in search for searches for which they are relevant, though?
The standards to address the relevance issue are:
For a good review of the <meta> tags in HTML pages, you can look at: