Lee Romero

On Content, Collaboration and Findability

Language change over time in your search log

Monday, October 10th, 2011

This is a second post in a series I have planned about the language found throughout your search log – all the way into the “long tail” and how it might or might not be feasible to understand it all.

My previous post, “80-20: The lie in your search log?“, highlighted how the slope of “short head” of your search terms may not be as steep as anecdotes would say.  That is, there can be a lot less commonality within a particular time range among even the most common terms in your search log than you might expect.

After writing that post, I began to wonder about the overall re-use of terms over periods of time.

In other words:

Even while commonality of re-using terms within a month is relatively low, how much commonality do we see in our users’ language (i.e., search terms) from month to month?

To answer this, I needed to take the entire set of terms for a month and compare them with the entire set from the next month and determine the overlap and then compare the second month’s set of terms to a third month’s, and so on.  Logically not a hard problem but quite a challenge in practice due to the volume of data I was manipulating (large only in the face of the tools I have to manipulate it).

So I pulled together every single term used over a period of about 18 months and broke them into the set used for each of those months and performed the comparison.

Before getting into the details, a few details to share for context about the search solution I’m writing about here:

  • The average number of searches performed each month was almost 123,000.
  • The average number of distinct terms during this period was just under 53,000.
  • This results in an average of about 2.3 search for each distinct term

My expectation was that comparing the entire set of terms from one month to the next would show a relatively high percentage of overlap.  What I found was not what I expected.

If you look at the unique terms and their overlap, surprisingly, the average overlap between months was a shockingly low 13.2%.  In other words, over 86% of the terms in any given month were not used at all in the

Month to Month Re-Use of Search Terms

previous month.

If you look at the total searches performed and the percent of searches performed with terms from the prior month, this goes up to an average of 36.2% – reflecting that the terms that are re-used in a subsequent month among the most common terms overall.

Month to Month Re-Use of Search Terms

As you can see, the amount of commonality from month-to-month among the terms used is very low.

What can you draw from this observation?

In a brief discussion about this with noted search analytics expert Lou Rosenfeld, his reaction was that this represented a significant amount of change in the information needs of the users of the system – significant enough to be surprising.

Another conclusion I draw from this is that it provides another reason why it is very hard to meaningfully improve search across the language of your users.  Based on my previous post on the flatness of the curve of term use within a month, we know that it we need to look at a pretty significant percentage of distinct terms each month to account for a decent percentage of all searches – 12% of distinct terms to account for only 50% of searches.  In our search solution, that 12% doesn’t seem that large until you realize it is still represents about 6,000 distinct terms.

Coupling that with the observation from the analysis here means that even if you review those terms for a given month, you will likely need to review a significant percentage of brand new terms the next month, and so on.  Not an easy task.

Having established just how challenging this can be, my next few posts will provide some ideas for grappling with the challenges.

In the meantime, if you have any insight on similar statistics from your solution (or statistics about the shape of the search log curve I previously wrote above), please feel free to share here, on the SearchCoP on Yahoo! groups or on the Enterprise Search Engine Professionals group on LinkedIn – I would very much like to compare numbers to see if we can identify meaningful generalizations from different solution.

The Findability Gap by Lou Rosenfeld

Friday, September 23rd, 2011

Lou Rosenfeld has just published a great presentation I would highly recommend for anything working in the search space:  The Findability Gap.

It provides a great picture of the overall landscape of the problem (it’s not just search, after all!).

I especially liked slide 4 – a very telling illustration of the challenge we face in intelligently making information available to our users.

Re: Slide 24 – As I’ve written about before, I would say that the 80/20 rule is more than just “not quite accurate”.  But that’s mincing words.

Overall, a highly recommended read.

80-20: The lie in your search log?

Saturday, November 13th, 2010

Recently, I have been trying to better understand the language in use by our users in the search solution we use, and in order to do that, I have been trying to determine what tools and techniques one might use to do that. This is the first post in a planned series about this effort.

I have many goals in pursuing this.  The primary goal has been to be able to identify trends from the whole set of language in use by users (and not just the short head).  This goals supports the underlying business desire of identifying content gaps or (more generally) where the variety of content available in certain categories does not match with the variety expected by users (i.e., how do we know when we need to target the creation and publication of specific content?)

Many approaches to this do focus on the short head – typically the top N terms, where N might be 50 or 100 or even 500 (some number that’s manageable).  I am interested in identifying ways to understand the language through the whole long tail as well.

As I have dug into this, I realized an important aspect of this problem is to understand how much commonality there is to the language in use by users and also how much the language in use by users changes over time – and this question leads directly to the topic at hand here.

Search Term Usage

Chart 1

There is an anecdote I have heard many times about the short head of your search log that “80 percent of your searches are accounted for by the top 20% most commonly-used terms“.  I now question this and wonder what others have seen.

I have worked closely with several different search solutions in my career and the three I have worked most closely with (and have most detailed insight on) do not come even close to the above assertion.  Chart 1 shows the usage curve for one of these.  The X axis is the percent of distinct terms (ordered by use) and the Y axis shows the percent of all searches accounted for by all terms up to X.

From this chart, you can see that it takes approximately 55% of distinct terms to account for 80% of all searches – that is a lot of terms!

This curve shows the usage for one month – I wondered about how similar this would be for other months and found (for this particular search solution) that the curves for every month were basically the exact same!

Wondering if this was an anomaly, I looked at a second search solution I have close access to to wonder if it might show signs of the “80/20” rule.  Chart 2 adds the curve for this second solution (it’s the blue curve – the higher of the two).

Chart 2

Chart 2

In this case, you will find that the curve is “higher” – it reaches 80% of searches at about 37% of distinct terms.  However, it is still pretty far from the “80/20” rule!

After looking at this data in more detail, I have realized why I have always been troubled at the idea of paying close attention to only the so-called “short head” – doing so leaves out an incredible amount of data!

In trying to understand the details of why, even though neither is close to adhering to the “80/20” rule, the usage curves are so different, I realize that there are some important distinctions between the two search solutions:

  1. The first solution is from a knowledge repository – a place where users primarily go in order to do research; the second is for a firm intranet – much more focused on news and HR type of information.
  2. The first solution provides “search as you type” functionality (showing a drop-down of actual search results as the user types), while the second provides auto-complete (showing a drop-down of possible terms to use).  The auto-complete may be encouraging users to adopt more commonality.

I’m not sure how (or really if) these factor into the shape of these curves.

In understanding this a bit better, I hypothesize two things:  1) the shape of this curve is stable over time for any given search solution, and 2) the shape of this curve tells you something important about how you can manage your search solution.  I am planning to dig more to answer hypothesis #1.

Questions for you:

  • Have you looked at term usage in your search solution?
  • Can you share your own usage charts like the above for your search solution and describe some important aspects of your solution?  Insight on more solutions might help answer my hypothesis #2.
  • Any ideas on what the shape of the curve might tell you?

I will be writing more on these search term usage curves in my next post as I dig more into the time-stability of these curves.

Best Bet Governance

Monday, February 22nd, 2010

My first post back after too-long a period of time off.  I wanted to jump back in and share some concrete thoughts on best bet governance.

I’ve previously written about best bets and how I thought, while not perfect, they were an important part of a search solution.  In that post, I also described the process we had adopted for managing best bets, which was a relatively indirect means supported by the search engine we used for the search solution.

Since moving employers, I now have responsibility for a local search solution as well as input on an enterprise search solution where neither of the search engines supports a similar model.  Instead, both support the (more typical?) model where you identify particular search terms that you feel need to have a best bet and you then need to identify a specific target (perhaps multiple targets) for those search terms.

This model offers some advantages such as specificity in the results and the ability to actively determine what search terms have a best bet that will show.

This model also offers some disadvantages, the primary one (in my mind) being that they must be managed – you must have a means to identify which terms should have best bets and which targets those terms should show as a best bet.  This implies some kind of manual management, which, in resource-constrained environments, can be a challenge.  As noted in my previous article, others have provided insight about how they have implemented and how they manage best bets.

Now having responsibility for a search solution requiring manual management of best bets, we’ve faced the same questions of governance and management and I thought I would share the governance model we’ve adopted.  I did review many of the previous writings on this to help shape these, so thanks to those who have written before on the topic!

Our governance model is largely based on trying to provide a framework for consistency and usability of our best bets.  We need some way to ensure we do not spend inordinate time on managing requests while also ensuring that we can identify new, valuable search terms and targets for best bets.

Without further ado, here is an overview of the governance we are using:

  • We will accept best bet requests from all users, though most requests come from site publishers on our portal.  Most of our best bets have web sites as targets, though about 30% have individual pieces of published content (documents) as targets.  As managers of the search solution, my team will also identify best bets when appropriate.
  • When we receive a request for a new best bet, we review the request against the following the following criteria:
    • No more than five targets can be identified for any one search term, though we prefer to keep it to one or two targets.
      • Any request for a best bet that would result in more than 2 targets for the search term forces a review of usage of the targets (usage is measured by our web analytics solution for both sites and published content).
      • The overall usage of the targets will identify if one or more targets should be dropped.
    • For a given target, no more than 20 individual search terms can be identified.  Typically, we try to keep this to fewer than 5 when possible.
    • If a target is identified as a best bet target that has not had a best bet search term associated with it previously, we confirm that it is either a highly used piece of content or that it is a significant new piece that is highly known or publicized (or may soon be by way of some type of marketing).
    • We also review the search terms identified for the best bet.  We will not use search terms with little to no usage during the previous 3 months.
    • We will not set up a best bet search term that matches the title of the target.  The relevancy algorithm for our search engine heavily weights titles, so this is not necessary.
    • We do prefer that the best bet search terms do have a logical connection to the title or summary of the target.  This ensures that a user will understand the connection between their search terms and a resulting best bet.  This is not a hard requirement, but a preference.  We do allow for spelling variants, synonyms, pluralized forms, etc.
    • We prefer terms that use words from our global taxonomy.
  • Our governance (management process, really) for managing best bets includes:.
    • Our search analyst reviews the usage of each best bet term.
      • If usage over an extended time is too low to warrant the best bet term, it is removed.
    • We also plan to use path analysis (pending some enhancements needed as this is written) to determine if, for specific terms, the best bet selections are used preferentially.  If that is found to not be the case, our intent is that the best bet target is removed.
    • We have integrated the best bet management into both our site life cycle process and our content life cycle
      • With the first, when we are retiring a site or changing the URL of a site we know to remove or update the best bet target
      • With the second, as content is retired, the best bets are removed
      • In each of these cases, we also evaluate the terms to see if there could be other good targets to use.

The one interesting experience we’ve had so far with this governance model is that we get a lot of push back from site publishers who want to provide a lengthy laundry list of terms for their site, even when 75% of that list is never used (or at least in a twelve month period we’ll sometimes check).  They seem convinced that there is value in setting up best bets for terms even when you can show that there is none.  We are currently making changes in the way we manage best bets and also in how we can use these desirable terms to enhance the organic results directly.  More on that later.

There you have our current governance model.  Not too fancy or complicated and still not ideal, but it’s working for us and we recognize that it’s a work in progress.

Now that I have the “monkey off my back” in terms of getting a new post published, I plan to re-start regular writing.  Check back soon for more on search, content management and taxonomy!

Enterprise Search Best Bets – a good enough practice?

Tuesday, February 3rd, 2009

Last summer, I read the article by Kas Thomas from CMS Watch titled “Best Bets – a Worst Practice” with some interest. I found his thesis to be provocative and posted a note to the SearchCoP community asking for other’s insights on the use of Best Bets. I received a number of responses taking some issue with Kas’ concept of what best bets is and some also some responses describing different means to manage best bets (hopefully without requiring the “serious amounts of human intervention” described by Kas.

In this post, I’ll provide a summary of sorts and also describe some of the ways described for managing best bets and also the way we have managed best bets.

Kas’ thesis is that best bets are not a good practice because they are largely a hack layered on top of a search engine and require significant manual intervention. Further, if your search engine isn’t already providing access to appropriate “best bets” for queries, you should get yourself a new search engine.

Are Best Bets Worth the Investment?

Some of the most interesting comments from the thread of discussion on the SearchCoP include (I’ll try to provide as cohesive picture of sentiment as I can but will only provide parts of the discussion – if I have portrayed intent incorrectly – that’s my fault and not the original author):

From Tim W:

“Search analytics are not used to determine BB … BB are links commonly used, enterprise resources that the search engine may not always rank highly because for a number of reasons. For example, lack of metadata, lack of links to the resource and content that does not reflect how people might look for the document. Perhaps it is an application and not a document at all.”

From Walter U:

“…manual Best Bets are expensive and error-prone. I consider them a last resort.”

From Jon T:

“Best Bets are not just about pushing certain results to the top. It is also about providing confidence in the results to users.

If you separate out Best Bets from the automatic results, it will show a user that these have been manually singled out as great content – a sign that some quality review has been applied.”

From Avi R:

“Best Bets can be hard to manage, because they require resources.

If no one keeps checking on them, they become stale, full of old content and bad links.

Best Bets are also incredibly useful.

They’re good for linking to content that can’t be indexed, and may even be on another site entirely. They’re good for dealing with … all the sorts of things that are obvious to humans but don’t fit the search paradigm.”

So, lots of differing opinions on best bets and their utility, I guess.

A few more pieces of background for you to consider: Walter U has posted on his blog (Most Casual Observer) a great piece titled “Good to Great Search” that discusses best bets (among other things); and, Dennis Deacon posted an article titled, “Enterprise Search Engine Best Bets – Pros & Cons” (which was also referenced in Kas Thomas’ post). Good reading on both – go take a look at them!

My own opinion – I believe that best bets are an important piece of search and agree with Jon T’s comment above that their presence (and, hopefully, quality!) give users some confidence that there is some human intelligence going into the presentation of the search results as a whole. I also have to agree with Kas’s argument that search engines should be able to consistently place the “right” item at the top of results, but I do not believe any search engine is really able to today – there are still many issues to deal with (see details in my posts on coverage, identity, and relevance for my own insights on some of the major issues).

That being said, I also agree that you need to manage best bets in a way that does not cost your organization more than their value – or to manage them in a way that the value is realized in multiple ways.

Contrary to what Tim W says, and as I have written about in my posts on search analytics (especially in the use of search results usage), I do believe you can use search analytics to inform your best bets but they do not provide a complete solution by any means.

Managing Best Bets

From here on out, I’ll describe some of the ways best bets can be managed – the first few will be summary of what people shared on the SearchCoP community and then I’ll provide some more detail on how we have managed them. The emphasis (bolding) is my own to highlight some of what I think are important points of differentiation.

From Tim W:

“We have a company Intranet index; kind of a phone book for web sites (A B C D…Z). It’s been around for a long time. If you want your web site listed in the company index, it must be registered in our “Content Tracker” application. Basically, the Content Tracker allows content owners to register their web site name, URL, add a description, metadata and an expiration date. This simple database table drives the Intranet index. The content owner must update their record once per year or it expires out of the index.

This database was never intended for Enterprise Search but it has proven to be a great source for Best Bets. We point our ODBC Database Fetch (Autonomy crawler) at the SQL database for the Content Tracker and we got instant, user-driven, high quality Best Bets.

Instead of managing 150+ Best Bets myself, we now have around 800 user-managed Best Bets. They expire out of the search engine if the content owner doesn’t update their record once per year. It has proven very effective for web content. In effect, we’ve turned over management of Best Bets to the collective wisdom of the employees.”

From Jim S:

“We have added an enterprise/business group best bet key word/phrase meta data.

All documents that are best bet are hosted through our WCM and have a keyword meta tag added to indicate they are a best bet. This list is limited and managed through a steering team and search administrator. We primarily only do best bets for popular searches. Employee can suggest a best bet – both the term and the associated link(s). It is collaborative/wiki like but still moderated and in the end approved or rejected by a team. There is probably less than 1 best bet suggestion a month.

If a document is removed or deleted the meta data tag also is removed and the best bet disappears automatically.

Our WCM also has a required review date for all content. The date is adjustable so that content will be deactivated at a specific date if the date is not extended. This is great for posting information that has a short life as well as requiring content owners to interact with the content at least every 30 Months (maximum) to verify that the content is still relevant to the audience. The Content is not removed from the system, rather it’s deactivated (unpublished) so it no longer accessible and the dynamic links and search index automatically remove the invalid references. The content owner can reactivate it by setting the review date into the future.

If an external link (not one in our WCM) is classified as a best bet then a WCM redirect page is created that stores the best bet meta tag. Of course it has a review/expiration so the link doesn’t go on forever and our link testing can flag if the link is no longer responding. If the document is in the DMS it would rarely be deleted. In normal cases it would be archived and a archive note would be placed to indicate the change. Thus no broken links.

Good content engineering on the front end will help automate the maintenance on the back end to keep the quality in search high.

The first process is external to the content and doesn’t require modifying the content (assuming I’m understanding Tim’s description correctly). There are obvious pros and cons to this approach.

By contrast, the second process embeds the “best bet” attribution in the content (perhaps more accurately in the content management system around the content) and also embeds the content in a larger management process – again, some obvious pros and cons to the approach.

Managing Best Bets at Novell

Now for a description of our process -The process and tools in place in our solution are similar to the description provided by Tim W. I spoke about this topic at the Enterprise Search Summit West in November 2007, so you might be able to find the presentation for it there (though I could not just now in a few minutes of searching).

With the search engine we use, the results displayed in best bets are actually just a secondary search performed when a user performs any search – the engine searches the standard corpus (whatever context the user has chosen, which would normally default to “everything”) and separately searches a specific index that include all content that is a potential best bet.

The top 5 (a number that’s configurable) results that match the user’s search from the best bets index are displayed above the regular results and are designated “best bets”.

How do items get into the best bets index, then? Similar to what Tim W describes, on our intranet, we have an “A-Z index” – in our case, it’s a web page that provides a list of all of the resources that have been identified as “important” at some point in the past by a user. (The A-Z index does provide category pages that provide subsets of links, but the main A-Z index includes all items so the sub-pages are not really relevant here.)

So the simple answer to, “How do items get into the best bets index?” is, “They are added to the A-Z index!” The longer answer is that users (any user) can request an item be added to the A-Z index and there is then a simple review process to get it into the A-Z index. We have defined some specific criteria for entries added to the A-Z, several of which are related to ensuring quality search results for the new item, so when a request is submitted, it is reviewed against these criteria and only added if it meets all of the criteria. Typically, findability is not something considered by the submitter, so there will be a cycle with the submitter to improve the findability of the item being added (normally, this would include improving the title of the item, adding keywords and a good description).

Once an item is added to the A-Z index, it is a potential best bet. The search engine indexes the items in the A-Z through a web crawler that is configured to start with the A-Z index page and goes just one link away from that (i.e., it only indexes items directly linked to from the A-Z index).

In this process, there is no way to directly map specific searches (keywords) to specific results showing up in best bets. The best bets will show up in the results for a given search based on normally calculated relevance for the search. However, the best bet population numbers only about 800 items instead of the roughly half million items that might show up in the regular results – as long as the targets in the A-Z index have good titles and are tagged with the proper keywords and description, they will normally show up in best bets results for those words.

Some advantages of this approach:

  • This approach works with our search engine and takes advantage of a long-standing “solution” our users are used to (the A-Z index has long been part of our intranet and many users turn to the A-Z index whenever they need to find anything, so its importance is well-ingrained in the company).
  • Given that the items in the A-Z index have been identified at some point in the past as “important”, we can arguably say that everything that should possibly be a best bet is included.
  • We have a point in a process to enforce some findability requirements (when a new item is added).
  • The items included can be any web resource, regardless of where it is (no need to be on our web site or in our CM system)
  • This approach provides a somewhat automated way to keep the A-Z index cleaned up – the search engine identifies broken links as it indexes content and by monitoring those for the best bets index, we know when content included the A-Z has been removed.
  • Because this approach depends on the “organic” results from the engine (just on a specially-selected subset of content), we do not have to directly manage keyword-to-result mapping – we delegate that to the content owner (by way of assigning appropriate keywords in the content).

Some disadvantages of this approach

  • The tool we use to manage the A-Z index content is a database but, it is not integrated with our content management system. Most specifically, it does not take advantage of automated expiration (or notification about expiration).
  • As a follow-on from the above point, there is no systematically enforced review cycle on individual items to ensure they are still relevant.
  • Because this approach depends on the organic results from the engine, we can not directly map keywords to specific results. (Both a good and bad thing, I guess!)
  • Because the index is generated using a web crawl (and not indexing a database directly for example), some targets (especially web applications) still end up not showing particular well because it might not be possible to have the home page of the application modified to include better keywords or descriptions or (in the face of our single sign-on solution), sometimes a complex set of redirects results in the crawler not indexing the “right” target.

People Search and Enterprise Search

Tuesday, October 14th, 2008

This post is the first of a brief series of posts I plan to write about the integration of “people search” (employee directory) with your enterprise search solution. In a sense, this treats “people” as just another piece of content within your search, though they represent a very valuable type of content.

This post will be an introduction and describe both a first and second generation solution to this problem. In subsequent posts, I plan to describe a solution that takes this solution forward one step (simplifying things for your users among other things) and then into some research that I believe shows a lot of promise and which you might be able to take advantage of within your own enterprise search solution.

Why People Search?

Finding contact information for your co-workers is such a common need that people have, forever, maintained phone lists – commonly just as word processing documents or spreadsheets – and also org charts, probably in a presentation file format of some type. I think of this approach as a first generation solution to the people search problem.

Its challenges are numerous, including:

  1. The maintenance of the document is fraught with the typical issues of maintaining any document (versioning, availability, etc.)
  2. In even a moderately large organization, the phone list may need to be updated by several people throughout the organization to keep it current.
  3. Search within this kind of phone list is limited – you can ensure you always have the latest version and then open it up and use your word processor’s search function or (I remember this well, myself) always keep a printout of the latest version of the phone list next to your workspace so you can look through it when you need to contact someone.

As computer technology has evolved and companies implemented corporate directories for authentication purposes (Active Directory, LDAP, eDirectory, etc.), it has become common to maintain your phone book as a purely online system based on your corporate directory. What does such a solution look like and what are its challenges?

A “Second Generation” Solution

I think it’s quite common now that companies will have an online (available via their intranet) employee directory that you can search using some (local, specific to the directory) search tools. Obvious things like doing fielded searches on name, title, phone number, etc. My current employer has sold a product named eGuide for quite some time that provides exactly this type of capability.

eGuide is basically a web interface for exposing parts of your corporate Directory for search and also for viewing the org chart of a company (as reflected in the Directory).

We have had this implemented on our intranet for many years now. It has been (and continues to be) one of the more commonly used applications on our intranet.

The problems with this second generation solution, though, triggered me to try to provide a better solution a few years ago using our enterprise search. What are the problems with this approach? Here are the issues that triggered a different (better?) solution:

  1. First and foremost, with nothing more than the employee finder as a separate place to search, you immediately force a searcher to have to make a decision before they do their search as to where they want to search. Many users might expect that the “enterprise” search actually does include anything that they can navigate to as potential targets so when they search on a person’s name and don’t see it in the result set they immediately think either A) why does the search not include individual people’s information, or B) this search engine is so bad that, even though it must include people information, it can’t even show the result at a high enough relevance to get it on the first page!
    1. Despite my statement to the contrary above, I am aware that Jakob Nielsen does actually advocate the presence of both a “people search” box and a more general search box because people are aware of the distinction between searching for content and search for people. We do still have both search boxes on our intranet, though, in a sense, the people search box is redundant.
  2. Secondly, the corporate directory commonly is a purely fielded search – you have to select which field(s) you want to search in and then you are restricted to searching just those fields.
    1. In other words, you as a searcher, need to know in which field a particular string (or partial string) might appear. For many fields, this might not be an issue – generally, first and last name are clear (though not always), email, phone number, etc., but the challenge is that a user has to decide in which field they want to look.
  3. Third, related to the previous point, directory searches are generally simplistic searches based on string matching or partial string matching. With a full search engine, you introduce the possibility of taking advantage of synonyms (especially useful on first names), doing spelling corrections, etc.

So there’s a brief description of what I would characterize as a first generation solution and a second generation solution along with highlights of some issues with each.

Up next, I’ll describe the next step forward in the solution to this issue – integrating people into your enterprise search solution.

People know where to find that, though!

Monday, October 13th, 2008

The title of this post – “People know where to find that, though!” is a very common phrase I hear as the search analyst and the primary search advocate at my company. Another version would be, “Why would someone expect to find that in our enterprise search?”

Why do I hear this so often? I assume that many organizations, like my own, have many custom web applications available on their intranet and even their public site. It is because of that prevalence, combined with a lack of communication between the Business and the Application team, that I hear these phrases so often.

I have (unfortunately!) lost count of the number of times a new web-based application goes into production without anyone even considering the findability of the application and its content (data) within the context of our enterprise search.

Typically, the conversation seems to go something like this:

  • Business: “We need an application that does X, Y and Z and is available on our web site.”
  • Application team: “OK – let’s get the requirements laid out and build the site. You need it to do X, Y and Z. So we will build a web application that has page archetypes A, B and C.”
  • Application team then builds the application, probably building in some kind of local search function – so that someone can find data once they are within the application.
  • The Business accepts the usability of the application and it goes into production

What did we completely miss in this discussion? Well, no one in the above process (unfortunately) has explicitly asked the question, “Does the content (data) in this site need to be exposed via our enterprise search?” Nor has anyone even asked the more basic question, “Should someone be able to find this application [the “home page” of the application in the context of a web application] via the enterprise search?”

  • Typically, the Business makes the (reasonable) assumption that goes something like, “Hey – I can find this application and navigate through its content via a web browser, so it will naturally work well with our enterprise search and I will easily be able to find it, right?!”
  • On the other hand, the Application Team has likely made 2 assumptions: 1) the Business did not explicitly ask for any kind of visibility in the enterprise search solution, so they don’t expect that, and 2) they’ve (likely) provided a local search function, so that would be completely sufficient as a search.

I’ve seen this scenario play out many, many times in just the last few years here. What often happens next depends on the application but includes many of the following symptoms:

  • The page archetypes designed by the Application Team will have the same (static) <title> tag in every instance of the page, regardless of the data displayed (generally, the data would be different based on query string parameters).
    • The effect? A web-crawler-based search engine (which we use) likely uses the <title> tag as an identifier for content and every instance of each page type has the same title, resulting in a whole lot of pretty useless (undifferentiated) search results. Yuck.
  • The page archetypes have either no or maybe redundant other metadata – keywords, description, content-date, author, etc.
    • The effect? The crawler has no differentiation based on <titles> and no additional hints from metadata. That is, lousy relevance.
  • The application has a variety of navigation or data manipulation capabilities (say, sorting data) based on standard HTML links.
    • The effect? The crawler happily follows all of the links – possibly (redundant) indexing the same data many, many times simply sorted on different columns.
    • Another effect? The dreaded calendar affect – the crawler will basically never stop finding new links because there’s always another page.
    • In either case, we see poor coverage of the content.

The overall effect is likely that the application does not work well with the enterprise search, or possibly that the application is that the application does not hold up to the pressure of the crawler hitting its pages much faster than anticipated (so I end up having to configure the crawler to avoid the application) and ending with yet another set of content that’s basically invisible in search.

Bringing this back around to the title – the response I often get when inquiring about a newly released application is something like, “People will know how to find that content – it’s in this application! Why would this need to be in the enterprise search?”

When I then ask, “Well, how do people know that they even need to navigate to or look in this application?” I’ll get a (virtual) shuffling of feet and shoulder shrugs.

All because of a perpetual lack of asking a few basic questions during a requirements gather stage of a project or (another way to look at it) lack of standards or policies which have “teeth” about the design and development of web application. The unfortunate thing is that, in my experience, if you ask the questions early, it’s typically on the scale of a few hours of a developer’s time to make the application work at least reasonably well with any crawler-based search engine. Unfortunately, because I often don’t find out about an application until after it’s in production, it then becomes a significant obstacle to get any changes made like this.

I’ll write more in a future post about the standards I have worked to establish (which are making some headway into adoption, finally!) to avoid this.

Edit: I’ve now posted the standards as mentioned above – you can find them in my post Standards to Improve Findability in Enterprise Applications.

Evaluating and Selecting a Search Engine

Tuesday, September 30th, 2008

A few months back, I was asked to evaluate my company’s current solution solution against another search engine to try to determine if it would be worthwhile to implement a new solution. I’ve done package / tool evaluations in the past but I felt that there was something a bit different about this in that I needed to somehow integrate a fairly standard requirements-based evaluation with a measure of quality of the search results themselves, which are not easily expressed as concrete requirements.

So I set about the task and asked the SearchCop for suggestions about how to do an evaluation of the search results in a meaningful and supportable way. I received several useful results, including some suggestions from Avi Rappaport, about a methodology to go about identifying a good representation of search terms to use in an evaluation.

With my own experiences and those of the SearchCoP in hand, I came up with a process that I thought I would share here.

Two Components to the Evaluation

I split the assessment into two distinct parts. The first was a traditional “requirements-based” assessment which allowed me to reflect support for a number of functional or architectural needs I could identify. Some examples of such requirements were:

  • The ability to support multiple file systems;
  • The ability to control the web crawler (independent of robots.txt or robots tags embedded in pages)
  • The power and flexibility of the administration interface, etc.

The second part of the assessment was to measure the quality of the search results.

I’ll provide more details below for each part of the assessment, but the key thing for this assessment was the have a (somewhat) quantitative way to measure the overall picture of the effectiveness and power of the search engines. It might be possible to even quantitatively combine the measure of these two components, though I did not do so in this case.

Requirements Assessment

For the first part, I used a simplified quality functional deployment matrix – I identified the various requirements to consider and assigned them a weight (level of importance); based on some previous experiences, I forced the weights to be either 10 (very important -probably “mandatory” in a semantic sense), a 5 (desirable but not absolutely necessary) or a 1 (nice to have) – this provides a better spread in the final outcome, I believe.

Then I reviewed the search engines against those requirements and assigned each search engine a “score” which, again, was measured as a 10 (met out of the box), a 5 (met with some level of configuration), a 1 (met with some customization – i.e., probably some type of scripting or similar, but not configuration through an admin UI) and a 0 (does not meet and can not meet).

The overall “score” for an engine was then measured as the sum of the product of the score and weight for each requirement.

This simplistic approach can have the effect of giving too much weight to certain areas of requirements in total. Because each requirement is given a weight, if there are areas of requirements that have a lot of detail in your particular case, you can give that area too much overall weight simply because of the amount of detail. In other words, if you have a total of, say, 50 requirements and 30 of them are in one area (say you have specified 30 different file formats you need to support – each as a different requirement), then a significant percentage of your overall score will be contingent on that area. In some cases, that is OK but in many, it is not.

In order to work around this, I took the following approach:

  • Grouped requirements into a set of categories;
  • The categories should reflect natural cohesiveness of the requirements but should also be defined in a way that each category is roughly equal in importance to other categories;
  • Compute the total possible score for each category (which in my case was 10 * (total-weight-of-requirements-in-category)
  • Compute the relative score of that category for a search engine by summing the product of that engine’s score and the weight of the requirements for that category; the relative score is that engine’s score divided by the total possible score for that category.
  • Now sum all of the relative scores for each category and (to get a number between 0 and 100) multiply by 100

This approach gives you a score for each engine between 0 and 100 and also gives each category a roughly equal effect on the total score.

If you are looking for some insights on categories of requirements you might want to include in your evaluation, I provide some of my thoughts in a subsequent post.

Search Results Quality

To measure the quality of search results, I took Avi’s insights from the SearchCoP and identified a set of specific searches that I wanted to measure. I identified the candidate searches by looking at the log files for the existing search solution on the site and pulling out a few searches that fell into each category Avi identified. The categories included:

  • Simple queries
  • Complex queries
  • Common queries
  • Spelling, typing and vocabulary errors
  • Force matching edge-case issues, including:
    • Many matches
    • Few matches
    • No matches

Going into this, I assumed I did not necessarily know the “right” targets for these searches, so I enlisted some volunteers among a group of knowledgeable employees (content managers on the web site) who could complete a survey I put together. The survey included a section where the participant had to execute each search against each search engine (the survey provided a link to do the search – so the participants did not have to actually go to a search screen somewhere and enter the terms and search – this was important to keep it somewhat simpler). The participants were then asked to score the quality of the results for each search engine (on a scale of 1-5).

The survey also included some other questions about presentation of results, performance, etc. (even though we did not customize search result templates or tweak anything in the searches, we wanted to get a general sense of usability) and also included a section where users could define and rate their own searches.

The results from the survey were then analyzed to get an overall measure of quality of results across this candidate set of searches for each search engine – basically doing some aggregation of the different searches into average scores or similar.

Outcome of the Assessment

With the engines we were looking at, the results were that one was better on the administration / architectural requirements and the
other was better on the search results – which makes for an interesting decision, I think.

The key takeaway for me from this process is that it is at least quantitative – one can argue over the set of requirements to include, or the weight of any particular requirement or the score of an engine on a particular requirement. However, the discussion can be held at that level instead of a more qualitative level (AKA “gut feel”).

Additionally, for search engines, taking a two-part approach ensures that each of these very important factors are included and reflected in the final outcome.

Issues with this Approach

In the case of my own execution of this approach, I know there are some issues (the general methodology is sound, I believe). Including (in no particular order):

  • I defined the set of requirements (ideally, I would have liked to have input from others but I’ve basically been a one-man show and I don’t think others would have had a lot of input or time to provide that input).
  • I defined the weights for requirements (see above).
  • I assigned the score for the requirements (again, see above).
  • I did not have hands-on with each engine under consideration and had to lean a lot on documentation, demos and discussions with vendors.
  • All summed up – I think the exact scores could be in question but, given me as the only resource it worked reasonably well.

As for the survey / search results evaluation:

  • I would have liked a larger population of participants, including people who did not know the site
  • I would have liked a larger population of queries to be included, but I felt the number already was pretty large (about 40 pre-defined ones and ability for 10 more user-defined)
  • I did not mask which engine produced which results. As Walter Underwood mentions (he referenced this post from the SearchCoP thread), that can cause some significant issues with reliability of measures.

The 3 Principles of Enterprise Search (part 3): Relevance

Thursday, January 10th, 2008

As I previously wrote, in my work on enterprise search, I have found there to be 3 Principles of Enterprise Search: Coverage, Identity and Relevance. My previous posts have discussed the principles of Coverage and Identity.

Here, I will cover the principle of Relevance.

So, in your efforts to improve search for your users, you have addressed the principle of Coverage and you have thousands of potential search candidates in your enterprise search tool. You have addressed the principle of Identity and all of those search results display well in a search results page, clearly identifying what they are so a searcher can confidently know what an item is. Now for the hardest of the three principles to address: Relevance.

The principle of Relevance is all about search results candidates showing up high in search results for appropriate search terms. Relating this back to the original driving question – “Why doesn’t X show up when I search on the search terms Y?”, the principle Relevancy addresses the situation where X is there and may even be listed as X, but it is on the second (or even farther down) page of results.

This principle is in some ways both the hardest and simplest to address. It is hard because it practically requires that you anticipate every searcher’s expectations and that you can practically read their minds (no mean feat!). It’s simple (at least given a search engine) because relevance is also a primary focus for the search engine itself – many search engines differentiate themselves from competitors based on how well the engine can estimate relevance for content objects based on a searcher’s criteria; so your search engine is likely going to help you a lot with regard to relevance.

However, there are still a lot of issues to consider and areas you need to address to help your search engine as well as your users.

  1. One of the first things you should consider is the set of keywords associated with your content. There are several different ways search engines will encounter keywords:
    1. First and foremost, the content of your search items present a set of keywords to most search engines; this is going to be the content visible in a web page or the words in the body of documents.
    2. The keywords accessible in the form of “keywords” <meta> tags in HTML pages, or “keywords” fields in the File Properties of documents in various formats.
    3. The keywords might even be terms in a database that is related in some way to the content that your search engine can use. This is very common for tightly constrained environments that integrate both a content management (or collaboration) environment with a search experience. If the tool controls both the content and the search, it can take advantage of a lot of “insight” that might not be directly available to an enterprise search solution.
    4. Some search engines will even use the text of links pointing to a content item as keywords describing the item. So content managers can influence can influence the relevance of content they don’t manage themselves by how they refer to it.
    5. Lastly, you also need to understand how your search engine will use and interpret these various sources of keywords and focus on those that provide the most impact. Some search engines might ignore the “keywords” <meta> tag for example, so you may not need to be concerned with that at all.
  2. One detail to highlight with regard to the content of your search items is that, just like the navigation challenges discussed in the around Coverage, if you have web sites that depend on JavaScript to display content, then that content likely will be invisible to your search engine, so it will not contribute to the keywords users can use to find the pages. I see this issue as becoming more of a problem in the future as applications are built that take advantage of AJAX to present dynamic user interfaces.
  3. Once you have a strategy for how you will present keywords to your search engine, you need to determine how best to manage the set of keywords that will be most useful to your content managers and to the users of your search tool. A principle tool for this is to have a taxonomy that helps inform your audience about preferred terms. I’ll write more about taxonomies in the future – for now, you should know that a very effective way to improve search is to simply constrain the terms used to tag content to a well-managed set.
    1. A taxonomy can also be used to provided guided navigation or constraint search pick lists. Instead of a simple keyword box for search, you can offer your users lists of values to select. The utility of this will depend on your users’ needs and you need to ensure you pay attention to usability.
  4. Related to taxonomies, you should also consider how best to manage synonyms. This will likely require some work with your taxonomy (to associate synonyms with “preferred” terms); this may require you to manage synonyms for your search engine (to define the mapping between synonyms used by the engine – hopefully, these rings are pulled from your taxonomy!); you might need to institute some means to tag your content with both the preferred terms and with synonyms (especially if you are exposing your content to search engines other than your own – i.e., your content is exposed to internet search engines).
  5. A third issue related to relevancy is the security of content; I relate this to relevance in the sense that if a user does not have access to a particular piece of content exposed by your search, effectively that content has zero relevance for that user. Many web tools (especially collaboration applications) provide users with very powerful management tools to control visibility of content – including even such details as differentiating even between who can know a piece of content exists and being able to download that content. However, interpreting granular searching controls on content is a very hard problem for an enterprise search tool to efficiently solve. In my experience, the most common “solution” for this type of problem is to not index such secure areas for inclusion in your enterprise search but to ensure that the tool provides a “local search’ and then ensure your enterprise search experience points users to this local search function when appropriate.
  6. Lastly for now, another area you should consider in terms of relevance is to monitor your search engine’s log files. Ultimately, I think this effort will transform into one of:
    1. Input to help you manage your taxonomy (by discovering the terms your search users are actually using and understanding how they differ from your taxonomy)
    2. Identification of holes in your content by understanding “not found” results (helping to identify and then solve Coverage issues)
    3. Identification of relevancy issues by understanding when some terms require more page scrolling than others.

Summary

To look back at the 3 Principles: 1) you need to make sure your search engine will find and index the necessary content; 2) you need to make sure your content will properly be identified in search results; and, 3) you need to ensure that your content will show as highly relevant for searches your users expect to show that content.

To address most issues, it does not require any magic or rocket science, but just an awareness of the issues and time and resources (these latter two being scarce for many!) to work on resolving them.

What have I missed? What else do others have to share?

The 3 Principles of Enterprise Search (part 2): Identity

Tuesday, January 8th, 2008

As I previously wrote, in my work on enterprise search, I have found there to be 3 Principles of Enterprise Search: Coverage, Identity and Relevance. My previous post discussed Coverage. Here, I will cover the principle of Identity.

What is Identity in a search solution?

The principle of Identity relates to the way a search engine will display search results. Search engines will have a number of tools available to display items in search results, up to and including providing sophisticated templating mechanisms where you can provide your own logic for how an item is displayed. The Identity of a search result is the way in which it is presented in a search results page.

With regard to the original driving question – “Why doesn’t X show up when I search on the search terms Y?”, the principle Identity addresses the situation where X is there but it’s called Z.

At the heart of most of this is the idea of a “title” for an item. Most search engines will use the text in the <title> tag of an HTML page as the name of the item in search results. Similarly, most search engines will use the Document Title field of documents that support such a field (all OpenOffice and Microsoft Office formats, PDF files, etc).

Issues with Identity

The specific challenges I see in regard to this principle include:

  1. Web pages or documents with no title – This is a problem that is much more common with Documents than with web pages. Most web pages will have at least some kind of title, but not many users are even aware of the “Title” File Property or, if they are, most of those users don’t bother to ensure it is filled in. If you expose a lot of documents in your search results, this single problem can be a search killer because the search engine either ends up displaying the URL or generating a title.
    • Some search engines will use heuristics to identity a title to use in search results if the identified title field is empty; for example, a search engine might use the URL of the item as the title (often, the path names and/or file name in the URL can be as informative as a title); or, the search engine might generate a title from header information in the document (looking for particular style of text, etc).
  2. Web pages or documents with useless or misleading titles – This issue probably equally plagues Documents and web pages. Some classic examples are caused by users creating documents by starting with a template, which was nicely titled “Business Paper Template”, for example, but then the users don’t change the title. You end up with dozens of items titled “Business Paper Template”. Or, users start with a completely unrelated document and edit it to what they’re working on; a user finds an item title “Q4 Financial Report” but the item is actually a design document deliverable for a client consulting engagement.
  3. Web pages or documents with redundant titles – This problem shows somewhat in documents (the “Business Paper Template” example) but it is far more common in a dynamic web sites. The application developers start with a common template (say a .jsp page) and leave in the static <title> tag for every single page. You end up with hundreds or thousands of items titled something like, “Business Expense Reports” (or whatever). Nothing will cause disenchantment in a searcher like that! “OK – Which one of these 10 items all titled the same am I supposed to pick??”
  4. Another important piece of Identity for search results is the snippet (or summary) that is often displayed along with the title. Some search engines will dynamically calculate this based on the user’s query (this is the most useful way to generate this); some engines will use a static description of the item – in this case, commonly the “description” <meta> tag may be used (or other common <meta> tags). You need to understand how your engine generates this and ensure the content being searched will show well. Depending on the quality of titles, this snippet can either be critically important (if titles are poor and the snippet is dynamically generated) or might be less critical if you ensure your content has good, clear, distinct titles.

Addressing the issues

As with the principle of Coverage, commonly identifying the specific cases where you have problems with Identity is “half the battle”. Depending on the particular issue, fixing the issue is normally pretty straightforward (if laborious). For Documents with no (or misleading or useless) title, work with the content owner to educate them about the importance of filling in the File Properties. Ideally, a workflow in the content management system that would include a review for this would be great, though we do not have that in place at Novell.

We do have a content management community of practice, though, through which I have shared this kind of insight many times and I continue to educate anyone who is or becomes a content manager.

For web applications, I’ve used the same approach as I describe in addressing issues with the Coverage principle – education of development teams and review of applications (hopefully before they are deployed).

Now, onto the third Principle of Enterprise Search – the principle that makes a potential search results candidate that identifies itself well into a good search results candidate – the principle of Relevance.