Lee Romero

On Content, Collaboration and Findability

Archive for the ‘Search’ Category

People Search and Enterprise Search, Part 3 – The Fourth Generation

Monday, October 20th, 2008

So we get to the exciting conclusion of my essays on the inclusion of employees in enterprise search. If you’ve read this far, you know how I have characters the first and second generation solutions and also provided a description of a third generation solution (which included some details on how we implemented it).

Here I will describe what I think of as a fourth generation solution to people finding within the enterprise. As I mentioned in the description of the third generation solution, one major omission still at this point is that the only types of searches with which you can find people is through administrative information – things like their name, address, phone number, user ID, email, etc.

This is useful when you have an idea of the person you’re looking for or at least the organization in which they might work. What do you do when you don’t know the person and may not even know the organization in which they work? You might know the particular skills or competencies they have but that may be it. This problem is particularly problematic in larger organizations or organizations that are physically very distributed.

The core idea with this type of solution is to provide the ability to find and work with people based on aspects beyond the administrative – the skills of the people, their interests, perhaps the network of people with which they interact, and more. While this might be a simplification, I think of this as expertise location, though that, perhaps, most cleanly fits into the first use case described below.

Some common use cases for this type of capability include:

  • Peer-to-peer connections – an employee is trying to solve a particular problem and they suspect someone in the company may have some skills that would enable them to solve the problem more quickly. Searching using those skills as keywords would enable them to directly contact relevant employees.
  • Resource planning – a consulting organization needs to staff a particular project and needs to find specific people with a particular skill set.
  • Skill assessment – an organization needs to be able to ascertain the overall competency of their employees in particular skill sets to identify potential training programs to make available.

This capability is something that has often been discussed and requested at my current employer, but which no one has really been willing to sponsor. That being said, I know there are several vendors with solutions in this space, including (at least – please share if you know of others):

  • Connectbeam – A company I first found out about at KM World 2007. They had some interesting technology on display that combines expertise location with the ability to visualize and explore social networks based on that expertise. Their product could digest content from a number of systems to automatically discern expertise.
  • ActiveNet – A product from Tacit Software, which (at a high level) is similar to Connectbeam. An interesting twist to this product is that it leaves the individuals whose expertise are managed in the system in control of how visible they are to others. In the discussions I’ve had with this company about the product, I’ve always had the impression that, in part, this provides a kind of virtual mailing list functionality where you can contact others (those with the necessary expertise) by sending an email without knowing who it’s going to. Those who receive it can either act on it or not and, as the sender, you only know who replies.
  • Another product about which I only know a bit is from a company named Trampoline Systems. I heard about them as I was doing some research on how to tune a prototype system of my own and understand that their Sonar platform provides similar functionality.
  • [Edit: Added this on 03 November, 2008] I have also found that Recommind provides expertise location functionality – you can read more about it here.
  • [Edit: Added this on 03 November, 2008] I also understand that the Inquira search product provides expertise location, though it’s not entirely clear to me from what I can find about this tool how it does this.

A common aspect of these is that they attempt to (and perhaps succeed) in automating the process of expertise discovery. I’ve seen systems where an employee has to maintain their own skill set and the problem with these is that the business process to maintain the data does not seem to really embed itself into a company – inevitably, the data gets out of date and is ill-maintained and so the system does not work.

I can not vouch for the accuracy of these systems but I firmly believe that if people search in the enterprise is going to meet the promise of enabling people to find each other and connect based on of-the-moment needs (skills, interests, areas of work, etc), it will be based on this type of capability – automatically discovering those aspects of a worker based on their work products, their project teams, their work assignments, etc.

I imagine within the not too distant future, as we see more merger of the “web 2.0” functionality into the enterprise this type of capability will become expected and welcome – it will be exciting to see how people will work together then.

This brings to a close my discussion of the various types of people search within the enterprise. I hope you’ve found this of interest. Please feel free to let me know if you think I have any omissions or misstatements in here – I’m happy to correct and/or fill in.

I plan another few posts that discuss a proof of concept I have put together based around the ideas of this fourth generation solution – look for those soon!

People Search and Enterprise Search, Part 2 – A third generation solution

Wednesday, October 15th, 2008

In my last post, I wrote about what I termed the first generation and second generation solution to people search in enterprise. This time, I will describe what I call a “third generation” solution to the problem that will integration people search with your enterprise search solution.

This is the stage of people search in use within my current employer’s enterprise.

What is the third generation?

What I refer to as a third generation solution for people search is one where an employee’s profile (their directory entry, i.e., the set of information about a particular employee) becomes a viable and useful target within your enterprise search solution. That is, when a user performs a search using the pervasive “search box” (you do have one, right?), they should be able to expect to find their fellow workers in the results (obviously, depending on the particular terms used to do the search) along with any content that matches that.

You remove the need for a searcher to know they need to look in another place (another application, i.e., the company’s yellow pages) and, instead, reinforce the primacy of that single search experience that brings everything together that a worker needs to do their job.

You also offer the full power of your enterprise search engine:

  • Full text search – no need to specifically search within a field, though most engines will offer a way to support that as well if you want to ffer that as an option;
  • The power of the search engine to work on multi-word searches to boost relevancy – so a search on just a last name might include a worker’s profile in the search results but one that includes both a first and last name (or user ID or location or other keywords that might appear in the worker’s profile) likely ensures that the person shows in the first page of results amidst other content that match;
  • The power of synonyms – so you can define synonyms for names in your engine and get matches for “Rob Smith” when a user searches on “Robert Smith” or “Bob Smith”;
  • Spelling corrections – Your engine likely has this functionality, so it can automatically offer up corrections if someone misspells a name, even.

Below, you will find a discussion of the implementation process we used and the problems we encountered. It might be of use to you if you attempt this type of thing.

Before getting to that, though, I would like to discuss what I believe to be remaining issue with a third generation solution in order to set up my follow-up post on this topic, which will describe additional ideas to solving the “people finder” problem within an enterprise.

The primary issue with the current solution we have (or any similar solution based strictly on information from a corporate directory) is that the profile of a worker consists only of administrative information. That is, you can find someone based on their name, title, department, address, email, etc., etc., etc., but you can not do anything useful to find someone based on much more useful attributes – what they actually do, what their skills or competencies are or what their interests might be. More on this topic in my next post!

The implementation of our third generation solution (read on for the gory details)

Read on from here for some insights on the challenges we faced in our implementation of this solution. It gets pretty detailed from here on out, so you’ve been warned!

(more…)

People Search and Enterprise Search

Tuesday, October 14th, 2008

This post is the first of a brief series of posts I plan to write about the integration of “people search” (employee directory) with your enterprise search solution. In a sense, this treats “people” as just another piece of content within your search, though they represent a very valuable type of content.

This post will be an introduction and describe both a first and second generation solution to this problem. In subsequent posts, I plan to describe a solution that takes this solution forward one step (simplifying things for your users among other things) and then into some research that I believe shows a lot of promise and which you might be able to take advantage of within your own enterprise search solution.

Why People Search?

Finding contact information for your co-workers is such a common need that people have, forever, maintained phone lists – commonly just as word processing documents or spreadsheets – and also org charts, probably in a presentation file format of some type. I think of this approach as a first generation solution to the people search problem.

Its challenges are numerous, including:

  1. The maintenance of the document is fraught with the typical issues of maintaining any document (versioning, availability, etc.)
  2. In even a moderately large organization, the phone list may need to be updated by several people throughout the organization to keep it current.
  3. Search within this kind of phone list is limited – you can ensure you always have the latest version and then open it up and use your word processor’s search function or (I remember this well, myself) always keep a printout of the latest version of the phone list next to your workspace so you can look through it when you need to contact someone.

As computer technology has evolved and companies implemented corporate directories for authentication purposes (Active Directory, LDAP, eDirectory, etc.), it has become common to maintain your phone book as a purely online system based on your corporate directory. What does such a solution look like and what are its challenges?

A “Second Generation” Solution

I think it’s quite common now that companies will have an online (available via their intranet) employee directory that you can search using some (local, specific to the directory) search tools. Obvious things like doing fielded searches on name, title, phone number, etc. My current employer has sold a product named eGuide for quite some time that provides exactly this type of capability.

eGuide is basically a web interface for exposing parts of your corporate Directory for search and also for viewing the org chart of a company (as reflected in the Directory).

We have had this implemented on our intranet for many years now. It has been (and continues to be) one of the more commonly used applications on our intranet.

The problems with this second generation solution, though, triggered me to try to provide a better solution a few years ago using our enterprise search. What are the problems with this approach? Here are the issues that triggered a different (better?) solution:

  1. First and foremost, with nothing more than the employee finder as a separate place to search, you immediately force a searcher to have to make a decision before they do their search as to where they want to search. Many users might expect that the “enterprise” search actually does include anything that they can navigate to as potential targets so when they search on a person’s name and don’t see it in the result set they immediately think either A) why does the search not include individual people’s information, or B) this search engine is so bad that, even though it must include people information, it can’t even show the result at a high enough relevance to get it on the first page!
    1. Despite my statement to the contrary above, I am aware that Jakob Nielsen does actually advocate the presence of both a “people search” box and a more general search box because people are aware of the distinction between searching for content and search for people. We do still have both search boxes on our intranet, though, in a sense, the people search box is redundant.
  2. Secondly, the corporate directory commonly is a purely fielded search – you have to select which field(s) you want to search in and then you are restricted to searching just those fields.
    1. In other words, you as a searcher, need to know in which field a particular string (or partial string) might appear. For many fields, this might not be an issue – generally, first and last name are clear (though not always), email, phone number, etc., but the challenge is that a user has to decide in which field they want to look.
  3. Third, related to the previous point, directory searches are generally simplistic searches based on string matching or partial string matching. With a full search engine, you introduce the possibility of taking advantage of synonyms (especially useful on first names), doing spelling corrections, etc.

So there’s a brief description of what I would characterize as a first generation solution and a second generation solution along with highlights of some issues with each.

Up next, I’ll describe the next step forward in the solution to this issue – integrating people into your enterprise search solution.

People know where to find that, though!

Monday, October 13th, 2008

The title of this post – “People know where to find that, though!” is a very common phrase I hear as the search analyst and the primary search advocate at my company. Another version would be, “Why would someone expect to find that in our enterprise search?”

Why do I hear this so often? I assume that many organizations, like my own, have many custom web applications available on their intranet and even their public site. It is because of that prevalence, combined with a lack of communication between the Business and the Application team, that I hear these phrases so often.

I have (unfortunately!) lost count of the number of times a new web-based application goes into production without anyone even considering the findability of the application and its content (data) within the context of our enterprise search.

Typically, the conversation seems to go something like this:

  • Business: “We need an application that does X, Y and Z and is available on our web site.”
  • Application team: “OK – let’s get the requirements laid out and build the site. You need it to do X, Y and Z. So we will build a web application that has page archetypes A, B and C.”
  • Application team then builds the application, probably building in some kind of local search function – so that someone can find data once they are within the application.
  • The Business accepts the usability of the application and it goes into production

What did we completely miss in this discussion? Well, no one in the above process (unfortunately) has explicitly asked the question, “Does the content (data) in this site need to be exposed via our enterprise search?” Nor has anyone even asked the more basic question, “Should someone be able to find this application [the “home page” of the application in the context of a web application] via the enterprise search?”

  • Typically, the Business makes the (reasonable) assumption that goes something like, “Hey – I can find this application and navigate through its content via a web browser, so it will naturally work well with our enterprise search and I will easily be able to find it, right?!”
  • On the other hand, the Application Team has likely made 2 assumptions: 1) the Business did not explicitly ask for any kind of visibility in the enterprise search solution, so they don’t expect that, and 2) they’ve (likely) provided a local search function, so that would be completely sufficient as a search.

I’ve seen this scenario play out many, many times in just the last few years here. What often happens next depends on the application but includes many of the following symptoms:

  • The page archetypes designed by the Application Team will have the same (static) <title> tag in every instance of the page, regardless of the data displayed (generally, the data would be different based on query string parameters).
    • The effect? A web-crawler-based search engine (which we use) likely uses the <title> tag as an identifier for content and every instance of each page type has the same title, resulting in a whole lot of pretty useless (undifferentiated) search results. Yuck.
  • The page archetypes have either no or maybe redundant other metadata – keywords, description, content-date, author, etc.
    • The effect? The crawler has no differentiation based on <titles> and no additional hints from metadata. That is, lousy relevance.
  • The application has a variety of navigation or data manipulation capabilities (say, sorting data) based on standard HTML links.
    • The effect? The crawler happily follows all of the links – possibly (redundant) indexing the same data many, many times simply sorted on different columns.
    • Another effect? The dreaded calendar affect – the crawler will basically never stop finding new links because there’s always another page.
    • In either case, we see poor coverage of the content.

The overall effect is likely that the application does not work well with the enterprise search, or possibly that the application is that the application does not hold up to the pressure of the crawler hitting its pages much faster than anticipated (so I end up having to configure the crawler to avoid the application) and ending with yet another set of content that’s basically invisible in search.

Bringing this back around to the title – the response I often get when inquiring about a newly released application is something like, “People will know how to find that content – it’s in this application! Why would this need to be in the enterprise search?”

When I then ask, “Well, how do people know that they even need to navigate to or look in this application?” I’ll get a (virtual) shuffling of feet and shoulder shrugs.

All because of a perpetual lack of asking a few basic questions during a requirements gather stage of a project or (another way to look at it) lack of standards or policies which have “teeth” about the design and development of web application. The unfortunate thing is that, in my experience, if you ask the questions early, it’s typically on the scale of a few hours of a developer’s time to make the application work at least reasonably well with any crawler-based search engine. Unfortunately, because I often don’t find out about an application until after it’s in production, it then becomes a significant obstacle to get any changes made like this.

I’ll write more in a future post about the standards I have worked to establish (which are making some headway into adoption, finally!) to avoid this.

Edit: I’ve now posted the standards as mentioned above – you can find them in my post Standards to Improve Findability in Enterprise Applications.

What is Enterprise Search?

Thursday, October 9th, 2008

Having written previously about my own principles of enterprise search and then some ideas on how to select a search engine, I thought it might be time to back up a bit and write about what I think of as “enterprise search”. Perhaps a bit basic or unnecessary but it gives some context to future posts.

The Enterprise in Enterprise Search

For me, the factors of a search solution that make it an enterprise solution include the following:

The user interface to access the solution is available to all employees of the company.

This has the following implications:

  • Given today’s technologies, this probably means that it’s a web-based interface to access the search.
    • More generally, the interface needs to be easily made available across the enterprise. In any somewhat-large organization, that means something either available online or easily installed or accessed from a user’s workspace.
  • I would also suggest that the search interface should be easily accessible from an employee’s standard workspace or a common starting point for employees.
    • One easy way to achieve this is to make access to an enterprise search solution part of the general intranet experience – especially on an intranet that shares a standard look-and-feel (and so, hopefully, a standard template). This is the ubiquitous “search box”.
    • Alternately, if users commonly use a specific application (say a CRM application or a collaboration tool), integrating the enterprise search into that is a better solution.
    • Lastly, it might be necessary to make access to the search solution “many-headed”. Meaning, it might be best to make it available through a number of means, including through a standard intranet search, a specialized client-based application and embedded in other, user-specific tools.
  • Given the likely broad range of users who will use it, the search interface should be subject to very thorough usability design and testing.
  • Adopting some of the standard conventions of a search experience are a good idea.

The content available through the solution covers all (relevant) content available to employees

This has the following implications:

  • If your enterprise has a significant volume of web content, your enterprise search should index all of those web pages – either via a web crawling approach or via indexing the file system containing the files (if it’s all static).
  • If your enterprise has a significant volume of content (data) in enterprise applications (CRM solution, HR system, etc.), you should have a strategy to determine which (if any) of the content from those systems would be included, how it will be included and how it will be presented in search results (potentially combined with content from many other systems in the same results page)
  • If your enterprise has custom web applications (and what organization does not), you should expect to provide a set of standards for design and development of web applications to ensure good findability from them and also expect to have to monitor compliance with those.
  • If your enterprise has significant content in collaboration tools (and who doesn’t – at least email!), you should have a strategy for including or not including that content. This could be very broad-ranging – email, SharePoint (and similar applications from companies like Interwoven, Open Text, Vignette, Novell, etc.), shared file systems, IM logs, and so on. At the very least, you need to consider the cost and value of including these types of content.
  • If you have content repositories available to employees (a document management system (or systems!) or a records management system), again, you should consider the cost and value of including content from these in your enterprise search.
  • While it is very useful to have a separate search for finding employees in a corporate directory, I believe that an enterprise search solution should include employees as a distinct “content type” and include them in standard search results page as well when relevant (e.g., searching on employee names, etc)
  • Another major question regarding the content of your enterprise search is security. If you include all of that content in your search, how will you manage the security of the items? The two major options are early binding (building ACLs into the search) or late binding (checking security at search time). If you are not familiar with these, I would recommend you do a bit of internet searching on the topics as it’s very important to your solution. I’ve found some interesting articles on this topic.
    • In my mind, it’s also feasible to “punt” on security in a sense and work to ensure that your enterprise search solution includes everything that is generally accessible to your employee population but does not include anything with specific access control on it.
    • If you can achieve the effect of getting a user “close to” the content (ensuring some level of “information scent” shows up) but leaving it to the user to make the final step (through any application-specific access control) seems to work well.

The Search in Enterprise Search

The other half of your enterprise search solution will be the search engine itself. There are plenty (many!) options available with a variety of strengths and weaknesses. I think if you plan to implement a truly enterprise search, the above list of content-based considerations should get you thinking of all of the places where you may have content “hiding” in your organization.

From that list, you should have a good sense of the volume of content and the complexity of sources your search will need to deal with.

Combining that with a careful requirements definition process and evaluation of alternatives should lead to a successful selection of a tool.

Once you have a tool, you “just” need to apply the proper amount of elbow grease to get it to index all of the content you wish and present it in a sensible way to your users! No big deal, right?

Categories of Search Requirements

Wednesday, October 1st, 2008

I was recently asked by a former co-worker (Ray Sims) for some suggestions around requirements that he might use as the basis for an evaluation of search engines. Having just gone through such an evaluation myself, and also having posted here about the general methodology I used for the evaluation, I thought I’d follow that up with some thoughts on requirements.

If you find yourself needing to evaluate a search engine, these might be of value – at least in giving you some areas to further detail.

I normally think of requirements for search in two very broad categories – those that are more basically about helping the user doing the search (End User Search Requirements) and those that are more directed at the people (person) who is responsible for administering / maintaining the search experience (Administrator Requirements).

End User Search Requirements

  • Search result page customization – Is it straightforward to provide your own UI on top of the search results (for integration into your web site)?
  • Search result integration with other UIs (outside of a web experience) – Specifically, it’s possible you might want to use search results in a non-web-based application – can the engine do that? (If you can provide result pages in different formats, a simple way to do this is to provide an XML result format that an application can pull in via a URL.)
  • Search result summaries for items – Specifically, these should be dynamic. The snippet shown in the results should show something relevant to what the searcher searched on – not just a static piece of text (like a metadata description field). This, by itself, can greatly enhance the perceived quality of results because it makes it easier for a user to make a determination on the quality of an item right from the search results – no need to look at the item (even a highlighted version of it).
  • Highlighting – it should be possible to see a highlighted version of a result (i.e., search terms are highlighted in the display of the document)
  • “Best Bets” (or key match or whatever) – Some don’t like these, but I think it’s important to have some ability to “hand pick” (or nearly hand pick) some results for some terms – also, I think it’s very desirable to be able to say “If a user searches on X, show this item as the top result” regardless of where that item would organically show in the result (or it might not even be really indexable)
  • Relevancy calculation “soundness” – This basically means that the engine generates a good measure of relevancy for searches and encompasses most of what differentiates engines. You should understand at a general level what effects the relevancy as computed by the engine. (For many search engines, this is a the “magic dust” they can bring to the table – so they may not be willing to expose too much about how they do this but you should ask.)
  • Stemming – The engine should support stemming – if a user searches on “run”, it should automatically match the use of words that share the same stem – “runs”, “running”, “ran”, etc.
  • Similar to stemming, the engine should support synonyms – if I search on “shoe”, it might be useful to include content that matches “boot” or “slipper”, etc.
  • Concept understanding (entity extraction) – Can the engine determine the entities in a piece of content even when the content is not explicitly defined? A piece of content might be about “Product X”, say, but it may never even explicitly mention “Product X”. Some search engines will claim to do this type of analysis.
  • Performance – Obviously, good performance is important and you should understand how it scales. Do you expect a few thousand searches a week? Tens of thousands? Hundreds of thousands? You need to understand your needs and ensure that the engine will meet them.
  • Customization of error / not found presentation – Can you define what happens when no results found or some type of system error happens – It can be useful to be able to define a specific behavior when an engine would otherwise return no results (a behavior that might be outside of the engine, specifically).
  • Related queries – It might be desirable to have something like, “Users who searched on X also commonly searched on Y”

Administrator Requirements

  • Indexing of web content – Most times, it’s important to be able to index web content – commonly through a crawler, especially if it’s dynamic content.
  • Indexing of repositories – You should understand your repository architecture and which repositories will need to be indexed and how the engines will do so. Some engines provide special hooks to index different major vendors (Open Text, SharePoint, Documentum, etc.) These types of tools are often not crawlable using a general web spider / crawling approach.
  • File System indexes – Many companies still have a significant content accessible on good old file servers – understand what types of file systems can be indexed and the protocol that the search engine supports (Samba, NFS, etc.)
  • Security of search results – Often, you might want to provide a single search experience that users can use to search any content to which they can navigate, even if that content is in its own repository which follows its own (proprietary) mechanism to secure documents.
    • This is something we have not tackled, but some engines do so. You typically have two approaches – “early binding”, when the security is basically rolled into the index and “late binding” which does security checking as users do searching.
    • Most vendors do the former because it can be very expensive to do a security check on every document that might show up in search results.
    • The primary advantage of late binding is that if you refresh your index weekly on, say, Saturday and there’s a document to which I did not have access, if someone provides me access on Monday, I still won’t see it in search (until after the next refresh); conversely, people can see items in search results that they no longer have access to as well.
  • Index scheduling / control – Regardless of the type of index, you should be able to control the schedule of indexing or how fast the indexer might hit your web sites / repositories / file systems. Also, it can be very useful to have different parts of the site refreshed at different rates. You might want press releases refreshed (or at least checked) hourly, while product documentation might only need to be refreshed weekly or monthly.
  • Relevancy control – It should be possible to administratively modify the relevancy for items – up or down. Ideally, this should be based on attributes of the content such as: the server it’s on, the path on the server, the date range of the content, presence of particular meta data, etc.
  • Synonyms – It should be possible to define business-specific synonyms. Some insight from Avi Rappoport (via the SearchCoP), is that you should be careful in the use of generic synonyms – they may cause more problems they fix (so if an engine provides synonym support, you might want to know if you get some default synonyms and how you might disable them).
  • Automation / integration – It is nice if the search engine can integrate or somehow provide support for automatic maintenance of some aspects of its configuration. For example, synonyms – you might already have a means to manage those (say, in your taxonomy tool!) and having to manually administer them as a separate work process would probably lead to long-term maintainability issues. In that case, some type of import mechanism. Or, another example, have your relevancy adjustments integrated with your web analytics (so that more popular content based on usage goes up in relevancy).
  • Performance (again) – How much content do you expect to index? How fast can that content be indexed by the engine? Does the engine do full re-indexing? Incremental? Real-time?
  • Reporting – You need to have good reporting.
    • Obvious stuff like most common searches (grouped by different spans like day, hour, week, month, etc., and also for time periods you can define – meaning, “Show me most common searches for the last six months grouped by week”), most common “no result” searches, common “error” searches, etc.
    • It would be especially useful to be able to do time analysis across these types of dimensions – Most engines don’t provide that from my experience; you can get a dump for a time period and a separate one for another period and you have to manually compare them. Being able to say, “How common has this search been for the last six months in each month?” helps you understand longer-term trends.
    • Also, it can be very useful to see reports where the search terms are somehow grouped. So a search for “email” and a search for “e-mail” (to use a very simple example) would show up together – basically some kind of categorization / standardization of the searches. Doing grouping based purely on the individual searches can make it very hard to “see the forest for the trees”.
    • Lastly – reports on what people do with search results can be very useful. OK – fine, “Product X” is a top ten search consistently, but what are people selecting when they search on that? Do they not click on anything? Do they click on the same item 90% of the time? Etc.
    • I’m also planning to post separately on more details around search metrics and analytics.  Keep watch!
  • Last but certainly not least – Architectural “fit” – Make sure you understand how well the engine will fit in your data center. OS(es) it runs on? Hardware compatibility, etc.  For some engines where you purchase a closed appliance, this may not be relevant but you should involve your data center people in understanding this area.

Evaluating and Selecting a Search Engine

Tuesday, September 30th, 2008

A few months back, I was asked to evaluate my company’s current solution solution against another search engine to try to determine if it would be worthwhile to implement a new solution. I’ve done package / tool evaluations in the past but I felt that there was something a bit different about this in that I needed to somehow integrate a fairly standard requirements-based evaluation with a measure of quality of the search results themselves, which are not easily expressed as concrete requirements.

So I set about the task and asked the SearchCop for suggestions about how to do an evaluation of the search results in a meaningful and supportable way. I received several useful results, including some suggestions from Avi Rappaport, about a methodology to go about identifying a good representation of search terms to use in an evaluation.

With my own experiences and those of the SearchCoP in hand, I came up with a process that I thought I would share here.

Two Components to the Evaluation

I split the assessment into two distinct parts. The first was a traditional “requirements-based” assessment which allowed me to reflect support for a number of functional or architectural needs I could identify. Some examples of such requirements were:

  • The ability to support multiple file systems;
  • The ability to control the web crawler (independent of robots.txt or robots tags embedded in pages)
  • The power and flexibility of the administration interface, etc.

The second part of the assessment was to measure the quality of the search results.

I’ll provide more details below for each part of the assessment, but the key thing for this assessment was the have a (somewhat) quantitative way to measure the overall picture of the effectiveness and power of the search engines. It might be possible to even quantitatively combine the measure of these two components, though I did not do so in this case.

Requirements Assessment

For the first part, I used a simplified quality functional deployment matrix – I identified the various requirements to consider and assigned them a weight (level of importance); based on some previous experiences, I forced the weights to be either 10 (very important -probably “mandatory” in a semantic sense), a 5 (desirable but not absolutely necessary) or a 1 (nice to have) – this provides a better spread in the final outcome, I believe.

Then I reviewed the search engines against those requirements and assigned each search engine a “score” which, again, was measured as a 10 (met out of the box), a 5 (met with some level of configuration), a 1 (met with some customization – i.e., probably some type of scripting or similar, but not configuration through an admin UI) and a 0 (does not meet and can not meet).

The overall “score” for an engine was then measured as the sum of the product of the score and weight for each requirement.

This simplistic approach can have the effect of giving too much weight to certain areas of requirements in total. Because each requirement is given a weight, if there are areas of requirements that have a lot of detail in your particular case, you can give that area too much overall weight simply because of the amount of detail. In other words, if you have a total of, say, 50 requirements and 30 of them are in one area (say you have specified 30 different file formats you need to support – each as a different requirement), then a significant percentage of your overall score will be contingent on that area. In some cases, that is OK but in many, it is not.

In order to work around this, I took the following approach:

  • Grouped requirements into a set of categories;
  • The categories should reflect natural cohesiveness of the requirements but should also be defined in a way that each category is roughly equal in importance to other categories;
  • Compute the total possible score for each category (which in my case was 10 * (total-weight-of-requirements-in-category)
  • Compute the relative score of that category for a search engine by summing the product of that engine’s score and the weight of the requirements for that category; the relative score is that engine’s score divided by the total possible score for that category.
  • Now sum all of the relative scores for each category and (to get a number between 0 and 100) multiply by 100

This approach gives you a score for each engine between 0 and 100 and also gives each category a roughly equal effect on the total score.

If you are looking for some insights on categories of requirements you might want to include in your evaluation, I provide some of my thoughts in a subsequent post.

Search Results Quality

To measure the quality of search results, I took Avi’s insights from the SearchCoP and identified a set of specific searches that I wanted to measure. I identified the candidate searches by looking at the log files for the existing search solution on the site and pulling out a few searches that fell into each category Avi identified. The categories included:

  • Simple queries
  • Complex queries
  • Common queries
  • Spelling, typing and vocabulary errors
  • Force matching edge-case issues, including:
    • Many matches
    • Few matches
    • No matches

Going into this, I assumed I did not necessarily know the “right” targets for these searches, so I enlisted some volunteers among a group of knowledgeable employees (content managers on the web site) who could complete a survey I put together. The survey included a section where the participant had to execute each search against each search engine (the survey provided a link to do the search – so the participants did not have to actually go to a search screen somewhere and enter the terms and search – this was important to keep it somewhat simpler). The participants were then asked to score the quality of the results for each search engine (on a scale of 1-5).

The survey also included some other questions about presentation of results, performance, etc. (even though we did not customize search result templates or tweak anything in the searches, we wanted to get a general sense of usability) and also included a section where users could define and rate their own searches.

The results from the survey were then analyzed to get an overall measure of quality of results across this candidate set of searches for each search engine – basically doing some aggregation of the different searches into average scores or similar.

Outcome of the Assessment

With the engines we were looking at, the results were that one was better on the administration / architectural requirements and the
other was better on the search results – which makes for an interesting decision, I think.

The key takeaway for me from this process is that it is at least quantitative – one can argue over the set of requirements to include, or the weight of any particular requirement or the score of an engine on a particular requirement. However, the discussion can be held at that level instead of a more qualitative level (AKA “gut feel”).

Additionally, for search engines, taking a two-part approach ensures that each of these very important factors are included and reflected in the final outcome.

Issues with this Approach

In the case of my own execution of this approach, I know there are some issues (the general methodology is sound, I believe). Including (in no particular order):

  • I defined the set of requirements (ideally, I would have liked to have input from others but I’ve basically been a one-man show and I don’t think others would have had a lot of input or time to provide that input).
  • I defined the weights for requirements (see above).
  • I assigned the score for the requirements (again, see above).
  • I did not have hands-on with each engine under consideration and had to lean a lot on documentation, demos and discussions with vendors.
  • All summed up – I think the exact scores could be in question but, given me as the only resource it worked reasonably well.

As for the survey / search results evaluation:

  • I would have liked a larger population of participants, including people who did not know the site
  • I would have liked a larger population of queries to be included, but I felt the number already was pretty large (about 40 pre-defined ones and ability for 10 more user-defined)
  • I did not mask which engine produced which results. As Walter Underwood mentions (he referenced this post from the SearchCoP thread), that can cause some significant issues with reliability of measures.

The 3 Principles of Enterprise Search (part 3): Relevance

Thursday, January 10th, 2008

As I previously wrote, in my work on enterprise search, I have found there to be 3 Principles of Enterprise Search: Coverage, Identity and Relevance. My previous posts have discussed the principles of Coverage and Identity.

Here, I will cover the principle of Relevance.

So, in your efforts to improve search for your users, you have addressed the principle of Coverage and you have thousands of potential search candidates in your enterprise search tool. You have addressed the principle of Identity and all of those search results display well in a search results page, clearly identifying what they are so a searcher can confidently know what an item is. Now for the hardest of the three principles to address: Relevance.

The principle of Relevance is all about search results candidates showing up high in search results for appropriate search terms. Relating this back to the original driving question – “Why doesn’t X show up when I search on the search terms Y?”, the principle Relevancy addresses the situation where X is there and may even be listed as X, but it is on the second (or even farther down) page of results.

This principle is in some ways both the hardest and simplest to address. It is hard because it practically requires that you anticipate every searcher’s expectations and that you can practically read their minds (no mean feat!). It’s simple (at least given a search engine) because relevance is also a primary focus for the search engine itself – many search engines differentiate themselves from competitors based on how well the engine can estimate relevance for content objects based on a searcher’s criteria; so your search engine is likely going to help you a lot with regard to relevance.

However, there are still a lot of issues to consider and areas you need to address to help your search engine as well as your users.

  1. One of the first things you should consider is the set of keywords associated with your content. There are several different ways search engines will encounter keywords:
    1. First and foremost, the content of your search items present a set of keywords to most search engines; this is going to be the content visible in a web page or the words in the body of documents.
    2. The keywords accessible in the form of “keywords” <meta> tags in HTML pages, or “keywords” fields in the File Properties of documents in various formats.
    3. The keywords might even be terms in a database that is related in some way to the content that your search engine can use. This is very common for tightly constrained environments that integrate both a content management (or collaboration) environment with a search experience. If the tool controls both the content and the search, it can take advantage of a lot of “insight” that might not be directly available to an enterprise search solution.
    4. Some search engines will even use the text of links pointing to a content item as keywords describing the item. So content managers can influence can influence the relevance of content they don’t manage themselves by how they refer to it.
    5. Lastly, you also need to understand how your search engine will use and interpret these various sources of keywords and focus on those that provide the most impact. Some search engines might ignore the “keywords” <meta> tag for example, so you may not need to be concerned with that at all.
  2. One detail to highlight with regard to the content of your search items is that, just like the navigation challenges discussed in the around Coverage, if you have web sites that depend on JavaScript to display content, then that content likely will be invisible to your search engine, so it will not contribute to the keywords users can use to find the pages. I see this issue as becoming more of a problem in the future as applications are built that take advantage of AJAX to present dynamic user interfaces.
  3. Once you have a strategy for how you will present keywords to your search engine, you need to determine how best to manage the set of keywords that will be most useful to your content managers and to the users of your search tool. A principle tool for this is to have a taxonomy that helps inform your audience about preferred terms. I’ll write more about taxonomies in the future – for now, you should know that a very effective way to improve search is to simply constrain the terms used to tag content to a well-managed set.
    1. A taxonomy can also be used to provided guided navigation or constraint search pick lists. Instead of a simple keyword box for search, you can offer your users lists of values to select. The utility of this will depend on your users’ needs and you need to ensure you pay attention to usability.
  4. Related to taxonomies, you should also consider how best to manage synonyms. This will likely require some work with your taxonomy (to associate synonyms with “preferred” terms); this may require you to manage synonyms for your search engine (to define the mapping between synonyms used by the engine – hopefully, these rings are pulled from your taxonomy!); you might need to institute some means to tag your content with both the preferred terms and with synonyms (especially if you are exposing your content to search engines other than your own – i.e., your content is exposed to internet search engines).
  5. A third issue related to relevancy is the security of content; I relate this to relevance in the sense that if a user does not have access to a particular piece of content exposed by your search, effectively that content has zero relevance for that user. Many web tools (especially collaboration applications) provide users with very powerful management tools to control visibility of content – including even such details as differentiating even between who can know a piece of content exists and being able to download that content. However, interpreting granular searching controls on content is a very hard problem for an enterprise search tool to efficiently solve. In my experience, the most common “solution” for this type of problem is to not index such secure areas for inclusion in your enterprise search but to ensure that the tool provides a “local search’ and then ensure your enterprise search experience points users to this local search function when appropriate.
  6. Lastly for now, another area you should consider in terms of relevance is to monitor your search engine’s log files. Ultimately, I think this effort will transform into one of:
    1. Input to help you manage your taxonomy (by discovering the terms your search users are actually using and understanding how they differ from your taxonomy)
    2. Identification of holes in your content by understanding “not found” results (helping to identify and then solve Coverage issues)
    3. Identification of relevancy issues by understanding when some terms require more page scrolling than others.

Summary

To look back at the 3 Principles: 1) you need to make sure your search engine will find and index the necessary content; 2) you need to make sure your content will properly be identified in search results; and, 3) you need to ensure that your content will show as highly relevant for searches your users expect to show that content.

To address most issues, it does not require any magic or rocket science, but just an awareness of the issues and time and resources (these latter two being scarce for many!) to work on resolving them.

What have I missed? What else do others have to share?

The 3 Principles of Enterprise Search (part 2): Identity

Tuesday, January 8th, 2008

As I previously wrote, in my work on enterprise search, I have found there to be 3 Principles of Enterprise Search: Coverage, Identity and Relevance. My previous post discussed Coverage. Here, I will cover the principle of Identity.

What is Identity in a search solution?

The principle of Identity relates to the way a search engine will display search results. Search engines will have a number of tools available to display items in search results, up to and including providing sophisticated templating mechanisms where you can provide your own logic for how an item is displayed. The Identity of a search result is the way in which it is presented in a search results page.

With regard to the original driving question – “Why doesn’t X show up when I search on the search terms Y?”, the principle Identity addresses the situation where X is there but it’s called Z.

At the heart of most of this is the idea of a “title” for an item. Most search engines will use the text in the <title> tag of an HTML page as the name of the item in search results. Similarly, most search engines will use the Document Title field of documents that support such a field (all OpenOffice and Microsoft Office formats, PDF files, etc).

Issues with Identity

The specific challenges I see in regard to this principle include:

  1. Web pages or documents with no title – This is a problem that is much more common with Documents than with web pages. Most web pages will have at least some kind of title, but not many users are even aware of the “Title” File Property or, if they are, most of those users don’t bother to ensure it is filled in. If you expose a lot of documents in your search results, this single problem can be a search killer because the search engine either ends up displaying the URL or generating a title.
    • Some search engines will use heuristics to identity a title to use in search results if the identified title field is empty; for example, a search engine might use the URL of the item as the title (often, the path names and/or file name in the URL can be as informative as a title); or, the search engine might generate a title from header information in the document (looking for particular style of text, etc).
  2. Web pages or documents with useless or misleading titles – This issue probably equally plagues Documents and web pages. Some classic examples are caused by users creating documents by starting with a template, which was nicely titled “Business Paper Template”, for example, but then the users don’t change the title. You end up with dozens of items titled “Business Paper Template”. Or, users start with a completely unrelated document and edit it to what they’re working on; a user finds an item title “Q4 Financial Report” but the item is actually a design document deliverable for a client consulting engagement.
  3. Web pages or documents with redundant titles – This problem shows somewhat in documents (the “Business Paper Template” example) but it is far more common in a dynamic web sites. The application developers start with a common template (say a .jsp page) and leave in the static <title> tag for every single page. You end up with hundreds or thousands of items titled something like, “Business Expense Reports” (or whatever). Nothing will cause disenchantment in a searcher like that! “OK – Which one of these 10 items all titled the same am I supposed to pick??”
  4. Another important piece of Identity for search results is the snippet (or summary) that is often displayed along with the title. Some search engines will dynamically calculate this based on the user’s query (this is the most useful way to generate this); some engines will use a static description of the item – in this case, commonly the “description” <meta> tag may be used (or other common <meta> tags). You need to understand how your engine generates this and ensure the content being searched will show well. Depending on the quality of titles, this snippet can either be critically important (if titles are poor and the snippet is dynamically generated) or might be less critical if you ensure your content has good, clear, distinct titles.

Addressing the issues

As with the principle of Coverage, commonly identifying the specific cases where you have problems with Identity is “half the battle”. Depending on the particular issue, fixing the issue is normally pretty straightforward (if laborious). For Documents with no (or misleading or useless) title, work with the content owner to educate them about the importance of filling in the File Properties. Ideally, a workflow in the content management system that would include a review for this would be great, though we do not have that in place at Novell.

We do have a content management community of practice, though, through which I have shared this kind of insight many times and I continue to educate anyone who is or becomes a content manager.

For web applications, I’ve used the same approach as I describe in addressing issues with the Coverage principle – education of development teams and review of applications (hopefully before they are deployed).

Now, onto the third Principle of Enterprise Search – the principle that makes a potential search results candidate that identifies itself well into a good search results candidate – the principle of Relevance.

The 3 Principles of Enterprise Search (part 1): Coverage

Tuesday, January 8th, 2008

This is the first in a series of posts in which I’ll delve into details of the Content Lifecycle that I’ve previously written about – the first topics will be a series on Search – an aspect of the “Use” stage of that lifecycle.

I’ve been working in the area of enterprise search for about 5 years now for my employer. During that time, I’ve learned a lot about how our own search engine QuickFinder works with regards to crawled indexes and (to a lesser extent) file system indexes. I’ve also learned a lot more about the general issues of searching within the enterprise.

A lot of my thinking in terms of a search solution revolves around a constant question I hear from users: “Why doesn’t X show up when I search on the search terms Y?” Users might not always phrase their problems that way, but that’s what most issues ultimately boil down to.

In considering the many issues users face with finding content, I have come to find there are three principles at play when it comes to findability: Coverage, Identity and Relevance. What are each of these principles? The principle of Coverage relates to what is included; the principle of Identity relates to how search results are identified; the principle of Relevance relates to when a content object shows up in search results.

First up – the principle of Coverage.

Issues with Coverage

The principle of Coverage is about the set of targets that a searcher might be able to find using your search tool. A content object first must be found by a search indexer in order to be a potential search results candidate; in other words, one answer to the user’s question above might be, “Because X isn’t in the search index!”. There are many issues I’ve found that inhibit good coverage – that is, issues that keep objects from even being a potential search results candidate (much less a good search results candidate):

  1. Lack of linkage – the simplest issue. Most web-based search engines will have a search indexer that operates as a crawler. A crawler must either be given a link to an object or it must find a link (along a path of links from a page it does start with) in order to discover it. A web page on a site that has no links to it will not be indexed – i.e., it will not be a search results candidate.
  2. JavaScript – a variation of #1 – I have not yet found a search indexer that crawls sites which will execute JavaScript. Any navigation on a page that depends on JavaScript for it to be displayed is effectively navigation that’s not there to search engines. JavaScript is quite common with menuing systems in browsers and so this issue can be problematic. To a user navigating in a browser that executes that JavaScript, it can be hard to understand why a crawler cannot follow the same links – “They’re right there in the menu!”
  3. On our intranet, we have many secure areas – and, for the most part, we have a single sign-on capability provided by iChain. Our search engine also is able to authenticate while it’s crawling. However, at times, the means by which some applications achieve their single-sign on, while it seems transparent to a user in a browser, will use a combination of redirects and required HTTP header values that stymie the search indexer.
  4. Similar to #3 – while probably 98+% of tools on our intranet have single sign-on, some do not support single sign-on. This can cause challenges for a crawler-based indexing engine – even with an indexing engine that will handle authentication, most engines will support only one set of credentials.
  5. Web applications that depend on a user performing a search (using a “local” search interface) to find content. Such a web application keeps its own content invisible to an enterprise search engine without some specific consideration.
  6. Like many enterprise search engines, our search engine will allow (or perhaps require) that you define a set of domains that it is permitted to index. Depending on the search engine, it might only look at content in a specific set of domains you identify (x.novell.com and y.novell.com). Some search engines will allow you to specify the domain with wild-cards (*.novell.com). Some will allow you to specify that the crawler can follow links off of a “permitted” domain to a certain depth (I think of this as the “you can go one link away” rule). However, given the an enterprise search will not be indexing all domains, it can happen that some content your users need to find is not available via a domain that is being indexed.
  7. The opposite problem can occur at times, especially with web applications that exhibit poor implementation of robots tags – you can end up with a crawler finding many, many content items that very low quality, or even useless search results, or which are redundant. Some specific examples I’ve encountered with this issue include:
    • Web applications that allow a user to click on a link to edit data or content in the application – the link to the “edit view” is found by the crawler and then the “edit” view of an item becomes a potential search target.
    • Web applications that show a table of items and provide links (commonly in the header row of the table) to sort those lists by values in the different columns. The crawler ends up indexing every possible sort order of the data – not very useful.
    • The dreaded web-based calendar tool – when these have a “next month” link, a crawler can get stuck in an infinite (or nearly so anyway) link loop indexing days or months or years far out into the future.
    • Sites (common for a tool like a Wiki) that provide a “printer-friendly” version of a page. A crawler will find the link to the printer-friendly version and, unless the indexer is told not to index it (via robots tags), it will be included along with the actual article itself.

Addressing the issues

One important reminder on all of the above is that, while these represent a variety issues, it will almost always be possible to index the content. It’s primarily a matter of understanding when content is not being included that users are searching for and doing the work to index it. I find that it’s not necessarily the search engine but the potential for some minor development work that will stall this at times. Weighing the cost of the work to make the content visible against the benefit of having the content as potential search targets may not show a compelling reason to do the work.

To address the above issues, the most common solution for these is to work with the content managers to identify (or possibly build) a good index page. No magic, really, but just the need to recognize the omission (or, in the case of the robots tag issue, the proliferation) of content and then priority to act on it.

Within my own enterprise, to address the issue of robots tags in web applications, I have taken a few specific steps:

  1. First, I have been on an educational campaign with our developers for several years now – most developers quickly understand the value in making <title> tags dynamic in terms of how it interacts with search and it’s normally such a small amount of change in an application, they will simply “do the right thing” from the beginning once they understand the problem;
  2. Second, part of our application development guidelines now formally includes standards around titling within web applications (this is really a detail of the education campaign);
  3. Third, when I can, I try to ensure I have an opportunity to review web applications prior to their “go live” to be able to provide feedback to the development teams on the findability of items in their application.
  4. Lastly, independent of “verification”, I have also provided a general methodology to our development teams to help them work through a strategy for titling and tagging in their web applications even if my absence.

In summary – addressing the Coverage principle is critical to ensure that the content your users are looking for is at least included in the index and is a potential search results candidate. In my next posts, I will address the principles that make a potential search results candidate a good search results candidate – the principles of Identity and Relevance.