Lee Romero

On Content, Collaboration and Findability

Archive for October 1st, 2008

Categories of Search Requirements

Wednesday, October 1st, 2008

I was recently asked by a former co-worker (Ray Sims) for some suggestions around requirements that he might use as the basis for an evaluation of search engines. Having just gone through such an evaluation myself, and also having posted here about the general methodology I used for the evaluation, I thought I’d follow that up with some thoughts on requirements.

If you find yourself needing to evaluate a search engine, these might be of value – at least in giving you some areas to further detail.

I normally think of requirements for search in two very broad categories – those that are more basically about helping the user doing the search (End User Search Requirements) and those that are more directed at the people (person) who is responsible for administering / maintaining the search experience (Administrator Requirements).

End User Search Requirements

  • Search result page customization – Is it straightforward to provide your own UI on top of the search results (for integration into your web site)?
  • Search result integration with other UIs (outside of a web experience) – Specifically, it’s possible you might want to use search results in a non-web-based application – can the engine do that? (If you can provide result pages in different formats, a simple way to do this is to provide an XML result format that an application can pull in via a URL.)
  • Search result summaries for items – Specifically, these should be dynamic. The snippet shown in the results should show something relevant to what the searcher searched on – not just a static piece of text (like a metadata description field). This, by itself, can greatly enhance the perceived quality of results because it makes it easier for a user to make a determination on the quality of an item right from the search results – no need to look at the item (even a highlighted version of it).
  • Highlighting – it should be possible to see a highlighted version of a result (i.e., search terms are highlighted in the display of the document)
  • “Best Bets” (or key match or whatever) – Some don’t like these, but I think it’s important to have some ability to “hand pick” (or nearly hand pick) some results for some terms – also, I think it’s very desirable to be able to say “If a user searches on X, show this item as the top result” regardless of where that item would organically show in the result (or it might not even be really indexable)
  • Relevancy calculation “soundness” – This basically means that the engine generates a good measure of relevancy for searches and encompasses most of what differentiates engines. You should understand at a general level what effects the relevancy as computed by the engine. (For many search engines, this is a the “magic dust” they can bring to the table – so they may not be willing to expose too much about how they do this but you should ask.)
  • Stemming – The engine should support stemming – if a user searches on “run”, it should automatically match the use of words that share the same stem – “runs”, “running”, “ran”, etc.
  • Similar to stemming, the engine should support synonyms – if I search on “shoe”, it might be useful to include content that matches “boot” or “slipper”, etc.
  • Concept understanding (entity extraction) – Can the engine determine the entities in a piece of content even when the content is not explicitly defined? A piece of content might be about “Product X”, say, but it may never even explicitly mention “Product X”. Some search engines will claim to do this type of analysis.
  • Performance – Obviously, good performance is important and you should understand how it scales. Do you expect a few thousand searches a week? Tens of thousands? Hundreds of thousands? You need to understand your needs and ensure that the engine will meet them.
  • Customization of error / not found presentation – Can you define what happens when no results found or some type of system error happens – It can be useful to be able to define a specific behavior when an engine would otherwise return no results (a behavior that might be outside of the engine, specifically).
  • Related queries – It might be desirable to have something like, “Users who searched on X also commonly searched on Y”

Administrator Requirements

  • Indexing of web content – Most times, it’s important to be able to index web content – commonly through a crawler, especially if it’s dynamic content.
  • Indexing of repositories – You should understand your repository architecture and which repositories will need to be indexed and how the engines will do so. Some engines provide special hooks to index different major vendors (Open Text, SharePoint, Documentum, etc.) These types of tools are often not crawlable using a general web spider / crawling approach.
  • File System indexes – Many companies still have a significant content accessible on good old file servers – understand what types of file systems can be indexed and the protocol that the search engine supports (Samba, NFS, etc.)
  • Security of search results – Often, you might want to provide a single search experience that users can use to search any content to which they can navigate, even if that content is in its own repository which follows its own (proprietary) mechanism to secure documents.
    • This is something we have not tackled, but some engines do so. You typically have two approaches – “early binding”, when the security is basically rolled into the index and “late binding” which does security checking as users do searching.
    • Most vendors do the former because it can be very expensive to do a security check on every document that might show up in search results.
    • The primary advantage of late binding is that if you refresh your index weekly on, say, Saturday and there’s a document to which I did not have access, if someone provides me access on Monday, I still won’t see it in search (until after the next refresh); conversely, people can see items in search results that they no longer have access to as well.
  • Index scheduling / control – Regardless of the type of index, you should be able to control the schedule of indexing or how fast the indexer might hit your web sites / repositories / file systems. Also, it can be very useful to have different parts of the site refreshed at different rates. You might want press releases refreshed (or at least checked) hourly, while product documentation might only need to be refreshed weekly or monthly.
  • Relevancy control – It should be possible to administratively modify the relevancy for items – up or down. Ideally, this should be based on attributes of the content such as: the server it’s on, the path on the server, the date range of the content, presence of particular meta data, etc.
  • Synonyms – It should be possible to define business-specific synonyms. Some insight from Avi Rappoport (via the SearchCoP), is that you should be careful in the use of generic synonyms – they may cause more problems they fix (so if an engine provides synonym support, you might want to know if you get some default synonyms and how you might disable them).
  • Automation / integration – It is nice if the search engine can integrate or somehow provide support for automatic maintenance of some aspects of its configuration. For example, synonyms – you might already have a means to manage those (say, in your taxonomy tool!) and having to manually administer them as a separate work process would probably lead to long-term maintainability issues. In that case, some type of import mechanism. Or, another example, have your relevancy adjustments integrated with your web analytics (so that more popular content based on usage goes up in relevancy).
  • Performance (again) – How much content do you expect to index? How fast can that content be indexed by the engine? Does the engine do full re-indexing? Incremental? Real-time?
  • Reporting – You need to have good reporting.
    • Obvious stuff like most common searches (grouped by different spans like day, hour, week, month, etc., and also for time periods you can define – meaning, “Show me most common searches for the last six months grouped by week”), most common “no result” searches, common “error” searches, etc.
    • It would be especially useful to be able to do time analysis across these types of dimensions – Most engines don’t provide that from my experience; you can get a dump for a time period and a separate one for another period and you have to manually compare them. Being able to say, “How common has this search been for the last six months in each month?” helps you understand longer-term trends.
    • Also, it can be very useful to see reports where the search terms are somehow grouped. So a search for “email” and a search for “e-mail” (to use a very simple example) would show up together – basically some kind of categorization / standardization of the searches. Doing grouping based purely on the individual searches can make it very hard to “see the forest for the trees”.
    • Lastly – reports on what people do with search results can be very useful. OK – fine, “Product X” is a top ten search consistently, but what are people selecting when they search on that? Do they not click on anything? Do they click on the same item 90% of the time? Etc.
    • I’m also planning to post separately on more details around search metrics and analytics.  Keep watch!
  • Last but certainly not least – Architectural “fit” – Make sure you understand how well the engine will fit in your data center. OS(es) it runs on? Hardware compatibility, etc.  For some engines where you purchase a closed appliance, this may not be relevant but you should involve your data center people in understanding this area.