Lee Romero

On Content, Collaboration and Findability

Archive for January 8th, 2008

The 3 Principles of Enterprise Search (part 2): Identity

Tuesday, January 8th, 2008

As I previously wrote, in my work on enterprise search, I have found there to be 3 Principles of Enterprise Search: Coverage, Identity and Relevance. My previous post discussed Coverage. Here, I will cover the principle of Identity.

What is Identity in a search solution?

The principle of Identity relates to the way a search engine will display search results. Search engines will have a number of tools available to display items in search results, up to and including providing sophisticated templating mechanisms where you can provide your own logic for how an item is displayed. The Identity of a search result is the way in which it is presented in a search results page.

With regard to the original driving question – “Why doesn’t X show up when I search on the search terms Y?”, the principle Identity addresses the situation where X is there but it’s called Z.

At the heart of most of this is the idea of a “title” for an item. Most search engines will use the text in the <title> tag of an HTML page as the name of the item in search results. Similarly, most search engines will use the Document Title field of documents that support such a field (all OpenOffice and Microsoft Office formats, PDF files, etc).

Issues with Identity

The specific challenges I see in regard to this principle include:

  1. Web pages or documents with no title – This is a problem that is much more common with Documents than with web pages. Most web pages will have at least some kind of title, but not many users are even aware of the “Title” File Property or, if they are, most of those users don’t bother to ensure it is filled in. If you expose a lot of documents in your search results, this single problem can be a search killer because the search engine either ends up displaying the URL or generating a title.
    • Some search engines will use heuristics to identity a title to use in search results if the identified title field is empty; for example, a search engine might use the URL of the item as the title (often, the path names and/or file name in the URL can be as informative as a title); or, the search engine might generate a title from header information in the document (looking for particular style of text, etc).
  2. Web pages or documents with useless or misleading titles – This issue probably equally plagues Documents and web pages. Some classic examples are caused by users creating documents by starting with a template, which was nicely titled “Business Paper Template”, for example, but then the users don’t change the title. You end up with dozens of items titled “Business Paper Template”. Or, users start with a completely unrelated document and edit it to what they’re working on; a user finds an item title “Q4 Financial Report” but the item is actually a design document deliverable for a client consulting engagement.
  3. Web pages or documents with redundant titles – This problem shows somewhat in documents (the “Business Paper Template” example) but it is far more common in a dynamic web sites. The application developers start with a common template (say a .jsp page) and leave in the static <title> tag for every single page. You end up with hundreds or thousands of items titled something like, “Business Expense Reports” (or whatever). Nothing will cause disenchantment in a searcher like that! “OK – Which one of these 10 items all titled the same am I supposed to pick??”
  4. Another important piece of Identity for search results is the snippet (or summary) that is often displayed along with the title. Some search engines will dynamically calculate this based on the user’s query (this is the most useful way to generate this); some engines will use a static description of the item – in this case, commonly the “description” <meta> tag may be used (or other common <meta> tags). You need to understand how your engine generates this and ensure the content being searched will show well. Depending on the quality of titles, this snippet can either be critically important (if titles are poor and the snippet is dynamically generated) or might be less critical if you ensure your content has good, clear, distinct titles.

Addressing the issues

As with the principle of Coverage, commonly identifying the specific cases where you have problems with Identity is “half the battle”. Depending on the particular issue, fixing the issue is normally pretty straightforward (if laborious). For Documents with no (or misleading or useless) title, work with the content owner to educate them about the importance of filling in the File Properties. Ideally, a workflow in the content management system that would include a review for this would be great, though we do not have that in place at Novell.

We do have a content management community of practice, though, through which I have shared this kind of insight many times and I continue to educate anyone who is or becomes a content manager.

For web applications, I’ve used the same approach as I describe in addressing issues with the Coverage principle – education of development teams and review of applications (hopefully before they are deployed).

Now, onto the third Principle of Enterprise Search – the principle that makes a potential search results candidate that identifies itself well into a good search results candidate – the principle of Relevance.

The 3 Principles of Enterprise Search (part 1): Coverage

Tuesday, January 8th, 2008

This is the first in a series of posts in which I’ll delve into details of the Content Lifecycle that I’ve previously written about – the first topics will be a series on Search – an aspect of the “Use” stage of that lifecycle.

I’ve been working in the area of enterprise search for about 5 years now for my employer. During that time, I’ve learned a lot about how our own search engine QuickFinder works with regards to crawled indexes and (to a lesser extent) file system indexes. I’ve also learned a lot more about the general issues of searching within the enterprise.

A lot of my thinking in terms of a search solution revolves around a constant question I hear from users: “Why doesn’t X show up when I search on the search terms Y?” Users might not always phrase their problems that way, but that’s what most issues ultimately boil down to.

In considering the many issues users face with finding content, I have come to find there are three principles at play when it comes to findability: Coverage, Identity and Relevance. What are each of these principles? The principle of Coverage relates to what is included; the principle of Identity relates to how search results are identified; the principle of Relevance relates to when a content object shows up in search results.

First up – the principle of Coverage.

Issues with Coverage

The principle of Coverage is about the set of targets that a searcher might be able to find using your search tool. A content object first must be found by a search indexer in order to be a potential search results candidate; in other words, one answer to the user’s question above might be, “Because X isn’t in the search index!”. There are many issues I’ve found that inhibit good coverage – that is, issues that keep objects from even being a potential search results candidate (much less a good search results candidate):

  1. Lack of linkage – the simplest issue. Most web-based search engines will have a search indexer that operates as a crawler. A crawler must either be given a link to an object or it must find a link (along a path of links from a page it does start with) in order to discover it. A web page on a site that has no links to it will not be indexed – i.e., it will not be a search results candidate.
  2. JavaScript – a variation of #1 – I have not yet found a search indexer that crawls sites which will execute JavaScript. Any navigation on a page that depends on JavaScript for it to be displayed is effectively navigation that’s not there to search engines. JavaScript is quite common with menuing systems in browsers and so this issue can be problematic. To a user navigating in a browser that executes that JavaScript, it can be hard to understand why a crawler cannot follow the same links – “They’re right there in the menu!”
  3. On our intranet, we have many secure areas – and, for the most part, we have a single sign-on capability provided by iChain. Our search engine also is able to authenticate while it’s crawling. However, at times, the means by which some applications achieve their single-sign on, while it seems transparent to a user in a browser, will use a combination of redirects and required HTTP header values that stymie the search indexer.
  4. Similar to #3 – while probably 98+% of tools on our intranet have single sign-on, some do not support single sign-on. This can cause challenges for a crawler-based indexing engine – even with an indexing engine that will handle authentication, most engines will support only one set of credentials.
  5. Web applications that depend on a user performing a search (using a “local” search interface) to find content. Such a web application keeps its own content invisible to an enterprise search engine without some specific consideration.
  6. Like many enterprise search engines, our search engine will allow (or perhaps require) that you define a set of domains that it is permitted to index. Depending on the search engine, it might only look at content in a specific set of domains you identify (x.novell.com and y.novell.com). Some search engines will allow you to specify the domain with wild-cards (*.novell.com). Some will allow you to specify that the crawler can follow links off of a “permitted” domain to a certain depth (I think of this as the “you can go one link away” rule). However, given the an enterprise search will not be indexing all domains, it can happen that some content your users need to find is not available via a domain that is being indexed.
  7. The opposite problem can occur at times, especially with web applications that exhibit poor implementation of robots tags – you can end up with a crawler finding many, many content items that very low quality, or even useless search results, or which are redundant. Some specific examples I’ve encountered with this issue include:
    • Web applications that allow a user to click on a link to edit data or content in the application – the link to the “edit view” is found by the crawler and then the “edit” view of an item becomes a potential search target.
    • Web applications that show a table of items and provide links (commonly in the header row of the table) to sort those lists by values in the different columns. The crawler ends up indexing every possible sort order of the data – not very useful.
    • The dreaded web-based calendar tool – when these have a “next month” link, a crawler can get stuck in an infinite (or nearly so anyway) link loop indexing days or months or years far out into the future.
    • Sites (common for a tool like a Wiki) that provide a “printer-friendly” version of a page. A crawler will find the link to the printer-friendly version and, unless the indexer is told not to index it (via robots tags), it will be included along with the actual article itself.

Addressing the issues

One important reminder on all of the above is that, while these represent a variety issues, it will almost always be possible to index the content. It’s primarily a matter of understanding when content is not being included that users are searching for and doing the work to index it. I find that it’s not necessarily the search engine but the potential for some minor development work that will stall this at times. Weighing the cost of the work to make the content visible against the benefit of having the content as potential search targets may not show a compelling reason to do the work.

To address the above issues, the most common solution for these is to work with the content managers to identify (or possibly build) a good index page. No magic, really, but just the need to recognize the omission (or, in the case of the robots tag issue, the proliferation) of content and then priority to act on it.

Within my own enterprise, to address the issue of robots tags in web applications, I have taken a few specific steps:

  1. First, I have been on an educational campaign with our developers for several years now – most developers quickly understand the value in making <title> tags dynamic in terms of how it interacts with search and it’s normally such a small amount of change in an application, they will simply “do the right thing” from the beginning once they understand the problem;
  2. Second, part of our application development guidelines now formally includes standards around titling within web applications (this is really a detail of the education campaign);
  3. Third, when I can, I try to ensure I have an opportunity to review web applications prior to their “go live” to be able to provide feedback to the development teams on the findability of items in their application.
  4. Lastly, independent of “verification”, I have also provided a general methodology to our development teams to help them work through a strategy for titling and tagging in their web applications even if my absence.

In summary – addressing the Coverage principle is critical to ensure that the content your users are looking for is at least included in the index and is a potential search results candidate. In my next posts, I will address the principles that make a potential search results candidate a good search results candidate – the principles of Identity and Relevance.