Lee Romero

On Content, Collaboration and Findability

Archive for January, 2008

The 3 Principles of Enterprise Search (part 3): Relevance

Thursday, January 10th, 2008

As I previously wrote, in my work on enterprise search, I have found there to be 3 Principles of Enterprise Search: Coverage, Identity and Relevance. My previous posts have discussed the principles of Coverage and Identity.

Here, I will cover the principle of Relevance.

So, in your efforts to improve search for your users, you have addressed the principle of Coverage and you have thousands of potential search candidates in your enterprise search tool. You have addressed the principle of Identity and all of those search results display well in a search results page, clearly identifying what they are so a searcher can confidently know what an item is. Now for the hardest of the three principles to address: Relevance.

The principle of Relevance is all about search results candidates showing up high in search results for appropriate search terms. Relating this back to the original driving question – “Why doesn’t X show up when I search on the search terms Y?”, the principle Relevancy addresses the situation where X is there and may even be listed as X, but it is on the second (or even farther down) page of results.

This principle is in some ways both the hardest and simplest to address. It is hard because it practically requires that you anticipate every searcher’s expectations and that you can practically read their minds (no mean feat!). It’s simple (at least given a search engine) because relevance is also a primary focus for the search engine itself – many search engines differentiate themselves from competitors based on how well the engine can estimate relevance for content objects based on a searcher’s criteria; so your search engine is likely going to help you a lot with regard to relevance.

However, there are still a lot of issues to consider and areas you need to address to help your search engine as well as your users.

  1. One of the first things you should consider is the set of keywords associated with your content. There are several different ways search engines will encounter keywords:
    1. First and foremost, the content of your search items present a set of keywords to most search engines; this is going to be the content visible in a web page or the words in the body of documents.
    2. The keywords accessible in the form of “keywords” <meta> tags in HTML pages, or “keywords” fields in the File Properties of documents in various formats.
    3. The keywords might even be terms in a database that is related in some way to the content that your search engine can use. This is very common for tightly constrained environments that integrate both a content management (or collaboration) environment with a search experience. If the tool controls both the content and the search, it can take advantage of a lot of “insight” that might not be directly available to an enterprise search solution.
    4. Some search engines will even use the text of links pointing to a content item as keywords describing the item. So content managers can influence can influence the relevance of content they don’t manage themselves by how they refer to it.
    5. Lastly, you also need to understand how your search engine will use and interpret these various sources of keywords and focus on those that provide the most impact. Some search engines might ignore the “keywords” <meta> tag for example, so you may not need to be concerned with that at all.
  2. One detail to highlight with regard to the content of your search items is that, just like the navigation challenges discussed in the around Coverage, if you have web sites that depend on JavaScript to display content, then that content likely will be invisible to your search engine, so it will not contribute to the keywords users can use to find the pages. I see this issue as becoming more of a problem in the future as applications are built that take advantage of AJAX to present dynamic user interfaces.
  3. Once you have a strategy for how you will present keywords to your search engine, you need to determine how best to manage the set of keywords that will be most useful to your content managers and to the users of your search tool. A principle tool for this is to have a taxonomy that helps inform your audience about preferred terms. I’ll write more about taxonomies in the future – for now, you should know that a very effective way to improve search is to simply constrain the terms used to tag content to a well-managed set.
    1. A taxonomy can also be used to provided guided navigation or constraint search pick lists. Instead of a simple keyword box for search, you can offer your users lists of values to select. The utility of this will depend on your users’ needs and you need to ensure you pay attention to usability.
  4. Related to taxonomies, you should also consider how best to manage synonyms. This will likely require some work with your taxonomy (to associate synonyms with “preferred” terms); this may require you to manage synonyms for your search engine (to define the mapping between synonyms used by the engine – hopefully, these rings are pulled from your taxonomy!); you might need to institute some means to tag your content with both the preferred terms and with synonyms (especially if you are exposing your content to search engines other than your own – i.e., your content is exposed to internet search engines).
  5. A third issue related to relevancy is the security of content; I relate this to relevance in the sense that if a user does not have access to a particular piece of content exposed by your search, effectively that content has zero relevance for that user. Many web tools (especially collaboration applications) provide users with very powerful management tools to control visibility of content – including even such details as differentiating even between who can know a piece of content exists and being able to download that content. However, interpreting granular searching controls on content is a very hard problem for an enterprise search tool to efficiently solve. In my experience, the most common “solution” for this type of problem is to not index such secure areas for inclusion in your enterprise search but to ensure that the tool provides a “local search’ and then ensure your enterprise search experience points users to this local search function when appropriate.
  6. Lastly for now, another area you should consider in terms of relevance is to monitor your search engine’s log files. Ultimately, I think this effort will transform into one of:
    1. Input to help you manage your taxonomy (by discovering the terms your search users are actually using and understanding how they differ from your taxonomy)
    2. Identification of holes in your content by understanding “not found” results (helping to identify and then solve Coverage issues)
    3. Identification of relevancy issues by understanding when some terms require more page scrolling than others.

Summary

To look back at the 3 Principles: 1) you need to make sure your search engine will find and index the necessary content; 2) you need to make sure your content will properly be identified in search results; and, 3) you need to ensure that your content will show as highly relevant for searches your users expect to show that content.

To address most issues, it does not require any magic or rocket science, but just an awareness of the issues and time and resources (these latter two being scarce for many!) to work on resolving them.

What have I missed? What else do others have to share?

The 3 Principles of Enterprise Search (part 2): Identity

Tuesday, January 8th, 2008

As I previously wrote, in my work on enterprise search, I have found there to be 3 Principles of Enterprise Search: Coverage, Identity and Relevance. My previous post discussed Coverage. Here, I will cover the principle of Identity.

What is Identity in a search solution?

The principle of Identity relates to the way a search engine will display search results. Search engines will have a number of tools available to display items in search results, up to and including providing sophisticated templating mechanisms where you can provide your own logic for how an item is displayed. The Identity of a search result is the way in which it is presented in a search results page.

With regard to the original driving question – “Why doesn’t X show up when I search on the search terms Y?”, the principle Identity addresses the situation where X is there but it’s called Z.

At the heart of most of this is the idea of a “title” for an item. Most search engines will use the text in the <title> tag of an HTML page as the name of the item in search results. Similarly, most search engines will use the Document Title field of documents that support such a field (all OpenOffice and Microsoft Office formats, PDF files, etc).

Issues with Identity

The specific challenges I see in regard to this principle include:

  1. Web pages or documents with no title – This is a problem that is much more common with Documents than with web pages. Most web pages will have at least some kind of title, but not many users are even aware of the “Title” File Property or, if they are, most of those users don’t bother to ensure it is filled in. If you expose a lot of documents in your search results, this single problem can be a search killer because the search engine either ends up displaying the URL or generating a title.
    • Some search engines will use heuristics to identity a title to use in search results if the identified title field is empty; for example, a search engine might use the URL of the item as the title (often, the path names and/or file name in the URL can be as informative as a title); or, the search engine might generate a title from header information in the document (looking for particular style of text, etc).
  2. Web pages or documents with useless or misleading titles – This issue probably equally plagues Documents and web pages. Some classic examples are caused by users creating documents by starting with a template, which was nicely titled “Business Paper Template”, for example, but then the users don’t change the title. You end up with dozens of items titled “Business Paper Template”. Or, users start with a completely unrelated document and edit it to what they’re working on; a user finds an item title “Q4 Financial Report” but the item is actually a design document deliverable for a client consulting engagement.
  3. Web pages or documents with redundant titles – This problem shows somewhat in documents (the “Business Paper Template” example) but it is far more common in a dynamic web sites. The application developers start with a common template (say a .jsp page) and leave in the static <title> tag for every single page. You end up with hundreds or thousands of items titled something like, “Business Expense Reports” (or whatever). Nothing will cause disenchantment in a searcher like that! “OK – Which one of these 10 items all titled the same am I supposed to pick??”
  4. Another important piece of Identity for search results is the snippet (or summary) that is often displayed along with the title. Some search engines will dynamically calculate this based on the user’s query (this is the most useful way to generate this); some engines will use a static description of the item – in this case, commonly the “description” <meta> tag may be used (or other common <meta> tags). You need to understand how your engine generates this and ensure the content being searched will show well. Depending on the quality of titles, this snippet can either be critically important (if titles are poor and the snippet is dynamically generated) or might be less critical if you ensure your content has good, clear, distinct titles.

Addressing the issues

As with the principle of Coverage, commonly identifying the specific cases where you have problems with Identity is “half the battle”. Depending on the particular issue, fixing the issue is normally pretty straightforward (if laborious). For Documents with no (or misleading or useless) title, work with the content owner to educate them about the importance of filling in the File Properties. Ideally, a workflow in the content management system that would include a review for this would be great, though we do not have that in place at Novell.

We do have a content management community of practice, though, through which I have shared this kind of insight many times and I continue to educate anyone who is or becomes a content manager.

For web applications, I’ve used the same approach as I describe in addressing issues with the Coverage principle – education of development teams and review of applications (hopefully before they are deployed).

Now, onto the third Principle of Enterprise Search – the principle that makes a potential search results candidate that identifies itself well into a good search results candidate – the principle of Relevance.

The 3 Principles of Enterprise Search (part 1): Coverage

Tuesday, January 8th, 2008

This is the first in a series of posts in which I’ll delve into details of the Content Lifecycle that I’ve previously written about – the first topics will be a series on Search – an aspect of the “Use” stage of that lifecycle.

I’ve been working in the area of enterprise search for about 5 years now for my employer. During that time, I’ve learned a lot about how our own search engine QuickFinder works with regards to crawled indexes and (to a lesser extent) file system indexes. I’ve also learned a lot more about the general issues of searching within the enterprise.

A lot of my thinking in terms of a search solution revolves around a constant question I hear from users: “Why doesn’t X show up when I search on the search terms Y?” Users might not always phrase their problems that way, but that’s what most issues ultimately boil down to.

In considering the many issues users face with finding content, I have come to find there are three principles at play when it comes to findability: Coverage, Identity and Relevance. What are each of these principles? The principle of Coverage relates to what is included; the principle of Identity relates to how search results are identified; the principle of Relevance relates to when a content object shows up in search results.

First up – the principle of Coverage.

Issues with Coverage

The principle of Coverage is about the set of targets that a searcher might be able to find using your search tool. A content object first must be found by a search indexer in order to be a potential search results candidate; in other words, one answer to the user’s question above might be, “Because X isn’t in the search index!”. There are many issues I’ve found that inhibit good coverage – that is, issues that keep objects from even being a potential search results candidate (much less a good search results candidate):

  1. Lack of linkage – the simplest issue. Most web-based search engines will have a search indexer that operates as a crawler. A crawler must either be given a link to an object or it must find a link (along a path of links from a page it does start with) in order to discover it. A web page on a site that has no links to it will not be indexed – i.e., it will not be a search results candidate.
  2. JavaScript – a variation of #1 – I have not yet found a search indexer that crawls sites which will execute JavaScript. Any navigation on a page that depends on JavaScript for it to be displayed is effectively navigation that’s not there to search engines. JavaScript is quite common with menuing systems in browsers and so this issue can be problematic. To a user navigating in a browser that executes that JavaScript, it can be hard to understand why a crawler cannot follow the same links – “They’re right there in the menu!”
  3. On our intranet, we have many secure areas – and, for the most part, we have a single sign-on capability provided by iChain. Our search engine also is able to authenticate while it’s crawling. However, at times, the means by which some applications achieve their single-sign on, while it seems transparent to a user in a browser, will use a combination of redirects and required HTTP header values that stymie the search indexer.
  4. Similar to #3 – while probably 98+% of tools on our intranet have single sign-on, some do not support single sign-on. This can cause challenges for a crawler-based indexing engine – even with an indexing engine that will handle authentication, most engines will support only one set of credentials.
  5. Web applications that depend on a user performing a search (using a “local” search interface) to find content. Such a web application keeps its own content invisible to an enterprise search engine without some specific consideration.
  6. Like many enterprise search engines, our search engine will allow (or perhaps require) that you define a set of domains that it is permitted to index. Depending on the search engine, it might only look at content in a specific set of domains you identify (x.novell.com and y.novell.com). Some search engines will allow you to specify the domain with wild-cards (*.novell.com). Some will allow you to specify that the crawler can follow links off of a “permitted” domain to a certain depth (I think of this as the “you can go one link away” rule). However, given the an enterprise search will not be indexing all domains, it can happen that some content your users need to find is not available via a domain that is being indexed.
  7. The opposite problem can occur at times, especially with web applications that exhibit poor implementation of robots tags – you can end up with a crawler finding many, many content items that very low quality, or even useless search results, or which are redundant. Some specific examples I’ve encountered with this issue include:
    • Web applications that allow a user to click on a link to edit data or content in the application – the link to the “edit view” is found by the crawler and then the “edit” view of an item becomes a potential search target.
    • Web applications that show a table of items and provide links (commonly in the header row of the table) to sort those lists by values in the different columns. The crawler ends up indexing every possible sort order of the data – not very useful.
    • The dreaded web-based calendar tool – when these have a “next month” link, a crawler can get stuck in an infinite (or nearly so anyway) link loop indexing days or months or years far out into the future.
    • Sites (common for a tool like a Wiki) that provide a “printer-friendly” version of a page. A crawler will find the link to the printer-friendly version and, unless the indexer is told not to index it (via robots tags), it will be included along with the actual article itself.

Addressing the issues

One important reminder on all of the above is that, while these represent a variety issues, it will almost always be possible to index the content. It’s primarily a matter of understanding when content is not being included that users are searching for and doing the work to index it. I find that it’s not necessarily the search engine but the potential for some minor development work that will stall this at times. Weighing the cost of the work to make the content visible against the benefit of having the content as potential search targets may not show a compelling reason to do the work.

To address the above issues, the most common solution for these is to work with the content managers to identify (or possibly build) a good index page. No magic, really, but just the need to recognize the omission (or, in the case of the robots tag issue, the proliferation) of content and then priority to act on it.

Within my own enterprise, to address the issue of robots tags in web applications, I have taken a few specific steps:

  1. First, I have been on an educational campaign with our developers for several years now – most developers quickly understand the value in making <title> tags dynamic in terms of how it interacts with search and it’s normally such a small amount of change in an application, they will simply “do the right thing” from the beginning once they understand the problem;
  2. Second, part of our application development guidelines now formally includes standards around titling within web applications (this is really a detail of the education campaign);
  3. Third, when I can, I try to ensure I have an opportunity to review web applications prior to their “go live” to be able to provide feedback to the development teams on the findability of items in their application.
  4. Lastly, independent of “verification”, I have also provided a general methodology to our development teams to help them work through a strategy for titling and tagging in their web applications even if my absence.

In summary – addressing the Coverage principle is critical to ensure that the content your users are looking for is at least included in the index and is a potential search results candidate. In my next posts, I will address the principles that make a potential search results candidate a good search results candidate – the principles of Identity and Relevance.