As I previously wrote, in my work on enterprise search, I have found there to be 3 Principles of Enterprise Search: Coverage, Identity and Relevance. My previous post discussed Coverage. Here, I will cover the principle of Identity.
What is Identity in a search solution?
The principle of Identity relates to the way a search engine will display search results. Search engines will have a number of tools available to display items in search results, up to and including providing sophisticated templating mechanisms where you can provide your own logic for how an item is displayed. The Identity of a search result is the way in which it is presented in a search results page.
With regard to the original driving question – “Why doesn’t X show up when I search on the search terms Y?”, the principle Identity addresses the situation where X is there but it’s called Z.
At the heart of most of this is the idea of a “title” for an item. Most search engines will use the text in the <title> tag of an HTML page as the name of the item in search results. Similarly, most search engines will use the Document Title field of documents that support such a field (all OpenOffice and Microsoft Office formats, PDF files, etc).
Issues with Identity
The specific challenges I see in regard to this principle include:
Addressing the issues
As with the principle of Coverage, commonly identifying the specific cases where you have problems with Identity is “half the battle”. Depending on the particular issue, fixing the issue is normally pretty straightforward (if laborious). For Documents with no (or misleading or useless) title, work with the content owner to educate them about the importance of filling in the File Properties. Ideally, a workflow in the content management system that would include a review for this would be great, though we do not have that in place at Novell.
We do have a content management community of practice, though, through which I have shared this kind of insight many times and I continue to educate anyone who is or becomes a content manager.
For web applications, I’ve used the same approach as I describe in addressing issues with the Coverage principle – education of development teams and review of applications (hopefully before they are deployed).
Now, onto the third Principle of Enterprise Search – the principle that makes a potential search results candidate that identifies itself well into a good search results candidate – the principle of Relevance.
This is the first in a series of posts in which I’ll delve into details of the Content Lifecycle that I’ve previously written about – the first topics will be a series on Search – an aspect of the “Use” stage of that lifecycle.
I’ve been working in the area of enterprise search for about 5 years now for my employer. During that time, I’ve learned a lot about how our own search engine QuickFinder works with regards to crawled indexes and (to a lesser extent) file system indexes. I’ve also learned a lot more about the general issues of searching within the enterprise.
A lot of my thinking in terms of a search solution revolves around a constant question I hear from users: “Why doesn’t X show up when I search on the search terms Y?” Users might not always phrase their problems that way, but that’s what most issues ultimately boil down to.
In considering the many issues users face with finding content, I have come to find there are three principles at play when it comes to findability: Coverage, Identity and Relevance. What are each of these principles? The principle of Coverage relates to what is included; the principle of Identity relates to how search results are identified; the principle of Relevance relates to when a content object shows up in search results.
First up – the principle of Coverage.
Issues with Coverage
The principle of Coverage is about the set of targets that a searcher might be able to find using your search tool. A content object first must be found by a search indexer in order to be a potential search results candidate; in other words, one answer to the user’s question above might be, “Because X isn’t in the search index!”. There are many issues I’ve found that inhibit good coverage – that is, issues that keep objects from even being a potential search results candidate (much less a good search results candidate):
Addressing the issues
One important reminder on all of the above is that, while these represent a variety issues, it will almost always be possible to index the content. It’s primarily a matter of understanding when content is not being included that users are searching for and doing the work to index it. I find that it’s not necessarily the search engine but the potential for some minor development work that will stall this at times. Weighing the cost of the work to make the content visible against the benefit of having the content as potential search targets may not show a compelling reason to do the work.
To address the above issues, the most common solution for these is to work with the content managers to identify (or possibly build) a good index page. No magic, really, but just the need to recognize the omission (or, in the case of the robots tag issue, the proliferation) of content and then priority to act on it.
Within my own enterprise, to address the issue of robots tags in web applications, I have taken a few specific steps:
In summary – addressing the Coverage principle is critical to ensure that the content your users are looking for is at least included in the index and is a potential search results candidate. In my next posts, I will address the principles that make a potential search results candidate a good search results candidate – the principles of Identity and Relevance.