As I previously wrote, in my work on enterprise search, I have found there to be 3 Principles of Enterprise Search: Coverage, Identity and Relevance. My previous post discussed Coverage. Here, I will cover the principle of Identity.
What is Identity in a search solution?
The principle of Identity relates to the way a search engine will display search results. Search engines will have a number of tools available to display items in search results, up to and including providing sophisticated templating mechanisms where you can provide your own logic for how an item is displayed. The Identity of a search result is the way in which it is presented in a search results page.
With regard to the original driving question – “Why doesn’t X show up when I search on the search terms Y?”, the principle Identity addresses the situation where X is there but it’s called Z.
At the heart of most of this is the idea of a “title” for an item. Most search engines will use the text in the <title> tag of an HTML page as the name of the item in search results. Similarly, most search engines will use the Document Title field of documents that support such a field (all OpenOffice and Microsoft Office formats, PDF files, etc).
Issues with Identity
The specific challenges I see in regard to this principle include:
- Web pages or documents with no title – This is a problem that is much more common with Documents than with web pages. Most web pages will have at least some kind of title, but not many users are even aware of the “Title” File Property or, if they are, most of those users don’t bother to ensure it is filled in. If you expose a lot of documents in your search results, this single problem can be a search killer because the search engine either ends up displaying the URL or generating a title.
- Some search engines will use heuristics to identity a title to use in search results if the identified title field is empty; for example, a search engine might use the URL of the item as the title (often, the path names and/or file name in the URL can be as informative as a title); or, the search engine might generate a title from header information in the document (looking for particular style of text, etc).
- Web pages or documents with useless or misleading titles – This issue probably equally plagues Documents and web pages. Some classic examples are caused by users creating documents by starting with a template, which was nicely titled “Business Paper Template”, for example, but then the users don’t change the title. You end up with dozens of items titled “Business Paper Template”. Or, users start with a completely unrelated document and edit it to what they’re working on; a user finds an item title “Q4 Financial Report” but the item is actually a design document deliverable for a client consulting engagement.
- Web pages or documents with redundant titles – This problem shows somewhat in documents (the “Business Paper Template” example) but it is far more common in a dynamic web sites. The application developers start with a common template (say a .jsp page) and leave in the static <title> tag for every single page. You end up with hundreds or thousands of items titled something like, “Business Expense Reports” (or whatever). Nothing will cause disenchantment in a searcher like that! “OK – Which one of these 10 items all titled the same am I supposed to pick??”
- Another important piece of Identity for search results is the snippet (or summary) that is often displayed along with the title. Some search engines will dynamically calculate this based on the user’s query (this is the most useful way to generate this); some engines will use a static description of the item – in this case, commonly the “description” <meta> tag may be used (or other common <meta> tags). You need to understand how your engine generates this and ensure the content being searched will show well. Depending on the quality of titles, this snippet can either be critically important (if titles are poor and the snippet is dynamically generated) or might be less critical if you ensure your content has good, clear, distinct titles.
Addressing the issues
As with the principle of Coverage, commonly identifying the specific cases where you have problems with Identity is “half the battle”. Depending on the particular issue, fixing the issue is normally pretty straightforward (if laborious). For Documents with no (or misleading or useless) title, work with the content owner to educate them about the importance of filling in the File Properties. Ideally, a workflow in the content management system that would include a review for this would be great, though we do not have that in place at Novell.
We do have a content management community of practice, though, through which I have shared this kind of insight many times and I continue to educate anyone who is or becomes a content manager.
For web applications, I’ve used the same approach as I describe in addressing issues with the Coverage principle – education of development teams and review of applications (hopefully before they are deployed).
Now, onto the third Principle of Enterprise Search – the principle that makes a potential search results candidate that identifies itself well into a good search results candidate – the principle of Relevance.