I’ve previously written about the three principles of enterprise search and also about the specific business process challenges I’ve run into again and again with web applications in terms of findability.
Here, I will provide some insights on the specific standards I’ve established to improve findability, primarily within web applications.
When an application is being specified, the application team must ensure that they discuss the following question with business users – What are the business objects within this application and which of those should be visible through enterprise search?
The first question is pretty standard and likely forms the basis for any kind of UML or entity relationship diagram that would be part of a design process for the application. The second part is often not asked but it forms the basis for what will eventually be the specific targets that will show in search results through the enterprise search.
Given the identification of which objects should be visible in search results, you can then easily start to plan out how they might show up, how the search engine will encounter them, whether the application might best provide a dynamic index page of links to the entities or support a standard crawl or perhaps even a direct index of the database(s) behind the application.
Basically, the standard here is that the application must provide a means to ensure that a search engine can find all of the objects that need to be visible and also to ensure that the search engine does not include things that it should not.
Some specific things that are included here:
- The entities that need to show up in search results should be visible as an individual target, addressable via a unique and stable URL. This ensures that when an item shows up in a set of search results, a searcher will see an entity that looks and behaves like what they want – if they’re looking for a document, they see that document and not a page that links to that document.
- The application should have a strategy for the implementation of “robots” meta tags – pages that should not be indexed should have a “noindex”. Pages that are navigational (and not destinations themselves for search) should be marked “noindex”. Pages that provide navigation to the items through various options (filters, sorting, etc) may need to have “nofollow” as well as so that a crawler does not get hung up in looking at multitudes of various pages all of which are marked “noindex” anyway.
- The application should not be frame-based. This is a more general standard for web applications, but frame-based applications consistently cause problems for crawlers as a crawler will index the individual frames but those individual frames are not ,themselves, useful targets.
- To simplify things for a search engine, an application can simply provide an index page that directly links to all desired objects that should show up in search; I’ve found this to be very useful and can be much simpler than working through the logic of a strategy for robots tags to ensure good coverage. This index page would be marked “noindex, follow” for its robot tags so that it, itself, is not indexed (otherwise it might show up as a potential result for a lot of searches if, say, the title of the items are included in this index page).
- Note that it is possible that for some applications, the answer to the leading question for this may be that nothing within the application is intended to be found via an enterprise search solution. That might be the case if the application provides its own local search function and there is no value in higher visibility (or possibly if the cost of that higher visibility is too high – say in the case that the application provides sophisticated access control which might be hard to translate to an enterprise solution).
With the standard for Coverage defined, we can be comfortable with knowing that the right things are going to show in search and the wrong things will not show up. How useful will they be as search results, though? If a searcher sees an item in a results list, will they be able to know that it’s what they’re looking for? So we need to ensure that the application addresses the identity principle.
The standard here is that the pages (ASP pages, JSP files, etc) that comprise the desirable targets for search must be designed to address the identity principle – specifically:
- Each page that shows a search target must dynamically generate a <title> tag that clearly describes what it shows.
- An application should also adopt a standard for how it identifies where the content / data is (the application name perhaps) as well as the content-specific name.
- Within our infrastructure, a standard like, “<application name>: <item name>” has worked well.
- In addition, each page that shows a search target must dynamically generate a “description” <meta> tag. This description can (and for our search does) be used as part of the results snippet displayed in a search results page, so it can provide a searcher important clues before the searcher even clicks on a target.
- The application team should develop a strategy for what to include in the “description”:
- In many applications, each item of interest will typically have some kind of user-entered text that can be interpreted as a description or which could be combined with some static text to make it so.
- For example, an entity might have a name (used in the <title> tag) and something referred to as a the “summary” or “subject” or maybe “description” – simply use that text.
- Alternately, the “description” might be generated as something like, “The help desk ticket <ticket ID> named <ticket name>”, for a page that might be part of a help desk ticket application.
Now we know that the search includes what it should and we also know that when those items show in search, they will be identifiable for what they are. How do we ensure that the items show up in search for searches for which they are relevant, though?
The standards to address the relevance issue are:
- Follow the standard above for titles (the words in the <title> tag will normally significantly boost relevancy for searches on those words regardless of your search engine)
- Each page that shows a search target must dynamically generate a “keywords” <meta> tag.
- The application team should devise a strategy for what would be included in the keywords, though some common concepts emerge:
- Any field that a user can assign to the entity would be a candidate – for example, if a user can select a Product with which an item is associated or a geography, an industry, etc. All of those terms are good candidates for inclusion in keywords
- While redundant, simply using the title of the item in the keywords can be useful (and reinforce the relevance of those words)
- If an application integrates with a taxonomy system (specifically, a thesaurus) any taxonomic tags assigned to an entity should be included.
- In addition, for a thesaurus, if the content will be indexed by internet search engines, directly including synonyms for taxonomic terms in the keywords can sometimes help – you might also include those synonyms directly in your own search engine’s configuration but you can’t do that with a search engine you don’t control. (Many internet search engines no longer consider the contents of these tags due to spamming in them but these can’t hurt even then.)
- The application may also generate additional <meta> tags that are specific to its needs. When integrated with a taxonomy that has defined facets, including a <meta> tag with the name for each facet and the assigned values can improve results.
- For example, if the application allows assignment of a product, it can generate a tag like: <meta name=”product” contents=”<selected values>”/>
- Some search engines will allow searching within named fields like this – providing you a combination of a full text search and fielded search ability.
For a good review of the <meta> tags in HTML pages, you can look at: