Lee Romero

On Content, Collaboration and Findability

Enterprise Search and Third-Party Applications

Tuesday, October 28th, 2008

Or, in other words, “How do you apply the application standards to improve findability to applications built by third-party providers who do not follow your standards?”

I’ve previously written about the standards I’ve put together for (web-based) applications that help ensure good findability for content / data within that application. These standards are generally relatively easy to apply to custom applications (though it can still be challenging to get involved with the design and development of those applications at the right time to keep the time investment minimal, as I’ve also previously written about).

However, it can be particularly challenging to apply these standards to third-party applications – For example, your CRM application, your learning management system, or your HR system, etc. Applying the existing standards could take a couple of different forms:

  1. Ideally, when your organization goes through the selection process for such an application, your application standards are explicitly included in the selection criteria and used to ensure you select a solution that will conform to your standards
  2. More commonly, you will identify compliance to the standards (perhaps during selection but perhaps later during implementation) and you might need to implement some type of customization within the application to provide compliance.
  3. Hopefully, you identify compliance to the standards during selection or later, but you find you can not customize the application and you need a different solution.

The rest of this post will discuss a solution for option #3 above – how you can implement a different solution. Note that some search engines will provide pre-built functionality to enable search within many of the more common third party solutions – those are great and useful, but what I will present here is a solution that can be implemented independent of the search engine (as long as the search engine has a crawler-based indexing function) and which is relatively minimal in investment.

Solving the third-party application conundrum for Enterprise Search

So, you have a third party application and, for whatever reason, it does not adhere to your application standards for findability. Perhaps it fails the coverage principle and it’s not possible to adequate find the useful content without getting many, many useless items; or perhaps it’s the identity principle and, while you can find all of the desirable targets, they have redundant titles; or it might even be that the application fails the relevance principle and you can index the high value targets and they show up with good names in results but they do not show up as relevant for keywords which you would expect. Likely, it’s a combination of all three of these issues.

The core idea in this solution is that you will need a helper application that creates what I call “shadow pages” of the high value targets you want to include in your enterprise search.

Note: I adopted the use of the term “shadow page” based on some informal discussions with co-workers on this topic – I am aware that others use this term in similar ways (though I don’t think it means the exact same thing) and also am aware that some search engines address what they call shadow domains and discourage their inclusion in their search results. If there is a preferred term for the idea described here – please let me know!

What is a shadow page? For my purposes here, I define a shadow page as:

  • A page which uniquely corresponds to a single desirable search target;
  • A page that has a distinct, unique URL;
  • A page that has a <title> and description that reflects the search target of which it is a shadow, and that title is distinct and provides a searcher who sees it in a search results page with insight about what the item is;
  • A page that has good metadata (keywords or other fields) that describe the target using terminology a searcher would use;
  • A page which contains text (likely hidden) that also reflects all of the above as well to enhance relevance for the words in the title, keywords, etc.;
  • A page which, when accessed, will automatically redirect a user to the page of which the page is a shadow.

To make this solution work, there are a couple of minimal assumptions of the application. A caveat: I recognize that, while I consider these as relatively simple assumptions, it is very likely that some applications will still not be able to meet these and so not be able to be exposed via your enterprise search with this type of solution.

  1. Each desirable search target must be addressable by a unique URL;
  2. It should be possible to define a query which will give you a list of the desirable targets in the application; this query could be an SQL query run against a database or possible a web services method call that returns a result in XML (or probably other formats but these are the most common in my experience);
  3. Given the identity (say, a primary key if you’re using a SQL database of some type) of a desirable search target, you must be able to also query the application for additional information about the search target.

Building a Shadow Page

Given the description of a shadow page and the assumptions about what is necessary to support it, it is probably obvious how they are used and how they are constructed, but here’s a description:

First – you would use the query that gives you a list of targets (item #2 from the assumptions) from your source application to generate an index page which you can give your indexer as a starting point.  This index page would have one link on it for each desirable target’s shadow page.  This index page would also have “robots” <meta> tags of “noindex,follow” to ensure that the index page itself is not included as a potential target.

Second – The shadow page for each target (which the crawler reaches thanks to the index page) is dynamically built from the query of the application given the identity of the desirable search target (item #3 from the assumptions).  The business rules defining how the desirable target should behave in search help define the necessary query, but the query would need to contain at minimum some of the following data: the name of the target, a description or summary of the target, some keywords that describe the target, a value which will help define the true URL of the actual target (per assumption #1, there must be a way to directly address each target).

The shadow page would be built something like the following:

  • The <title> tag would be the name of the target from the query (perhaps plus an application name to provide context)
  • The “description” <meta> tag would be the description or summary of the target from the query, perhaps plus a few static keywords that help ensure the presence of additional insight about the target.   For example, if the target represents a learning activity, the additional static text might indicate that.
  • The “keywords” <meta> tag would include the keywords from the query, plus some static keywords to ensure good coverage.  To follow the previous example, it might be appropriate to include words like “learning”, “training”, “class”, etc. in a target that is a learning activity to ensure that, if the keywords for the specific target do not include those words, searchers can still find the shadow page target in search.
  • The <body> of the page can be built to include all of the above text – from my experience, wrapping the body in a CSS style that visually hides the text keeps the text from actually appearing in a browser.
  • Lastly, the shadow page has a bit of JavaScript in it that redirects a browser to the actual target – this is why you need to have the target addressable via a URL and also that the query needs to provide the information necessary to create that URL.  Most engines (I know of none) will not be able to execute the JavaScript, so will not know that the page is really a redirect to the desired target.

The overall effect of this is that the search engine will index the shadow page, which has been constructed to ensure good adherence to the principles of enterprise search, and to a searcher, it will behave like a good search target but when the user clicks on it from a search result, the user ends up looking at the actual desired target.  The only clue the user might have is that the URL of the target in the search results is not what they end up looking at in their browser’s address bar.

The following provides a simple example of the source (in HTML – sorry for those who might not be able to read it) for a shadow page (the parts that change from page to page are in bold):

<html>
<head>
<TITLE>title of target</TITLE>
<meta name="robots" content="index, nofollow">
<meta name="keywords" content="keywords for target">
<meta name="description" content="description of target">
<script type="text/javascript">
document.location.href="URL of actual target";
</script>
</head>
<body>
<div style="display:none;">
<h1>title of target</h1>
description of target and keywords of target
</div>
</body>
</html>

Advantages of this Solution

A few things that are immediately obvious advantages of this approach:

  1. First and foremost, with this approach, you can provide searchers with the ability to find content which otherwise would be locked away and not available via your enterprise search!
  2. You can easily control the targets that are available via your enterprise search within the application (potentially much easier than trying to figure out the right combination of robots tags or inclusion / exclusion settings for your indexer).
  3. You can very tightly control how a target looks to the search engine (including integration with your taxonomy to provide elaborated keywords, synonyms, etc)

Problems with this Solution

There are also a number of issues that I need to highlight with this approach – unfortunately, it’s not perfect!

  1. The most obvious issue is that this depends on the ability to query for a set of targets against a database or web service of some sort.
    1. Most applications will be technically able to support this, but in many organizations, this could present too great a risk from a data security perspective (the judicious use of database views and proper management of read rights on the database should solve this, however!)
    2. This potentially creates too high a level of dependence between your search solution and the inner workings of the application – an upgrade of the application could change the data schema enough to break this approach.  Again, I think that the use of database views can solve this (by abstracting away the details of the implementation into a single view which can be changed as necessary through any upgrade).
  2. Some applications may simply not offer a “deep linking” ability into high value content – there is no way to uniquely address a content item without the context of the application.  This solution can not be applied to such applications.  (Though my opinion is that such applications are poorly designed, but that’s another matter entirely!)
  3. This solution depends on JavaScript to forward the user from the shadow page to the actual target.  If your user population has a large percentage of people who do not use JavaScript, this solution fails them utterly.
  4. This solution depends on your search engine not following the JavaScript or somehow otherwise determining that the shadow page is a very low quality target (perhaps by examining the styles on the text and determining the text is not visible).  If you have a search engine that is this smart, hopefully you have a way to configure it to ignore this for at least some areas or page types.
  5. Another major issue is that this solution largely circumvents a search engine’s built in ability to do item-by-item security as the target to the search engine is the shadow page.  I think the key here is to not use this solution for content that requires this level of security.

Conclusion

There you have it – a solution to the exposure of your high value targets from your enterprise applications that is independent of your search engine and can provide you (the search administrator) with a good level of control over how content appears to your search engine, while ensuring that what is included highly adheres to my principles of enterprise search.

Standards to Improve Findability in Enterprise Applications

Thursday, October 23rd, 2008

I’ve previously written about the three principles of enterprise search and also about the specific business process challenges I’ve run into again and again with web applications in terms of findability.

Here, I will provide some insights on the specific standards I’ve established to improve findability, primarily within web applications.

As you might expect, these standards map closely to the three principles of enterprise search and so that’s how I will discuss them.

Coverage

When an application is being specified, the application team must ensure that they discuss the following question with business users – What are the business objects within this application and which of those should be visible through enterprise search?

The first question is pretty standard and likely forms the basis for any kind of UML or entity relationship diagram that would be part of a design process for the application. The second part is often not asked but it forms the basis for what will eventually be the specific targets that will show in search results through the enterprise search.

Given the identification of which objects should be visible in search results, you can then easily start to plan out how they might show up, how the search engine will encounter them, whether the application might best provide a dynamic index page of links to the entities or support a standard crawl or perhaps even a direct index of the database(s) behind the application.

Basically, the standard here is that the application must provide a means to ensure that a search engine can find all of the objects that need to be visible and also to ensure that the search engine does not include things that it should not.

Some specific things that are included here:

  • The entities that need to show up in search results should be visible as an individual target, addressable via a unique and stable URL. This ensures that when an item shows up in a set of search results, a searcher will see an entity that looks and behaves like what they want – if they’re looking for a document, they see that document and not a page that links to that document.
  • The application should have a strategy for the implementation of “robots” meta tags – pages that should not be indexed should have a “noindex”. Pages that are navigational (and not destinations themselves for search) should be marked “noindex”. Pages that provide navigation to the items through various options (filters, sorting, etc) may need to have “nofollow” as well as so that a crawler does not get hung up in looking at multitudes of various pages all of which are marked “noindex” anyway.
  • The application should not be frame-based. This is a more general standard for web applications, but frame-based applications consistently cause problems for crawlers as a crawler will index the individual frames but those individual frames are not ,themselves, useful targets.
  • To simplify things for a search engine, an application can simply provide an index page that directly links to all desired objects that should show up in search; I’ve found this to be very useful and can be much simpler than working through the logic of a strategy for robots tags to ensure good coverage. This index page would be marked “noindex, follow” for its robot tags so that it, itself, is not indexed (otherwise it might show up as a potential result for a lot of searches if, say, the title of the items are included in this index page).
  • Note that it is possible that for some applications, the answer to the leading question for this may be that nothing within the application is intended to be found via an enterprise search solution. That might be the case if the application provides its own local search function and there is no value in higher visibility (or possibly if the cost of that higher visibility is too high – say in the case that the application provides sophisticated access control which might be hard to translate to an enterprise solution).

Identity

With the standard for Coverage defined, we can be comfortable with knowing that the right things are going to show in search and the wrong things will not show up. How useful will they be as search results, though? If a searcher sees an item in a results list, will they be able to know that it’s what they’re looking for? So we need to ensure that the application addresses the identity principle.

The standard here is that the pages (ASP pages, JSP files, etc) that comprise the desirable targets for search must be designed to address the identity principle – specifically:

  • Each page that shows a search target must dynamically generate a <title> tag that clearly describes what it shows.
  • An application should also adopt a standard for how it identifies where the content / data is (the application name perhaps) as well as the content-specific name.
  • Within our infrastructure, a standard like, “<application name>: <item name>” has worked well.
  • In addition, each page that shows a search target must dynamically generate a “description” <meta> tag. This description can (and for our search does) be used as part of the results snippet displayed in a search results page, so it can provide a searcher important clues before the searcher even clicks on a target.
  • The application team should develop a strategy for what to include in the “description”:
    • In many applications, each item of interest will typically have some kind of user-entered text that can be interpreted as a description or which could be combined with some static text to make it so.
    • For example, an entity might have a name (used in the <title> tag) and something referred to as a the “summary” or “subject” or maybe “description” – simply use that text.
    • Alternately, the “description” might be generated as something like, “The help desk ticket <ticket ID> named <ticket name>”, for a page that might be part of a help desk ticket application.

Relevance

Now we know that the search includes what it should and we also know that when those items show in search, they will be identifiable for what they are. How do we ensure that the items show up in search for searches for which they are relevant, though?

The standards to address the relevance issue are:

  • Follow the standard above for titles (the words in the <title> tag will normally significantly boost relevancy for searches on those words regardless of your search engine)
  • Each page that shows a search target must dynamically generate a “keywords” <meta> tag.
  • The application team should devise a strategy for what would be included in the keywords, though some common concepts emerge:
    • Any field that a user can assign to the entity would be a candidate – for example, if a user can select a Product with which an item is associated or a geography, an industry, etc. All of those terms are good candidates for inclusion in keywords
    • While redundant, simply using the title of the item in the keywords can be useful (and reinforce the relevance of those words)
    • If an application integrates with a taxonomy system (specifically, a thesaurus) any taxonomic tags assigned to an entity should be included.
    • In addition, for a thesaurus, if the content will be indexed by internet search engines, directly including synonyms for taxonomic terms in the keywords can sometimes help – you might also include those synonyms directly in your own search engine’s configuration but you can’t do that with a search engine you don’t control. (Many internet search engines no longer consider the contents of these tags due to spamming in them but these can’t hurt even then.)
  • The application may also generate additional <meta> tags that are specific to its needs. When integrated with a taxonomy that has defined facets, including a <meta> tag with the name for each facet and the assigned values can improve results.
    • For example, if the application allows assignment of a product, it can generate a tag like: <meta name=”product” contents=”<selected values>”/>
    • Some search engines will allow searching within named fields like this – providing you a combination of a full text search and fielded search ability.

Additional resources

For a good review of the <meta> tags in HTML pages, you can look at:

People know where to find that, though!

Monday, October 13th, 2008

The title of this post – “People know where to find that, though!” is a very common phrase I hear as the search analyst and the primary search advocate at my company. Another version would be, “Why would someone expect to find that in our enterprise search?”

Why do I hear this so often? I assume that many organizations, like my own, have many custom web applications available on their intranet and even their public site. It is because of that prevalence, combined with a lack of communication between the Business and the Application team, that I hear these phrases so often.

I have (unfortunately!) lost count of the number of times a new web-based application goes into production without anyone even considering the findability of the application and its content (data) within the context of our enterprise search.

Typically, the conversation seems to go something like this:

  • Business: “We need an application that does X, Y and Z and is available on our web site.”
  • Application team: “OK – let’s get the requirements laid out and build the site. You need it to do X, Y and Z. So we will build a web application that has page archetypes A, B and C.”
  • Application team then builds the application, probably building in some kind of local search function – so that someone can find data once they are within the application.
  • The Business accepts the usability of the application and it goes into production

What did we completely miss in this discussion? Well, no one in the above process (unfortunately) has explicitly asked the question, “Does the content (data) in this site need to be exposed via our enterprise search?” Nor has anyone even asked the more basic question, “Should someone be able to find this application [the “home page” of the application in the context of a web application] via the enterprise search?”

  • Typically, the Business makes the (reasonable) assumption that goes something like, “Hey – I can find this application and navigate through its content via a web browser, so it will naturally work well with our enterprise search and I will easily be able to find it, right?!”
  • On the other hand, the Application Team has likely made 2 assumptions: 1) the Business did not explicitly ask for any kind of visibility in the enterprise search solution, so they don’t expect that, and 2) they’ve (likely) provided a local search function, so that would be completely sufficient as a search.

I’ve seen this scenario play out many, many times in just the last few years here. What often happens next depends on the application but includes many of the following symptoms:

  • The page archetypes designed by the Application Team will have the same (static) <title> tag in every instance of the page, regardless of the data displayed (generally, the data would be different based on query string parameters).
    • The effect? A web-crawler-based search engine (which we use) likely uses the <title> tag as an identifier for content and every instance of each page type has the same title, resulting in a whole lot of pretty useless (undifferentiated) search results. Yuck.
  • The page archetypes have either no or maybe redundant other metadata – keywords, description, content-date, author, etc.
    • The effect? The crawler has no differentiation based on <titles> and no additional hints from metadata. That is, lousy relevance.
  • The application has a variety of navigation or data manipulation capabilities (say, sorting data) based on standard HTML links.
    • The effect? The crawler happily follows all of the links – possibly (redundant) indexing the same data many, many times simply sorted on different columns.
    • Another effect? The dreaded calendar affect – the crawler will basically never stop finding new links because there’s always another page.
    • In either case, we see poor coverage of the content.

The overall effect is likely that the application does not work well with the enterprise search, or possibly that the application is that the application does not hold up to the pressure of the crawler hitting its pages much faster than anticipated (so I end up having to configure the crawler to avoid the application) and ending with yet another set of content that’s basically invisible in search.

Bringing this back around to the title – the response I often get when inquiring about a newly released application is something like, “People will know how to find that content – it’s in this application! Why would this need to be in the enterprise search?”

When I then ask, “Well, how do people know that they even need to navigate to or look in this application?” I’ll get a (virtual) shuffling of feet and shoulder shrugs.

All because of a perpetual lack of asking a few basic questions during a requirements gather stage of a project or (another way to look at it) lack of standards or policies which have “teeth” about the design and development of web application. The unfortunate thing is that, in my experience, if you ask the questions early, it’s typically on the scale of a few hours of a developer’s time to make the application work at least reasonably well with any crawler-based search engine. Unfortunately, because I often don’t find out about an application until after it’s in production, it then becomes a significant obstacle to get any changes made like this.

I’ll write more in a future post about the standards I have worked to establish (which are making some headway into adoption, finally!) to avoid this.

Edit: I’ve now posted the standards as mentioned above – you can find them in my post Standards to Improve Findability in Enterprise Applications.