Or, in other words, “How do you apply the application standards to improve findability to applications built by third-party providers who do not follow your standards?”
I’ve previously written about the standards I’ve put together for (web-based) applications that help ensure good findability for content / data within that application. These standards are generally relatively easy to apply to custom applications (though it can still be challenging to get involved with the design and development of those applications at the right time to keep the time investment minimal, as I’ve also previously written about).
However, it can be particularly challenging to apply these standards to third-party applications – For example, your CRM application, your learning management system, or your HR system, etc. Applying the existing standards could take a couple of different forms:
The rest of this post will discuss a solution for option #3 above – how you can implement a different solution. Note that some search engines will provide pre-built functionality to enable search within many of the more common third party solutions – those are great and useful, but what I will present here is a solution that can be implemented independent of the search engine (as long as the search engine has a crawler-based indexing function) and which is relatively minimal in investment.
So, you have a third party application and, for whatever reason, it does not adhere to your application standards for findability. Perhaps it fails the coverage principle and it’s not possible to adequate find the useful content without getting many, many useless items; or perhaps it’s the identity principle and, while you can find all of the desirable targets, they have redundant titles; or it might even be that the application fails the relevance principle and you can index the high value targets and they show up with good names in results but they do not show up as relevant for keywords which you would expect. Likely, it’s a combination of all three of these issues.
The core idea in this solution is that you will need a helper application that creates what I call “shadow pages” of the high value targets you want to include in your enterprise search.
Note: I adopted the use of the term “shadow page” based on some informal discussions with co-workers on this topic – I am aware that others use this term in similar ways (though I don’t think it means the exact same thing) and also am aware that some search engines address what they call shadow domains and discourage their inclusion in their search results. If there is a preferred term for the idea described here – please let me know!
What is a shadow page? For my purposes here, I define a shadow page as:
To make this solution work, there are a couple of minimal assumptions of the application. A caveat: I recognize that, while I consider these as relatively simple assumptions, it is very likely that some applications will still not be able to meet these and so not be able to be exposed via your enterprise search with this type of solution.
Given the description of a shadow page and the assumptions about what is necessary to support it, it is probably obvious how they are used and how they are constructed, but here’s a description:
First – you would use the query that gives you a list of targets (item #2 from the assumptions) from your source application to generate an index page which you can give your indexer as a starting point. This index page would have one link on it for each desirable target’s shadow page. This index page would also have “robots” <meta> tags of “noindex,follow” to ensure that the index page itself is not included as a potential target.
Second – The shadow page for each target (which the crawler reaches thanks to the index page) is dynamically built from the query of the application given the identity of the desirable search target (item #3 from the assumptions). The business rules defining how the desirable target should behave in search help define the necessary query, but the query would need to contain at minimum some of the following data: the name of the target, a description or summary of the target, some keywords that describe the target, a value which will help define the true URL of the actual target (per assumption #1, there must be a way to directly address each target).
The shadow page would be built something like the following:
The overall effect of this is that the search engine will index the shadow page, which has been constructed to ensure good adherence to the principles of enterprise search, and to a searcher, it will behave like a good search target but when the user clicks on it from a search result, the user ends up looking at the actual desired target. The only clue the user might have is that the URL of the target in the search results is not what they end up looking at in their browser’s address bar.
The following provides a simple example of the source (in HTML – sorry for those who might not be able to read it) for a shadow page (the parts that change from page to page are in bold):
<body> <div style="display:none;"> <h1>title of target</h1> description of target and keywords of target </div> </body> </html>
A few things that are immediately obvious advantages of this approach:
There are also a number of issues that I need to highlight with this approach – unfortunately, it’s not perfect!
There you have it – a solution to the exposure of your high value targets from your enterprise applications that is independent of your search engine and can provide you (the search administrator) with a good level of control over how content appears to your search engine, while ensuring that what is included highly adheres to my principles of enterprise search.
I’ve previously written about the three principles of enterprise search and also about the specific business process challenges I’ve run into again and again with web applications in terms of findability.
Here, I will provide some insights on the specific standards I’ve established to improve findability, primarily within web applications.
When an application is being specified, the application team must ensure that they discuss the following question with business users – What are the business objects within this application and which of those should be visible through enterprise search?
The first question is pretty standard and likely forms the basis for any kind of UML or entity relationship diagram that would be part of a design process for the application. The second part is often not asked but it forms the basis for what will eventually be the specific targets that will show in search results through the enterprise search.
Given the identification of which objects should be visible in search results, you can then easily start to plan out how they might show up, how the search engine will encounter them, whether the application might best provide a dynamic index page of links to the entities or support a standard crawl or perhaps even a direct index of the database(s) behind the application.
Basically, the standard here is that the application must provide a means to ensure that a search engine can find all of the objects that need to be visible and also to ensure that the search engine does not include things that it should not.
Some specific things that are included here:
With the standard for Coverage defined, we can be comfortable with knowing that the right things are going to show in search and the wrong things will not show up. How useful will they be as search results, though? If a searcher sees an item in a results list, will they be able to know that it’s what they’re looking for? So we need to ensure that the application addresses the identity principle.
The standard here is that the pages (ASP pages, JSP files, etc) that comprise the desirable targets for search must be designed to address the identity principle – specifically:
Now we know that the search includes what it should and we also know that when those items show in search, they will be identifiable for what they are. How do we ensure that the items show up in search for searches for which they are relevant, though?
The standards to address the relevance issue are:
For a good review of the <meta> tags in HTML pages, you can look at:
The title of this post – “People know where to find that, though!” is a very common phrase I hear as the search analyst and the primary search advocate at my company. Another version would be, “Why would someone expect to find that in our enterprise search?”
Why do I hear this so often? I assume that many organizations, like my own, have many custom web applications available on their intranet and even their public site. It is because of that prevalence, combined with a lack of communication between the Business and the Application team, that I hear these phrases so often.
I have (unfortunately!) lost count of the number of times a new web-based application goes into production without anyone even considering the findability of the application and its content (data) within the context of our enterprise search.
Typically, the conversation seems to go something like this:
What did we completely miss in this discussion? Well, no one in the above process (unfortunately) has explicitly asked the question, “Does the content (data) in this site need to be exposed via our enterprise search?” Nor has anyone even asked the more basic question, “Should someone be able to find this application [the "home page" of the application in the context of a web application] via the enterprise search?”
I’ve seen this scenario play out many, many times in just the last few years here. What often happens next depends on the application but includes many of the following symptoms:
The overall effect is likely that the application does not work well with the enterprise search, or possibly that the application is that the application does not hold up to the pressure of the crawler hitting its pages much faster than anticipated (so I end up having to configure the crawler to avoid the application) and ending with yet another set of content that’s basically invisible in search.
Bringing this back around to the title – the response I often get when inquiring about a newly released application is something like, “People will know how to find that content – it’s in this application! Why would this need to be in the enterprise search?”
When I then ask, “Well, how do people know that they even need to navigate to or look in this application?” I’ll get a (virtual) shuffling of feet and shoulder shrugs.
All because of a perpetual lack of asking a few basic questions during a requirements gather stage of a project or (another way to look at it) lack of standards or policies which have “teeth” about the design and development of web application. The unfortunate thing is that, in my experience, if you ask the questions early, it’s typically on the scale of a few hours of a developer’s time to make the application work at least reasonably well with any crawler-based search engine. Unfortunately, because I often don’t find out about an application until after it’s in production, it then becomes a significant obstacle to get any changes made like this.
I’ll write more in a future post about the standards I have worked to establish (which are making some headway into adoption, finally!) to avoid this.
Edit: I’ve now posted the standards as mentioned above – you can find them in my post Standards to Improve Findability in Enterprise Applications.