Lee Romero

On Content, Collaboration and Findability

Archive for the ‘Enterprise Search’ Category

Search Analytics – Basic Metrics

Tuesday, January 20th, 2009

In my first few posts (about a year ago now), I covered what I call the three principles of enterprise search – coverage, identity, and relevance. I have posted on enterprise search topics a few times in the meantime and wanted to return to the topic with some thoughts to share on search analytics and provide some ideas for actionable metrics related to search.

I’m planning 3 posts in this series – this first one will cover some of what I think of as the “basic” metrics, a second post on some more advanced ideas and a third post focusing more on metrics related to the usage of search results (instead of just the searching behavior itself).

Before getting into the details, I also wanted to say that I’ve found a lot of inspiration from the writings and speaking of Louis Rosenfeld and also Avi Rappoport and strongly recommend you look into their writings. A specific webinar to share with you, provided by Louis, is “Site Search Analytics for a Better User Experience“, which Louis presented in a Search CoP webcast last spring. Good stuff!

Now onto some basic metrics I’ve found useful. Most of these are pretty obvious, but I guess it’s good to start at the start.

  • Total searches for a given time period – This is the most basic measure – how much is search even used? This can be useful to help you understand if people are using the search more or less over time.
    • In terms of actionable steps, if you pay attention to this metric over time, it can tell you, at a high level, whether users are finding navigation to be useful or not. Increasing search usage can point to the need to improve navigation – so perhaps might indicate the need for a better navigational taxonomy, so look at whether highly-sought content has clear navigation and labeling.
  • Total distinct search terms for a given time period – Of all of the searches you are measuring with the first metric, how many are unique combinations of search criteria (note: criteria may include both user-entered keywords and also something like categories or taxonomy values selected from pick lists if your search supports that)? If you take the ratio of total searches to distinct searches, you can determine the average number of times any one search term is used.
    • In terms of taking action on this, there is not much new to this metric compared to total searches, but the value I find is that it seems to be a bit more stable from period to period.
    • Monitoring the ratio over time is interesting (in my experience, ours tends to run about 1.87 searches / distinct search and variations seem small over time). Not sure what a benchmark should be. Anyone? Understanding and comparing to benchmarks probably would provide some more distinct tasks.
  • Total distinct words for a given time period and average words per search – take the previous metric and pull apart individual search terms (or user-selected taxonomic values) and get down to the individual words.
    • This view of the data helps you understand the variety of words in use throughout search. Often, I find that understanding the most common individual words is more useful than the top searches.
    • In terms of action, again, not much new here other than comparing to the total searches to find ways to understanding search usage.
    • I’m also interested in whatever benchmarks anyone else knows of in this area – again, I think comparing to benchmarks could be very useful. Just to share from my end, here are what I see (looking at these values week by week over a fairly long period):
      • Average words per search: 2.02. Maximum (of weekly averages) was 2.16 and minimum (of weekly averages) was 1.84. So pretty stable. So, on average, most searches use two words.
      • Average uses of each word (during any given week): 4.95. Maximum (of weekly averages) was 5.69 and minimum (of weekly averages) was 2.93. So a much wider variance than we see in words per search.
  • (The most obvious?) Top N searches for a given time period – I typically look at weekly data and, for this metric, I most commonly look at the top 100 searches and focus on about the top 20. Actions to take:
    • Ensure that common searches return decent results. If it does not show good results, what’s causing it to show up as a common search (it would seem that users are unlikely to find what they need)? If it does show what appear to be good results, does this expose specific issues with navigation (as opposed to the general issues observable from the metrics listed above)?
    • If a search shows up that hasn’t been in the top of the list, does that represent something new in your users’ work that they need access to? Perhaps a some type of seasonal (annual or maybe monthly) change?
  • Trending of all of the above – More useful than any of the above metrics as single snapshots for a given time period (which is what it seems like many engines will provide out of the box) is the ability to view trends over longer periods. Not just the ability to view the above metrics over longer periods but the ability to see what the metrics were, say, last week and compare those to the week before, and the week before that, etc.
    • I’ve mentioned a few of these, but comparing how the trend is changing of how many searches are performed each week (or month or quarter) is much more useful than just knowing that data point during any given time period.
    • One of the challenges I’ve had with any of the “Top N” type metrics (searches, words, etc.) is the ability to easily compare and contrast the top searches week to week – being able to compare in an easily-comprehended manner what searches have been popular each week (or month) over, say, a few month (quarter) period helps you know if any particular common search is likely a single spike (and likely not worth spending time on improving results for) or an indication of a real trend (and thus very worthwhile to act on). I have ended up doing a good bit of manual work with data to get this insight – anyone know of tools that make it easier?
  • Top Searches over time – another type of metric I’ve spent time trying to tweak is to understand what makes a “top search over an extended period of time”. This is similar to understanding and reviewing trends over time but with a twist.
    • Let’s say that you gather weekly reports and you have access to the data week by week over a longer period of time (let’s say a year).
    • The question is – over a longer time period, what are the searches you should pay attention to and actively work to improve? What is a “top search”?
    • A first answer is to simply count the total searches over that year and whichever searches were most commonly used are the ones to pay attention to.
    • What I’ve found is that using that definition can lead to anomalous situations like a search that is very popular for one week (but otherwise perhaps doesn’t appear at all) could appear to be a “top search” simply because it was so popular that one week.
      • To address this, what I do is to impose a minimum threshold on the # of reporting periods (weeks in my case) that a search needs to be a top search in order for it to be considered a top search for the longer time period. The ratio I use is normally 25% – so a term needs to be a top search for 25% of the weeks being considered to be considered at all. Within that subset of popular searches, you can then count the total searches.
      • Alternately, if you can, massage your data to include the total searches (over the longer time period) and total reporting periods in which the search occurs as two distinct columns and you can sort / filter the data as you wish.
      • The important thing is to recognize that if you’re looking to actively work on improving specific searches, you need to focus your (limited, I’m sure!) time on those searches that warrant your time, not find yourself spending time on a search that only appears as a popular search in one reporting period.
    • On the other hand, a search that might not be a top N search any given week could, if you look at usage over time, be stable enough in its use that over the course of a longer period it would be a top search.
      • This is the inverse of the first issue. In this case, the key issue is that you will need access over longer periods of time to all of the search terms for each reporting period – not just the top searches. Depending on your engine, this data may or may not be available.
  • Another important dimension you should pay attention to when interpreting behavior is seasonality. You should compare your data to the same period a year ago (or quarter ago or maybe month ago, depending on your situation) to see if there are terms that are popular only at particular times.
    • An example on our intranet is that each year you can see the week before and of the “Take your Kids to Work” program, searches on ‘kids to work’ goes through the roof and then disappears again for another year. Also, at the end of each year, you see searches on “holidays” go way up (users looking for information on what dates are company holidays and also about holiday policy).
    • This insight can help you anticipate information needs that are cyclical, which could mean ensuring that new content for the new cycle (say we had a new site for the Kids to Work program each year, though I’m not sure if we do) shows well for searches that users will use to find it.
    • It also helps you understand what might be useful temporary navigation to provide to users for this type of situation. Having a link from your intranet home page to your holiday policies might not be useful all of the time but if you know that people are looking for that in late November and December, placing a link to the policies for that period can help your users find the information they need.
  • Another area of metrics you need to be attention to are not found searches and error searches.
    • What percentage of searches result in not found searches for your reporting periods? How is that changing? If it’s going up, you seem to have a problem. If it’s stable, is it higher than it should be?
    • What are the searches that users are most commonly doing that are resulting in no results being found? Focus on those and work to ensure whether it’s a content issue (not having the right content) or perhaps a tagging issue (the users are not using expected words to find the content).
    • The action you take will depend on the percentage of not found results and also on the value of losing users on those not found.
      • On an e-commerce site, each potential customer you lose because they couldn’t find what they were looking for represents hard dollars lost.
      • On an intranet, it is harder to directly tie a cost to the not found search but if your percentage is high, you need to address it (improving coverage or tagging or whatever is necessary).
      • A relatively low “not found” percentage might not indicate a good situation – it might also simply reflect very large corpus of content being included in which just about any words a user might use will get some kind of result even if it’s not a useful result. More about that in my next post.
        • I’m not sure what a benchmark is for high or low percentage of not found, exactly. Does anyone know of any resource that might provide that?
        • On our intranet search, this metric has been very stable at around 7-8% over a fairly extended time period. That is not high enough to warrant general concern, though I do look for whether there are any common searches in this and there actually does not seem to be – individual “not found” results are almost always related to obvious misspellings and our engine provides spelling correction suggestions so it’s likely that when a user gets this, they click on the (automatically provided) link to see results with the corrected spelling and they (likely) no longer get the “no results” result.
      • Customizing your search results page for not found searches can be useful and provide alternate searches (based on the user’s search criteria) is very useful though it might be a very challenging effort.
    • What types of things might trigger an “error search” will depend on your engine (some engines may be very good at handling errors and controlling resources so as to effectively never return an error unless the engine is totally offline (in which case, it’s not too likely you’ll capture metrics on searches). Also, whether these are reported on in a way that you can act on will depend on your engine. If so, I think of these as very similar to “not found” searches. You should understand their percentage (and whether it’s going up, down or is stable), what are the keywords that trigger errors, etc. Modify your engine configuration, content or results display as possible to deal with this.
      • An example: With the engine we use, the engine tries to ensure that single searches do not cause performance issues so if a search would return too many results (what is considered “too many” is configurable but it is ultimately limited), it triggers an “error” result being returned to the user. I was able to find the searches that trigger this response and ensure that (hand-picked) items show up in the search results page for any common search that triggers an error.

That’s all of the topics I have for “basic metrics”. Next up, some ideas (along with actions to take from them) on more complex search metrics. Hopefully, you find my recommendations for specific actions you can take on each metric useful (as they do tend to make the posts longer, I realize!).

The future is search enabled applications, not enterprise search

Wednesday, November 5th, 2008

In an exchange in comments on Stephen Arnold’s blog, Stephen states the line that is the title of this post:

“the future is search enabled applications, not enterprise search”

I’m somewhat familiar with Stephen (I’ve seen him speak at a couple of conferences and also have followed his writing on his blog for some time), but I had actually not seen this declaration in the past (though Stephen says he’s accused of saying it too much).

In any event – I find this an interesting claim and I think I would agree with the sentiment but I also think that it depends on how you look at it.  As I wrote previously in trying to lay out what I thought enterprise search is, I think that the key aspects of an enterprise search are that it’s available to all members of the enterprise and that it covers all relevant content.

Down in the details, if access to the enterprise search is through embedding that it in numerous locations or one location, I do not believe it matters.  In fact, as I wrote previously, embedding access through multiple points is probably ideal – let workers access it within the environment in which they work, regardless of what tool(s) they normally use to do their job.

On the other hand, if the expectation is that you can embed search in single applications and expect that search only within that application is sufficient, I do not think that is now or will in the future be sufficient.  The information needs for any organization are diverse enough that no one application can realistically handle all of them – email, document management, CRM, support knowledge bases, intranets, policies, etc.

Thoughts?

People Search – A Fourth Generation Proof of Concept – Part 2: The Design

Monday, November 3rd, 2008

In my last post, I described the goals I have tried to achieve with my proof of concept people search function. Here I will describe the design and implementation of this proof of concept.

Designing the Solution

Given the goals above, here’s the general outline of the design for this solution:

  • It would be built as a web application that generates a “profile page” for each worker – it is the set of all such profile pages that comprise the targets for a search engine to index.
  • Combined with a search engine (probably any search engine capable of indexing web pages would be sufficient – I used QuickFinder), it becomes trivial to integrate the search of these profiles into your enterprise search to provide a fourth generation solution to people search.
  • The core tenet of the data used is that I wanted to identify a set of activities for workers. The aggregation of keywords related to those activity is used to generate a profile for a worker.
  • An activity could potentially be anything that represents an event, action, writing, task, assignment, etc., that is associated with the worker.
  • Some examples of activities might include: edit of a wiki article, assignment of a task in an online workspace, posting of a message in a discussion form, membership in a project team, publishing a document in a corporate repository, posting an email to a mailing list, and so on.

Initially the web application directly queried the various systems used as sources when generating a profile for a worker. That is not scalable and also limits the amount of processing you can do, so I designed a simple SQL database to contain the data for this (implemented in MySQL). This database is essentially a data mart of worker data. The primary tables are:

  • worker (one row for each worker); this table contains the basic administrative data for a worker (it’s effectively a mirror of the organization’s corporate directory)
  • activity_source (each row describes a single source of activity which a worker might produce)
  • activity (one row for each individual “activity” associated with a worker); an activity must have a “description” – typically the title of an item or the subject of an email, etc.
  • From these tables, a few additional tables are generated by processing the data from the activity table
    • activity_keyword (contains a row for each keyword associated with an activity); a keyword is either any (individual) word from the description of the activity or a piece of metadata associated with the item (for systems which support such);
    • worker_top_keyword (aggregates the individual keywords associated with a worker [by association from activity_keyword through activity to the worker table]) so it’s easy to identify the top keywords for a worker without doing aggregation queries; each keyword in this table is weighted (see the description below of weights); I think of the set of keywords in this table for a worker to be that worker’s “attributes”
    • worker_connection (aggregates “linkage” between workers based on similarity of their keyword profiles); more on this later.

With the implementation of this database, I also implemented a synchronization tool that updates the data in the tables from the source systems for the various types of activities.

By automatically pulling data from these source systems (which workers use in their regular day-to-day work), you remove the need for the workers to maintain data.

  • By simply doing their job and “leaving traces” of that worker, they generate the data necessary for generating this profile. This achieves goal #2.
  • By restricting the set of data sources used to ones which anyone could examine for a worker’s activities (for example, I can view the history of a Wiki article and see who has edited it), I achieve goal #3.

Now, how should the profile page for a worker be presented?

Initially, I put together a design that did two things: 1) provided a typical employee directory style layout of my administrative details and 2) provided a list of all of the activities for a worker, grouped by activity source. In other words, you would see a list of all of the Wiki articles edited by the worker, a list of mailing list memberships, a list of community memberships, project team memberships, task assignments, etc. Each activity source’s list would be separately displayed (in a simple bulleted list). (Before this would go into production, I always have assumed I would ask for some design help from our electronic marketing group to give it a more professional look, but I thought the bulleted list worked perfectly well functionally.)

This proved simple and effective and also enabled the profile page to provide direct links to those activities that are addressable via a link (for example, the profile page could link directly to a Wiki article I’ve edited from my profile page, it could link to each discussion post, etc.)

However, this approach suffered from at least two problems: 1) it lacked an immediately obvious visual presentation of a worker’s attributes, and 2) it exposed every detailed activity of a worker to anyone who viewed the profile (I found when I demoed this to people, some had the immediate reaction of, “Wow – anyone can see all of these details? I’m not sure I like that!” – a reaction that surprised me given that any of the details are generally visible to anyone who wants to look, but go figure).

After looking for alternatives, I found that the keywords for a worker (when combined with their weights) provided good input for a tag cloud – which is what I ended up using as the default presentation of a worker’s keywords (visible to everyone). This helps to highlight what someone is “about”, presents a generally attractive visualization of the data, and, if the default view of a worker displays this tag cloud (and the worker’s administrative data) and does not show all of the details, it alleviates the concern mentioned above.

I have found the implementation of the tag cloud to be the trigger that pulls people into this tool – it helps satisfy my goal #5 because, for most people who have looked at this, it provides immediate validation when they see words they expect to see in their own tag cloud.

Here’s a shot of what part of my profile page looks like (partially obscured):

Lee Romero Profile

Lee Romero Profile

Additional Design Considerations

I wanted to keep the initial proof of concept simple in order to try to test different ways of using the data from the activity sources. With that in mind, here are some details on how I’ve done this so far:

  • When parsing the text associated with an activity into “keywords”, I took the simplest approach I could: the words from an activity are split into separate words when any non-alphanumeric is found. So a string like “content-management infrastructure” would result in 3 keywords: content, management and infrastructure.
  • I also removed any words that are stop words in our search engine.
  • Each keyword for a worker is assigned a weight. Simplistically, the weight of a keyword is the number of times that keyword shows up in that worker’s stream of activities.
  • However, the tool that maintains the keywords allows an administrator to assign a weight to each activity source – so some sources can be given an artificial boost just by assigning a weight for that activity source higher than 1. The only source whose weight I’ve really toyed with so far is the corporate directory itself – I have given that a weight of 20 instead of 1.
  • The weights for keywords are used in two ways:
    • The top 50 keywords (by weight) for a worker are used in the tag cloud for that worker. The weight is then used to size the words in the tag cloud.
    • When the “keywords” <meta> tag is being computed for a worker’s profile, the keywords are sorted by weight and the keywords are included until the length of the keywords content attribute is greater than 250 characters. This means that the top keywords are the ones which will give the worker higher relevance for searches on those words.
  • Because all workers will have, at absolute minimum, the same details in this profile as they would in the corporate directory, and because the keywords from that activity source are given extra weight, those keywords will almost certainly be in the “keywords” <meta> tag for their profile – this helps satisfy my goal #6 by ensuring good relevance when people search on worker’s administrative data (first name, last name, etc.)

Some additional functions I have layered on top of the basic profile / search mechanism that I believe will make this a valuable solution:

  • The keywords in the tag cloud are links to pages that provide details about that keyword. When a user clicks on a keyword in a tag cloud, they are presented with a tag cloud of keywords related to their starting keyword (related by way of people who have the keywords in common). In other words, it provides a set of keywords that have a lot in common with their starting keyword. The “keyword profile” page also provides a list of workers who use the selected keyword (the list is sorted by keyword weight).
  • When you view a worker, you are also presented with a list of workers who are “similar to” the worker you are looking at – the similarity measure is the percent of overlap of the current worker’s profile (weighted keywords) maps to the other workers. This provides a way to explore a neighborhood of similar people.
  • In addition to the list of similar worker, a link is provided for each worker which, when clicked, displays a page explaining why the two workers are similar.
  • Almost all of the data sources have a date threshold applied to the data pulled from the source – most of them take data from the last year. This ensures that the data used to build a profile is effectively self-maintaining.
  • Each worker has control over whether others can see all of the details (the individual activities) in their profile. By default, only the tag cloud and administrative data is visible. A worker can opt in to allow others to see their entire profile.

Issues / Future Directions

The proof of concept has been very interesting to work through and has presented me with some (subjective) proof of the value of this approach, as simple as it is. That being said, there are some issues and additional areas I hope are explored in the future:

  • This is a proof of concept built as basically a skunkworks project – I am hoping it will officially get some sponsorship and be launched into production.
  • I would like to see it integrated with additional data sources – currently, it uses 12 data sources but some high value sources that are not included would be our CRM system and our HR system. With the sources currently in use, it tends to skew the people whose profiles look sufficiently detailed to be ones who use the sources. Integrating these is relatively easy – a single SQL query from the source system that provides a list of activities for workers (where the source system can define whatever it wants to represent activities) is all that’s needed. It is this ease of adding in sources that achieves my goal #4.
  • I believe there is still a lot of work to do around tweaking the weights of activity sources to balance out the effects of various sources.
  • I would like to see some exploration of workers directly tagging other workers (to add keywords) or possibly allowing workers to give a thumbs up / thumbs down to individual keywords in a profile for a worker. This would add a powerful way for people to influence their own and others’ profiles.
  • This approach also needs to receive more testing from others to validate its effectiveness. I have had a few dozen people look at it and provide feedback but some more quantitative approach to this would be valuable.
  • I think this profile for a worker could be presented in a FOAF format as well – I’m not sure if that provides additional value, but it is a path to explore.
  • The algorithm for parsing out keywords from the activities could be improved beyond the very simplistic parsing applied now.
  • And, finally, I think that the measurement of similarity between workers could be significantly improved and the data from the links between workers embedded in this could be used to do some research to find “invisible communities” within the company. This would be a kind of organizational network analysis through data mining, which

People Search – A Fourth Generation Proof of Concept – Part 1: The goals

Friday, October 31st, 2008

I have previously described what I termed the various generations of solutions to the common challenge of workers finding connecting with or finding co-workers within an enterprise.  My most recent post described the fourth generation solution – which enables users to search and connect using much more than simple administrative terms (name, email, address, etc.) for the search.

Over my next couple of posts, I will provide a write-up of a proof of concept implementation I’ve assembled that meets a lot of the need for this with what I believe to be relatively minimal investment.

The follow represent the goals I’ve set for myself in this proof of concept:

  1. Demonstrate the usefulness of a people search based on attributes of workers other than purely administrative data – things like their skills, competencies, work, interests, etc.
  2. Demonstrate the feasibility of discerning the skills, competencies, work and/or interests through a means that does not depend on maintenance of data by the worker (which, from my experience, is not long-term maintainable).
    1. More specifically, provide a test bed to explore different algorithms for discovering keywords important keywords for people.
  3. Demonstrate the feasibility of discovering keywords using only data that is generally “publicly visible” within an enterprise.
  4. Provide a path for integrating manually-maintained skills data (if that were to be collected), or any other data (possibly including tags applied by co-workers as seen in IBM’s Dog Ear project).
  5. Provide a compelling user experience that draws people in and gives people a visual presentation of what another person is “about” (what describes them).
  6. Provide a solution that provides, at minimum, the equivalent of a 3rd generation solution (in other words you can find that worker based on their administrative data).

Also, I wanted to say that part of the inspiration for this proof of concept came from a session I attended at Enterprise Search Summit 2007 as presented by Trent Parkhill.  In his session, he described a mechanism where submissions to a company’s repository would be tagged with the names of participants in the project that produced the document as a deliverable.  Then, when users were searching for content, there was a secondary search that produced a list of people associated with the terms and / or documents found by the user’s search.  I’ve kind of turned that around and treated the people as being tagged by the keywords of the items they produce.

In my next post, I will describe the overall design of my proof of concept.

Enterprise Search and Third-Party Applications

Tuesday, October 28th, 2008

Or, in other words, “How do you apply the application standards to improve findability to applications built by third-party providers who do not follow your standards?”

I’ve previously written about the standards I’ve put together for (web-based) applications that help ensure good findability for content / data within that application. These standards are generally relatively easy to apply to custom applications (though it can still be challenging to get involved with the design and development of those applications at the right time to keep the time investment minimal, as I’ve also previously written about).

However, it can be particularly challenging to apply these standards to third-party applications – For example, your CRM application, your learning management system, or your HR system, etc. Applying the existing standards could take a couple of different forms:

  1. Ideally, when your organization goes through the selection process for such an application, your application standards are explicitly included in the selection criteria and used to ensure you select a solution that will conform to your standards
  2. More commonly, you will identify compliance to the standards (perhaps during selection but perhaps later during implementation) and you might need to implement some type of customization within the application to provide compliance.
  3. Hopefully, you identify compliance to the standards during selection or later, but you find you can not customize the application and you need a different solution.

The rest of this post will discuss a solution for option #3 above – how you can implement a different solution. Note that some search engines will provide pre-built functionality to enable search within many of the more common third party solutions – those are great and useful, but what I will present here is a solution that can be implemented independent of the search engine (as long as the search engine has a crawler-based indexing function) and which is relatively minimal in investment.

Solving the third-party application conundrum for Enterprise Search

So, you have a third party application and, for whatever reason, it does not adhere to your application standards for findability. Perhaps it fails the coverage principle and it’s not possible to adequate find the useful content without getting many, many useless items; or perhaps it’s the identity principle and, while you can find all of the desirable targets, they have redundant titles; or it might even be that the application fails the relevance principle and you can index the high value targets and they show up with good names in results but they do not show up as relevant for keywords which you would expect. Likely, it’s a combination of all three of these issues.

The core idea in this solution is that you will need a helper application that creates what I call “shadow pages” of the high value targets you want to include in your enterprise search.

Note: I adopted the use of the term “shadow page” based on some informal discussions with co-workers on this topic – I am aware that others use this term in similar ways (though I don’t think it means the exact same thing) and also am aware that some search engines address what they call shadow domains and discourage their inclusion in their search results. If there is a preferred term for the idea described here – please let me know!

What is a shadow page? For my purposes here, I define a shadow page as:

  • A page which uniquely corresponds to a single desirable search target;
  • A page that has a distinct, unique URL;
  • A page that has a <title> and description that reflects the search target of which it is a shadow, and that title is distinct and provides a searcher who sees it in a search results page with insight about what the item is;
  • A page that has good metadata (keywords or other fields) that describe the target using terminology a searcher would use;
  • A page which contains text (likely hidden) that also reflects all of the above as well to enhance relevance for the words in the title, keywords, etc.;
  • A page which, when accessed, will automatically redirect a user to the page of which the page is a shadow.

To make this solution work, there are a couple of minimal assumptions of the application. A caveat: I recognize that, while I consider these as relatively simple assumptions, it is very likely that some applications will still not be able to meet these and so not be able to be exposed via your enterprise search with this type of solution.

  1. Each desirable search target must be addressable by a unique URL;
  2. It should be possible to define a query which will give you a list of the desirable targets in the application; this query could be an SQL query run against a database or possible a web services method call that returns a result in XML (or probably other formats but these are the most common in my experience);
  3. Given the identity (say, a primary key if you’re using a SQL database of some type) of a desirable search target, you must be able to also query the application for additional information about the search target.

Building a Shadow Page

Given the description of a shadow page and the assumptions about what is necessary to support it, it is probably obvious how they are used and how they are constructed, but here’s a description:

First – you would use the query that gives you a list of targets (item #2 from the assumptions) from your source application to generate an index page which you can give your indexer as a starting point.  This index page would have one link on it for each desirable target’s shadow page.  This index page would also have “robots” <meta> tags of “noindex,follow” to ensure that the index page itself is not included as a potential target.

Second – The shadow page for each target (which the crawler reaches thanks to the index page) is dynamically built from the query of the application given the identity of the desirable search target (item #3 from the assumptions).  The business rules defining how the desirable target should behave in search help define the necessary query, but the query would need to contain at minimum some of the following data: the name of the target, a description or summary of the target, some keywords that describe the target, a value which will help define the true URL of the actual target (per assumption #1, there must be a way to directly address each target).

The shadow page would be built something like the following:

  • The <title> tag would be the name of the target from the query (perhaps plus an application name to provide context)
  • The “description” <meta> tag would be the description or summary of the target from the query, perhaps plus a few static keywords that help ensure the presence of additional insight about the target.   For example, if the target represents a learning activity, the additional static text might indicate that.
  • The “keywords” <meta> tag would include the keywords from the query, plus some static keywords to ensure good coverage.  To follow the previous example, it might be appropriate to include words like “learning”, “training”, “class”, etc. in a target that is a learning activity to ensure that, if the keywords for the specific target do not include those words, searchers can still find the shadow page target in search.
  • The <body> of the page can be built to include all of the above text – from my experience, wrapping the body in a CSS style that visually hides the text keeps the text from actually appearing in a browser.
  • Lastly, the shadow page has a bit of JavaScript in it that redirects a browser to the actual target – this is why you need to have the target addressable via a URL and also that the query needs to provide the information necessary to create that URL.  Most engines (I know of none) will not be able to execute the JavaScript, so will not know that the page is really a redirect to the desired target.

The overall effect of this is that the search engine will index the shadow page, which has been constructed to ensure good adherence to the principles of enterprise search, and to a searcher, it will behave like a good search target but when the user clicks on it from a search result, the user ends up looking at the actual desired target.  The only clue the user might have is that the URL of the target in the search results is not what they end up looking at in their browser’s address bar.

The following provides a simple example of the source (in HTML – sorry for those who might not be able to read it) for a shadow page (the parts that change from page to page are in bold):

<html>
<head>
<TITLE>title of target</TITLE>
<meta name="robots" content="index, nofollow">
<meta name="keywords" content="keywords for target">
<meta name="description" content="description of target">
<script type="text/javascript">
document.location.href="URL of actual target";
</script>
</head>
<body>
<div style="display:none;">
<h1>title of target</h1>
description of target and keywords of target
</div>
</body>
</html>

Advantages of this Solution

A few things that are immediately obvious advantages of this approach:

  1. First and foremost, with this approach, you can provide searchers with the ability to find content which otherwise would be locked away and not available via your enterprise search!
  2. You can easily control the targets that are available via your enterprise search within the application (potentially much easier than trying to figure out the right combination of robots tags or inclusion / exclusion settings for your indexer).
  3. You can very tightly control how a target looks to the search engine (including integration with your taxonomy to provide elaborated keywords, synonyms, etc)

Problems with this Solution

There are also a number of issues that I need to highlight with this approach – unfortunately, it’s not perfect!

  1. The most obvious issue is that this depends on the ability to query for a set of targets against a database or web service of some sort.
    1. Most applications will be technically able to support this, but in many organizations, this could present too great a risk from a data security perspective (the judicious use of database views and proper management of read rights on the database should solve this, however!)
    2. This potentially creates too high a level of dependence between your search solution and the inner workings of the application – an upgrade of the application could change the data schema enough to break this approach.  Again, I think that the use of database views can solve this (by abstracting away the details of the implementation into a single view which can be changed as necessary through any upgrade).
  2. Some applications may simply not offer a “deep linking” ability into high value content – there is no way to uniquely address a content item without the context of the application.  This solution can not be applied to such applications.  (Though my opinion is that such applications are poorly designed, but that’s another matter entirely!)
  3. This solution depends on JavaScript to forward the user from the shadow page to the actual target.  If your user population has a large percentage of people who do not use JavaScript, this solution fails them utterly.
  4. This solution depends on your search engine not following the JavaScript or somehow otherwise determining that the shadow page is a very low quality target (perhaps by examining the styles on the text and determining the text is not visible).  If you have a search engine that is this smart, hopefully you have a way to configure it to ignore this for at least some areas or page types.
  5. Another major issue is that this solution largely circumvents a search engine’s built in ability to do item-by-item security as the target to the search engine is the shadow page.  I think the key here is to not use this solution for content that requires this level of security.

Conclusion

There you have it – a solution to the exposure of your high value targets from your enterprise applications that is independent of your search engine and can provide you (the search administrator) with a good level of control over how content appears to your search engine, while ensuring that what is included highly adheres to my principles of enterprise search.

Standards to Improve Findability in Enterprise Applications

Thursday, October 23rd, 2008

I’ve previously written about the three principles of enterprise search and also about the specific business process challenges I’ve run into again and again with web applications in terms of findability.

Here, I will provide some insights on the specific standards I’ve established to improve findability, primarily within web applications.

As you might expect, these standards map closely to the three principles of enterprise search and so that’s how I will discuss them.

Coverage

When an application is being specified, the application team must ensure that they discuss the following question with business users – What are the business objects within this application and which of those should be visible through enterprise search?

The first question is pretty standard and likely forms the basis for any kind of UML or entity relationship diagram that would be part of a design process for the application. The second part is often not asked but it forms the basis for what will eventually be the specific targets that will show in search results through the enterprise search.

Given the identification of which objects should be visible in search results, you can then easily start to plan out how they might show up, how the search engine will encounter them, whether the application might best provide a dynamic index page of links to the entities or support a standard crawl or perhaps even a direct index of the database(s) behind the application.

Basically, the standard here is that the application must provide a means to ensure that a search engine can find all of the objects that need to be visible and also to ensure that the search engine does not include things that it should not.

Some specific things that are included here:

  • The entities that need to show up in search results should be visible as an individual target, addressable via a unique and stable URL. This ensures that when an item shows up in a set of search results, a searcher will see an entity that looks and behaves like what they want – if they’re looking for a document, they see that document and not a page that links to that document.
  • The application should have a strategy for the implementation of “robots” meta tags – pages that should not be indexed should have a “noindex”. Pages that are navigational (and not destinations themselves for search) should be marked “noindex”. Pages that provide navigation to the items through various options (filters, sorting, etc) may need to have “nofollow” as well as so that a crawler does not get hung up in looking at multitudes of various pages all of which are marked “noindex” anyway.
  • The application should not be frame-based. This is a more general standard for web applications, but frame-based applications consistently cause problems for crawlers as a crawler will index the individual frames but those individual frames are not ,themselves, useful targets.
  • To simplify things for a search engine, an application can simply provide an index page that directly links to all desired objects that should show up in search; I’ve found this to be very useful and can be much simpler than working through the logic of a strategy for robots tags to ensure good coverage. This index page would be marked “noindex, follow” for its robot tags so that it, itself, is not indexed (otherwise it might show up as a potential result for a lot of searches if, say, the title of the items are included in this index page).
  • Note that it is possible that for some applications, the answer to the leading question for this may be that nothing within the application is intended to be found via an enterprise search solution. That might be the case if the application provides its own local search function and there is no value in higher visibility (or possibly if the cost of that higher visibility is too high – say in the case that the application provides sophisticated access control which might be hard to translate to an enterprise solution).

Identity

With the standard for Coverage defined, we can be comfortable with knowing that the right things are going to show in search and the wrong things will not show up. How useful will they be as search results, though? If a searcher sees an item in a results list, will they be able to know that it’s what they’re looking for? So we need to ensure that the application addresses the identity principle.

The standard here is that the pages (ASP pages, JSP files, etc) that comprise the desirable targets for search must be designed to address the identity principle – specifically:

  • Each page that shows a search target must dynamically generate a <title> tag that clearly describes what it shows.
  • An application should also adopt a standard for how it identifies where the content / data is (the application name perhaps) as well as the content-specific name.
  • Within our infrastructure, a standard like, “<application name>: <item name>” has worked well.
  • In addition, each page that shows a search target must dynamically generate a “description” <meta> tag. This description can (and for our search does) be used as part of the results snippet displayed in a search results page, so it can provide a searcher important clues before the searcher even clicks on a target.
  • The application team should develop a strategy for what to include in the “description”:
    • In many applications, each item of interest will typically have some kind of user-entered text that can be interpreted as a description or which could be combined with some static text to make it so.
    • For example, an entity might have a name (used in the <title> tag) and something referred to as a the “summary” or “subject” or maybe “description” – simply use that text.
    • Alternately, the “description” might be generated as something like, “The help desk ticket <ticket ID> named <ticket name>”, for a page that might be part of a help desk ticket application.

Relevance

Now we know that the search includes what it should and we also know that when those items show in search, they will be identifiable for what they are. How do we ensure that the items show up in search for searches for which they are relevant, though?

The standards to address the relevance issue are:

  • Follow the standard above for titles (the words in the <title> tag will normally significantly boost relevancy for searches on those words regardless of your search engine)
  • Each page that shows a search target must dynamically generate a “keywords” <meta> tag.
  • The application team should devise a strategy for what would be included in the keywords, though some common concepts emerge:
    • Any field that a user can assign to the entity would be a candidate – for example, if a user can select a Product with which an item is associated or a geography, an industry, etc. All of those terms are good candidates for inclusion in keywords
    • While redundant, simply using the title of the item in the keywords can be useful (and reinforce the relevance of those words)
    • If an application integrates with a taxonomy system (specifically, a thesaurus) any taxonomic tags assigned to an entity should be included.
    • In addition, for a thesaurus, if the content will be indexed by internet search engines, directly including synonyms for taxonomic terms in the keywords can sometimes help – you might also include those synonyms directly in your own search engine’s configuration but you can’t do that with a search engine you don’t control. (Many internet search engines no longer consider the contents of these tags due to spamming in them but these can’t hurt even then.)
  • The application may also generate additional <meta> tags that are specific to its needs. When integrated with a taxonomy that has defined facets, including a <meta> tag with the name for each facet and the assigned values can improve results.
    • For example, if the application allows assignment of a product, it can generate a tag like: <meta name=”product” contents=”<selected values>”/>
    • Some search engines will allow searching within named fields like this – providing you a combination of a full text search and fielded search ability.

Additional resources

For a good review of the <meta> tags in HTML pages, you can look at:

People Search and Enterprise Search, Part 3 – The Fourth Generation

Monday, October 20th, 2008

So we get to the exciting conclusion of my essays on the inclusion of employees in enterprise search. If you’ve read this far, you know how I have characters the first and second generation solutions and also provided a description of a third generation solution (which included some details on how we implemented it).

Here I will describe what I think of as a fourth generation solution to people finding within the enterprise. As I mentioned in the description of the third generation solution, one major omission still at this point is that the only types of searches with which you can find people is through administrative information – things like their name, address, phone number, user ID, email, etc.

This is useful when you have an idea of the person you’re looking for or at least the organization in which they might work. What do you do when you don’t know the person and may not even know the organization in which they work? You might know the particular skills or competencies they have but that may be it. This problem is particularly problematic in larger organizations or organizations that are physically very distributed.

The core idea with this type of solution is to provide the ability to find and work with people based on aspects beyond the administrative – the skills of the people, their interests, perhaps the network of people with which they interact, and more. While this might be a simplification, I think of this as expertise location, though that, perhaps, most cleanly fits into the first use case described below.

Some common use cases for this type of capability include:

  • Peer-to-peer connections – an employee is trying to solve a particular problem and they suspect someone in the company may have some skills that would enable them to solve the problem more quickly. Searching using those skills as keywords would enable them to directly contact relevant employees.
  • Resource planning – a consulting organization needs to staff a particular project and needs to find specific people with a particular skill set.
  • Skill assessment – an organization needs to be able to ascertain the overall competency of their employees in particular skill sets to identify potential training programs to make available.

This capability is something that has often been discussed and requested at my current employer, but which no one has really been willing to sponsor. That being said, I know there are several vendors with solutions in this space, including (at least – please share if you know of others):

  • Connectbeam – A company I first found out about at KM World 2007. They had some interesting technology on display that combines expertise location with the ability to visualize and explore social networks based on that expertise. Their product could digest content from a number of systems to automatically discern expertise.
  • ActiveNet – A product from Tacit Software, which (at a high level) is similar to Connectbeam. An interesting twist to this product is that it leaves the individuals whose expertise are managed in the system in control of how visible they are to others. In the discussions I’ve had with this company about the product, I’ve always had the impression that, in part, this provides a kind of virtual mailing list functionality where you can contact others (those with the necessary expertise) by sending an email without knowing who it’s going to. Those who receive it can either act on it or not and, as the sender, you only know who replies.
  • Another product about which I only know a bit is from a company named Trampoline Systems. I heard about them as I was doing some research on how to tune a prototype system of my own and understand that their Sonar platform provides similar functionality.
  • [Edit: Added this on 03 November, 2008] I have also found that Recommind provides expertise location functionality – you can read more about it here.
  • [Edit: Added this on 03 November, 2008] I also understand that the Inquira search product provides expertise location, though it’s not entirely clear to me from what I can find about this tool how it does this.

A common aspect of these is that they attempt to (and perhaps succeed) in automating the process of expertise discovery. I’ve seen systems where an employee has to maintain their own skill set and the problem with these is that the business process to maintain the data does not seem to really embed itself into a company – inevitably, the data gets out of date and is ill-maintained and so the system does not work.

I can not vouch for the accuracy of these systems but I firmly believe that if people search in the enterprise is going to meet the promise of enabling people to find each other and connect based on of-the-moment needs (skills, interests, areas of work, etc), it will be based on this type of capability – automatically discovering those aspects of a worker based on their work products, their project teams, their work assignments, etc.

I imagine within the not too distant future, as we see more merger of the “web 2.0” functionality into the enterprise this type of capability will become expected and welcome – it will be exciting to see how people will work together then.

This brings to a close my discussion of the various types of people search within the enterprise. I hope you’ve found this of interest. Please feel free to let me know if you think I have any omissions or misstatements in here – I’m happy to correct and/or fill in.

I plan another few posts that discuss a proof of concept I have put together based around the ideas of this fourth generation solution – look for those soon!

People Search and Enterprise Search, Part 2 – A third generation solution

Wednesday, October 15th, 2008

In my last post, I wrote about what I termed the first generation and second generation solution to people search in enterprise. This time, I will describe what I call a “third generation” solution to the problem that will integration people search with your enterprise search solution.

This is the stage of people search in use within my current employer’s enterprise.

What is the third generation?

What I refer to as a third generation solution for people search is one where an employee’s profile (their directory entry, i.e., the set of information about a particular employee) becomes a viable and useful target within your enterprise search solution. That is, when a user performs a search using the pervasive “search box” (you do have one, right?), they should be able to expect to find their fellow workers in the results (obviously, depending on the particular terms used to do the search) along with any content that matches that.

You remove the need for a searcher to know they need to look in another place (another application, i.e., the company’s yellow pages) and, instead, reinforce the primacy of that single search experience that brings everything together that a worker needs to do their job.

You also offer the full power of your enterprise search engine:

  • Full text search – no need to specifically search within a field, though most engines will offer a way to support that as well if you want to ffer that as an option;
  • The power of the search engine to work on multi-word searches to boost relevancy – so a search on just a last name might include a worker’s profile in the search results but one that includes both a first and last name (or user ID or location or other keywords that might appear in the worker’s profile) likely ensures that the person shows in the first page of results amidst other content that match;
  • The power of synonyms – so you can define synonyms for names in your engine and get matches for “Rob Smith” when a user searches on “Robert Smith” or “Bob Smith”;
  • Spelling corrections – Your engine likely has this functionality, so it can automatically offer up corrections if someone misspells a name, even.

Below, you will find a discussion of the implementation process we used and the problems we encountered. It might be of use to you if you attempt this type of thing.

Before getting to that, though, I would like to discuss what I believe to be remaining issue with a third generation solution in order to set up my follow-up post on this topic, which will describe additional ideas to solving the “people finder” problem within an enterprise.

The primary issue with the current solution we have (or any similar solution based strictly on information from a corporate directory) is that the profile of a worker consists only of administrative information. That is, you can find someone based on their name, title, department, address, email, etc., etc., etc., but you can not do anything useful to find someone based on much more useful attributes – what they actually do, what their skills or competencies are or what their interests might be. More on this topic in my next post!

The implementation of our third generation solution (read on for the gory details)

Read on from here for some insights on the challenges we faced in our implementation of this solution. It gets pretty detailed from here on out, so you’ve been warned!

(more…)

People Search and Enterprise Search

Tuesday, October 14th, 2008

This post is the first of a brief series of posts I plan to write about the integration of “people search” (employee directory) with your enterprise search solution. In a sense, this treats “people” as just another piece of content within your search, though they represent a very valuable type of content.

This post will be an introduction and describe both a first and second generation solution to this problem. In subsequent posts, I plan to describe a solution that takes this solution forward one step (simplifying things for your users among other things) and then into some research that I believe shows a lot of promise and which you might be able to take advantage of within your own enterprise search solution.

Why People Search?

Finding contact information for your co-workers is such a common need that people have, forever, maintained phone lists – commonly just as word processing documents or spreadsheets – and also org charts, probably in a presentation file format of some type. I think of this approach as a first generation solution to the people search problem.

Its challenges are numerous, including:

  1. The maintenance of the document is fraught with the typical issues of maintaining any document (versioning, availability, etc.)
  2. In even a moderately large organization, the phone list may need to be updated by several people throughout the organization to keep it current.
  3. Search within this kind of phone list is limited – you can ensure you always have the latest version and then open it up and use your word processor’s search function or (I remember this well, myself) always keep a printout of the latest version of the phone list next to your workspace so you can look through it when you need to contact someone.

As computer technology has evolved and companies implemented corporate directories for authentication purposes (Active Directory, LDAP, eDirectory, etc.), it has become common to maintain your phone book as a purely online system based on your corporate directory. What does such a solution look like and what are its challenges?

A “Second Generation” Solution

I think it’s quite common now that companies will have an online (available via their intranet) employee directory that you can search using some (local, specific to the directory) search tools. Obvious things like doing fielded searches on name, title, phone number, etc. My current employer has sold a product named eGuide for quite some time that provides exactly this type of capability.

eGuide is basically a web interface for exposing parts of your corporate Directory for search and also for viewing the org chart of a company (as reflected in the Directory).

We have had this implemented on our intranet for many years now. It has been (and continues to be) one of the more commonly used applications on our intranet.

The problems with this second generation solution, though, triggered me to try to provide a better solution a few years ago using our enterprise search. What are the problems with this approach? Here are the issues that triggered a different (better?) solution:

  1. First and foremost, with nothing more than the employee finder as a separate place to search, you immediately force a searcher to have to make a decision before they do their search as to where they want to search. Many users might expect that the “enterprise” search actually does include anything that they can navigate to as potential targets so when they search on a person’s name and don’t see it in the result set they immediately think either A) why does the search not include individual people’s information, or B) this search engine is so bad that, even though it must include people information, it can’t even show the result at a high enough relevance to get it on the first page!
    1. Despite my statement to the contrary above, I am aware that Jakob Nielsen does actually advocate the presence of both a “people search” box and a more general search box because people are aware of the distinction between searching for content and search for people. We do still have both search boxes on our intranet, though, in a sense, the people search box is redundant.
  2. Secondly, the corporate directory commonly is a purely fielded search – you have to select which field(s) you want to search in and then you are restricted to searching just those fields.
    1. In other words, you as a searcher, need to know in which field a particular string (or partial string) might appear. For many fields, this might not be an issue – generally, first and last name are clear (though not always), email, phone number, etc., but the challenge is that a user has to decide in which field they want to look.
  3. Third, related to the previous point, directory searches are generally simplistic searches based on string matching or partial string matching. With a full search engine, you introduce the possibility of taking advantage of synonyms (especially useful on first names), doing spelling corrections, etc.

So there’s a brief description of what I would characterize as a first generation solution and a second generation solution along with highlights of some issues with each.

Up next, I’ll describe the next step forward in the solution to this issue – integrating people into your enterprise search solution.

People know where to find that, though!

Monday, October 13th, 2008

The title of this post – “People know where to find that, though!” is a very common phrase I hear as the search analyst and the primary search advocate at my company. Another version would be, “Why would someone expect to find that in our enterprise search?”

Why do I hear this so often? I assume that many organizations, like my own, have many custom web applications available on their intranet and even their public site. It is because of that prevalence, combined with a lack of communication between the Business and the Application team, that I hear these phrases so often.

I have (unfortunately!) lost count of the number of times a new web-based application goes into production without anyone even considering the findability of the application and its content (data) within the context of our enterprise search.

Typically, the conversation seems to go something like this:

  • Business: “We need an application that does X, Y and Z and is available on our web site.”
  • Application team: “OK – let’s get the requirements laid out and build the site. You need it to do X, Y and Z. So we will build a web application that has page archetypes A, B and C.”
  • Application team then builds the application, probably building in some kind of local search function – so that someone can find data once they are within the application.
  • The Business accepts the usability of the application and it goes into production

What did we completely miss in this discussion? Well, no one in the above process (unfortunately) has explicitly asked the question, “Does the content (data) in this site need to be exposed via our enterprise search?” Nor has anyone even asked the more basic question, “Should someone be able to find this application [the “home page” of the application in the context of a web application] via the enterprise search?”

  • Typically, the Business makes the (reasonable) assumption that goes something like, “Hey – I can find this application and navigate through its content via a web browser, so it will naturally work well with our enterprise search and I will easily be able to find it, right?!”
  • On the other hand, the Application Team has likely made 2 assumptions: 1) the Business did not explicitly ask for any kind of visibility in the enterprise search solution, so they don’t expect that, and 2) they’ve (likely) provided a local search function, so that would be completely sufficient as a search.

I’ve seen this scenario play out many, many times in just the last few years here. What often happens next depends on the application but includes many of the following symptoms:

  • The page archetypes designed by the Application Team will have the same (static) <title> tag in every instance of the page, regardless of the data displayed (generally, the data would be different based on query string parameters).
    • The effect? A web-crawler-based search engine (which we use) likely uses the <title> tag as an identifier for content and every instance of each page type has the same title, resulting in a whole lot of pretty useless (undifferentiated) search results. Yuck.
  • The page archetypes have either no or maybe redundant other metadata – keywords, description, content-date, author, etc.
    • The effect? The crawler has no differentiation based on <titles> and no additional hints from metadata. That is, lousy relevance.
  • The application has a variety of navigation or data manipulation capabilities (say, sorting data) based on standard HTML links.
    • The effect? The crawler happily follows all of the links – possibly (redundant) indexing the same data many, many times simply sorted on different columns.
    • Another effect? The dreaded calendar affect – the crawler will basically never stop finding new links because there’s always another page.
    • In either case, we see poor coverage of the content.

The overall effect is likely that the application does not work well with the enterprise search, or possibly that the application is that the application does not hold up to the pressure of the crawler hitting its pages much faster than anticipated (so I end up having to configure the crawler to avoid the application) and ending with yet another set of content that’s basically invisible in search.

Bringing this back around to the title – the response I often get when inquiring about a newly released application is something like, “People will know how to find that content – it’s in this application! Why would this need to be in the enterprise search?”

When I then ask, “Well, how do people know that they even need to navigate to or look in this application?” I’ll get a (virtual) shuffling of feet and shoulder shrugs.

All because of a perpetual lack of asking a few basic questions during a requirements gather stage of a project or (another way to look at it) lack of standards or policies which have “teeth” about the design and development of web application. The unfortunate thing is that, in my experience, if you ask the questions early, it’s typically on the scale of a few hours of a developer’s time to make the application work at least reasonably well with any crawler-based search engine. Unfortunately, because I often don’t find out about an application until after it’s in production, it then becomes a significant obstacle to get any changes made like this.

I’ll write more in a future post about the standards I have worked to establish (which are making some headway into adoption, finally!) to avoid this.

Edit: I’ve now posted the standards as mentioned above – you can find them in my post Standards to Improve Findability in Enterprise Applications.