In my last post, I described the goals I have tried to achieve with my proof of concept people search function. Here I will describe the design and implementation of this proof of concept.
Designing the Solution
Given the goals above, here’s the general outline of the design for this solution:
- It would be built as a web application that generates a “profile page” for each worker – it is the set of all such profile pages that comprise the targets for a search engine to index.
- Combined with a search engine (probably any search engine capable of indexing web pages would be sufficient – I used QuickFinder), it becomes trivial to integrate the search of these profiles into your enterprise search to provide a fourth generation solution to people search.
- The core tenet of the data used is that I wanted to identify a set of activities for workers. The aggregation of keywords related to those activity is used to generate a profile for a worker.
- An activity could potentially be anything that represents an event, action, writing, task, assignment, etc., that is associated with the worker.
- Some examples of activities might include: edit of a wiki article, assignment of a task in an online workspace, posting of a message in a discussion form, membership in a project team, publishing a document in a corporate repository, posting an email to a mailing list, and so on.
Initially the web application directly queried the various systems used as sources when generating a profile for a worker. That is not scalable and also limits the amount of processing you can do, so I designed a simple SQL database to contain the data for this (implemented in MySQL). This database is essentially a data mart of worker data. The primary tables are:
- worker (one row for each worker); this table contains the basic administrative data for a worker (it’s effectively a mirror of the organization’s corporate directory)
- activity_source (each row describes a single source of activity which a worker might produce)
- activity (one row for each individual “activity” associated with a worker); an activity must have a “description” – typically the title of an item or the subject of an email, etc.
- From these tables, a few additional tables are generated by processing the data from the activity table
- activity_keyword (contains a row for each keyword associated with an activity); a keyword is either any (individual) word from the description of the activity or a piece of metadata associated with the item (for systems which support such);
- worker_top_keyword (aggregates the individual keywords associated with a worker [by association from activity_keyword through activity to the worker table]) so it’s easy to identify the top keywords for a worker without doing aggregation queries; each keyword in this table is weighted (see the description below of weights); I think of the set of keywords in this table for a worker to be that worker’s “attributes”
- worker_connection (aggregates “linkage” between workers based on similarity of their keyword profiles); more on this later.
With the implementation of this database, I also implemented a synchronization tool that updates the data in the tables from the source systems for the various types of activities.
By automatically pulling data from these source systems (which workers use in their regular day-to-day work), you remove the need for the workers to maintain data.
- By simply doing their job and “leaving traces” of that worker, they generate the data necessary for generating this profile. This achieves goal #2.
- By restricting the set of data sources used to ones which anyone could examine for a worker’s activities (for example, I can view the history of a Wiki article and see who has edited it), I achieve goal #3.
Now, how should the profile page for a worker be presented?
Initially, I put together a design that did two things: 1) provided a typical employee directory style layout of my administrative details and 2) provided a list of all of the activities for a worker, grouped by activity source. In other words, you would see a list of all of the Wiki articles edited by the worker, a list of mailing list memberships, a list of community memberships, project team memberships, task assignments, etc. Each activity source’s list would be separately displayed (in a simple bulleted list). (Before this would go into production, I always have assumed I would ask for some design help from our electronic marketing group to give it a more professional look, but I thought the bulleted list worked perfectly well functionally.)
This proved simple and effective and also enabled the profile page to provide direct links to those activities that are addressable via a link (for example, the profile page could link directly to a Wiki article I’ve edited from my profile page, it could link to each discussion post, etc.)
However, this approach suffered from at least two problems: 1) it lacked an immediately obvious visual presentation of a worker’s attributes, and 2) it exposed every detailed activity of a worker to anyone who viewed the profile (I found when I demoed this to people, some had the immediate reaction of, “Wow – anyone can see all of these details? I’m not sure I like that!” – a reaction that surprised me given that any of the details are generally visible to anyone who wants to look, but go figure).
After looking for alternatives, I found that the keywords for a worker (when combined with their weights) provided good input for a tag cloud – which is what I ended up using as the default presentation of a worker’s keywords (visible to everyone). This helps to highlight what someone is “about”, presents a generally attractive visualization of the data, and, if the default view of a worker displays this tag cloud (and the worker’s administrative data) and does not show all of the details, it alleviates the concern mentioned above.
I have found the implementation of the tag cloud to be the trigger that pulls people into this tool – it helps satisfy my goal #5 because, for most people who have looked at this, it provides immediate validation when they see words they expect to see in their own tag cloud.
Here’s a shot of what part of my profile page looks like (partially obscured):
Additional Design Considerations
I wanted to keep the initial proof of concept simple in order to try to test different ways of using the data from the activity sources. With that in mind, here are some details on how I’ve done this so far:
- When parsing the text associated with an activity into “keywords”, I took the simplest approach I could: the words from an activity are split into separate words when any non-alphanumeric is found. So a string like “content-management infrastructure” would result in 3 keywords: content, management and infrastructure.
- I also removed any words that are stop words in our search engine.
- Each keyword for a worker is assigned a weight. Simplistically, the weight of a keyword is the number of times that keyword shows up in that worker’s stream of activities.
- However, the tool that maintains the keywords allows an administrator to assign a weight to each activity source – so some sources can be given an artificial boost just by assigning a weight for that activity source higher than 1. The only source whose weight I’ve really toyed with so far is the corporate directory itself – I have given that a weight of 20 instead of 1.
- The weights for keywords are used in two ways:
- The top 50 keywords (by weight) for a worker are used in the tag cloud for that worker. The weight is then used to size the words in the tag cloud.
- When the “keywords” <meta> tag is being computed for a worker’s profile, the keywords are sorted by weight and the keywords are included until the length of the keywords content attribute is greater than 250 characters. This means that the top keywords are the ones which will give the worker higher relevance for searches on those words.
- Because all workers will have, at absolute minimum, the same details in this profile as they would in the corporate directory, and because the keywords from that activity source are given extra weight, those keywords will almost certainly be in the “keywords” <meta> tag for their profile – this helps satisfy my goal #6 by ensuring good relevance when people search on worker’s administrative data (first name, last name, etc.)
Some additional functions I have layered on top of the basic profile / search mechanism that I believe will make this a valuable solution:
- The keywords in the tag cloud are links to pages that provide details about that keyword. When a user clicks on a keyword in a tag cloud, they are presented with a tag cloud of keywords related to their starting keyword (related by way of people who have the keywords in common). In other words, it provides a set of keywords that have a lot in common with their starting keyword. The “keyword profile” page also provides a list of workers who use the selected keyword (the list is sorted by keyword weight).
- When you view a worker, you are also presented with a list of workers who are “similar to” the worker you are looking at – the similarity measure is the percent of overlap of the current worker’s profile (weighted keywords) maps to the other workers. This provides a way to explore a neighborhood of similar people.
- In addition to the list of similar worker, a link is provided for each worker which, when clicked, displays a page explaining why the two workers are similar.
- Almost all of the data sources have a date threshold applied to the data pulled from the source – most of them take data from the last year. This ensures that the data used to build a profile is effectively self-maintaining.
- Each worker has control over whether others can see all of the details (the individual activities) in their profile. By default, only the tag cloud and administrative data is visible. A worker can opt in to allow others to see their entire profile.
Issues / Future Directions
The proof of concept has been very interesting to work through and has presented me with some (subjective) proof of the value of this approach, as simple as it is. That being said, there are some issues and additional areas I hope are explored in the future:
- This is a proof of concept built as basically a skunkworks project – I am hoping it will officially get some sponsorship and be launched into production.
- I would like to see it integrated with additional data sources – currently, it uses 12 data sources but some high value sources that are not included would be our CRM system and our HR system. With the sources currently in use, it tends to skew the people whose profiles look sufficiently detailed to be ones who use the sources. Integrating these is relatively easy – a single SQL query from the source system that provides a list of activities for workers (where the source system can define whatever it wants to represent activities) is all that’s needed. It is this ease of adding in sources that achieves my goal #4.
- I believe there is still a lot of work to do around tweaking the weights of activity sources to balance out the effects of various sources.
- I would like to see some exploration of workers directly tagging other workers (to add keywords) or possibly allowing workers to give a thumbs up / thumbs down to individual keywords in a profile for a worker. This would add a powerful way for people to influence their own and others’ profiles.
- This approach also needs to receive more testing from others to validate its effectiveness. I have had a few dozen people look at it and provide feedback but some more quantitative approach to this would be valuable.
- I think this profile for a worker could be presented in a FOAF format as well – I’m not sure if that provides additional value, but it is a path to explore.
- The algorithm for parsing out keywords from the activities could be improved beyond the very simplistic parsing applied now.
- And, finally, I think that the measurement of similarity between workers could be significantly improved and the data from the links between workers embedded in this could be used to do some research to find “invisible communities” within the company. This would be a kind of organizational network analysis through data mining, which