Lee Romero

On Content, Collaboration and Findability
January 15th, 2009

Enterprise Taxonomy – A Business Process for Managing A Taxonomy

Now that I’ve posted quite a bit on the technical side of an enterprise taxonomy, I thought I’d share a bit on the business process side of how we have managed our taxonomy.

I spoke about this topic at the  2007 Taxonomy Boot Camp. (As an aside, I tried to find if the presentation I used is available on the site but I couldn’t find it – if someone knows of an online archive, please let me know and I can provide a link from here.) The session I delivered was titled, “The Process and Politics of Implementing a Corporate Taxonomy” and focused on the overall process we have implemented.

What follows is an overview of the larger process we used to establish the taxonomy and a description of the smaller process used to maintain it and I’ll close with some of my own thoughts on what it is that triggers changes in a taxonomy.

Getting Started

When we first started trying to formalize a taxonomy, one of the first steps we took was to do an organizational mapping to identify participants in the process. We focused on the following:

  • Groups that had significant investments in web content publication
  • Groups that had significant interest and investment in knowledge capture and sharing
  • Groups that have influence on the corporate culture

We felt that this organizational mapping was important because it would help increase buy-in to the taxonomy from those who have most vested interest in it and also (with help from that last group) would help increase larger scale adoption of the language. Once we felt that we had identified the groups that met these criteria, we engaged with the executives for the groups to help us identify one or more people who could be included in our Taxonomy Review Board.

The rest of the “getting started” process included content audits and analyses to identify terminology used to describe the content, definition of the structure of the taxonomy we wanted to use, organization of the terminology into this structure and then working with the Taxonomy Review Board to confirm the end result as a first version of the (evolving) taxonomy.

We also layed out the objectives we had for the overall process – which you can find in my post on the vision we have developed for our taxonomy. The really pertinent items we wanted to ensure were: We wanted to ensure that the taxonomy was actively managed and we wanted to ensure that the management process was transparent.

The People

Now that the taxonomy had been established, we needed to identify the people and process we would use for maintaining and enhancing the taxonomy.

The people who are involved include:

  • The taxonomy manager – a single person responsible for responding to requests for changes, proactively identifying proposed changes within the taxonomy and handling the “administrative” side of the process. If it’s helpful, I’ve found that this responsibility probably takes about 10% of a single person’s time (though that obviously reflects the size of our organization and volume of content, etc., and can vary at different times) This is my role within the process.
  • A core team – a group of about 3 people (one of which is the taxonomy manager) who do a first-level check of change requests to make sure that requests that are obviously (at least in the minds of the core team) not worth moving farther in a review process are not further considered. Time commitment for this group is probably on the scale of about a few hours a month.
  • The above-mentioned Taxonomy Review Board (TRB)- A cross-organizational group that reviews proposed changes and aligns with them or propose counter-proposals. This group currently has about 15-20 members. Time commitment for this group is minimal – normally, the proposals for change have been considered and detailed enough by the time this group sees them that their involvement is to receive emails with change proposals and either align (so no reply necessary) or write a counter proposal.

This organization has helped to keep the taxonomy managed, while also keeping overall enterprise expense to manage it fairly small.

The Process

Now, I am, at heart, a software engineer. Why is this pertinent? Early on in my career, I came to appreciate the need and value for change control (or, as I prefer to think of it change management or change visibility – I’ve always thought “control” seemed a bit stronger than you could really achieve) and that has seeped into our process.

At its heart, our process is similar to a software development team’s change control board (CCB) process:

  1. All changes, upon identification, are captured in the same bug-tracking system used for our engineering and IT systems (an implementation of Bugzilla). Just like with software, all changes are treated as either enhancements (extending beyond what we have now) or defects (a problem or mistake that was not anticipated) and so they follow the same lifecycle (I generically use the word “bug” below to mean the specific documented request in our tracking system for a change, regardless of whether the change is an enhancement or defect).
  2. Once a change is documented as a bug (I’ll write a bit more below about the sources of changes to the taxonomy), it is assigned to the taxonomy manager for resolution.
  3. The taxonomy manager then needs to do a few things:
    1. Ensure that the bug contains all of the necessary details and any obvious questions are answered. An example of this would be the specific guidelines we have for one of our classifications – I shared these with the TaxoCoP and Patrick Lambe blogged about them as well. In this case, the taxonomy manager is on the hook to ensure a request adheres to these guidelines.
    2. Describing the impact of the change on the rest of the taxonomy (if any).
  4. The change is then reviewed by the core team – this review is typically virtual via email exchange but can be a meeting convened by the taxonomy manager.
    1. If the core team aligns with the change (perhaps after some continued evolution of it), it moves forward for a review by the full Taxonomy Review Board.
    2. If the core team rejects the change, it is canceled. The taxonomy manager communicates that back to the requester (if the trigger for the change was a particular person or group).
  5. When a change is put to the Taxonomy Review Board, which is a virtual team (and which is geographically very distributed), it is communicated by email to the TRB.
    1. At this point in the process, we want to ensure efficiency in process so we do not use a “request for comment” type of approach.
    2. Instead, the change is detailed for the TRB and the TRB members are given two options: 1) align with the change as stated or 2) provide a counter proposal. This helps keep focused and helps to avoid potentially lengthy discussions on the change at this point.
    3. To further accelerate decision making and reduce time on the part of the TRB members, each request is also positioned as a time-boxed proposal: You have until <this date> to provide a counter proposal or else you are assumed to align to the change. In other words, no reply from a member equates to alignment.
    4. Another implication of this is that by the point a change reaches the TRB it is almost inevitably going to go ahead in some form (perhaps changed by counter-proposals from the TRB). It will not be canceled. That seems possible but so far has not happened in practice.
  6. Upon achieving alignment within the TRB on the final proposal, the change is executed in the taxonomy and the request closed. The change is communicated back to the requester to close the loop and (especially for significant changes) the change may also be communicated to our larger content manager community.

Issues with the Process and Framework

While it has worked effectively we still face a number of issues with this process. These include:

  • The need to keep on top of organizational changes – specifically, with regard to membership in the Taxonomy Review Board. A member’s role within the enterprise can change to the point where they may not be in the best position to represent a group of interests. In addition, with some organizational changes we’ve seen, it can result in an “unbalanced” TRB.
  • Which brings us to the second issue – organizational coverage. Currently, we have a TRB that overly represents our marketing organization and is missing representation from some groups that should be represented.
  • Lastly, support of this process from within our IT organization is a concern. I see this in a couple of different ways:
    • Organizationally, the taxonomy manager falls within IT but the responsibility to continue managing the taxonomy is not perceived as a priority (and there’s a question as to whether it should even organizationally be within IT);
    • In terms of adoption, it has been a challenge to educate the IT organization about the value and use of the taxonomy. An example would be integration with a business intelligence solution to ensure consistency in language and, more specifically, to be able to effectively integrate insights about content (which does use the taxonomy) with more transactional-based “data”.

Identifying the Need for Change

What triggers a change in the taxonomy?

As I (re-)gather my thoughts on this topic, one lingering question came back to me about the overall process. The question is external to the process (which takes the approach of “a change comes from somewhere and we’re not going to worry about where it comes from but once it’s been identified, we’ll wedge it into this process”) but I am interested in understanding what other taxonomists might actively do in maintaining a taxonomy. In other words, how much change do you experience that comes from others compared to your own recommendations or insights?

Here’s a list of triggers that have resulted in changes in the taxonomy:

  • We provide content publishers with a mechanism to request a change to one facet (“Item Type”) at the point where they are submitting a piece of content. I consider this to be a purely tactical, reactive change and, given the above process, suffers from the problem that a content publisher cannot sit at their computer waiting for the business process to complete before they submit their content. So even if a new value is adopted, they will need to publish their content with a temporary value and remember to come back and change it after the fact.
  • I have engaged with content owners several times who were planning to publish a set of content and worked proactively with them to understand their content and ensure that the taxonomy provides good coverage. It’s lucky (though perhaps it shouldn’t be!) when this can happen and I manage to ensure the taxonomy changes are in place before they need to publish content.
  • When a new repository is being migrated or merged into a system using the taxonomy, there will likely be a number of changes in the taxonomy, including adoption of whole new classifications and introduction of new values. Also, this almost inevitably require a good mapping from local system values to the taxonomy values where there is (near) overlap.
  • Most proactively on my part, I have also used analytics from a number of sources to help refine the taxonomy, including:
    • Reviewing search query logs to understand the language being used by people looking for content
    • Reviewing the “free text” fields (e.g., title, description, etc.) within content management systems to look for terms that are commonly used that might warrant explicit use in a constrained classification.
    • Reviewing the volume of content when split along various dimensions of the existing taxonomy – looking for opportunities to merge (values are under-utilized), split (values are over-utilized) or perhaps retired (values are not utilized)
  • Adoption of new terminology by groups responsible for that part of the taxonomy. A common example is the terminology used to describe our various solution offerings – these will, at times, be changed unilaterally by our marketing organization and we then need understand how that translates to the existing taxonomy and to content tagged with that taxonomy.
  • Lastly, given that another part of the vision of the taxonomy is to use systems of record where possible, a number of changes are triggered outside of the taxonomy and simply synchronized in from the source system. This approach assumes (true in all cases as far as I am aware) that the source systems provide their own management process on values and these changes do not require any review through the above taxonomy management process.
January 14th, 2009

Enterprise Taxonomy – An XML schema for Publishing a Taxonomy

In my continuing dive into the structure of our taxonomy, which, hopefully might be of use or interest to you to understand and possibly adopt to your own needs, so far, I’ve provided an outline of the application solution and then a high level outline of the data model we’re using.

One of the important features of our solution is that our taxonomy system provides the ability for other systems to consume the taxonomy via an XML document. I’ll explore that a bit here.

Accessing the XML

Access to the XML document for the taxonomy is through a very simple means: a standard HTTP GET. The query string in the request can specify various parameters on the URL – effectively, a very simple web service. The types of parameters supported include:

  • Identifying which classification is desired (default is to return all)
  • Specifying the statuses of values to include (default will return all)
  • Specifying the language to include (default returns English)
  • Specifying the level of detail of interest (default returns the briefest format)

With regard to the language – one of the business rules followed in our web sites is that you provide content in the user’s selected language when available and return English when the user’s language is not available (English should always be available). This rule is pushed down into this interface at the level of each value. So a consuming application might request the set of German values for the taxonomy and get all of the classification details in German and, say, 99% of the values in German but if there are values that are not translated, those are returned in English. This approach keeps the taxonomy consistent with our general rules (though if taxonomy values are used directly in a user interface, it does present a possibly confusing same-page mix of non-English and English).

Document structure

The returned XML document looks like the following. I’m not using any formal XML schema syntax – instead showing the elements and how they relate to each other with a brief description of th elements that I don’t think are self-explanatory.

  • taxonomy
    • classification – has an attribute id (the ID of the classification)
      • name – has an attribute lang (the language code describing the language of the name element)
      • description – has an attribute lang (the language code describing the language of the description element)
      • status
      • createDate
      • updateDate
      • sourceSystem
      • comments
      • hasValues (a Y/N indicating if a consuming application should expect to find values in the values element)
      • constrained (a Y/N indicating if a consuming application should enforce the rule that values for this classification must come from the list of values provided)
      • multiValued (a Y/N indicating if a consuming application should allow multiple values be assigned for any given content piece)
      • dataType
      • changeHistory – an element with a sequence of elements, one for each auditable event in this item’s life history
      • aliases – has attribute count (the number of alias elements included)
        • alias – a structured element providing details on an alias
      • levels – has an attribute count (the number of levels included)
        • level – a structured element providing details on the level (omitted here)
      • values – has an attribute count (the number of values included)
        • value – has an atribute id (the ID of the value in the taxonomy system)
          • name – has an attribute lang (the language code describing the language of the name element)
          • description – has an attribute lang (the language code describing the language of the description element)
          • status
          • createDate
          • updateDate
          • sourceSystemId
          • levelRef – attribute id (identifies the specific level [in the levels element above] with which this value is associated)
          • aliases – attribute count (the number of aliases for this value)
            • alias – a structured element providing details on an alias
          • changeHistory – Same as for classification
          • values – recursive structure reflecting hierarchy within a classification’s set of values
            • value (etc.)

And that’s the schema. Looks complicated, but it’s really pretty simple, I think. The advantage of this has been that consuming applications do not need to directly access the database containing this (which would be pretty simple in principle) and so can be insulated from changes in the underlying structure of the database as we need to make them.

Providing access via an HTTP get keeps the technical cost minimal for consuming applications (they need to be able to read from an HTTP socket and then parse XML, both pretty standard functions in modern languages / libraries).

One last comment – in regard to the level of detail parameter mentioned above – the “brief” level includes the names , descriptions and statuses only of the classifications, levels and values.  The “detailed” includes all details except the changeHistory elements.  The “complete” level includes all of the above.  The “complete” format is probably not very useful for consumers as most will not care about the life history of elements (though that is of interest and value within the taxonomy).

Relationship to other Schemas

Just to connect the dots – I know of other XML schemas that we could conceivably have used to publish this document.  With help from the Taxonomy community of practice, I found the following while researching for a schema to use (I especially want to say thanks to Leonard Will, Mike Taylor, Marcel van Mackelenbergh and Bob Bater for their insights):

At the time we were designing (defining) a schema to use, we knew we wanted to keep it as simple as possible and (right or wrong) as close to the underlying model as we could, which made sense within our business environment. It wasn’t clear at the time which of the above might provide the most likely path forward (in terms of standard adoption) so we “rolled our own”. And, another factor was that the schemas seemed far more general than our needs warranted; for example, the broader-than / narrower-than type relations were implicit in our structure and specifying those explicitly seemed confusing. (To be honest, all of which could be interpreted as “we weren’t educated enough to understand the options and took the simpler-at-the-time approach of rolling our own”.)

I am still not as familiar as I would like to be with the above, so I still would not be able to say which would be most appropriate, but the SKOS schema, now in draft from the W3C seems like a potential solution that would fit our needs and could eventually become a broader standard.  Does anyone have any insights as to where this is moving?

January 13th, 2009

Enterprise Taxonomy – The Structure in Detail

In my previous post, I started describing the structure of the taxonomy we are using in some detail; originally, the following was part of my last post but it got a bit too long so I’ve split it. In this post, I’ll explore the structure in yet more detail – getting closer to a data model.

If you are going through a similar process that we’ve been through and you want to organize your taxonomy in a database, this might provide you with enough detail to get moving.

One note on terminology – much of what we have used is not what I would consider “standard” among taxonomist but was derived during a period when we had numerous systems we were trying to pull together, each of which used one of many different terms – categories, attributes, metadata, fields, tags, etc. I was charged at this point (which was before we started digging into the details of defining an enterprise taxonomy) with trying to define some terms that we could all use so that we could at least understand each other. A taxonomy for taxonomies, I guess.

Classification

The primary construct in the taxonomy is called a “Classification”. A better term for this I now know would be “Facet” as that’s what they are. The intent is that a Classification is a specific set of values (perhaps explicitly defined or perhaps defined by a set of guidelines or business rules) with which pieces of content can be associated (they can be tagged with values from the classification).

In our schema, a Classification itself has a number of elements:

  • Name – The preferred name for the Classification. Typically used as the label for fields on, for example, data entry forms of various sorts.
  • Definition – A concise definition of the Classification. Forcing the explicit definition of this helps reduce fuzzy thinking and gets people to clearly differentiate when a new Classification is needed versus using an existing one. This can be displayed in other systems that allow users to associate classification values with content as a kind of “mini-help”.
  • Life History (create date, modification date, audit trail) – We maintain the create date (actually, date added to the taxonomy) and a modification date so we know what happened when to the Classification. More detail is provided below on the audit trail.
  • Source System – Each classification might be sourced from another system. An example is a product listing – these are not maintained in the taxonomy but in their own systems and the taxonomy simply uses that list. Another example (where we do not have automation) is language (where we reference ISO standards as the master even though the values are still manually maintained in our taxonomy database).
  • Comments – A text field to hold comments for use within the taxonomy. Notes about issues, etc. Not intended for end users as the Definition is.
  • Data Type – The type of values expected for this Classification. Most commonly, just Strings, but we do define (for example) Creation Date and Expiration Date as classifications with data type of Date.
  • Value Indicators – The taxonomy provides indicators to help other systems know what to do with the Classification – Should assignment be constrained to just the values provided by the taxonomy? Should other systems allow content pieces to be associated with multiple values of a classification?
  • Synonyms – We provide for the Classification itself to have synonyms (these are synonyms for the Name of the classification). This can be used when (despite best attempts to the contrary) people want to continue to use different terms for the same classificatoin. An example might be that one system (and its user group) might want to refer to a “Region” whereas another might use the term “Market” or “Area”.
  • Status – We provide a status indicator on pretty much everything within the taxonomy (Classifications, individual values, etc). The usage is consistent and breaks down into:
    • “Active” – the value can be assigned to new/modified content; should be displayed in any type of search UI (say as a pick list) if appropriate; and should be displayed if a user views the taxonomic tagging of an item.
    • “Inactive” – the value should not be able to be assigned to new content or be newly assigned to existing content; it should be displayed in search UIs (if appropriate) and should be displayed if a user views the taxonomic tagging of an item. Basically, it was valid at one point and still has value on content already tagged with it but we do not use it any more.
    • “Deleted” – We don’t delete values physically, but mark them “Deleted”. The value can not be assigned when creating or editing content, it should not be displayed in any search UI and it should not be displayed if a user views the taxonomic tagging of a piece of content. Basically, the value is no longer in the taxonomy (though some systems may still have the value associated with content in some ways).
    • “Proposed” – The first status for most items. The value would only be in the Taxonomy system itself and would not propagate to other systems. Indicates that it’s being considered for adding but has not yet been approved.
  • A set of Classification Levels – Some classifications have an internal structure, described below in the “Level” section.
  • Localizations of Classification – There may be non-English translations of the name and description of a classification in the taxonomy database (see below for more about multiple languages).
  • A set of Classification Values – Most classification have a set of explicit values that can be associated with a piece of content. The values might be a flat list or might be hierarchical. The taxonomy database supports both. Currently, we do not support any type of many-to-many relationship or relationships across Classifications – just a simple one-to-many within a Classification which is a value / sub-value relationship (some Classifications provide more explicit constraints on the intended meaning of the relationship). Also, we do not have a construct that allows for an explicit (in the taxonomy database) meaning for any given relationship (specifically, narrower-than, broader-than, etc.) It’s implicit in the structure of the values.

Given the definition of a Classification as above, the terminology we use is that the taxonomy is, itself, the set of all Classifications we have defined and which can be used to tag content.  As with Classification itself, this is not, I think, consistent with standard using (the hierarchical structure within any one Classification would be considered a taxonomy) but adopting this definition at least got us organizationally out of the confusion of how we have a taxonomy when all of the values are not in a single, strict hierarchy.

Value

A Value is a single (usually textual, though might be dates or numbers) term which can be associated with a piece of content. Values are grouped into Classifications. A value association to a piece of content is what connects that piece of content to the taxonomy.

Like a Classification, a Value has a structure, which is only used when the Classification provides explicit values:

  • ID – the unique identifier within the taxonomy that identifies the value. Most systems using the taxonomy will store this ID as the associate (and not the associated value). This allows for the Value to have its textual representation changed without having to revisit any content (say a product name changes or a country’s name changes)
  • Structure details – What classification this value is associated with and which value in this Classification (if any) is the parent of this value. Also, some values have a designated “Level” (see below for more on that).
  • Value – the textual representation of this value. The string users will see and interpret as the “value”.
  • Definition – the definition of this value. As with the classifications, forcing this to be clearly defined provides a good “buffer” against people requesting values to be added that are duplicative or not generally useful. I’m surprised by how often asking a requestor for a clear definition (and how it’s different from another value that seems similar) stops them in their tracks.
  • Life History – same as the Classifications
  • Source System ID – For Classifications whose values come from another system, we maintain the source system’s ID so we can associate it back to the source system for updates. This can also be used by systems that pull from the taxonomy and also might happen (for other business reasons) to pull data from the same source systems and allows those systems to cross between the two sets of values.
  • Status – Same as for Classifications
  • Synonyms – Same as for Classifications but applied to the individual values. Synonyms for values are much more common than synonyms for classifications. Systems using the synonyms can potentially do many different things with synonyms (displaying them while a content manager is associating values with content, supporting search on them, etc.)
  • Localization of Value and Definition – Non-English translations of the value and definition. See below for more details.

Level

Within a single Classification, we have adopted a mechanism we refer to as a “Level” in order to have a structure within the Classification when it’s meaningful to have different Values grouped into semantically different sets. I think of this as the means by which we support a structure of Classifications.

A good example is Geography. We have a single classification for Geography which contains all necessary values for tagging content for geographic relevance (or irrelevance in some cases). However, each Value within that Classification might represent a different type of Geography. Some values are regions of the world (“North America” or “EMEA”); some values are Countries (“France” or “Japan”); and some might be areas within a country of use (“Midwest United States”).

A Level is a hierarchy of terms within a Classification and any given Value can be assigned to a Level.

The value of this is that systems using the taxonomy can provide user interfaces that group similar values (a nested, tree-style interface, say) while we do not need to have multiple Classifications with relationships across the Classifications to support this.

Multiple Languages

In order to support multiple languages on our web sites, we have provided a means to localize the entire taxonomy. Because localized content is a critical component of our customer-facing site, we provide a structure so that all text that can be used outside of the taxonomy (primarily things like the names and definitions of Classifications, the name and definition for Values, Level names, and even synonyms of each of these) can be localized.

Systems that pull from the taxonomy can then use the available localized terms in their displays (falling back to English if a particular term is not available in a specific language). This could be used in field labels on forms or navigation labels in a browsing interface, menu items, etc.

Audit Events

As I mentioned in my post on a vision for an enterprise taxonomy, the taxonomy should provide transparency and allow interested users to examine the history of changes within the taxonomy. This is accomplished by maintaining a history of audit events which can be associated with any of the entities within the taxonomy (classifications, values, levels, etc). Each event is pretty simple:

  • Event type – the type of event that occurred (addition of a new entity, modification of an entity, etc.)
  • Event description – a longer (description) field describing the event. For bugs added / modified manually (as opposed to changes via feed from another system) this comment will almost always include a reference to the bug (in our bug database) that describes the change more fully.
  • Date / time of event – When the event occurred
  • User who triggered the event – Who triggered the event
  • Associated entity (the value, classification, level, etc. that changed) – what was changed.

With the above, when a user views the taxonomy, they can see the full lifecycle of any given entity in the taxonomy.

The processes that pull taxonomy values from source systems also populate events, so we are gathering these for automated and manually maintained values.

All together, this helps provide interested users with some confidence in what’s changing and why it’s changing. In addition, provides the ability (not exercised) to measure “turbulence” in the taxonomy – amount of change over time, etc.

Up next, I’ll describe the XML schema we use for publishing from the taxonomy.

January 12th, 2009

Enterprise Taxonomy – The Structure

(Editor’s note – I started this several weeks ago and managed to get myself busy with a lot of other things in the meantime and am finally getting back to it now. Apologies for the lengthy pause in the discussion.)

In my last post, I described the vision we developed for our taxonomy and provided a little bit of insight on how it’s managed. I thought some might find it interesting to understand the structure within the taxonomy at a deeper level.

When we initiated our taxonomy effort, we started (as I think most do) by collecting a lot of the language used throughout our enterprise in a big spreadsheet. We went through the language and organized it into a variety of facets and for many of those facets, we organized the values into a hierarchy. We managed the taxonomy in a spreadsheet for a while with some success but there were problems (of course):

  1. It was not possible to actually do any meaningful integration from a spreadsheet into any systems (to use the taxonomy);
  2. It was always a challenge to ensure people had access to the most recent view of the taxonomy;
  3. It was hard to really to meaningfully integrate the taxonomy with source systems that provide many of our labels in the taxonomy (to pull in values from those source systems).

Given this challenge and a developer resource and some good insights about what the taxonomy needed to do, we have created a relatively simple application that has enabled the taxonomy to be much more visible and also much more directly integrated with other systems. Note: It’s very likely that a commercial product would provide what we’ve done and a lot more, but when we set out on this it was not feasible to spend “hard” money on this, so we spent “soft” money in the form of a developer’s time. Perhaps not the best strategy but it’s been successful for our needs so far.

Given the above challenges we had with the “spreadsheet approach”, my primary interest was to solve the problems of access, display and integration and I was not interested in a system that provided a UI for maintaining the taxonomy (that was also supported by the fact that I’ve strived to have most of the taxonomy sourced from business systems and that the management of the other values has primarily been a one-person job and that person was familiar with databases and could update directly).

So, the taxonomy system comprises the following components:

  1. A SQL database (built in MySQL to be specific);
  2. A web application that provides a view of what is in the database – basically a mirror of the database structure which is described below;
  3. A set of processes that run on schedules to pull data from source systems into the taxonomy;
  4. An XML output following a formal(ish) specification to allow other systems to pull values from the taxonomy.

In my next post (possibly later today, even), I’ll provide more details on the structure – closer to a data model for the bits and pieces that comprise the entire taxonomy.

December 8th, 2008

Enterprise Taxonomy – A Vision

In my continuing coverage of a variety of content management and knowledge management topics, I thought it time to share some thoughts and experiences on managing an enterprise taxonomy for a corporation. I am planning a few posts on the topic – starting with a vision for the taxonomy that we developed at the start of our efforts that have helped to guide us, then moving to covering the management process, some insight on usage of the taxonomy and also a description of what the taxonomy looks like.

One important note – a lot of the initial leg work for the taxonomy was done by Frank Montoya and Meredith Lavine, so credit to them for getting things moving.

When we started out in developing an enterprise taxonomy, the company had nothing in place as any kind of content taxonomy – there was an implicit navigational taxonomy for web sites and there was ad hoc taxonomy in “keywords” type fields in a number of content management systems throughout the company. We knew that to be successful, we needed to have more formality to the taxonomy.

As we set about trying to define what we wanted in the taxonomy, we also realized we needed to ensure we were on a common ground for what we were trying to accomplish – otherwise, it was easy to imagine the taxonomy pulled all over the place, making it hard to achieve meaningful results in the long run. We needed some type of common vision for the taxonomy.

In working with a core group of stakeholders, we came up with the following statements as our vision for the enterprise taxonomy.

The Enterprise Taxonomy will:

  • Be adopted for use in all systems that manage content or documents for those classifications that are defined within the Taxonomy
  • Be used to tag content within those systems in order to ensure consistent language to describe our content
  • Enhance the search experience for users through that tagging
  • Be managed as its own asset, including defining the classifications and the values used within those classifications
  • Use appropriate systems of record when possible to define the set of values used for a particular classification
  • Enable monitoring of changes to the taxonomy values by content managers

One note on this vision – it uses the term “classification” in a number of locations. Within our nomenclature, you can read “classification” as meaning the same thing as a “facet” in a faceted taxonomy.

Some of these are pretty straightforward statements, but I thought I’d share a few thoughts on some of them.

First – part of the vision is that the taxonomy is managed as its own asset – what does that mean? It means:

  • The taxonomy is a piece of content (actually, many pieces of content) subject to the same types of business rules we apply to other content.
  • The taxonomy is subject to workflows for review of changes.
  • The taxonomy is subject to periodic reviews.
  • Changes to the taxonomy can be “staged” in the way other content changes can be staged for review.
  • The taxonomy must be treated as an asset with value.

The vision also notes that it will use systems of record. Our taxonomy is broken into many classifications (facets), several of which overlap with other business entities in the company – product lines, solution, geographies, etc. Whenever possible, we literally (in a system, database sense) integrate the taxonomy to pull data from systems of record for those classifications that have a system of record. This provides many advantages:

  • Commonality across systems for users.
  • Standardized language between content tagging language and language used within business intelligence systems.
  • Changes to classifications that have a system of record can be managed using the appropriate business process in that system of record – the taxonomy review process does not need to include these classification (we assume that the system of records will ensure appropriate reviews are performed).
  • Ownership of the values for these classifications can be kept closer to the business responsible for them. That is, we have enabled a distributed ownership model.
  • This helps minimize which classifications must be reviewed within the taxonomy itself – keeping the taxonomy much more nimble. The classifications that need to have a review within the taxonomy are those that are pretty much purely about content (item type being an example).
  • Eventually, this will enable deeper integration between business intelligence systems and content management systems through direct linkage of business objects (say, a product) to content tagged with that. This linkage can be done using standard database mechanisms. (Something we have not yet implemented, though.)

Given that the taxonomy is managed as an asset, we also felt that it was important that content managers must able to monitor changes within the taxonomy. This means:

  • Content managers have a means to easily find and review all changes being considered to the taxonomy (for classifications managed within the taxonomy – though many classifications managed in a system of record also provide this).
  • Content managers should be able to comment any specific proposed change.
  • Content managers should be able to inspect any entity (classification, or value, etc.) within the taxonomy and view a life history of it within the taxonomy – what was it added, changed, deprecated, etc.
  • Content managers should be able to view all classifications and values – including ones that are no longer “active” within the taxonomy (they have been deprecated).

So there’s a start to taxonomy. Up next, I’ll provide some insight on the details of what the taxonomy looks like.

December 5th, 2008

Seth Earley on the Fractal Nature of Knowledge

A few weeks back, I was asked by Stan Garfield via email about how I might go about measuring if “knowledge specialization” is increasing – it was a question originally raised by Arnold Kling and Arnold had the hypothesis that increasing knowledge specialization in organizations was making management of those organizations more difficult.

Seth Earley was included on the email thread as well, and, while I replied (only on email – I didn’t post my reply here, though I could if anyone’s interested), I was sure Seth would have some good insights about how to go about grappling with the question.

Yesterday, Seth posted his reply on his blog, which I think highlight a good point about the initial theory – that even trying to analyze the level of specialization in knowledge is tricky because knowledge is fractal – no matter how detailed a look you take at it, there are always levels of detail below that.  To quote Seth:

[Knowledge] “is endlessly complex and classification depends on scale and perspective. It’s not a matter of “there should be more categories… “; there are more. It simply depends on where you look and your perspective.”

In my own reply, I had a vague feeling of unease about the idea of measuring increased knowledge specialization but did not think through what it meant, I tried to come up with ways one might try to discern a hypothetical increase in knowledge specialization. I’m glad to see Seth managed to more concisely crystalize the vague unease I had with the question.

I also really liked Seth’s summarization:

“The bottom line is that economic value is created not by understanding where all the knowledge is and micromanaging activities, but by providing broad constraints on targets, problems to solve, competitive differentiation, values, and resources and then creating the right circumstances that allow teams of people to focus knowledge and expertise on solving problems. Knowledge classifications are part of the tools for communicating value and telling the organization when trial and error has produced something that can be reused and applied to solving other problems.”

November 25th, 2008

Additional Community Metrics

My last several posts have been focused on various aspects of community metrics – primarily those derived from the use of a particular tool (mailing lists) used within our communities. While quite fruitful from an analysis perspective, these are not the only metrics we’ve looked at or reported on. In this post, I’ll provide some insights on other metrics we’ve used in case they might be of interest.

Before going on, though, I also wanted to highlight what I’ve found to be an extremely thorough and useful guide covering KPIs for knowledge management from a far more general perspective than just communities – How to Use KPIs in Knowledge Management by Patrick Lambe. I would highly recommend that anyone interested in measuring and evaluating a knowledge management program (or a community of practice initiative specifically) read this document for an excellent overview for a variety of areas. Go ahead… I’ll wait.

OK – Now that you’ve read a very thorough list, I will also direct you to Miguel Cornejo Castro’s blog, who has published on community metrics. I know I’ve seen his paper on this before, but in digging just now I could not seem to come up with a link to it. Hopefully, someone can provide a pointer.

UPDATE:  Miguel was kind enough to provide the link to the paper I was recalling in my mention above: The Macuarium Set of CoP Measurements.  Thanks, Miguel!

If you can provide pointers to additional papers or writings on metrics, please comment here or on the com-prac list.

With that aside, here are some of the additional metrics we’ve used in the past (when we were reporting regularly on the entire program, it was generally done quarterly to give you an idea of the span we looked at each time we assembled this):

  • Usage of intranet-based web sites – specifically, site visits and hits on a community’s site as track by our web analytics solution;
  • Intellectual assets produced – specifically, tracking those produced (or significantly updated) and published via one of our repositories;
  • Number of “anecdotes” captured for community members – that is, the one-off “pats on the back” that community members receive – this attempted to capture some of the softer aspects of community value;
  • Number of knowledge share events held – many communities commonly host virtual events (using one of several different webcasting tools) and we tracked those as well as any in-person events;
  • Attendance at community knowledge share events and playback of recordings of webcasts – an attempt to capture how impactful the events were on members;
  • White papers produced – a specific drill into the intellectual assets;
  • For most of these, we also provided insights on quarter-to-quarter change within communities and for the community of practice program overall to give community sponsors / leaders insight on which direction things were moving;
  • We also looked at our corporate wiki for some insights on a couple levels:
    • Using our community member lists, we knew who was a member of a community, so we could analyze content authoring within the wiki by that same group; this provided insight on how much community members contributed to this knowledge base;
    • Within our corporate wiki, authors have the ability to assign articles to categories; one set of such categories were the communities, so we reported on authoring activity and usage of wiki articles that were assigned a category corresponding to one of the communities of practice; this provided insight on the utility and interest in knowledge associated with the communites.
  • And, finally, we also reported another “softer” piece of data, which was to allow the communities themselves to highlight specific events, results, or issues for the communities.

This is my last planned post on community metrics for now. I will likely return to the topic in the future. I hope the posts have been interesting and also have provided food for thought for your own community programs or efforts.

November 21st, 2008

Visualizing Knowledge Flow in a Community

In my last post, I described some ideas about how to get a sense of knowledge flow within a community using some basic metrics data you can collect. I thought it might be useful to provide a more active visualization of the data from a sample community. As always, data has been obfuscated a bit here but the underlying numbers are most accurate – I believe it provides a more compelling “story” of sorts to see data that at least approximates reality.

I knew that Google had provided its own visualization API which provides quite a lot of ways to visualize data, including a “Motion Chart” – which I’d seen in action before and found a fascinating way to present data. So I set about trying to determine a way to use that type of visualization with the metrics I’ve written about here.

The following is the outcome of a first cut at this (requires Flash):

This visualization shows each of the lists associated with a particular community as a circle (if you hover over a circle, you’ll see a pop-up showing that list’s name – you can click on it to have that persist and play with the “Trails” option as well to see the path persist).

The default options should have “Cumulative Usage” on the Y axis, Members on the X axis, “Active Members” as the color and “Usage” as the size.

An interpretation of what you’re seeing – once you push play, lists will move up the Y axis as their total “knowledge flow” grows over time. They’ll move right and left as their membership grows / shrinks. The size of a circle reflects the “flow” at that time – so a large circle also means the circle will move up the Y axis.

It’s interesting to see how a list’s impact changes over time – if you watch the list titled “List 9” (which appears about Sept 05 in the playback), you’ll see it has an initial surge and then its impact just sort of pulsates over the next few years. Its final position is higher up than “List 7” (which is present since the start) but you can see that List 7 does see some impact later in the playback.

You can also modify which values show in which part of this visualization – if you try some other options and can produce something more insightful, please let me know!

I may spend some time looking at the other visualization tools available in the Google Visualization API and see if they might provide value in visualization other types of metrics we’ve gathered over time. If I find something interesting, I’ll post back here.



November 20th, 2008

Measuring Knowledge Flow within a Community of Practice

In my series on metrics about communities of practice, I’ve covered a pretty broad range of topics, including measuring, understanding and acting on:

In this post, I’ll slightly change gears and present some thoughts on a more research-like use of this data. First, an introduction to what drove this thinking.

“Why do we need to provide navigation to communities? There’s nothing going in them anyway!”

A few years back as we were considering some changes in the navigational architecture on our intranet, I heard the above statment and it made me scratch my head. What did this person mean – there is nothing going on in communities? There sure seemed to be a lot of activity that I could see!

A quick bit of background: Though I have not discussed much about our community program outside of the mailing lists, every community had other resources that they utilized – one of the most common being a site on our intranet. On top of that, at the time of the discussion mentioned above, communities actually had a top spot in the global navigation on our intranet – which provided the typical menu-style navigation to top resources employees needed. One of the top-level menus was labeled “communities” and as sub-menu items, it included subset of the most strategic / active communities. Very nice and direct way to guide employees to these sites (and through them to the other resources available to community members like the mailing lists I’ve discussed).

Back to the discussion at hand – As we were revisiting the navigational architecture, one of the inputs was usage of the various destinations that made up the global navigation. We have a good web analytics solution in place on our intranet (the same we use on our public site) so we had some good insight on usage and I could not argue the point – the intranet sites for the communities simply did not get much traffic.

As I considered this, a thought occurred to me – what we were missing is that we had two distinct ways of viewing “usage” or “activity” (web site usage and mailing list membership / activity) and we were unable to merge them. An immediate question occurred to me – what if, instead of a mailing list tool, we used an online forum tool of some sort (say, phpBB or something similar)? Wouldn’t that merge together these two factors? The act of posting to a forum or reading forums immediately becomes different web-based activities that we could measure, right?

Given the history of mailing list usage within the company, I was not ready to seriously propose that kind of change, but I did set out to try to answer the question – Can we somehow compare mailing list activity to web site usage to be able to merge together this data?

The rest of this post will discuss how I went about this and present some of the details behind what I found.

The Basic Components

The starting point for my thinking was that the rough analogy to make between web sites and mailing lists is that a single post to a mailing list could be thought of as equivalent to a web page. The argument I would make is that (of course, depending on the software used), for a visitor to read a single post using an online forum tool, they would have to visit the page displaying that post. So our first component is

Pc = the number of posts during a given time period for a community

In reality, many tools will combine together a thread into a single page (or, at least, fewer than one page per comment). If you make an assumption that within a community, there’s likely an average number of posts per thread, we could define a constant representing that ratio. So, define:

Rc = the ratio of posts per thread within a community for a given time period

Note that while I did not discuss it in the context of the review of activity metrics, it’s possible with the activity data we are gathering to identify thread and so we can compute Rc.

Tc = total threads within a community for a given time period

Rc = Pc / Tc

Now, how do we make an estimate of how many page views members would generate if they visited the forum instead of having posts show up in their mailbox? The first (rough, and quite poor) guess would be that every member would read every post. This is not realistic and to get an accurate answer would likely require some analysis directly with community members. That being said, I think, within a constant factor, the number of readers can be approximated by the number of active members within the community (it’s true that any active member can be assumed to have read at least some of the posts – their own). A couple more definitions, then:

Mc = the number of members of a community at a given time

Ac = the number of active members within a community for a given time period

In addition to assuming that active members represent a high percentage of readers, I wanted to reflect the readership (which is likely lower) among non-active members (AKA “lurkers”). We know the number of lurkers for a given time period is:

Lc = the number of lurkers within a community over a given time period = (Mc – Ac)

So we can define a factor representing the readership of these lurkers

PRc = the percent of lurkers who would read posts during a given time period (PR means “passive reader”)

Can we approximate PRc for a community from data we are already capturing? At the (fuzzy) level of this argument, I would think that the percentage of active to total members probably is echoed within the lurker community to estimate the number of lurkers who will read any given post in detail:

PRc ~= Ac / Mc

The Formula

So, with the basic components defined above, the formula that I have worked out for computing a proxy for web site traffic from mailing lists becomes:

Uc = the “usage” of a community as reflected through its mailing list

= Pc * (Ac + PRc * Lc) / Rc

= Pc * (Ac + Ac / Mc * Lc) / Rc

= Pc * (Ac + Ac / Mc * (Mc – Ac)) / Rc

= (2 * Pc * Ac – Pc * Ac2 / Mc ) / (Pc / Tc)

= (2 * Ac * Tc – Ac2 * Tc / Mc)

So with that, we have a formula which can help us relate mailing list activity to web site usage (up to some perhaps over-reaching simplifications, I’ll admit!). All of these factors are measurable from the data we are collecting and so I’ll provide a couple of sample charts in the next section.

Some Samples

Here are a few samples of measuring this “usage” over a series of quarters in various communities.

As you will see in the samples, this metric shows a wide variance in values between communities, but relative stability of values within a community.

Small Community Usage Metric

Small Community Usage Metric

The first sample shows data for a small community. As before, I have obfuscated the data a bit, but you can see a bit jump early in the lifecycle and then an extended period of low-level usage. The spike represents the formal “launch” of the community, when a first communication went out to potential members and many people joined. The drop-off to low level usage shown here represents, I believe, a challenge for the community to address and to make the community more vital (of course, it could also be that other ways of observing “usage” of the community might expose that it actually is very vital).

The second sample shows data for a large, stable community – you’ll note that the computed value for “usage” is significantly higher here than in the above sample (in the range of around 30,000-40,000 as opposed to a range of 500-1,000 as the small community stabilized around).

Large Community

Large Community

How does this relate to the title of this post?

Well, after putting the above together, I realized that if you ignore the Rc factor (which converts the measurement of these “member-posts” into a figure purportedly comparable to web page views), you get a number that represents how much of an impact the flow of content through a mailing list has on its members – indirectly, a measure of how much information or knowledge could be passing through a community’s members.

The end result calculation would look something like:

Kc = the knowledge flow within a community for a given period

= (2 * Pc * Ac – Pc * Ac2 / Mc )

This concept depends on making the (giant) leap that the “knowledge content” of a post is equivalent across all posts, which is obviously not true. For the intellectual argument, though, one could introduce a factor that could be measured for each post and replace Pc (which has the effect of treating the knowledge content of a post as “1”) with the sum of that evaluation of each post across a community (where each post is scored a 0-1 on a scale representing that post’s “knowledge content”).

I have not done that analysis, however (it would be a very subjective and manually intensive task!), and, within an approximation that’s probably no less accurate than all of the assumptions above (said with appropriate tongue-in-cheek), I would say that one could argue that you could multiply Kc by a constant factor (representing the average knowledge content of a community) and have the same effect.

Further, if you use this calculation primarily to compare a community with itself over time, you likely find that the constant factor likely does not change over time and you can simply remove it from the calculation (again, with the qualifier that you can then only compare a community to itself!) and you are left with the above definition of Kc.

Validating this Analysis

So far, I’ve provided a fairly complicated description of this compound metric and a couple of sample charts that show this metric for a couple of sample communities. Some obvious questions you might be asking:

  • What’s the value in this metric? Is it actionable?
  • How valid is this metric in the sense of really reflecting “usage” (much less any sense of “knowledge flow”)?

To be honest, so far, I have not been very successful in answering these questions. In terms of being actionable – using this data might lend itself to the types of actions you take based on web analytics, however, there is not an obvious (to me) analog to the conversion that is a fundamental component of web analytics. It seems more likely an after-the-fact measure of what happened instead of a forward-looking tool that can help a community manager or community leader focus the community.

In terms of validity, I’m not sure how to go about measuring if this metric if “valid”. Some ideas that come to my mind at least to compare this to include:

  • Comparing this metric to the actual usage of a community’s web site (via our web analytics tool); do they correlate in some way?
  • Comparing this compound metric to the simpler metric of posts to the community’s mailing lists – how do these compare and why does (or does not) this compound metric provide any better insight?
  • Taking a different approach to this formula – I think understanding how this metric changes as you hold some parts constant and change others would help understand what it “means”.
    • For example, if membership and posts remain the same, but the # of different posters changes, what happens?
    • If posts active members change but total membership changes, what happens?

I’d be very happy to hear from someone who might have some thoughts on how to validate this metric or (perhaps even better) poke holes in what its failings are.

Summing Up

Whew! If you’re still with me, you are a brave or stubborn soul! A few thoughts on all of this to summarize:

  • I do believe that this type of analysis could be useful to understand the flow through a community over time; I think it needs significantly more research to get to a better formula, though the outline above could be a starting point;
  • I have not been able to really validate the ideas expressed here in any way except intuitively, so take with an appropriate grain of salt;
  • I think this type of analysis could also be applied in a variety of other contexts – use of a community Wiki, use of a community blog, attendance at “physical space” meetings, attending virtual knowledge share events, use of community workspaces, etc.; I have not tried this, yet, though;
  • With that last comment in mind, I believe that a key idea here is that this type of compound metric provides an avenue to combine the measurement of knowledge sharing across all of a community’s avenues – raising the possibility of providing something like a “Dow Jones Index” for a community’s knowledge sharing – perhaps collapsing down to a single, measurable quantity that you can track over time.
    • And, yes, I do recognize that such a metric is, at best, on shaky ground and likely not really supportable. I raise this idea because I was once asked to generate a single “knowledge sharing index” that would cover the corporation and this type of analysis could lead in that direction. (For the record, when faced with that question, we resisted spending time
November 14th, 2008

Community of Practice Metrics and Membership, Part 5 – Performance Management

My recent posts have been quite long and detailed with examples in terms of how we have been able to understand and analyze community membership and activity for our community of practice initiative. This post is less focused on numbers and more focused on a particular use of this data in a more strategic manner.

Performance Management

Within my employer, we have a (probably pretty typical) performance management program intended to address both career development (a long term view – “what do you want to be when you grow up?”) and also performance (the shorter term view – “what have you done for me lately?”)

We also have an employee management portal (embedded in the larger intranet) where an employee could manage details about their job, work, etc., including recording their development goals (and efforts) and performance (objectives and work to achieve those).  Managers have a view of this that allows them to see their employees’ data.

Communities and Performance Management

As we worked to drive the communities initiative and adoption of communities of practice as a part of the corporate culture, one of the questions that commonly came up was, “How do these communities contribute to my performance? How can I communicate that to my manager?” That could be asked from the perspective of career development (how can my involvement in communities help me grow?) and also for performance (if I am involved in a community, how does it help me achieve my objectives that are used to measure my performance?)

These are all pretty easily answered, but in an objective sense, we found that managers had a challenge in talking with their employees about their involvement in communities and that part of that challenge was that managers did not necessarily “see” their employee’s community involvement (if they were not part of the same community).

Given that we now had our definition of a community member is and also what an active community member is, it seemed like we could provide some insight to managers from this data and embed that in the employee management portal.

As we were working through this, we found that there was going to be a new component added to the employee management portal labeled “My involvement”, which was intended to capture and display information about how the employee has been involved in the company at large – things like formal recognition they’ve received or recognition they’ve given to others (as part of our employee recognition program) or other ways in which they’ve been “involved”.

This seemed like a perfectly natural place in which we can expose insights to employees and their managers about an employee’s involvement in communities of practice!

So we had a place and the data – it became a simple matter of getting an enhancement into the queue for the employee management portal to expose the data there. It took a few months, but we managed to do that and now employees can view their own involvement and managers can view their employees’ involvement in our communities. The screenshot below shows the part of the employee management portal where an employee or manager can see this view (as with other images, I’ve obscured some of the details a bit here):

Community Involvement in Employee Management Portal

Community Involvement in Employee Management Portal

The Value?

So, what has been the value of this exposure? How has it been used?

While this helps to make some of the conversations between manager and employee about community involvement a bit more concrete, we do recognize that this is still a very partial picture of that involvement. There are many ways in which an employee can be involved in and add value to and learn from a community that goes beyond this simplistic data. (I’ll write more about this “partial picture” issue in a future post.)

That being said, providing this insight to managers has proved very valuable to engender discussions between a manager and an employee about the employee’s community involvement – what they have learned (how it has effected their career development) and also how it might have contributed to their performance. This discussion, by itself, has helped employees demonstrate their growth and value in ways that otherwise could have been a challenge.

For managers, this gives them insight into value their employees provide that otherwise would have been difficult to “see”.

For the community of practice program, this type of visibility has had an ancillary effect of encouraging more people to join communities as I suspect (though can not quantify) that some managers will ask employees about the communities of which they are a member and (more importantly in this regard) the ones in which they are not a member (but which they might be, either by work focus or interest).

Overall, simply including this insight builds an organizational expectation of involvement.