Lee Romero

On Content, Collaboration and Findability

Archive for January 13th, 2009

Enterprise Taxonomy – The Structure in Detail

Tuesday, January 13th, 2009

In my previous post, I started describing the structure of the taxonomy we are using in some detail; originally, the following was part of my last post but it got a bit too long so I’ve split it. In this post, I’ll explore the structure in yet more detail – getting closer to a data model.

If you are going through a similar process that we’ve been through and you want to organize your taxonomy in a database, this might provide you with enough detail to get moving.

One note on terminology – much of what we have used is not what I would consider “standard” among taxonomist but was derived during a period when we had numerous systems we were trying to pull together, each of which used one of many different terms – categories, attributes, metadata, fields, tags, etc. I was charged at this point (which was before we started digging into the details of defining an enterprise taxonomy) with trying to define some terms that we could all use so that we could at least understand each other. A taxonomy for taxonomies, I guess.

Classification

The primary construct in the taxonomy is called a “Classification”. A better term for this I now know would be “Facet” as that’s what they are. The intent is that a Classification is a specific set of values (perhaps explicitly defined or perhaps defined by a set of guidelines or business rules) with which pieces of content can be associated (they can be tagged with values from the classification).

In our schema, a Classification itself has a number of elements:

  • Name – The preferred name for the Classification. Typically used as the label for fields on, for example, data entry forms of various sorts.
  • Definition – A concise definition of the Classification. Forcing the explicit definition of this helps reduce fuzzy thinking and gets people to clearly differentiate when a new Classification is needed versus using an existing one. This can be displayed in other systems that allow users to associate classification values with content as a kind of “mini-help”.
  • Life History (create date, modification date, audit trail) – We maintain the create date (actually, date added to the taxonomy) and a modification date so we know what happened when to the Classification. More detail is provided below on the audit trail.
  • Source System – Each classification might be sourced from another system. An example is a product listing – these are not maintained in the taxonomy but in their own systems and the taxonomy simply uses that list. Another example (where we do not have automation) is language (where we reference ISO standards as the master even though the values are still manually maintained in our taxonomy database).
  • Comments – A text field to hold comments for use within the taxonomy. Notes about issues, etc. Not intended for end users as the Definition is.
  • Data Type – The type of values expected for this Classification. Most commonly, just Strings, but we do define (for example) Creation Date and Expiration Date as classifications with data type of Date.
  • Value Indicators – The taxonomy provides indicators to help other systems know what to do with the Classification – Should assignment be constrained to just the values provided by the taxonomy? Should other systems allow content pieces to be associated with multiple values of a classification?
  • Synonyms – We provide for the Classification itself to have synonyms (these are synonyms for the Name of the classification). This can be used when (despite best attempts to the contrary) people want to continue to use different terms for the same classificatoin. An example might be that one system (and its user group) might want to refer to a “Region” whereas another might use the term “Market” or “Area”.
  • Status – We provide a status indicator on pretty much everything within the taxonomy (Classifications, individual values, etc). The usage is consistent and breaks down into:
    • “Active” – the value can be assigned to new/modified content; should be displayed in any type of search UI (say as a pick list) if appropriate; and should be displayed if a user views the taxonomic tagging of an item.
    • “Inactive” – the value should not be able to be assigned to new content or be newly assigned to existing content; it should be displayed in search UIs (if appropriate) and should be displayed if a user views the taxonomic tagging of an item. Basically, it was valid at one point and still has value on content already tagged with it but we do not use it any more.
    • “Deleted” – We don’t delete values physically, but mark them “Deleted”. The value can not be assigned when creating or editing content, it should not be displayed in any search UI and it should not be displayed if a user views the taxonomic tagging of a piece of content. Basically, the value is no longer in the taxonomy (though some systems may still have the value associated with content in some ways).
    • “Proposed” – The first status for most items. The value would only be in the Taxonomy system itself and would not propagate to other systems. Indicates that it’s being considered for adding but has not yet been approved.
  • A set of Classification Levels – Some classifications have an internal structure, described below in the “Level” section.
  • Localizations of Classification – There may be non-English translations of the name and description of a classification in the taxonomy database (see below for more about multiple languages).
  • A set of Classification Values – Most classification have a set of explicit values that can be associated with a piece of content. The values might be a flat list or might be hierarchical. The taxonomy database supports both. Currently, we do not support any type of many-to-many relationship or relationships across Classifications – just a simple one-to-many within a Classification which is a value / sub-value relationship (some Classifications provide more explicit constraints on the intended meaning of the relationship). Also, we do not have a construct that allows for an explicit (in the taxonomy database) meaning for any given relationship (specifically, narrower-than, broader-than, etc.) It’s implicit in the structure of the values.

Given the definition of a Classification as above, the terminology we use is that the taxonomy is, itself, the set of all Classifications we have defined and which can be used to tag content.  As with Classification itself, this is not, I think, consistent with standard using (the hierarchical structure within any one Classification would be considered a taxonomy) but adopting this definition at least got us organizationally out of the confusion of how we have a taxonomy when all of the values are not in a single, strict hierarchy.

Value

A Value is a single (usually textual, though might be dates or numbers) term which can be associated with a piece of content. Values are grouped into Classifications. A value association to a piece of content is what connects that piece of content to the taxonomy.

Like a Classification, a Value has a structure, which is only used when the Classification provides explicit values:

  • ID – the unique identifier within the taxonomy that identifies the value. Most systems using the taxonomy will store this ID as the associate (and not the associated value). This allows for the Value to have its textual representation changed without having to revisit any content (say a product name changes or a country’s name changes)
  • Structure details – What classification this value is associated with and which value in this Classification (if any) is the parent of this value. Also, some values have a designated “Level” (see below for more on that).
  • Value – the textual representation of this value. The string users will see and interpret as the “value”.
  • Definition – the definition of this value. As with the classifications, forcing this to be clearly defined provides a good “buffer” against people requesting values to be added that are duplicative or not generally useful. I’m surprised by how often asking a requestor for a clear definition (and how it’s different from another value that seems similar) stops them in their tracks.
  • Life History – same as the Classifications
  • Source System ID – For Classifications whose values come from another system, we maintain the source system’s ID so we can associate it back to the source system for updates. This can also be used by systems that pull from the taxonomy and also might happen (for other business reasons) to pull data from the same source systems and allows those systems to cross between the two sets of values.
  • Status – Same as for Classifications
  • Synonyms – Same as for Classifications but applied to the individual values. Synonyms for values are much more common than synonyms for classifications. Systems using the synonyms can potentially do many different things with synonyms (displaying them while a content manager is associating values with content, supporting search on them, etc.)
  • Localization of Value and Definition – Non-English translations of the value and definition. See below for more details.

Level

Within a single Classification, we have adopted a mechanism we refer to as a “Level” in order to have a structure within the Classification when it’s meaningful to have different Values grouped into semantically different sets. I think of this as the means by which we support a structure of Classifications.

A good example is Geography. We have a single classification for Geography which contains all necessary values for tagging content for geographic relevance (or irrelevance in some cases). However, each Value within that Classification might represent a different type of Geography. Some values are regions of the world (”North America” or “EMEA”); some values are Countries (”France” or “Japan”); and some might be areas within a country of use (”Midwest United States”).

A Level is a hierarchy of terms within a Classification and any given Value can be assigned to a Level.

The value of this is that systems using the taxonomy can provide user interfaces that group similar values (a nested, tree-style interface, say) while we do not need to have multiple Classifications with relationships across the Classifications to support this.

Multiple Languages

In order to support multiple languages on our web sites, we have provided a means to localize the entire taxonomy. Because localized content is a critical component of our customer-facing site, we provide a structure so that all text that can be used outside of the taxonomy (primarily things like the names and definitions of Classifications, the name and definition for Values, Level names, and even synonyms of each of these) can be localized.

Systems that pull from the taxonomy can then use the available localized terms in their displays (falling back to English if a particular term is not available in a specific language). This could be used in field labels on forms or navigation labels in a browsing interface, menu items, etc.

Audit Events

As I mentioned in my post on a vision for an enterprise taxonomy, the taxonomy should provide transparency and allow interested users to examine the history of changes within the taxonomy. This is accomplished by maintaining a history of audit events which can be associated with any of the entities within the taxonomy (classifications, values, levels, etc). Each event is pretty simple:

  • Event type – the type of event that occurred (addition of a new entity, modification of an entity, etc.)
  • Event description – a longer (description) field describing the event. For bugs added / modified manually (as opposed to changes via feed from another system) this comment will almost always include a reference to the bug (in our bug database) that describes the change more fully.
  • Date / time of event – When the event occurred
  • User who triggered the event – Who triggered the event
  • Associated entity (the value, classification, level, etc. that changed) – what was changed.

With the above, when a user views the taxonomy, they can see the full lifecycle of any given entity in the taxonomy.

The processes that pull taxonomy values from source systems also populate events, so we are gathering these for automated and manually maintained values.

All together, this helps provide interested users with some confidence in what’s changing and why it’s changing. In addition, provides the ability (not exercised) to measure “turbulence” in the taxonomy – amount of change over time, etc.

Up next, I’ll describe the XML schema we use for publishing from the taxonomy.