Yesterday, I delivered my presentation at Taxonomy Boot Camp 2010 on “Enterprise Taxonomy: Six Components of a vision”. You can find the presentation on my site here and also on the Taxonomy Boot Camp site here (the latter requires a login you will need to get from the conference).
Some of the most interesting topics for me this week have been about semantic (web) technologies and also some details on the implementation of taxonomy in SharePoint 2010. Good stuff.
In addition, I’ve had the opportunity to meet and re-meet many people who work in the taxonomy space and also in search, so it’s been a very revitalizing experience.
Recently, I have been trying to better understand the language in use by our users in the search solution we use, and in order to do that, I have been trying to determine what tools and techniques one might use to do that. This is the first post in a planned series about this effort.
I have many goals in pursuing this. The primary goal has been to be able to identify trends from the whole set of language in use by users (and not just the short head). This goals supports the underlying business desire of identifying content gaps or (more generally) where the variety of content available in certain categories does not match with the variety expected by users (i.e., how do we know when we need to target the creation and publication of specific content?)
Many approaches to this do focus on the short head – typically the top N terms, where N might be 50 or 100 or even 500 (some number that’s manageable). I am interested in identifying ways to understand the language through the whole long tail as well.
As I have dug into this, I realized an important aspect of this problem is to understand how much commonality there is to the language in use by users and also how much the language in use by users changes over time – and this question leads directly to the topic at hand here.
There is an anecdote I have heard many times about the short head of your search log that “80 percent of your searches are accounted for by the top 20% most commonly-used terms“. I now question this and wonder what others have seen.
I have worked closely with several different search solutions in my career and the three I have worked most closely with (and have most detailed insight on) do not come even close to the above assertion. Chart 1 shows the usage curve for one of these. The X axis is the percent of distinct terms (ordered by use) and the Y axis shows the percent of all searches accounted for by all terms up to X.
From this chart, you can see that it takes approximately 55% of distinct terms to account for 80% of all searches – that is a lot of terms!
This curve shows the usage for one month – I wondered about how similar this would be for other months and found (for this particular search solution) that the curves for every month were basically the exact same!
Wondering if this was an anomaly, I looked at a second search solution I have close access to to wonder if it might show signs of the “80/20″ rule. Chart 2 adds the curve for this second solution (it’s the blue curve – the higher of the two).
In this case, you will find that the curve is “higher” – it reaches 80% of searches at about 37% of distinct terms. However, it is still pretty far from the “80/20″ rule!
After looking at this data in more detail, I have realized why I have always been troubled at the idea of paying close attention to only the so-called “short head” – doing so leaves out an incredible amount of data!
In trying to understand the details of why, even though neither is close to adhering to the “80/20″ rule, the usage curves are so different, I realize that there are some important distinctions between the two search solutions:
I’m not sure how (or really if) these factor into the shape of these curves.
In understanding this a bit better, I hypothesize two things: 1) the shape of this curve is stable over time for any given search solution, and 2) the shape of this curve tells you something important about how you can manage your search solution. I am planning to dig more to answer hypothesis #1.
Questions for you:
I will be writing more on these search term usage curves in my next post as I dig more into the time-stability of these curves.
Now that I’ve posted quite a bit on the technical side of an enterprise taxonomy, I thought I’d share a bit on the business process side of how we have managed our taxonomy.
I spoke about this topic at the 2007 Taxonomy Boot Camp. (As an aside, I tried to find if the presentation I used is available on the site but I couldn’t find it – if someone knows of an online archive, please let me know and I can provide a link from here.) The session I delivered was titled, “The Process and Politics of Implementing a Corporate Taxonomy” and focused on the overall process we have implemented.
What follows is an overview of the larger process we used to establish the taxonomy and a description of the smaller process used to maintain it and I’ll close with some of my own thoughts on what it is that triggers changes in a taxonomy.
When we first started trying to formalize a taxonomy, one of the first steps we took was to do an organizational mapping to identify participants in the process. We focused on the following:
We felt that this organizational mapping was important because it would help increase buy-in to the taxonomy from those who have most vested interest in it and also (with help from that last group) would help increase larger scale adoption of the language. Once we felt that we had identified the groups that met these criteria, we engaged with the executives for the groups to help us identify one or more people who could be included in our Taxonomy Review Board.
The rest of the “getting started” process included content audits and analyses to identify terminology used to describe the content, definition of the structure of the taxonomy we wanted to use, organization of the terminology into this structure and then working with the Taxonomy Review Board to confirm the end result as a first version of the (evolving) taxonomy.
We also layed out the objectives we had for the overall process – which you can find in my post on the vision we have developed for our taxonomy. The really pertinent items we wanted to ensure were: We wanted to ensure that the taxonomy was actively managed and we wanted to ensure that the management process was transparent.
Now that the taxonomy had been established, we needed to identify the people and process we would use for maintaining and enhancing the taxonomy.
The people who are involved include:
This organization has helped to keep the taxonomy managed, while also keeping overall enterprise expense to manage it fairly small.
Now, I am, at heart, a software engineer. Why is this pertinent? Early on in my career, I came to appreciate the need and value for change control (or, as I prefer to think of it change management or change visibility – I’ve always thought “control” seemed a bit stronger than you could really achieve) and that has seeped into our process.
At its heart, our process is similar to a software development team’s change control board (CCB) process:
While it has worked effectively we still face a number of issues with this process. These include:
What triggers a change in the taxonomy?
As I (re-)gather my thoughts on this topic, one lingering question came back to me about the overall process. The question is external to the process (which takes the approach of “a change comes from somewhere and we’re not going to worry about where it comes from but once it’s been identified, we’ll wedge it into this process”) but I am interested in understanding what other taxonomists might actively do in maintaining a taxonomy. In other words, how much change do you experience that comes from others compared to your own recommendations or insights?
Here’s a list of triggers that have resulted in changes in the taxonomy:
In my continuing dive into the structure of our taxonomy, which, hopefully might be of use or interest to you to understand and possibly adopt to your own needs, so far, I’ve provided an outline of the application solution and then a high level outline of the data model we’re using.
One of the important features of our solution is that our taxonomy system provides the ability for other systems to consume the taxonomy via an XML document. I’ll explore that a bit here.
Access to the XML document for the taxonomy is through a very simple means: a standard HTTP GET. The query string in the request can specify various parameters on the URL – effectively, a very simple web service. The types of parameters supported include:
With regard to the language – one of the business rules followed in our web sites is that you provide content in the user’s selected language when available and return English when the user’s language is not available (English should always be available). This rule is pushed down into this interface at the level of each value. So a consuming application might request the set of German values for the taxonomy and get all of the classification details in German and, say, 99% of the values in German but if there are values that are not translated, those are returned in English. This approach keeps the taxonomy consistent with our general rules (though if taxonomy values are used directly in a user interface, it does present a possibly confusing same-page mix of non-English and English).
The returned XML document looks like the following. I’m not using any formal XML schema syntax – instead showing the elements and how they relate to each other with a brief description of th elements that I don’t think are self-explanatory.
And that’s the schema. Looks complicated, but it’s really pretty simple, I think. The advantage of this has been that consuming applications do not need to directly access the database containing this (which would be pretty simple in principle) and so can be insulated from changes in the underlying structure of the database as we need to make them.
Providing access via an HTTP get keeps the technical cost minimal for consuming applications (they need to be able to read from an HTTP socket and then parse XML, both pretty standard functions in modern languages / libraries).
One last comment – in regard to the level of detail parameter mentioned above – the “brief” level includes the names , descriptions and statuses only of the classifications, levels and values. The “detailed” includes all details except the changeHistory elements. The “complete” level includes all of the above. The “complete” format is probably not very useful for consumers as most will not care about the life history of elements (though that is of interest and value within the taxonomy).
Just to connect the dots – I know of other XML schemas that we could conceivably have used to publish this document. With help from the Taxonomy community of practice, I found the following while researching for a schema to use (I especially want to say thanks to Leonard Will, Mike Taylor, Marcel van Mackelenbergh and Bob Bater for their insights):
At the time we were designing (defining) a schema to use, we knew we wanted to keep it as simple as possible and (right or wrong) as close to the underlying model as we could, which made sense within our business environment. It wasn’t clear at the time which of the above might provide the most likely path forward (in terms of standard adoption) so we “rolled our own”. And, another factor was that the schemas seemed far more general than our needs warranted; for example, the broader-than / narrower-than type relations were implicit in our structure and specifying those explicitly seemed confusing. (To be honest, all of which could be interpreted as “we weren’t educated enough to understand the options and took the simpler-at-the-time approach of rolling our own”.)
I am still not as familiar as I would like to be with the above, so I still would not be able to say which would be most appropriate, but the SKOS schema, now in draft from the W3C seems like a potential solution that would fit our needs and could eventually become a broader standard. Does anyone have any insights as to where this is moving?
In my previous post, I started describing the structure of the taxonomy we are using in some detail; originally, the following was part of my last post but it got a bit too long so I’ve split it. In this post, I’ll explore the structure in yet more detail – getting closer to a data model.
If you are going through a similar process that we’ve been through and you want to organize your taxonomy in a database, this might provide you with enough detail to get moving.
One note on terminology – much of what we have used is not what I would consider “standard” among taxonomist but was derived during a period when we had numerous systems we were trying to pull together, each of which used one of many different terms – categories, attributes, metadata, fields, tags, etc. I was charged at this point (which was before we started digging into the details of defining an enterprise taxonomy) with trying to define some terms that we could all use so that we could at least understand each other. A taxonomy for taxonomies, I guess.
The primary construct in the taxonomy is called a “Classification”. A better term for this I now know would be “Facet” as that’s what they are. The intent is that a Classification is a specific set of values (perhaps explicitly defined or perhaps defined by a set of guidelines or business rules) with which pieces of content can be associated (they can be tagged with values from the classification).
In our schema, a Classification itself has a number of elements:
Given the definition of a Classification as above, the terminology we use is that the taxonomy is, itself, the set of all Classifications we have defined and which can be used to tag content. As with Classification itself, this is not, I think, consistent with standard using (the hierarchical structure within any one Classification would be considered a taxonomy) but adopting this definition at least got us organizationally out of the confusion of how we have a taxonomy when all of the values are not in a single, strict hierarchy.
A Value is a single (usually textual, though might be dates or numbers) term which can be associated with a piece of content. Values are grouped into Classifications. A value association to a piece of content is what connects that piece of content to the taxonomy.
Like a Classification, a Value has a structure, which is only used when the Classification provides explicit values:
Within a single Classification, we have adopted a mechanism we refer to as a “Level” in order to have a structure within the Classification when it’s meaningful to have different Values grouped into semantically different sets. I think of this as the means by which we support a structure of Classifications.
A good example is Geography. We have a single classification for Geography which contains all necessary values for tagging content for geographic relevance (or irrelevance in some cases). However, each Value within that Classification might represent a different type of Geography. Some values are regions of the world (”North America” or “EMEA”); some values are Countries (”France” or “Japan”); and some might be areas within a country of use (”Midwest United States”).
A Level is a hierarchy of terms within a Classification and any given Value can be assigned to a Level.
The value of this is that systems using the taxonomy can provide user interfaces that group similar values (a nested, tree-style interface, say) while we do not need to have multiple Classifications with relationships across the Classifications to support this.
In order to support multiple languages on our web sites, we have provided a means to localize the entire taxonomy. Because localized content is a critical component of our customer-facing site, we provide a structure so that all text that can be used outside of the taxonomy (primarily things like the names and definitions of Classifications, the name and definition for Values, Level names, and even synonyms of each of these) can be localized.
Systems that pull from the taxonomy can then use the available localized terms in their displays (falling back to English if a particular term is not available in a specific language). This could be used in field labels on forms or navigation labels in a browsing interface, menu items, etc.
As I mentioned in my post on a vision for an enterprise taxonomy, the taxonomy should provide transparency and allow interested users to examine the history of changes within the taxonomy. This is accomplished by maintaining a history of audit events which can be associated with any of the entities within the taxonomy (classifications, values, levels, etc). Each event is pretty simple:
With the above, when a user views the taxonomy, they can see the full lifecycle of any given entity in the taxonomy.
The processes that pull taxonomy values from source systems also populate events, so we are gathering these for automated and manually maintained values.
All together, this helps provide interested users with some confidence in what’s changing and why it’s changing. In addition, provides the ability (not exercised) to measure “turbulence” in the taxonomy – amount of change over time, etc.
Up next, I’ll describe the XML schema we use for publishing from the taxonomy.
(Editor’s note – I started this several weeks ago and managed to get myself busy with a lot of other things in the meantime and am finally getting back to it now. Apologies for the lengthy pause in the discussion.)
In my last post, I described the vision we developed for our taxonomy and provided a little bit of insight on how it’s managed. I thought some might find it interesting to understand the structure within the taxonomy at a deeper level.
When we initiated our taxonomy effort, we started (as I think most do) by collecting a lot of the language used throughout our enterprise in a big spreadsheet. We went through the language and organized it into a variety of facets and for many of those facets, we organized the values into a hierarchy. We managed the taxonomy in a spreadsheet for a while with some success but there were problems (of course):
Given this challenge and a developer resource and some good insights about what the taxonomy needed to do, we have created a relatively simple application that has enabled the taxonomy to be much more visible and also much more directly integrated with other systems. Note: It’s very likely that a commercial product would provide what we’ve done and a lot more, but when we set out on this it was not feasible to spend “hard” money on this, so we spent “soft” money in the form of a developer’s time. Perhaps not the best strategy but it’s been successful for our needs so far.
Given the above challenges we had with the “spreadsheet approach”, my primary interest was to solve the problems of access, display and integration and I was not interested in a system that provided a UI for maintaining the taxonomy (that was also supported by the fact that I’ve strived to have most of the taxonomy sourced from business systems and that the management of the other values has primarily been a one-person job and that person was familiar with databases and could update directly).
So, the taxonomy system comprises the following components:
In my next post (possibly later today, even), I’ll provide more details on the structure – closer to a data model for the bits and pieces that comprise the entire taxonomy.
In my continuing coverage of a variety of content management and knowledge management topics, I thought it time to share some thoughts and experiences on managing an enterprise taxonomy for a corporation. I am planning a few posts on the topic – starting with a vision for the taxonomy that we developed at the start of our efforts that have helped to guide us, then moving to covering the management process, some insight on usage of the taxonomy and also a description of what the taxonomy looks like.
When we started out in developing an enterprise taxonomy, the company had nothing in place as any kind of content taxonomy – there was an implicit navigational taxonomy for web sites and there was ad hoc taxonomy in “keywords” type fields in a number of content management systems throughout the company. We knew that to be successful, we needed to have more formality to the taxonomy.
As we set about trying to define what we wanted in the taxonomy, we also realized we needed to ensure we were on a common ground for what we were trying to accomplish – otherwise, it was easy to imagine the taxonomy pulled all over the place, making it hard to achieve meaningful results in the long run. We needed some type of common vision for the taxonomy.
In working with a core group of stakeholders, we came up with the following statements as our vision for the enterprise taxonomy.
The Enterprise Taxonomy will:
One note on this vision – it uses the term “classification” in a number of locations. Within our nomenclature, you can read “classification” as meaning the same thing as a “facet” in a faceted taxonomy.
Some of these are pretty straightforward statements, but I thought I’d share a few thoughts on some of them.
First – part of the vision is that the taxonomy is managed as its own asset – what does that mean? It means:
The vision also notes that it will use systems of record. Our taxonomy is broken into many classifications (facets), several of which overlap with other business entities in the company – product lines, solution, geographies, etc. Whenever possible, we literally (in a system, database sense) integrate the taxonomy to pull data from systems of record for those classifications that have a system of record. This provides many advantages:
Given that the taxonomy is managed as an asset, we also felt that it was important that content managers must able to monitor changes within the taxonomy. This means:
So there’s a start to taxonomy. Up next, I’ll provide some insight on the details of what the taxonomy looks like.