Lee Romero

On Content, Collaboration and Findability

Archive for January 14th, 2009

Enterprise Taxonomy – An XML schema for Publishing a Taxonomy

Wednesday, January 14th, 2009

In my continuing dive into the structure of our taxonomy, which, hopefully might be of use or interest to you to understand and possibly adopt to your own needs, so far, I’ve provided an outline of the application solution and then a high level outline of the data model we’re using.

One of the important features of our solution is that our taxonomy system provides the ability for other systems to consume the taxonomy via an XML document. I’ll explore that a bit here.

Accessing the XML

Access to the XML document for the taxonomy is through a very simple means: a standard HTTP GET. The query string in the request can specify various parameters on the URL – effectively, a very simple web service. The types of parameters supported include:

  • Identifying which classification is desired (default is to return all)
  • Specifying the statuses of values to include (default will return all)
  • Specifying the language to include (default returns English)
  • Specifying the level of detail of interest (default returns the briefest format)

With regard to the language – one of the business rules followed in our web sites is that you provide content in the user’s selected language when available and return English when the user’s language is not available (English should always be available). This rule is pushed down into this interface at the level of each value. So a consuming application might request the set of German values for the taxonomy and get all of the classification details in German and, say, 99% of the values in German but if there are values that are not translated, those are returned in English. This approach keeps the taxonomy consistent with our general rules (though if taxonomy values are used directly in a user interface, it does present a possibly confusing same-page mix of non-English and English).

Document structure

The returned XML document looks like the following. I’m not using any formal XML schema syntax – instead showing the elements and how they relate to each other with a brief description of th elements that I don’t think are self-explanatory.

  • taxonomy
    • classification – has an attribute id (the ID of the classification)
      • name – has an attribute lang (the language code describing the language of the name element)
      • description – has an attribute lang (the language code describing the language of the description element)
      • status
      • createDate
      • updateDate
      • sourceSystem
      • comments
      • hasValues (a Y/N indicating if a consuming application should expect to find values in the values element)
      • constrained (a Y/N indicating if a consuming application should enforce the rule that values for this classification must come from the list of values provided)
      • multiValued (a Y/N indicating if a consuming application should allow multiple values be assigned for any given content piece)
      • dataType
      • changeHistory – an element with a sequence of elements, one for each auditable event in this item’s life history
      • aliases – has attribute count (the number of alias elements included)
        • alias – a structured element providing details on an alias
      • levels – has an attribute count (the number of levels included)
        • level – a structured element providing details on the level (omitted here)
      • values – has an attribute count (the number of values included)
        • value – has an atribute id (the ID of the value in the taxonomy system)
          • name – has an attribute lang (the language code describing the language of the name element)
          • description – has an attribute lang (the language code describing the language of the description element)
          • status
          • createDate
          • updateDate
          • sourceSystemId
          • levelRef – attribute id (identifies the specific level [in the levels element above] with which this value is associated)
          • aliases – attribute count (the number of aliases for this value)
            • alias – a structured element providing details on an alias
          • changeHistory – Same as for classification
          • values – recursive structure reflecting hierarchy within a classification’s set of values
            • value (etc.)

And that’s the schema. Looks complicated, but it’s really pretty simple, I think. The advantage of this has been that consuming applications do not need to directly access the database containing this (which would be pretty simple in principle) and so can be insulated from changes in the underlying structure of the database as we need to make them.

Providing access via an HTTP get keeps the technical cost minimal for consuming applications (they need to be able to read from an HTTP socket and then parse XML, both pretty standard functions in modern languages / libraries).

One last comment – in regard to the level of detail parameter mentioned above – the “brief” level includes the names , descriptions and statuses only of the classifications, levels and values.  The “detailed” includes all details except the changeHistory elements.  The “complete” level includes all of the above.  The “complete” format is probably not very useful for consumers as most will not care about the life history of elements (though that is of interest and value within the taxonomy).

Relationship to other Schemas

Just to connect the dots – I know of other XML schemas that we could conceivably have used to publish this document.  With help from the Taxonomy community of practice, I found the following while researching for a schema to use (I especially want to say thanks to Leonard Will, Mike Taylor, Marcel van Mackelenbergh and Bob Bater for their insights):

At the time we were designing (defining) a schema to use, we knew we wanted to keep it as simple as possible and (right or wrong) as close to the underlying model as we could, which made sense within our business environment. It wasn’t clear at the time which of the above might provide the most likely path forward (in terms of standard adoption) so we “rolled our own”. And, another factor was that the schemas seemed far more general than our needs warranted; for example, the broader-than / narrower-than type relations were implicit in our structure and specifying those explicitly seemed confusing. (To be honest, all of which could be interpreted as “we weren’t educated enough to understand the options and took the simpler-at-the-time approach of rolling our own”.)

I am still not as familiar as I would like to be with the above, so I still would not be able to say which would be most appropriate, but the SKOS schema, now in draft from the W3C seems like a potential solution that would fit our needs and could eventually become a broader standard.  Does anyone have any insights as to where this is moving?