Enterprise Taxonomy – An XML schema for Publishing a Taxonomy
In my continuing dive into the structure of our taxonomy, which, hopefully might be of use or interest to you to understand and possibly adopt to your own needs, so far, I’ve provided an outline of the application solution and then a high level outline of the data model we’re using.
One of the important features of our solution is that our taxonomy system provides the ability for other systems to consume the taxonomy via an XML document. I’ll explore that a bit here.
Accessing the XML
Access to the XML document for the taxonomy is through a very simple means: a standard HTTP GET. The query string in the request can specify various parameters on the URL – effectively, a very simple web service. The types of parameters supported include:
- Identifying which classification is desired (default is to return all)
- Specifying the statuses of values to include (default will return all)
- Specifying the language to include (default returns English)
- Specifying the level of detail of interest (default returns the briefest format)
With regard to the language – one of the business rules followed in our web sites is that you provide content in the user’s selected language when available and return English when the user’s language is not available (English should always be available). This rule is pushed down into this interface at the level of each value. So a consuming application might request the set of German values for the taxonomy and get all of the classification details in German and, say, 99% of the values in German but if there are values that are not translated, those are returned in English. This approach keeps the taxonomy consistent with our general rules (though if taxonomy values are used directly in a user interface, it does present a possibly confusing same-page mix of non-English and English).
Document structure
The returned XML document looks like the following. I’m not using any formal XML schema syntax – instead showing the elements and how they relate to each other with a brief description of th elements that I don’t think are self-explanatory.
- taxonomy
- classification – has an attribute id (the ID of the classification)
- name – has an attribute lang (the language code describing the language of the name element)
- description – has an attribute lang (the language code describing the language of the description element)
- status
- createDate
- updateDate
- sourceSystem
- comments
- hasValues (a Y/N indicating if a consuming application should expect to find values in the values element)
- constrained (a Y/N indicating if a consuming application should enforce the rule that values for this classification must come from the list of values provided)
- multiValued (a Y/N indicating if a consuming application should allow multiple values be assigned for any given content piece)
- dataType
- changeHistory – an element with a sequence of elements, one for each auditable event in this item’s life history
- aliases – has attribute count (the number of alias elements included)
- alias – a structured element providing details on an alias
- levels – has an attribute count (the number of levels included)
- level – a structured element providing details on the level (omitted here)
- values – has an attribute count (the number of values included)
- value – has an atribute id (the ID of the value in the taxonomy system)
- name – has an attribute lang (the language code describing the language of the name element)
- description – has an attribute lang (the language code describing the language of the description element)
- status
- createDate
- updateDate
- sourceSystemId
- levelRef – attribute id (identifies the specific level [in the levels element above] with which this value is associated)
- aliases – attribute count (the number of aliases for this value)
- alias – a structured element providing details on an alias
- changeHistory – Same as for classification
- values – recursive structure reflecting hierarchy within a classification’s set of values
- value (etc.)
- value – has an atribute id (the ID of the value in the taxonomy system)
- classification – has an attribute id (the ID of the classification)
And that’s the schema. Looks complicated, but it’s really pretty simple, I think. The advantage of this has been that consuming applications do not need to directly access the database containing this (which would be pretty simple in principle) and so can be insulated from changes in the underlying structure of the database as we need to make them.
Providing access via an HTTP get keeps the technical cost minimal for consuming applications (they need to be able to read from an HTTP socket and then parse XML, both pretty standard functions in modern languages / libraries).
One last comment – in regard to the level of detail parameter mentioned above – the “brief” level includes the names , descriptions and statuses only of the classifications, levels and values. The “detailed” includes all details except the changeHistory elements. The “complete” level includes all of the above. The “complete” format is probably not very useful for consumers as most will not care about the life history of elements (though that is of interest and value within the taxonomy).
Relationship to other Schemas
Just to connect the dots – I know of other XML schemas that we could conceivably have used to publish this document. With help from the Taxonomy community of practice, I found the following while researching for a schema to use (I especially want to say thanks to Leonard Will, Mike Taylor, Marcel van Mackelenbergh and Bob Bater for their insights):
- RDF-XML – A (relatively) low-level schema used to publish facts (ontologies when used in whole)
- SKOS – A schema in draft as I write this from W3C for sharing knowledge organization systems among. Based on RDF.
- Zthes – An XML schema for publishing thesauri.
- Topic Maps – An XML schema for representing topics and relationships among topics (Bill French also mentioned this in a comment on my previous post.)
- A good list of references available from the ANSI/NISO Z39.19 – Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies
At the time we were designing (defining) a schema to use, we knew we wanted to keep it as simple as possible and (right or wrong) as close to the underlying model as we could, which made sense within our business environment. It wasn’t clear at the time which of the above might provide the most likely path forward (in terms of standard adoption) so we “rolled our own”. And, another factor was that the schemas seemed far more general than our needs warranted; for example, the broader-than / narrower-than type relations were implicit in our structure and specifying those explicitly seemed confusing. (To be honest, all of which could be interpreted as “we weren’t educated enough to understand the options and took the simpler-at-the-time approach of rolling our own”.)
I am still not as familiar as I would like to be with the above, so I still would not be able to say which would be most appropriate, but the SKOS schema, now in draft from the W3C seems like a potential solution that would fit our needs and could eventually become a broader standard. Does anyone have any insights as to where this is moving?
January 15th, 2009 at 10:06 am
In developing the British Standard for structured vocabularies, BS8723, we produced a data model for a thesaurus, as well as an XML schema derived from it. These were published in DD8723 (produced as a “Draft for Development” rather than a British Standard as we did not consider that it was sufficiently developed at that stage). This is available at http://schemas.bs8723.org/, and comments are welcome.
Subsequently, on work for the corresponding international standard ISO25964, we developed the model further. As the ISO standard is still at draft stage we cannot make it public (regrettably!) but you can see the current draft data model attached to a message I sent to the SKOS mailing list at http://lists.w3.org/Archives/Public/public-esw-thes/2008Dec/0003.html
We think that this model represents all that is needed for a standard thesaurus, including several elements which are not covered by SKOS. It may be expanded later to cover other types of structured vocabulary such as classification schemes and other systems using pre-coordinated headings.
Do you think that this gives a sound foundation on which XML schemas or other exchange formats can be built?
Leonard Will
January 15th, 2009 at 2:32 pm
Hi Leonard – Thank you very much for providing insight on another alternative (one I missed :-/).
I just took a look (admittedly not as deep as I could) at the Draft on http://schemas.bs8723.org/. It does seem to present a good foundation.
Applying it to the needs I’ve had to address, it does appear that the availability of the ThesaurusConcept structure (I’d relate that to the “value” concept I use) and the ThesaurusArray structure (which I’d roughly equate to the “Classification” and probably the “Classification Level” concept I use) cover all of the needs we would have. The hierarchy we need to express between values is inherent in the ThesaurusConcept (which, in fact, is *far* more general than the hierarchy we have in use).
Not being deeply familiar with this or SKOS – can you share some insight (a pointer to a document, perhaps?) that provides details on how this differs?
Thanks again for the pointer!
January 15th, 2009 at 3:02 pm
Lee –
I’d encourage you to look at the draft ISO model that I referred to rather than the one on the BSI site, as the later model has the important addition of a “conceptGroup” structure. This is more appropriate for development into a classification scheme than the “thesaurusArray” structure. The idea of an array is that it is a group of sibling concepts, sharing a common parent to which they have a BT/NT relationship. These concepts must therefore necessarily all be of the same fundamental type, i.e. must all be in the same facet. The conceptGroup was introduced to allow concepts to be grouped without this restriction – all concepts relating to a subdivision of the subject area, for example – these groups are sometimes called “microthesauri”, “themes” or the like.
You will find more information in DD8723, available from the British Standards Institution, but unfortunately they charge for it. The ISO draft is more up-to-date, but will not be available for public comment until later this year, we hope. Again we are frustrated by the policy of these standardising bodies in not allowing us to distribute drafts for comment until they are issued officially to those who choose to pay for them. 🙁
If you don’t mind sitting through audio recordings with accompanying PowerPoint slides you might find it useful to look at the record of a meeting of ISKO UK held last year at:
http://www.iskouk.org/SKOS_July2008.htm
Leonard
January 15th, 2009 at 3:16 pm
Thanks again, Leonard – I looked at the model you’d linked to in the mailing list post but hadn’t looked closely enough to realize what the differences were. And, yes, that looks much closer to the structure we use (or, at least I can more easily see how it would translate to it).