A Thesaurus Data Model for British Standard 8723

by Alistair Miles

The working group producing the new BS 8723 standard for thesauri (structured vocabularies) is currently focusing on the issue of standard formats for interchange of thesaurus data. At a recent meeting it was concluded that a (semi-)formal data model for thesaurus data, using some sort of establishing modeling language, would be a good starting point.

Here is my first attempt to use UML to capture the data model expressed informally as prose in BS 8723 part 2 (monolingual thesauri). The UML was generated using StarUML which is free, and I read this tutorial on UML. I’ve tried to be as faithful to BS 8723 part 2 as possible and capture no more than what is expressed therein nor add any interpretation …

[Download the .uml file for StarUML – includes diagrams for both “Model A” and “Model B” described in next posting.]

Model A (Minimal)

The class diagram for the data model is below (the XMI is at this link).

This model expresses the following features of a monolingual thesaurus (note that the word “association” has a special meaning in UML – it simply means a relationship) …

  • A monolingual thesaurus is associated with one language.
  • A monolingual thesaurus contains one or more thesaurus terms.
  • A thesaurus term is contained by one thesaurus.
  • A thesaurus term has one lexical value, which is a string.
  • The thesaurus term class is a generalisation of preferred terms and of non-preferred terms.
  • A preferred term is associated with zero or more non-preferred terms (equivalent).
  • A non-preferred term is associated with one preferred term (equivalent).
  • The equivalent association has two roles: UF (used for) and USE (use).
  • A preferred term is associated with zero or more other preferred terms (broader/narrower).
  • The broader/narrower association has two roles: BT (broader term) and NT (narrower term).
  • A preferred term is associated with zero or more other preferred terms (related).
  • The related association has two roles, both called RT (related term).
  • A thesaurus term has zero or more annotations (notes).
  • The note class is a generalisation of scope notes, definitions, history notes and editorial notes.
  • A monolingual thesaurus has zero or more thesaurus arrays.
  • A thesaurus array contains one or more preferred terms in some order.
  • A thesaurus array has zero or one node label, which is a string.

This model does not express some features …

  • N-ary relationships between preferred and non-preferred terms (e.g. USE A + B).
  • Specialisations of the broader/narrower relationship (e.g. broader generic, broader partitive … )
  • The constraint that the transitive closure of the broader/narrower relationship should be irreflexive (i.e. there shouldn’t be any loops in the hierarchy).
  • The constraint that the transitive closure of the broader/narrower relationship should be disjoint with the related relationship (i.e. a term cannot be related to another term and at the same time broader or narrower than that term, directly or indirectly).

Note also that the model makes the assumption that any type of note may be associated with either type of term. BS 8723 part 2 suggests that scope notes and definitions may only be associated with preferred terms, but does not explicitly prohibit their use with non-preferred terms.

What’s the Point of a Model?

Why bother with a UML model? Why not just write a W3C XML schema or DTD and be done with it?

I find myself wrestling with this question, because the answer is not as obvious as I would like to think. Ultimately, the goal is to provide a way of passing data between systems that, internally, store and manage data in different ways – and, crucially, to make sure that each of these systems interprets the data they receive “in the correct way”. Interpreting data correctly (i.e. as intended by the sender) depends on sharing an understanding of the conceptual and logical model underlying the data itself.

A UML class diagram is undoubtedly a pretty good way of expressing a conceptual model for data, in a way that is independent of any specific implementation paradigm. There are fairly intuitive ways of translating from a UML class diagram to an XML schema or a relational structure, and of course there are precise ways of deriving a set of object-oriented class definitions. However, as far as I know, there is no way of telling with mathematical certainty that an XML schema or a relational structure has the same expressivity as a UML class diagram. Ultimately, a person has to design the schema or the relational model, and the best they can hope for is to demonstrate that it has the same expressivity by means of a set of test cases.

This is a major problem for a standardisation initiative, because one cannot prove with absolute certainty that any particular serialisation format has at least the same expressivity as the conceptual model – i.e. one cannot prove “conformance”. The best one can do is demonstrate a lossless bi-directional transformation for a set of test data. Good test data – data that exercises all of the expressive features of the model – would have to be carefully engineered by hand.

So the conclusion I am left with is this: a UML model provides a good way for people to share a conceptual model. That conceptual model can be used, by people, to guide their design and interpretation of data structures within applications and of data serialisation formats used to pass data between applications, to improve their chances of writing code that makes sense – although the closest we can currently get to objectively assessing whether the code makes sense is via a set of test cases.

I found the following links helpful when researching this problem …

[1] Modeling XML Vocabularies with UML: Part I
[2] Modeling XML Vocabularies with UML: Part II
[3] Modeling XML Vocabularies with UML: Part III