A Thesaurus Data Model for British Standard 8723 (Part 2)

by Alistair Miles

Continuing on from my initial exploration of using UML to capture the monolingual thesaurus data model described in BS 8723 part 2 (written up here), below is an alternative UML model attempting to represent the underlying conceptual structure of a monolingual thesaurus. This model is more complicated, so I’ve broken it into separate class diagrams for easier viewing …

[Download the .uml file for StarUML – includes diagrams for both “Model B” and “Model A” described in previous posting.]

Model B (Concept-Oriented)

[Download the XMI.]

The main classes are depicted below …

The class diagram above captures the following features …

  • A monolingual thesaurus contains one or more thesaurus concepts.
  • A thesaurus concept is contained by one and only one monolingual thesaurus.
  • A thesaurus concept is associated with one or more thesaurus terms (labels).
  • A thesaurus term is associated with one and only one thesaurus concept (meaning).

The use of notes is depicted below …

The class diagram above is intended to capture an ambiguity in BS 8723 part 2 regarding whether notes (annotations) are associated with concepts or terms or both or either. The diagram therefore captures the following features …

  • A note is associated with zero or one thesaurus concept and zero or one thesaurus term.
  • A thesaurus concept is associated with zero or more notes (annotation).
  • A thesaurus term is associated with zero or more notes (annotation).
  • The class of notes is a generalisation of scope notes, definitions, editorial notes and history notes.

The associations between concepts and terms is depicted by the diagram below …

The diagram above captures the following …

  • The class of thesaurus terms is a generalisation of preferred terms and non-preferred terms.
  • A thesaurus concept is associated with one and only one preferred term (preferred label).
  • A thesaurus concept is associated with zero or more non-preferred terms (non-preferred label).
  • A preferred term is associated with zero or more non-preferred terms (UF).
  • A non-preferred term is associated with one and only one preferred term (USE).

The diagram above fails to capture several important logical characteristics of a monolingual thesaurus. E.g. it does not capture the axiom that the USE/UF association is implied by the preferred label and non-preferred label associations. It also does not capture the axiom that no two thesaurus terms in a monolingual thesaurus may have the same lexical value (the simpler model A in the previous blog entry does not capture this either).

Relationships between concepts is depicted by the diagram below …

The diagram above is intended to capture two classes of association between thesaurus concepts. I’m not at all sure about how to model classes of association, or generalisations of associations, in UML, so this bit might be a bit dodgy. Modelling generalisations of relationships is very important to capturing the extensibility of thesaurus relationships, so this needs further attention.

Finally, the notion of “arrays” and “node labels” is captured in the following diagram …

The diagram above captures the following …

  • A monolingual thesaurus contains zero or more thesaurus arrays.
  • A thesaurus array contains one or more thesaurus concepts, in some order.
  • A thesaurus concept may be contained by zero or one thesaurus array.
  • A thesaurus array has zero or one node label, which is a string.
Advertisements