What’s in a ‘term’?

by Alistair Miles

A new issue thesaurusRepresentation-11 has recently been added to the SKOS proposals and issues list. There have been several requests for a solution to this issue to involve the addition of a new class to SKOS, to represent ‘terms’, which are linguistic entities somewhere between RDF plain literals and concepts. I believe there are strong arguments for not adding a new class to SKOS, and this article is an attempt to explain some of my reasoning.

SKOS currently makes use of two basic classes of entity: (1) the class skos:Concept, (2) the class of plain literals.

The first argument against introducing a third class to represent some sort of linguistic entity is that it is simply not necessary. All of the motivating use cases I have in mind for SKOS, focusing on its application for information retrieval applications, can be satisfied without it. Of course until we have a proper description and agreement of the motivating use cases for SKOS this argument cannot be substantiated. This will be the first task for the Semantic Web Deployment Working Group, if it is chartered as currently specified. I have begun developing the outline of a vision for SKOS, in this paper submitted to the Dublin Core 2006 conference, and in this presentation to the Ecoterm working group, all comments on these would be gratefully received.

The second argument against introducing a third class to represent some kind of linguistic entity is that there are dramatically different expectations with respect to the logical characteristics of this class, depending on the audience and the intended usage, and which would inevitably lead to very serious problems when merging data from different sources and drawing the simplest of logical conclusions.

To illustrate some of these problems, consider first a hypothetical class foo:Descriptor whose members are thesaurus descriptors (a.k.a. ‘preferred terms’), a hypothetical class foo:NonDescriptor, disjoint with foo:Descriptor, whose members are thesaurus non-descriptors (a.k.a. ‘non-preferred terms’), and a hypothetical class foo:Term defined as the union of foo:Descriptor and foo:NonDescriptor. I.e.

foo:Descriptor a owl:Class.
foo:NonDescriptor a owl:Class; owl:disjointWith foo:Descriptor.
foo:Term a owl:Class; owl:unionOf (foo:Descriptor foo:Non:Descriptor).

In a thesaurus, a ‘term’ can only have a single lexical form. I.e. ‘economic cooperation’ and ‘economic co-operation’ are treated as different ‘terms’ in a thesaurus. So, we can define a property foo:lexicalForm with domain foo:Term and range xs:string, and we can declare this property to be a functional property. I.e.

foo:lexicalForm a owl:DatatypeProperty, owl:FunctionalProperty;
 rdfs:domain foo:Term; rdfs:range xs:string.

Now, if a foo:Term in some thesaurus X has the same lexical form as a foo:Term in some other thesaurus Y, this does NOT entitle us to make any logical inferences whatseover. This is obvious for two reasons: firstly because foo:Descriptor and foo:NonDescriptor are disjoint, and therefore inferring identity of a foo:Descriptor and a foo:NonDescriptor would lead to a logical inconsistency; and secondly because a foo:Term with lexical form ‘bank’ in thesaurus X might be used to denote something completely different to a foo:Term with lexical form ‘bank’ in thesaurus Y.

I.e. the foo:lexicalForm property is definitely NOT an inverse functional property.

Finally, we create a class foo:Concept, and a property foo:meaning with domain foo:Term and range foo:Concept. From the thesaurus data model, each ‘term’ has only one meaning – this is the entire purppose of a controlled vocabulary, to remove ambiguity. We can represent this logically by declaring foo:meaning to be a functional property, i.e.

foo:Concept a owl:Class.
foo:meaning a owl:ObjectProperty, owl:FunctionalProperty;
  rdfs:domain foo:Term; rdfs:range foo:Concept.

Now, consider a completely different set of classes and properties. The bar:Term class is used to represent a word or multi-word term from some natural language, which may have multiple lexical forms, perhaps where alternative spellings are possible, or perhaps where multiple character sets for a single language are used. So we declare a bar:lexicalForm property with domain bar:Term, and with a minimum cardinality restriction to ensure that every bar:Term has at least 1 lexical form. The bar:Term class is used to capture linguistic entities that are part of some particular natural language, therefore the combination of the language of a term and any of its lexical forms may be used to infer the identity of two entities of type bar:Term. If we use plain literals with language tags as the values of the bar:lexicalForm property, then this property can be declared as an inverse functional property. I.e.

bar:Term a owl:Class;
  rdfs:subClassOf [
    a owl:Restriction;
    owl:onProperty bar:lexicalForm;
    owl:minCardinality "1"^^xs:integer;
bar:lexicalForm a owl:DatatypeProperty, owl:InverseFunctionalProperty;
  rdfs:domain bar:Term; rdfs:range rdf:Literal.

Of course, any linguistic entity must have at least one meaning, and may have any number of meanings, so to represent this we declare a class bar:Concept, and a property bar:meaning with domain bar:Term and range bar:Concept, and with the appropriate cardinality constraints, i.e.

bar:Concept a owl:Class.
bar:meaning a owl:ObjectProperty;
  rdfs:domain bar:Term; rdfs:range bar:Concept.
bar:Term rdfs:subClassOf [
  a owl:Restriction;
  owl:onProperty bar:meaning;
  owl:minCardinality "1"^^xs:integer;

Because a bar:Term can have many meanings, and because two terms may share the same meaning, the property bar:meaning is *niether functional nor inverse functional*.

The point of these two examples is to illustrate how different notions of what a ‘term’ is can lead to logical frameworks that appear superficially similar, and yet are *fundamentally different* at the logical level.

E.g. consider how foo:lexicalForm is a functional property and not an inverse functional property, whereas bar:lexicalForm is an inverse functional property and not a functional property. E.g. consider how foo:meaning is a functional property, whereas bar:meaning is not functional and has a minimum cardinality restriction.

Any confusion between these different approaches would lead to entirely inappropriate inferences being drawn, especially with respect to the identity of individuals, which would cause unexpected and severely erroneous behaviour in applications that are built for one or the other interpretation.

I have only given 2 possible logical configurations above, I fully expect that there could be many more. Because of the enormous potential for logical ambiguity, I would like to explore the absolute limitations of just the two basic classes of SKOS, i.e. skos:Concept and plain literal. Where linguistic associations become necessary, I strongly favour a representation based on n-ary relations, and believe that the associations may be represented quite sufficiently by n-ary relations between plain literals, and where necessary, resources of type skos:Concept. Specifying the conditions for determining the logical identity of two n-ary relationships is I believe vastly easier than specifying the conditions for the logical identity of two ‘terms’.

Of course the logical conditions for establishing the identity of two plain literals and of two resources of type skos:Concept must be specified, but I believe this to be a tractable problem.

The case of plain literals is already handled by the RDF semantics, and can be informally stated as follows: A plain literal consists of a string of characters (the literal value) and a language tag, which is itself a string of characters, and which may be empty. The logical identity of two plain literals may be established by a string comparison of the literal values and of the language tags. If two plain literals have the same literal value and the same language tag, then they are the same logical entity.

Two members of the skos:Concept class may be treated as the same logical entity if and only if it is deemed true that their properties (including labelling and annotation) are identical. In practice, this means that logical identity is only ever used to describe the situation where two URIs have been used to denote a single conceptual unit in some concept scheme. Two conceptual units in different concept schemes are NEVER treated as logically identical entities, even if it is deemed that they share the same meaning. This avoids the confusion that would result from the merging of labels and annotations from different sources. Where two conceptual units from different concept schemes are deemed to share the same meaning, then another property may be used to assert this equivalence of meaning (currently ‘skos:exactMatch’), which *does not* denote logical identity.

Finally, I offer the prize of an unspecified quantity of beer for the best answer to the following question: when are two words the same? 🙂