INSERT OR REPLACE explained in more detail

A few weeks ago we were asked to improve data entry performance of Tracker’s RDF store.

From earlier investigations we knew that a large amount of the RDF store’s update time was going to the application having to first delete triples and internally to the insert having to look up preexisting values.

For this reason we came up with the idea of providing a replace feature on top of standard SPARQL 1.1 Update.

When working with triples is a feature like replace of course a bit ambiguous. I’ll first briefly explain working with triples to describe things. When I want to describe a person Mark who has two dogs, we could do it like this:

  • Max is a Dog
  • Max is 10 years old
  • Mimi is a Dog
  • Mimi is 11 years old
  • Mark is a Person
  • Mark is 30 years old
  • Mark owns Max
  • Mark owns Mimi

If you look at those descriptions, you can simplify each by writing exactly three things: the subject, the property and the value.

In RDF we call these three subject, predicate and object. All subjects and predicates will be resources, the objects can either be a resource or a literal. You wrap resources in inequality signs.

You can continue talking about a resource using semicolon, and you continue talking about a predicate using comma. When you want to finish talking about a resource, you write a dot. Now you know how the Turtle format works.

In SPARQL Update you insert data with INSERT { Turtle formatted data }. Let’s translate that to Mark’s story:

INSERT {
  <Max> a <Dog> ;
        <hasName> ‘Max’ ;
        <hasAge> 10 .
  <Mimi> a <Dog> ;
        <hasName> ‘Mimi’ ;
        <hasAge> 11 .
  <Mark> a <Person> ;
         <hasName> ‘Mark’ ;
         <hasAge> 30 ;
         <owns> <Max>, <Mimi>
}

In the example we are using both single value property and multiple value properties. You can have only one name and one age, so <hasName> and <hasAge> are single value properties. But you can own more than one dog, so <owns> is a multiple value property.

The ambiguity with a replace feature for SPARQL Update is at multiple value properties. Does it need to replace the entire list of values? Does it need to append to the list? Does it need to update just one item in the list? And which one? This probably explains why it’s not specified in SPARQL Update.

For single value properties there’s no ambiguity. For multiple value properties on a resource where the particular triple already exists, there’s also no ambiguity: RDF doesn’t allow duplicate triples. This means that in RDF you can’t own <Max> twice. This is also true for separate insert executions.

In the next two examples the first query is equivalent to the second query. Keep this in mind because it will matter for our replace feature:

INSERT { <Mark> <owns> <Max>, <Max>, <Mimi> }

Is the same as

INSERT { <Mark> <owns> <Max>, <Mimi> }

There is no ambiguity for single value properties so we can implement replace for single value properties:

INSERT OR REPLACE {
  <Max> a <Dog> ;
        <hasName> ‘Max’ ;
        <hasAge> 11 .
  <Mimi> a <Dog> ;
        <hasName> ‘Mimi’ ;
        <hasAge> 12 .
  <Mark> a <Person> ;
         <hasName> ‘Mark’ ;
         <hasAge> 31 ;
         <owns> <Max>, <Mimi>
}

As mentioned earlier doesn’t RDF allow duplicate triples, so nothing will change to the ownerships of Mark. However, would we have added a new dog then just as if OR REPLACE was not there would he be added to Mark’s ownerships. The following example will actually add Morm to Mark’s dogs (and this is different than with the single value properties, they are overwritten instead).

INSERT OR REPLACE {
  <Morm> a <Dog> ;
        <hasName> ‘Morm’ ;
        <hasAge> 2 .
  <Max> a <Dog> ;
        <hasName> ‘Max’ ;
        <hasAge> 12 .
  <Mimi> a <Dog> ;
         <hasName> ‘Mimi’ ;
         <hasAge> 13 .
  <Mark> a <Person> ;
          <hasName> ‘Mark’ ;
          <hasAge> 32 ;
          <owns> <Max>, <Mimi>, <Morm>
}

We know that this looks a bit strange, but in RDF it kinda makes sense too. Note again that our replace feature is not part of standard SPARQL 1.1 Update (and will probably never be).

If for some reason you want to completely overwrite Mark’s ownerships then you need to precede the insert with a delete. If you also want to remove the dogs from the store (let’s say because, however unfortunate, they died), then you also have to remove their rdfs:Resource type:

DELETE { <Mark> <owns> ?dog . ?dog a rdfs:Resource }
WHERE { <Mark> <owns> ?dog }
INSERT OR REPLACE {
  <Fred> a <Dog> ;
        <hasName> ‘Fred’ ;
        <hasAge> 1 .
  <Mark> a <Person> ;
         <hasName> ‘Mark’ ;
         <hasAge> 32 ;
         <owns> <Fred> .
}

We don’t plan to add a syntax for overwriting, adding or deleting individual items or entire lists of a multiple value property at this time (other than with the preceding delete). There are technical reasons for this, but I will spare you the details. You can find the code that implements replace in the branch sparql-update where it’s awaiting review and then merge to master.

We saw performance improvements, whilst greatly depending on the use-case, of 30% and more. A use-case that was tested in particular was synchronizing contact data. The original query was varying in time between 17s and 23s for 1000 contacts. With the replace feature it takes around 13s for 1000 contacts. For more information on this performance test, read this mailing list thread and experiment yourself with this example.

The team working on qtcontacts-tracker, which is a backend for the QtContacts API that uses Tracker’s RDF store, are working on integrating with our replace feature. They promised me tests and numbers by next week.