INSERT OR REPLACE explained in more detail

A few weeks ago we were asked to improve data entry performance of Tracker’s RDF store.

From earlier investigations we knew that a large amount of the RDF store’s update time was going to the application having to first delete triples and internally to the insert having to look up preexisting values.

For this reason we came up with the idea of providing a replace feature on top of standard SPARQL 1.1 Update.

When working with triples is a feature like replace of course a bit ambiguous. I’ll first briefly explain working with triples to describe things. When I want to describe a person Mark who has two dogs, we could do it like this:

  • Max is a Dog
  • Max is 10 years old
  • Mimi is a Dog
  • Mimi is 11 years old
  • Mark is a Person
  • Mark is 30 years old
  • Mark owns Max
  • Mark owns Mimi

If you look at those descriptions, you can simplify each by writing exactly three things: the subject, the property and the value.

In RDF we call these three subject, predicate and object. All subjects and predicates will be resources, the objects can either be a resource or a literal. You wrap resources in inequality signs.

You can continue talking about a resource using semicolon, and you continue talking about a predicate using comma. When you want to finish talking about a resource, you write a dot. Now you know how the Turtle format works.

In SPARQL Update you insert data with INSERT { Turtle formatted data }. Let’s translate that to Mark’s story:

INSERT {
  <Max> a <Dog> ;
        <hasName> ‘Max’ ;
        <hasAge> 10 .
  <Mimi> a <Dog> ;
        <hasName> ‘Mimi’ ;
        <hasAge> 11 .
  <Mark> a <Person> ;
         <hasName> ‘Mark’ ;
         <hasAge> 30 ;
         <owns> <Max>, <Mimi>
}

In the example we are using both single value property and multiple value properties. You can have only one name and one age, so <hasName> and <hasAge> are single value properties. But you can own more than one dog, so <owns> is a multiple value property.

The ambiguity with a replace feature for SPARQL Update is at multiple value properties. Does it need to replace the entire list of values? Does it need to append to the list? Does it need to update just one item in the list? And which one? This probably explains why it’s not specified in SPARQL Update.

For single value properties there’s no ambiguity. For multiple value properties on a resource where the particular triple already exists, there’s also no ambiguity: RDF doesn’t allow duplicate triples. This means that in RDF you can’t own <Max> twice. This is also true for separate insert executions.

In the next two examples the first query is equivalent to the second query. Keep this in mind because it will matter for our replace feature:

INSERT { <Mark> <owns> <Max>, <Max>, <Mimi> }

Is the same as

INSERT { <Mark> <owns> <Max>, <Mimi> }

There is no ambiguity for single value properties so we can implement replace for single value properties:

INSERT OR REPLACE {
  <Max> a <Dog> ;
        <hasName> ‘Max’ ;
        <hasAge> 11 .
  <Mimi> a <Dog> ;
        <hasName> ‘Mimi’ ;
        <hasAge> 12 .
  <Mark> a <Person> ;
         <hasName> ‘Mark’ ;
         <hasAge> 31 ;
         <owns> <Max>, <Mimi>
}

As mentioned earlier doesn’t RDF allow duplicate triples, so nothing will change to the ownerships of Mark. However, would we have added a new dog then just as if OR REPLACE was not there would he be added to Mark’s ownerships. The following example will actually add Morm to Mark’s dogs (and this is different than with the single value properties, they are overwritten instead).

INSERT OR REPLACE {
  <Morm> a <Dog> ;
        <hasName> ‘Morm’ ;
        <hasAge> 2 .
  <Max> a <Dog> ;
        <hasName> ‘Max’ ;
        <hasAge> 12 .
  <Mimi> a <Dog> ;
         <hasName> ‘Mimi’ ;
         <hasAge> 13 .
  <Mark> a <Person> ;
          <hasName> ‘Mark’ ;
          <hasAge> 32 ;
          <owns> <Max>, <Mimi>, <Morm>
}

We know that this looks a bit strange, but in RDF it kinda makes sense too. Note again that our replace feature is not part of standard SPARQL 1.1 Update (and will probably never be).

If for some reason you want to completely overwrite Mark’s ownerships then you need to precede the insert with a delete. If you also want to remove the dogs from the store (let’s say because, however unfortunate, they died), then you also have to remove their rdfs:Resource type:

DELETE { <Mark> <owns> ?dog . ?dog a rdfs:Resource }
WHERE { <Mark> <owns> ?dog }
INSERT OR REPLACE {
  <Fred> a <Dog> ;
        <hasName> ‘Fred’ ;
        <hasAge> 1 .
  <Mark> a <Person> ;
         <hasName> ‘Mark’ ;
         <hasAge> 32 ;
         <owns> <Fred> .
}

We don’t plan to add a syntax for overwriting, adding or deleting individual items or entire lists of a multiple value property at this time (other than with the preceding delete). There are technical reasons for this, but I will spare you the details. You can find the code that implements replace in the branch sparql-update where it’s awaiting review and then merge to master.

We saw performance improvements, whilst greatly depending on the use-case, of 30% and more. A use-case that was tested in particular was synchronizing contact data. The original query was varying in time between 17s and 23s for 1000 contacts. With the replace feature it takes around 13s for 1000 contacts. For more information on this performance test, read this mailing list thread and experiment yourself with this example.

The team working on qtcontacts-tracker, which is a backend for the QtContacts API that uses Tracker’s RDF store, are working on integrating with our replace feature. They promised me tests and numbers by next week.

A REPLACE extension for Tracker’s SPARQL’s Update

SPARQL Update has INSERT and DELETE. To update an existing triple in RDF you need to DELETE it first. You of course already have our INSERT-SILENT but that just ignores certain errors; it doesn’t replace triples.

A (performance) problem is that with each DELETE having to solve all possible solutions you create an extra query for each time you want to update using a ‘DELETE-WHERE INSERT’-construction.

INSERT also checks for old values. It has to do this to implement SPARQL Update where you can’t insert a triple with a different value than the old value: If the value of a triple is identical, the insert for that triple is ignored; if the triple didn’t exist yet, it’s inserted; if the values aren’t identical, error is thrown — you need to use DELETE upfront.

Both having to do the extra delete and the old-values come at a performance price.

To solve this we plan to provide Tracker specific support for REPLACE. It’ll be Tracker specific simply because this isn’t specified in SPARQL Update. That has a probable reason:

Replacing or updating doesn’t fit well in the RDF world. Updating properties that have multiple values, like nie:keyword, is ambiguous: does it need to replace the entire list of values; does it need to append to the list; does it need to update just one item in the list, and which one? This probably explains why it’s not specified in SPARQL Update.

We decided to let our REPLACE be only different than INSERT for single value properties. For multi value properties will our REPLACE behave the same as normal INSERT.

How a GraphUpdated triggered by a REPLACE behaves is still being decided. Especially the value of the object’s ID for resource objects in the ‘deletes’-array. Having to look up the old ID kinda defeats the purpose of having a REPLACE (as we’d still need to look it up, like what an INSERT does, destroying part of the performance gain).

Either way, let me show you some examples:

We start with an insert of a resource that has a single value and two times a multi value property filled in:

INSERT { <r> a nie:InformationElement ;
             nie:title 'title';
             nie:keyword 'keyw1';
             nie:keyword 'keyw2' }

A quick query to verify, and yes it’s in:

SELECT ?t ?k { <r> nie:title ?t; nie:keyword ?k }
Results:
  title, keyw1
  title, keyw2

If we repeat the query a second time then the old-values check will turn the insert into a noop:

INSERT { <r> a nie:InformationElement ;
             nie:title 'title';
             nie:keyword 'keyw1';
             nie:keyword 'keyw2' }

And a quick query to verify that, and indeed nothing has changed:

SELECT ?t ?k { <r> nie:title ?t; nie:keyword ?k }
Results:
  title, keyw1
  title, keyw2

If we’d do that last insert query but with different values, we’d get this:

INSERT { <r> a nie:InformationElement ;
             nie:title 'title new';
             nie:keyword 'keyw4';
             nie:keyword 'keyw3' }

SparqlError.Constraint: Unable to insert multiple values for subject
`r' and single valued property `dc:title' (old_value: 'title', new
 value: 'title new')

Note that for the two nie:keyword triples this would have worked, but given that each query is a transaction and because the nie:title part failed, aren’t those two written either.

Let’s now try the same with INSERT OR REPLACE (edit: changed from just REPLACE to INSERT OR REPLACE):

INSERT OR REPLACE { <r> a nie:InformationElement ;
                        nie:title 'title new';
                        nie:keyword 'keyw4';
                        nie:keyword 'keyw3' }

And a quick query now yields:

SELECT ?t ?k { <r> nie:title ?t; nie:keyword ?k }
Results:
  title new, keyw1
  title new, keyw2
  title new, keyw3
  title new, keyw4

You can see that how it behaved for nie:title was different than for nie:keyword. That’s because nie:title is a single value -and nie:keyword is a multi value property.

What if we do want to reset the multi value property and insert a complete new list? Simple, just do this as a single query (space or newline delimited) (edit: changed to INSERT OR REPLACE from just REPLACE):

DELETE { <r> nie:keyword ?k } WHERE { <r> nie:keyword ?k }
INSERT OR REPLACE { <r> a nie:InformationElement ;
                        nie:title 'title new';
                        nie:keyword 'keyw4';
                        nie:keyword 'keyw3' }

And a quick query now yields:

SELECT ?t ?k { <r> nie:title ?t; nie:keyword ?k }
Results:
  title new, keyw3
  title new, keyw4

The work on this is in progress. You can find it in the branch sparql-update. It’s working but especially the GraphUpdated stuff is unfinished.

Also note that the final syntax may change.