INSERT OR REPLACE explained in more detail

A few weeks ago we were asked to improve data entry performance of Tracker’s RDF store.

From earlier investigations we knew that a large amount of the RDF store’s update time was going to the application having to first delete triples and internally to the insert having to look up preexisting values.

For this reason we came up with the idea of providing a replace feature on top of standard SPARQL 1.1 Update.

When working with triples is a feature like replace of course a bit ambiguous. I’ll first briefly explain working with triples to describe things. When I want to describe a person Mark who has two dogs, we could do it like this:

  • Max is a Dog
  • Max is 10 years old
  • Mimi is a Dog
  • Mimi is 11 years old
  • Mark is a Person
  • Mark is 30 years old
  • Mark owns Max
  • Mark owns Mimi

If you look at those descriptions, you can simplify each by writing exactly three things: the subject, the property and the value.

In RDF we call these three subject, predicate and object. All subjects and predicates will be resources, the objects can either be a resource or a literal. You wrap resources in inequality signs.

You can continue talking about a resource using semicolon, and you continue talking about a predicate using comma. When you want to finish talking about a resource, you write a dot. Now you know how the Turtle format works.

In SPARQL Update you insert data with INSERT { Turtle formatted data }. Let’s translate that to Mark’s story:

INSERT {
  <Max> a <Dog> ;
        <hasName> ‘Max’ ;
        <hasAge> 10 .
  <Mimi> a <Dog> ;
        <hasName> ‘Mimi’ ;
        <hasAge> 11 .
  <Mark> a <Person> ;
         <hasName> ‘Mark’ ;
         <hasAge> 30 ;
         <owns> <Max>, <Mimi>
}

In the example we are using both single value property and multiple value properties. You can have only one name and one age, so <hasName> and <hasAge> are single value properties. But you can own more than one dog, so <owns> is a multiple value property.

The ambiguity with a replace feature for SPARQL Update is at multiple value properties. Does it need to replace the entire list of values? Does it need to append to the list? Does it need to update just one item in the list? And which one? This probably explains why it’s not specified in SPARQL Update.

For single value properties there’s no ambiguity. For multiple value properties on a resource where the particular triple already exists, there’s also no ambiguity: RDF doesn’t allow duplicate triples. This means that in RDF you can’t own <Max> twice. This is also true for separate insert executions.

In the next two examples the first query is equivalent to the second query. Keep this in mind because it will matter for our replace feature:

INSERT { <Mark> <owns> <Max>, <Max>, <Mimi> }

Is the same as

INSERT { <Mark> <owns> <Max>, <Mimi> }

There is no ambiguity for single value properties so we can implement replace for single value properties:

INSERT OR REPLACE {
  <Max> a <Dog> ;
        <hasName> ‘Max’ ;
        <hasAge> 11 .
  <Mimi> a <Dog> ;
        <hasName> ‘Mimi’ ;
        <hasAge> 12 .
  <Mark> a <Person> ;
         <hasName> ‘Mark’ ;
         <hasAge> 31 ;
         <owns> <Max>, <Mimi>
}

As mentioned earlier doesn’t RDF allow duplicate triples, so nothing will change to the ownerships of Mark. However, would we have added a new dog then just as if OR REPLACE was not there would he be added to Mark’s ownerships. The following example will actually add Morm to Mark’s dogs (and this is different than with the single value properties, they are overwritten instead).

INSERT OR REPLACE {
  <Morm> a <Dog> ;
        <hasName> ‘Morm’ ;
        <hasAge> 2 .
  <Max> a <Dog> ;
        <hasName> ‘Max’ ;
        <hasAge> 12 .
  <Mimi> a <Dog> ;
         <hasName> ‘Mimi’ ;
         <hasAge> 13 .
  <Mark> a <Person> ;
          <hasName> ‘Mark’ ;
          <hasAge> 32 ;
          <owns> <Max>, <Mimi>, <Morm>
}

We know that this looks a bit strange, but in RDF it kinda makes sense too. Note again that our replace feature is not part of standard SPARQL 1.1 Update (and will probably never be).

If for some reason you want to completely overwrite Mark’s ownerships then you need to precede the insert with a delete. If you also want to remove the dogs from the store (let’s say because, however unfortunate, they died), then you also have to remove their rdfs:Resource type:

DELETE { <Mark> <owns> ?dog . ?dog a rdfs:Resource }
WHERE { <Mark> <owns> ?dog }
INSERT OR REPLACE {
  <Fred> a <Dog> ;
        <hasName> ‘Fred’ ;
        <hasAge> 1 .
  <Mark> a <Person> ;
         <hasName> ‘Mark’ ;
         <hasAge> 32 ;
         <owns> <Fred> .
}

We don’t plan to add a syntax for overwriting, adding or deleting individual items or entire lists of a multiple value property at this time (other than with the preceding delete). There are technical reasons for this, but I will spare you the details. You can find the code that implements replace in the branch sparql-update where it’s awaiting review and then merge to master.

We saw performance improvements, whilst greatly depending on the use-case, of 30% and more. A use-case that was tested in particular was synchronizing contact data. The original query was varying in time between 17s and 23s for 1000 contacts. With the replace feature it takes around 13s for 1000 contacts. For more information on this performance test, read this mailing list thread and experiment yourself with this example.

The team working on qtcontacts-tracker, which is a backend for the QtContacts API that uses Tracker’s RDF store, are working on integrating with our replace feature. They promised me tests and numbers by next week.

11 thoughts on “INSERT OR REPLACE explained in more detail”

  1. Hello,

    I’m writing a music player and have ran into issues with speeds regarding TreeViews and TreeModels. I have found lots of ways to hack around the problems including removing the model from treeview on populate, removing sorting while populating, etc.

    However, at this point I want to write my own treemodel and I know you have written a lot on that sort of stuff so I was wondering if you could help.

    1)Should just rewriting the model be enough or will I have to rewrite the treeview to?
    2)I need filtering and sorting, any advice on that?

    any other advice you can give me from your own experience?

    Thank you very much,

    Scott

  2. Thanks for your fast reply. Could you elaborate on the second part? So if I am using sql database backing, you think i should sort and filter in the query, and then populate the tree based on that? Wouldn’t that require, for the most part, a full repopulation?

    Scott

  3. Each time you change either the sort order or the filter the model should be repopulated, yes. As a consequence should the tree be invalidated (so that it fetches the currently visible rows from the model and updates or refreshes itself).

    You could also set a new model each time you change either the order or the filter (make both part of the model’s construction).

    Perhaps try to restore the selection (= which of the items in the TreeView are selected) after setting the new model for usability reasons. It can of course happen that items that where selected before, aren’t available after changing the filter.

    This can not happen when changing the sort order. But the code for restoring the selection can be the same for changing the sort order as for changing the filter, of course.

    What isn’t going to give you good performance is doing filtering or sorting on top of a raw non-sorted and unfiltered model (this is what most people do, as filter-models and sort-models are usually plug-in components in the toolkit).

    Although this depends on the amount of items. The more items your TreeModel has, the more doing filtering and sorting in the query will help with your performance.

    For a TreeModel that has 20 items, there’s of course absolutely no point in not using filter-models and sort-models. That’ll be fast enough and that wont be a performance bottleneck anyway. When your TreeModel has 20 million items; that’s different.

    What also, and of course, helps a lot is trying to involve LIMIT in the TreeModel: if only 50 items are visible; then there’s no point getting all items from the database immediately. Again, this depends on the amount of items that your query yields.

    If that’s 50 million items then it’s probably not a good idea to pump them all into your heap memory. Instead, use LIMIT in the model’s code to fetch the results around the row-number that the TreeView requests from the TreeModel. Or use a cursor (try to avoid all 50 million from getting copied, if only 50 are visible at one time). A DB cursor is more or less exactly that, but at the level of the database. You can compare a DB cursor with Enumerator (in .NET) or Iterator (in Java) by the way.

    If your query yields 20 results; then just don’t bother (it’s not going to be a bottleneck).

  4. There’s some great advice in there, that leads me to some more questions.

    Unfortunately, for my app I am not in a situation where reading from the db is the best solution. I have a hashmap of song objects, and that is readily availaible to easily access any song based on it’s id. At this point, it would be unfeasible to switch how things are done in the back end, so I have a few qustions, if you would be so kind, regarding using a hasmap of songs rather than fetching from a DB.

    As for the amount of rows, I think it is safe to assume that most users have less than 5000 songs. But, like you said, I am worried about the users with large music libraries. I think i can assume quite confidently less than 1 million songs though.

    Now, to actually ask some questions:

    Based on what you said in the first 2 paragraphs, i read that the best approach would be to create an entirely new treemodel, populate that treemodel, destroy the old one, and use the new one EVERY time the user filters or sorts. It seems to me that this could be an inefficient way to go about this if I am not using a db to do my filtering and sorting.

    What i was originally thinking for my non-db use situation is that every time the user filters, to add/remove (instead of entirely repopoulating) rows based on whether they should show or not. that way we i’m not doing a total repopulate every time i search.

    For the sort, I was definitely thinking of doing an outside sort, but i am not sure how to do this if i don’t repopulate everytime. It could get messy.

    So, a some questions I have:
    1) Do you think that my proposed way of filtering will be faster than the TreeModel Filter?
    2) How much faster?
    3) If I were to have instant/direct access to sorting variables in my sort func, do you think it will be faster then the one in the default TreeModel or especially TreeModel sort?
    4) What is a good way to make populating, in general, faster? Why is it so slow in the TreeModel that comes with Gtk?
    5) Overall, how much speed improvements do you think I can get not using a db and rewriting a Custom treemodel?

    Also, I am curious about what you are doing at

    http://pvanhoof.be/blog/index.php/2006/01/13/three-million-rows-in-a-gtktreeview

    Sorry if I am assuming too much of you, you just seem to know your stuff based on some of your blog posts :)

    Scott

  5. This is surprising to me.

    I feel like the TreeModel stuffed in a filter, stuffed in a sort is very heavy and slows things down a lot.

    I think at this point this is my plan:

    Write a custom treemodel that is specific to my use case
    Include a function for inserting multiple rows all at once
    Include a function for turning sorting on/off (to stop it from sorting on insert)
    Include a custom/specific-for-my-situation filter.

    Both filtering and sorting will be done in the same tree, to increase speed.

    Anything you see wrong with this?

    Scott

  6. What is wrong with your idea is that gtk_tree_model_row_inserted is (translated to) a signal that you can’t easily do for multiple rows at once. And that this signal must be emitted with the gdk_threads_enter/leave lock used or in the context of gtk’s mainloop. It’s the TreeView that would have to be adapted to accept a signal with multiple rows: you can’t implement a workaround for this in your custom TreeModel. This is in my opinion the big problem of gtk’s treeview-treemodel architecture. Other than that, no problems with your ideas.

Comments are closed.