IPC performance improvements for insert queries

Although with SQLite WAL we have direct-access now, we don’t support direct-access for insert and delete SPARQL queries. Those queries when made using libtracker-sparql still go over D-Bus using Adrien’s FD passing D-Bus IPC technique. The library will do that for you.

After investigating a performance analysis by somebody from Intel we learned that there is still a significant overhead per each IPC call. In the analysis the person made miner-fs combine multiple insert transactions together and then send it over as a single big transaction. This was noticeably faster than making many individual IPC requests.

The problem with this is that if one of the many insert queries fail, they all fail: not good.

We’re now experimenting with a private API that allows you to pass n individual insert transactions, and get n errors back, using one IPC call.

The numbers are promising even on Desktop D-Bus (the test):

$ cd tests/functional-tests/
$ ./update-array-performance-test
First run (first update then array)
Array: 0.103675, Update: 0.139094
Reversing run (first array then update)
Array: 0.290607, Update: 0.161749
$ ./update-array-performance-test
First run (first update then array)
Array: 0.105920, Update: 0.137554
Reversing run (first array then update)
Array: 0.118785, Update: 0.130630
$ ./update-array-performance-test
First run (first update then array)
Array: 0.108501, Update: 0.136524
Reversing run (first array then update)
Array: 0.117308, Update: 0.151192
$

We’re now deciding whether or not the API will become public; returning arrays of errors isn’t exactly ‘nice’ or ‘standard’.

3 thoughts on “IPC performance improvements for insert queries”

  1. Why not run them all serially and stop on first error, returning error + failed query index. Seems like a saner api.

  2. I would love to see someone do a comparison using POSIX message queues, or even a semaphore/shared memory approach.

  3. @Alexander Larsson: One reason is because (in this API) each query is independent of the other queries. Query 7 and 8 should not stop when query 6 has an error. (In this API) each ‘query’ is a transaction by itself (a ‘query’ can with whitespace as delimiter have multiple INSERTs). It’s wrong when (with this API) somebody would make a query 7 depend on a query 6. He should merge query 6 and 7 together with a space as 6, instead (that’s a transaction).

    Your proposal is of course still possible if we allow to just continue sending queries (transactions, if you prefer that word), and then at the end return index plus error of all that got fed. But we’d still need to return a array of index+error at the commit API, just like the current API (there would be no benefit, except that now you need to collect all queries and pass them as a big array, whereas when the API would be split perhaps client developers could have less memory use … hmm).

    Returning the errors per query immediately would require executing the query. That would mean that it’s again blocking per query (not good).

    Anyway, yes, the API isn’t final and might change. It’s not intended to really be public either atm.

    This one illustrates it better:

    http://git.gnome.org/browse/tracker/tree/tests/tracker-steroids/tracker-test.c?h=multi-insert#n297

    @Conrad: Check the link behind “Adrien’s FD passing D-Bus IPC technique” in the article for a performance analysis of various IPC methods.

    We don’t think that shared memory with message queues is a good idea for data that is short lived, like queries. But you are welcome to try to adapt Adrien’s test software to also include the IPC-mechanism that you have in mind, then publicize your findings. Also note that a query result (or the data to insert) can be large; a shared-memory with message-queues solution would require us to buffer all that in memory (per message) until the other side has received it. With FD passing you can elegantly stream it over, which allows you to keep memory consumption low and horizontal (= important).

Comments are closed.