Comments on: IPC performance, the report

By: Adrien Bustany

Adrien Bustany — Thu, 20 May 2010 15:14:40 +0000

@Peter Lund: Interesting point, I admit I’m not totally aware of what happens at kernel level. Cachegrind does give some cache hit/miss numbers, I should maybe have a look at them for various buffer sizes.

You might want to read the updated version of the report: http://blogs.gnome.org/abustany/2010/05/20/ipc-performance-the-return-of-the-report/
It does include a new IPC method, and also a report about my vmsplice experiment.

By: Peter Lund

Peter Lund — Wed, 19 May 2010 17:36:16 +0000

“On Linux, a memory page is (generally) 4096 bytes, as a consequence buffers smaller than 4kB will use a full memory page when sent over the socket and waste memory bandwidth.”

Say what?

Only the parts actually written/read will be copied — the kernel pretty much just does a memcpy(). The page size is irrelevant.

Btw, copying (with memcpy or similar) is practically free as long as everything stays within the CPU cache(s). If you are moving big amounts of data around you’d rather not stream through it more than once because if it is big it will /not/ fit in cache. The CPU might be able to 1) do smart prefetching, 2) have a sensible write allocation policy for its cache(s), 3) not blow the cache(s) entirely but retain most of the non-copied data. I don’t believe we’re quite there yet ;)

Using vmsplice() (or mmap() tricks) or copying smaller chunks that each fit in the cache ought to be good ideas. You also don’t have the 2x spike in memory use around the copying (1x for the sender, 1x for the receiver).

By: pvanhoof

pvanhoof — Sun, 16 May 2010 11:44:28 +0000

@Aigars Mahinovs: Apples and oranges: Tokyo Cabinet is a key-value store, not a database with an SQL front end. We translate the SPARQL query into SQL, so we need the expressiveness of such an SQL front end (Tokyo Cabinet’s query expressiveness isn’t sufficient, no). With a key-value store would implementing a SPARQL endpoint be too difficult to do in a well performing way. Note that at some point we considered Tokyo Cabinet for our FTS needs. But because sqlite-fts was already well integrated with SQLite, we opted for this instead. When somebody finds the time to implement a virtual table for Tokyo Cabinet in SQLite, then we might reconsider this indeed (but only for FTS, and note that also for FTS we already have a few specific needs. We would like snippet support, for example).

Keeping all things in RAM also isn’t an option, the target hardware is a mobile device with only a few hundred megabytes of RAM. Not yet gigabytes. But maybe in future this will be an option? BTW, if you have the right indexes in your tables then aren’t on-disk databases going to perform bad either.

In any case, the storage of the data also wasn’t really the point of the experiment. But we did want a realistic measurement, not a sort of “everything-is-in-RAM-dream-come-true”; that’s not useful to measure. Particularly sqlite_value_bytes(), sqlite_value_text() and the memcpy on its results had to be realistic for this experiment.

By: Aigars Mahinovs

Aigars Mahinovs — Sun, 16 May 2010 11:17:33 +0000

If you want to increase the performance 5-10 times, use Tokyo Cabinet and not SQLite to actually get the data. But if you want maximum performance and don’t care that much about RAM, then keeping the data in RAM can easily be 10-50 times faster than any database.

By: pvanhoof

pvanhoof — Fri, 14 May 2010 11:54:41 +0000

@Alexander Larsson: Ah, yes, with a pipe it would be possible to use a FD indeed. You’re right about that. I guess that, at least on Linux, pipe will give similar results as a unix domain socket..? Under the hood I expect that the kernel deals with both the same way. I need to look up how vmsplice works; splicing user pages into a pipe sounds good, yes.

By: Alexander Larsson

Alexander Larsson — Fri, 14 May 2010 11:15:45 +0000

pvanhoof:

A file descriptor is only a pipe2() call away. You just create one, send the reading side to the client and then write out the data on the writing side. Additionally it might be possible to get less copying if you use vmsplice() to write to the buffer.

I don’t understand what sqlite3_step overwriting data has to do with anything, this would obviously copy the data just like your custom socket protocol.

By: pvanhoof

pvanhoof — Fri, 14 May 2010 11:09:58 +0000

@Andres Freund: That wasn’t the point of this experiment, though. Note that with shared-cache and read_uncommitted mode you can get pretty far with SQLite. Although we’d need a bit more, indeed. SQLite is a design decision for Tracker and we’re of course well aware of SQLite’s limitations. Long term we might consider a database engine like Firebird, but this isn’t being planned at this moment (not for the 0.9 series of Tracker).

By: Andres Freund

Andres Freund — Fri, 14 May 2010 10:32:00 +0000

Perhaps the problems referenced in the beginning are a good indication that it might not be the real thing to make sqlite into a multi-user, multi-client database. Its not exactly intended to be that…

By: pvanhoof

pvanhoof — Fri, 14 May 2010 09:07:53 +0000

@Kall Vahlman: Yes we already knew that D-Bus wasn’t fit for large transfers. The experiment was, however, done to see how well such a custom protocol and Unix domain socket would fit with Tracker’s tracker-store RDF query service, and how fast we could make it given SQLite’s limitations (sqlite_value_bytes is rather slow); not to prove that D-Bus is slow at large transfers, everybody indeed already knows that. If it were files then we’d of course use D-Bus’s FD passing mechanisms. Your so-called *real conclusion* about Prepare() is precisely why we’ve split the experiment’s socket’s protocol up in a Prepare() and Fetch(), so that we could easily adapt the solution to something like this afterward.

@Alexander Larsson: When you have a file descriptor then using D-Bus’s FD passing is of course the preferred way. In our case we have sqlite3_step(), though. When (and if) we do replace D-Bus for the transfer of query’s result-data we wont replace D-Bus for the other IPC-needs either, of course. sqlite3_step() overwrites the previous pointers to the cell data, so we’d have to copy if we’d wanted to pass info about shared memory, or something. Causing a lot of memory usage in the server. So instead, we opted for sending it over in chunks of 64k.

@Rodrigo Kumpera, Rodrigo Kumpera & tvst (streaming-mode suggestions): Sounds like a good idea to me.

By: Alexander Larsson

Alexander Larsson — Fri, 14 May 2010 08:04:01 +0000

This is why gvfs used dbus for everything except for the actual file contents. File contents is instead sent via a custom binary protocol over a pipe that is passed from the gvfs daemon to the client as a result of the open() dbus method.

For gvfs the code handling the fd passing involved is a bit nasty. However, recent versions of dbus has fd passing built in, which should make this very clean and nice.

By: Kalle Vahlman

Kalle Vahlman — Fri, 14 May 2010 06:30:18 +0000

“The real conclusion of this study is: if you have to pass a lot of data between two programs and don’t need a lot of flexibility, then DBus is not the right answer, and never intended to be.”

I’m surprised that someone still needed a study to prove that, I thought it was already common knowledge…

The recommended way to get both the flexibility of D-Bus and the bandwidth of raw sockets is to use the Prepare() call to set up a socket from which the data can can be transferred.

*That* should be the conclusion ;)

By: Rodrigo Kumpera

Rodrigo Kumpera — Fri, 14 May 2010 03:30:29 +0000

Can’t DBUS be changed to do progressive message decoding? Or even support a streaming mode? This would solve most of your concerns while improving DBUS at the same time.

By: tvst

tvst — Fri, 14 May 2010 00:36:00 +0000

Great post!

Maybe it would be a good idea to write a dbus api exactly for these bulk-data transmissions. This api would use dbus for service discovery, then transparently forward the data through unix sockets.

This is similar to the way “bluetooth over wifi” works. Bluetooth is used for discovery / handshakes, then wifi is used for bulk data transfer.

From ZDNet:
“Devices will use the regular low-power Bluetooth radios to recognize each other and establish connections,” writes ZDNet’s Rik Fairlie. “If they need to transfer a large file, they will be able to turn on their Wi-Fi radios, then turn them off to save power after finishing the transfer.”
(http://www.zdnet.com/blog/soho-networking/bluetooth-and-wi-fi-combo-could-yield-faster-bulk-data-transfers/139?p=139)

What do you think?