Het Internet Der Crap

Ik heb in Belgische bedrijven al bugs gefiled waarbij een buffer afkomstig van een Angular HTTP POST sessie gewoon met memcpy in een stack buffer van 1024 bytes gekopieerd wordt, met uiteraard strlen op de origin buffer als size_t n parameter. Daar kwam zelfs ruzie van want de geniale programmeur vond dat dat geen bug was, omdat je de admin credentials nodig had om te reproduceren. Het probleem was dat ook zonder in te loggen de buffer tot aan de memcpy kon geraken.

Het is ook geen uitzondering om wanneer je met de strings tool op de binaries op vele routers losgaat, je gewoonweg backdoors en hardcoded passwords krijgt. Je komt programmeurs tegen die dat er gewoon ingestopt hebben. Dat vinden ze normaal. Ruwe anti-security arrogantie. En ze zijn nog trots op hun onprofessionalisme ook.

Deze tijd is dan ook niet de tijd van het Internet Der Dingen, maar wel van Het Internet Der Crap. Gemaakt door prutsers. Zelfs de backdoors zijn prutswerk.

Het is zeer erg gesteld. Ik voel me beschaamd om vele van mijn collega’s en wens mij te distantiëren.

Ik heb nog eens m’n steentje bijgedrage

Heb dit geyoutubed gisterenavond toen m’n duikclub was komen vlees op een rooster en gloeiende koolen verbranden. Oh met den Hans Teeuwen en de Theo Maassen zijn we ook aan’t lachen geweest. Na één uur de Geubels vond ik dat we ook eens andere komieken moeten laten zien aan de wereld. Duikers zijn een leutige bende. Een beetje zoals ons, neurdjes. Ik raad alle andere computermannekens aan om ook te gaan duiken. Toffe mensen.

https://www.youtube.com/watch?v=yi1XwrPwz1k

Dit is niet de eerste keer dat ik dit filmpje blog. Maar ik zou graag meer Astronomers willen tegenkomen zodat we misschien effectief die planeet hier kunnen opkopen. En dan een gigantisch ruimteschip maken. Samen! Brand new world order 2.0. Or, shut the fuck up.

Daarnet nog een goed idee besproken met diene dat is blijven crashen. Ik wil hier in Rillaar toch ooit eens een zwembad zetten. Misschien moet ik er ééntje maken in de vorm van een schotel? Dan kunnen de duikers bij mij komen oefenen.

Hey guys

Have you guys stopped debating systemd like a bunch of morons already? Because I’ve been keeping myself away from the debate a bit: the amount of idiot was just too large for my mind.People who know me also know that quite a bit of idiot fits into it.

I remember when I was younger, somewhere in the beginning of the century, that we first debated ORBit-2, then Bonobo, then software foolishly written with it like Evolution, Mono (and the idea of rewriting Evolution in C#. But first we needed a development environment MonoDevelop to write it in – oh the gnomes). XFree86 and then the X.Org fork. Then Scaffolding and Anjuta. Beagle and Tracker (oh the gnomes). Rhythmbox versus Banshee (oh the gnomes). Desktop settings in gconf, then all sorts of gnome services, then having a shared mainloop implementation with Qt.

Then god knows what. Dconf, udev, gio, hal, FS monitoring: a lot of things that were silently but actually often bigger impact changes than systemd is because much much more real code had to be rewritten, not just badly written init.d-scripts. The Linux eco-system has reinvented itself several times without most people having noticed it.

Then finally D-Bus came. And yes, evil Lennart was there too. He was also one of those young guys working on stuff. I think evil evil pulseaudio by that time (thank god Lennart replaced the old utter crap we had with it). You know, working on stuff.

D-Bus’s debate began a bit like systemd’s debate: Everybody had feelings about their own IPC system being better because of this and that (most of which where really bad copies of xmms’s remote control infrastructure). It turned out that KDE got it mostly right with DCOP, so D-Bus copied a lot from it. It also opened a lot of IPC author’s eyes that message based IPC, uniform activation of services, introspection and a uniform way of defining the interface are all goddamned important things. Also other things, like tools for monitoring and debugging plus libraries for all goddamn popular programming environments and most importantly for IPC their mainloops, appeared to be really important. The uniformity between Qt/KDE and Gtk+/GNOME application’s IPC systems was quite a nice thing and a real enabler: suddenly the two worlds’ softwares could talk with each other. Without it, Tracker could not have happened on the N900 and N9. Or how do you think qt/qsparql talks with it?

Nowadays everybody who isn’t insane or has a really, really, really good reason (like awesome latency or something, although kdbus solves that too), and with exception of all Belgian Linux programmers (who for inexplicable reasons all want their own personal IPC – and then endlessly work on bridges to all other Belgian Linux programmer’s IPC systems), won’t write his own IPC system. They’ll just use D-Bus and get it over with (or they initially bridge to D-Bus, and refactor their own code away over time).

But anyway.

The D-Bus debate was among software developers. And sometimes teh morons joined. But they didn’t understand what the heck we where doing. It was easy to just keep the debate mostly technical. Besides, we had some (for them) really difficult to understand stuff to reply like “you have file descriptor passing for that”, “study it and come back”. Those who came back are now all expert D-Bus users (I btw think and/or remember that evil Lennart worked on FD passing in D-Bus).

Good times. Lot’s of debates.

But the systemd debate, not the software, the debate, is only moron.

Recently I’ve seen some people actually looking into it and learning it. And then reporting about what they learned. That’s not really moron of course. But then their blogs get morons in the comments. Morons all over the place.

Why aren’t they on their fantastic *BSD or Devuan or something already?

ps. Lennart, if you read this (I don’t think so): I don’t think you are evil. You’re super nice and fluffy. Thanks for all the fish!

De dierentuin: geboortebeperking versus slachten

Michel Vandenbosch versus Dirk Draulans: ben ooit zo’n 15 jaar vegetariër geweest om vandaag tevreden te zijn over dit goed voorbereid en mooi gebalanceerd debat. Mijn dank aan de redactie van Terzake.

Ik was het eens met beide heren. Daarom was dit een waardig filosofische discussie: geboortebeperking versus het aan de leeuwen voeden van overbodige dieren plus het nut en de zin van goed gerunde dierenparken. Dat nut is me duidelijk: educatie voor de dwaze mens (z’n kinders, in de hoop dat de opvolging minder dwaas zal zijn)

Hoewel ik het eens was met beide ben ik momenteel zelf meer voor het aan de leeuwen voeden van overbodige dieren dan dat ik voor geboortebeperking van wel of niet bedreigde diersoorten ben. Dat leek me, met groot respect voor Vandenbosch’s, Draulans’ standpunt te zijn. Ethisch snapte ik Vandenbosch ook: is het niet beter om aan geboortebeperking te doen teneinde het leed van een slachting te vermijden?

Ik kies voor het standpunt van Draulans omdat dit het meeste de echte wereld nabootst. Ik vind het ook zeer goed dat het dierenpark de slachting van de giraffe aan de kinderen toonde. Want dit is de werkelijkheid. Onze kinderen moeten de werkelijkheid zien. We moeten met ons verstand de werkelijkheid nabootsen. Geen eufemisme zoals het doden van een giraffe een euthanasie noemen. Laten we onze kinders opvoeden met de werkelijkheid.

Maalwarkstrodon

It’s a mythical beast that speaks in pornographic subplots and maintains direct communication with your girlfriends every wants and desires so as better to inform you on how to best please her. It has the feet of bonzi buddy, the torso of that man who uses 1 weird trick to perfect his abs, and the arms of the scientists that hate her. Most impressively, Maalwarkstrodon has a skull made from a Viagra, Levitra, Cialis, and Propecia alloy. This beast of malware belches sexy singles from former east-bloc soviet satellite states and is cloaked in the finest fashions from paris and milan, imported directly from Fujian china.

Maalwarkstrodon is incapable of offering any less than the best deals at 80% to 90% off, and will not rest until your 2 million dollar per month work-at-home career comes to fruition and the spoils of all true nigerian royalty are delivered unto those most deserving of a kings riches.

Maalwarkstrodon will also win the malware arms race.

RE: Scudraketten

Wanneer Isabel Albers iets schrijft ben ik aandachtig: wij moeten investeren in infrastructuur.

Dit creëert welvaart, distribueert efficiënt het geld en investeert in onze kinderen hun toekomst: iets wat nodig is; en waar we voor staan.

De besparingsinspanningen kunnen we beperken wat betreft investeringen in infrastructuur; we moeten ze des te meer doorvoeren wat betreft andere overheidsuitgaven.

Misschien moeten we bepaalde scudraketten lanceren? Een scudraket op de overheidsomvang zou geen slecht doen.

Een week mediastorm meemaken over hoe hard we snoeien in bepaalde overheidssectoren: laten we dat hard en ten gronde doen.

Laten we tegelijk investeren in de Belgische infrastructuur. Laten we veel investeren.

Rusland

De cultuur die heerst, leeft,  ploert en zweet op het grondgebied van Rusland zal niet verdwijnen. Zelfs niet na een nucleaire aanval tenzij die echt verschrikkelijk wreedaardig is. En dan zou ik ons verdommen dat we dat gedaan hebben zoals ik hoor te doen. Wij, Europeanen, moeten daar mee leven zoals zij met ons moeten leven.

Mr. Dillon; smartphone innovation in Europe ought to be about people’s privacy

Dear Mark,

Your team and you yourself are working on the Jolla Phone. I’m sure that you guys are doing a great job and although I think you’ve been generating hype and vaporware until we can actually buy the damn thing, I entrust you with leading them.

As their leader you should, I would like to, allow them to provide us with all of the device’s source code and build environments of their projects so that we can have the exact same binaries. With exactly the same I mean that it should be possible to use MD5 checksums. I’m sure you know what that means and you and I know that your team knows how to provide geeks like me with this. I worked with some of them together during Nokia’s Harmattan and Fremantle and we both know that you can easily identify who can make this happen.

The reason why is simple: I want Europe to develop a secure phone similar to how, among other open source projects, the Linux kernel can be trusted. By peer review of the source code.

Kind regards,

A former Harmattan developer who worked on a component of the Nokia N9 that stores the vast majority of user’s privacy.

ps. I also think that you should reunite Europe’s finest software developers and secure the funds to make this workable. But that’s another discussion which I’m eager to help you with.

We delivered

Damned guys, we’re too shy about what we delivered. When the N900 was made public we flooded the planets with our blogs about it. And now?

I’m proud of the software on this device. It’s good. Look at what Engadget is writing about it! Amazing. We should all be proud! And yes, I know about the turbulence in Nokia-land. Deal with it, it’s part of our job. Para-commandos don’t complain that they might get shot. They just know. It’s called research and development! (I know, bad metaphor)

I don’t remember that many good reviews about even the N900, and that phone was by many of its owners seen as among the best they’ve ever owned. Now is the time to support Harmattan the same way we passionately worked on the N900 and its predecessor tablets (N810, N800 and 770). Even if the N9’s future is uncertain: who cares? It’s mostly open source! And not open source in the ‘Android way’. You know what I mean.

The N9 will be a good phone. The Harmattan software is awesome. Note that Tracker and QSparql are being used by many of its standard applications. We have always been allowed to develop Tracker the way it’s supposed to be done. Like many other similar projects: in upstream.

As for short term future I can announce that we’re going to make Michael Meeks happy by finally solving the ever growing journal problem. Michael repeatedly and rightfully complained about this to us at conferences. Thanks Michael. I’ll write about how we’ll do it, soon. We have some ideas.

We have many other plans for long term future. But let’s for now work step by step. Our software, at least what goes to Harmattan, must be rock solid and very stable from now on. Introducing a serious regression would be a catastrophe.

I’m happy because with that growing journal – problem, I can finally focus on a tough coding problem again. I don’t like bugfixing-only periods. But yeah, I have enough experience to realize that sometimes this is needed.

And now, now we’re going to fight.

IPC performance improvements for insert queries

Although with SQLite WAL we have direct-access now, we don’t support direct-access for insert and delete SPARQL queries. Those queries when made using libtracker-sparql still go over D-Bus using Adrien’s FD passing D-Bus IPC technique. The library will do that for you.

After investigating a performance analysis by somebody from Intel we learned that there is still a significant overhead per each IPC call. In the analysis the person made miner-fs combine multiple insert transactions together and then send it over as a single big transaction. This was noticeably faster than making many individual IPC requests.

The problem with this is that if one of the many insert queries fail, they all fail: not good.

We’re now experimenting with a private API that allows you to pass n individual insert transactions, and get n errors back, using one IPC call.

The numbers are promising even on Desktop D-Bus (the test):

$ cd tests/functional-tests/
$ ./update-array-performance-test
First run (first update then array)
Array: 0.103675, Update: 0.139094
Reversing run (first array then update)
Array: 0.290607, Update: 0.161749
$ ./update-array-performance-test
First run (first update then array)
Array: 0.105920, Update: 0.137554
Reversing run (first array then update)
Array: 0.118785, Update: 0.130630
$ ./update-array-performance-test
First run (first update then array)
Array: 0.108501, Update: 0.136524
Reversing run (first array then update)
Array: 0.117308, Update: 0.151192
$

We’re now deciding whether or not the API will become public; returning arrays of errors isn’t exactly ‘nice’ or ‘standard’.

LRU cache for prepared statements in Tracker’s RDF store

While trying to handle a bug that had a description like “if I do this, tracker-store’s memory grows to 80MB and my device starts swapping”, we where surprised to learn that a sqlite3_stmt consumes about 5 kb heap. Auwch.

Before we didn’t think that those prepared statements where very large, so we threw all of them in a hashtable for in case the query was ran again later. However, if you collect thousands of such statements, memory consumption obviously grows.

We decided to implement a LRU cache for these prepared statements. For clients that access the database using direct-access the cache will be smaller, so that max consumption is only a few megabytes. Because our INSERT and DELETE queries are more reusable than SELECT queries, we split it into two different caches per thread.

The implementation is done with a simple intrinsic linked ring list. We’re still testing it a little bit to get good cache-size numbers. I guess it’ll go in master soon. For your testing pleasure you can find the branch here.

Support for SPARQL IN and NOT IN, the new class signals

I made some documentation about our SPARQL-IN feature that we recently added. I added some interesting use-cases like doing an insert and a delete based on in values.

For the new class signal API that we’re developing this and next week, we’ll probably emit the IDs that tracker:id() would give you if you’d use that on a resource. This means that IN is very useful for the purpose of giving you metadata of resources that are in the list of IDs that you just received from the class signal.

We never documented tracker:id() very much, as it’s not an RDF standard; rather it’s something Tracker specific. But neither are the class signals a RDF standard; they are Tracker specific too. I guess here that makes it usable in combo and turns the status of ‘internal API’, irrelevant.

We’re right now prototyping the new class signals API. It’ll probably be a “sa(iii)a(iii)”:

That’s class-name and two arrays of subject-id, predicate-id, object-id. The class-name is to allow D-Bus filtering. The first array are the deletes and the second are the inserts. We’ll only give you object-ids of non-literal objects (literal objects have no internal object-id). This means that we don’t throw literals to you in the signal (you need to make a query to get them, we’ll throw 0 to you in the signal).

We give you the object-ids because of a use-case that we didn’t cover yet:

Given triple <a> nie:isLogicalPartOf <b>. When <a> is deleted, how do you know <b> during the signal? So the feature request was to do a select ?b { <a> nie:isLogicalPartOf ?b } when <a> is deleted (so the client couldn’t do that query anymore).

With the new signal we’ll give you the ID of <b> when <a> is deleted. We’ll also implement a tracker:uri(integer id) allowing you to get <b> out of that ID. It’ll do something like this, but then much faster: select ?subject { ?subject a rdfs:Resource . FILTER (tracker:id(?subject) IN (%d)) }

I know there will be people screaming for all objects, also literals, in the signals, but we don’t want to flood your D-Bus daemon with all that data. Scream all you want. Really, we don’t. Just do a roundtrip query.

Domain indexes finished, technical conclusions

The support for domain specific indexes is, awaiting review / finished. Although we can further optimize it now. More on that later in this post. Image that you have this ontology:

nie:InformationElement a rdfs:Class .

nie:title a rdf:Property ;
  nrl:maxCardinality 1 ;
  rdfs:domain nie:InformationElement ;
  rdfs:range xsd:string .

nmm:MusicPiece a rdfs:Class ;
  rdfs:subClassOf nie:InformationElement .

nmm:beatsPerMinute a rdf:Property ;
  nrl:maxCardinality 1 ;
  rdfs:domain nmm:MusicPiece ;
  rdfs:range xsd:integer .

With that ontology there are three tables called “Resource”, “nmo:MusicPiece” and “nie:InformationElement” in SQLite’s schema:

  • The “Resource” table has ID and the subject string
  • The “nie:InformationElement” has ID and “nie:title”
  • The “nmm:MusicPiece” one has ID and “nmm:beatsPerMinute”

That’s fairly simple, right? The problem is that when you ORDER BY “nie:title” that you’ll cause a full table scan on “nie:InformationElement”. That’s not good, because there are less “nmm:MusicPiece” records than “nie:InformationElement” ones.

Imagine that we do this SPARQL query:

SELECT ?title WHERE {
   ?resource a nmm:MusicPiece ;
             nie:title ?title
} ORDER BY ?title

We translate that, for you, to this SQL on our schema:

SELECT   "title_u" FROM (
  SELECT "nmm:MusicPiece1"."ID" AS "resource_u",
         "nie:InformationElement2"."nie:title" AS "title_u"
  FROM   "nmm:MusicPiece" AS "nmm:MusicPiece1",
         "nie:InformationElement" AS "nie:InformationElement2"
  WHERE  "nmm:MusicPiece1"."ID" = "nie:InformationElement2"."ID"
  AND    "title_u" IS NOT NULL
) ORDER BY "title_u"

OK, so with support for domain indexes we change the ontology like this:

nmm:MusicPiece a rdfs:Class ;
  rdfs:subClassOf nie:InformationElement ;
  tracker:domainIndex nie:title .

Now we’ll have the three tables called “Resource”, “nmo:MusicPiece” and “nie:InformationElement” in SQLite’s schema. But they will look like this:

  • The “Resource” table has ID and the subject string
  • The “nie:InformationElement” has ID and “nie:title”
  • The “nmm:MusicPiece” table now has three columns called ID, “nmm:beatsPerMinute” and “nie:title”

The same data, for titles of music pieces, will be in both “nie:InformationElement” and “nmm:MusicPiece”. We copy to the mirror column during ontology change coping, and when new inserts happen.

When now the rdf:type is known in the SPARQL query as a nmm:MusicPiece, like in the query mentioned earlier, we know that we can use the “nie:title” from the “nmm:MusicPiece” table in SQLite. That allows us to generate you this SQL query:

SELECT   "title_u" FROM (
  SELECT "nmm:MusicPiece1"."ID" AS "resource_u",
         "nmm:MusicPiece1"."nie:title" AS "title_u"
  FROM   "nmm:MusicPiece" AS "nmm:MusicPiece1"
  WHERE  "title_u" IS NOT NULL
) ORDER BY "title_u"

A remaining optimization is when you request a rdf:type that is a subclass of nmm:MusicPiece, like this:

SELECT ?title WHERE {
  ?resource a nmm:MusicPiece, nie:InformationElement ;
            nie:title ?title
} ORDER BY ?title

It’s still not as bad as now the “nie:title” is still taken from the “nmm:MusicPiece” table. But the join with “nie:InformationElement” is still needlessly there (we could just do the earlier SQL query in this case):

SELECT   "title_u" FROM (
  SELECT "nmm:MusicPiece1"."ID" AS "resource_u",
         "nmm:MusicPiece1"."nie:title" AS "title_u"
  FROM   "nmm:MusicPiece" AS "nmm:MusicPiece1",
         "nie:InformationElement" AS "nie:InformationElement2"
  WHERE  "nmm:MusicPiece1"."ID" = "nie:InformationElement2"."ID"
  AND    "title_u" IS NOT NULL
) ORDER BY "title_u"

We will probably optimize this specific use-case further later this week.

Friday’s performance improvements in Tracker

The crawler’s modification time queries

Yesterday we optimized the crawler’s query that gets the modification time of files. We use this timestamp to know whether or not a file must be reindexed.

Originally, we used a custom SQLite function called tracker:uri-is-parent() in SPARQL. This, however, caused a full table scan. As long as your SQL table for nfo:FileDataObjects wasn’t too large, that wasn’t a huge problem. But it didn’t scale linear. I started with optimizing the function itself. It was using a strlen() so I replaced that with a sqlite3_value_bytes(). We only store UTF-8, so that worked fine. It gained me ~ 10%; not enough.

So this commit was a better improvement. First it makes nfo:belongsToContainer an indexed property. The x nfo:belongsToContainer p means x is in a directory p for file resources. The commit changes the query to use the property that is now indexed.

The original query before we started with this optimization took 1.090s when you had ~ 300,000 nfo:FileDataObject resources. The new query takes about 0.090s. It’s of course an unfair comparison because now we use an indexed property. Adding the index only took a total of 10s for a ~ 300,000 large table and the table is being queried while we index (while we insert into it). Do the math, it’s a huge win in all situations. For the SQLite freaks; the SQLite database grew by 4 MB, with all items in the table indexed.

PDF extractor

Another optimization I did earlier was the PDF extractor. Originally, we used the poppler-glib library. This library doesn’t allow us to set the OutputDev at runtime. If compiled with Cairo, the OutputDev is in some versions a CairoOutputDev. We don’t want all images in the PDF to be rendered to a Cairo surface. So I ported this back to C++ and made it always use a TextOutputDev instead. In poppler-glib master this appears to have improved (in git master poppler_page_get_text_page is always using a TextOutputDev).

Another major problem with poppler-glib is the huge amount of copying strings in heap. The performance to extract metadata and content text for a 70 page PDF document without any images went from 1.050s to 0.550s. A lot of it was caused by copying strings and GValue boxing due to GObject properties.

Table locked problem

Last week I improved D-Bus marshaling by using a database cursor. I forgot to handle SQLITE_LOCKED while Jürg and Carlos had been introducing multithreaded SELECT support. Not good. I fixed this; it was causing random Table locked errors.

Performance DBus handling of the query results in Tracker’s RDF service

Before

For returning the results of a SPARQL SELECT query we used to have a callback like this. I removed error handling, you can find the original here.

We need to marshal a database result_set to a GPtrArray because dbus-glib fancies that. This is a lot of boxing the strings into GValue and GStrv. It does allocations, so not good.

static void
query_callback(TrackerDBResultSet *result_set,GError *error,gpointer user_data)
{
  TrackerDBusMethodInfo *info = user_data;
  GPtrArray *values = tracker_dbus_query_result_to_ptr_array (result_set);
  dbus_g_method_return (info->context, values);
  tracker_dbus_results_ptr_array_free (&values);
}

void
tracker_resources_sparql_query (TrackerResources *self, const gchar *query,
                                DBusGMethodInvocation *context, GError **error)
{
  TrackerDBusMethodInfo *info = ...; guint request_id;
  TrackerResourcesPrivate *priv= ...; gchar *sender;
  info->context = context;
  tracker_store_sparql_query (query, TRACKER_STORE_PRIORITY_HIGH,
                              query_callback, ...,
                              info, destroy_method_info);
}

After

Last week I changed the asynchronous callback to return a database cursor. In SQLite that means an sqlite3_step(). SQLite returns const pointers to the data in the cell with its sqlite3_column_* APIs.

This means that now we’re not even copying the strings out of SQLite. Instead, we’re using them as const to fill in a raw DBusMessage:

static void
query_callback(TrackerDBCursor *cursor,GError *error,gpointer user_data)
{
  TrackerDBusMethodInfo *info = user_data;
  DBusMessage *reply; DBusMessageIter iter, rows_iter;
  guint cols; guint length = 0;
  reply = dbus_g_method_get_reply (info->context);
  dbus_message_iter_init_append (reply, &iter);
  cols = tracker_db_cursor_get_n_columns (cursor);
  dbus_message_iter_open_container (&iter, DBUS_TYPE_ARRAY,
                                    "as", &rows_iter);
  while (tracker_db_cursor_iter_next (cursor, NULL)) {
    DBusMessageIter cols_iter; guint i;
    dbus_message_iter_open_container (&rows_iter, DBUS_TYPE_ARRAY,
                                      "s", &cols_iter);
    for (i = 0; i < cols; i++, length++) {
      const gchar *result_str = tracker_db_cursor_get_string (cursor, i);
      dbus_message_iter_append_basic (&cols_iter,
                                      DBUS_TYPE_STRING,
                                      &result_str);
    }
    dbus_message_iter_close_container (&rows_iter, &cols_iter);
  }
  dbus_message_iter_close_container (&iter, &rows_iter);
  dbus_g_method_send_reply (info->context, reply);
}

Results

The test is a query on 13500 resources where we ask for two strings, repeated eleven times. I removed a first repeat from each round, because the first time the sqlite3_stmt still has to be created. This means that our measurement would get a few more milliseconds. I also directed the standard out to /dev/null to avoid the overhead created by the terminal. The results you see below are the value for “real”.

There is of course an overhead created by the “tracker-sparql” program. It does demarshaling using normal dbus-glib. If your application uses DBusMessage directly, then it can avoid the same overhead. But since for both rounds I used the same “tracker-sparql” it doesn’t matter for the measurement.

$ time tracker-sparql -q "SELECT ?u  ?m { ?u a rdfs:Resource ;
          tracker:modified ?m }" > /dev/null

Without the optimization:

0.361s, 0.399s, 0.327s, 0.355s, 0.340s, 0.377s, 0.346s, 0.380s, 0.381s, 0.393s, 0.345s

With the optimization:

0.279s, 0.271s, 0.305s, 0.296s, 0.295s, 0.294s, 0.295s, 0.244s, 0.289s, 0.237s, 0.307s

The improvement ranges between 7% and 40% with average improvement of 22%.

Focus on query performance

Every (good) developer knows that copying of memory and boxing, especially when dealing with a large amount of pieces like members of collections and the cells in a table, are a bad thing for your performance.

More experienced developers also know that novice developers tend to focus on just their algorithms to improve performance, while often the single biggest bottleneck is needless boxing and allocating. Experienced developers come up with algorithms that avoid boxing and copying; they master clever pragmatical engineering and know how to improve algorithms. A lot of newcomers use virtual machines and script languages that are terrible at giving you the tools to control this and then they start endless religious debates about how great their programming language is (as if it matters). (Anti-.NET people don’t get on your horses too soon: if you know what you are doing, C# is actually quite good here).

We were of course doing some silly copying ourselves. Apparently it had a significant impact on performance.

Once Jürg and Carlos have finished the work on parallelizing SELECT queries we plan to let the code that walks the SQLite statement fill in the DBusMessage directly without any memory copying or boxing (for marshalling to DBus). We found the get_reply and send_reply functions; they sound useful for this purpose.

I still don’t really like DBus as IPC for data transfer of Tracker’s RDF store’s query results. Personally I think I would go for a custom Unix socket here. But Jürg so far isn’t convinced. Admittedly he’s probably right; he’s always right. Still, DBus to me doesn’t feel like a good IPC for this data transfer..

We know about the requests to have direct access to the SQLite database from your own process. I explained in the bug that SQLite3 isn’t MVCC and that this means that your process will often get blocked for a long time on our transaction. A longer time than any IPC overhead takes.