A REPLACE extension for Tracker’s SPARQL’s Update

SPARQL Update has INSERT and DELETE. To update an existing triple in RDF you need to DELETE it first. You of course already have our INSERT-SILENT but that just ignores certain errors; it doesn’t replace triples.

A (performance) problem is that with each DELETE having to solve all possible solutions you create an extra query for each time you want to update using a ‘DELETE-WHERE INSERT’-construction.

INSERT also checks for old values. It has to do this to implement SPARQL Update where you can’t insert a triple with a different value than the old value: If the value of a triple is identical, the insert for that triple is ignored; if the triple didn’t exist yet, it’s inserted; if the values aren’t identical, error is thrown — you need to use DELETE upfront.

Both having to do the extra delete and the old-values come at a performance price.

To solve this we plan to provide Tracker specific support for REPLACE. It’ll be Tracker specific simply because this isn’t specified in SPARQL Update. That has a probable reason:

Replacing or updating doesn’t fit well in the RDF world. Updating properties that have multiple values, like nie:keyword, is ambiguous: does it need to replace the entire list of values; does it need to append to the list; does it need to update just one item in the list, and which one? This probably explains why it’s not specified in SPARQL Update.

We decided to let our REPLACE be only different than INSERT for single value properties. For multi value properties will our REPLACE behave the same as normal INSERT.

How a GraphUpdated triggered by a REPLACE behaves is still being decided. Especially the value of the object’s ID for resource objects in the ‘deletes’-array. Having to look up the old ID kinda defeats the purpose of having a REPLACE (as we’d still need to look it up, like what an INSERT does, destroying part of the performance gain).

Either way, let me show you some examples:

We start with an insert of a resource that has a single value and two times a multi value property filled in:

INSERT { <r> a nie:InformationElement ;
             nie:title 'title';
             nie:keyword 'keyw1';
             nie:keyword 'keyw2' }

A quick query to verify, and yes it’s in:

SELECT ?t ?k { <r> nie:title ?t; nie:keyword ?k }
Results:
  title, keyw1
  title, keyw2

If we repeat the query a second time then the old-values check will turn the insert into a noop:

INSERT { <r> a nie:InformationElement ;
             nie:title 'title';
             nie:keyword 'keyw1';
             nie:keyword 'keyw2' }

And a quick query to verify that, and indeed nothing has changed:

SELECT ?t ?k { <r> nie:title ?t; nie:keyword ?k }
Results:
  title, keyw1
  title, keyw2

If we’d do that last insert query but with different values, we’d get this:

INSERT { <r> a nie:InformationElement ;
             nie:title 'title new';
             nie:keyword 'keyw4';
             nie:keyword 'keyw3' }

SparqlError.Constraint: Unable to insert multiple values for subject
`r' and single valued property `dc:title' (old_value: 'title', new
 value: 'title new')

Note that for the two nie:keyword triples this would have worked, but given that each query is a transaction and because the nie:title part failed, aren’t those two written either.

Let’s now try the same with INSERT OR REPLACE (edit: changed from just REPLACE to INSERT OR REPLACE):

INSERT OR REPLACE { <r> a nie:InformationElement ;
                        nie:title 'title new';
                        nie:keyword 'keyw4';
                        nie:keyword 'keyw3' }

And a quick query now yields:

SELECT ?t ?k { <r> nie:title ?t; nie:keyword ?k }
Results:
  title new, keyw1
  title new, keyw2
  title new, keyw3
  title new, keyw4

You can see that how it behaved for nie:title was different than for nie:keyword. That’s because nie:title is a single value -and nie:keyword is a multi value property.

What if we do want to reset the multi value property and insert a complete new list? Simple, just do this as a single query (space or newline delimited) (edit: changed to INSERT OR REPLACE from just REPLACE):

DELETE { <r> nie:keyword ?k } WHERE { <r> nie:keyword ?k }
INSERT OR REPLACE { <r> a nie:InformationElement ;
                        nie:title 'title new';
                        nie:keyword 'keyw4';
                        nie:keyword 'keyw3' }

And a quick query now yields:

SELECT ?t ?k { <r> nie:title ?t; nie:keyword ?k }
Results:
  title new, keyw3
  title new, keyw4

The work on this is in progress. You can find it in the branch sparql-update. It’s working but especially the GraphUpdated stuff is unfinished.

Also note that the final syntax may change.

Synchronizing your application’s data with Tracker’s RDF store

A few months ago we added the implicit tracker:modified property to all resources. This property is an auto-increment. It used to be that the property was incremented on ~ each SQL update-query that happens. The value is stored per resource.Synchronization in water

We are now changing this to be per transaction. A transaction in Tracker is one set of SPARQL-Update INSERT or DELETE queries. You can do inserts and deletes about multiple resources in one such sentence (a sentence can contain multiple space delimited Update queries). An exception is everything related to ontology changes. These ontology changes get the first increment as their value for tracker:modified. This is also for ontology changes that happen after the initial ontology transaction (at the first start, is this first transaction made). The exception is made for supporting future ontology changes and the possibly needed data conversions.

The per-resource tracker:modified value is useful for application’s synchronization purposes: you can test your application’s stored tracker:modified value against the always increasing (w. exception at int. overflow) Tracker’s tracker:modified value to know whether or not your version is older.

The reason why we are changing this to per-transaction is because this way we can guarantee that the value will be restored after a journal replay and/or a backup’s restore without having to store it in either the journal nor the backup. This means that we now guarantee the value being restored without having to change either the backup’s format nor the journal’s format.

Having a persistent journal we actually make a simple copy of the journal to deliver you a backup in a fast file-copy. But let this deception be known only by the people who care about the implementation. Sssht!

We’re already rotating and compressing the rotated chunks for reducing the journal size. We’re working on not journaling data that is embedded in local files this week. A re-index of that local file will re-insert the data anyway. This will significantly reduce the size of the journal too.

All quiet on the Tracker front

It has been a long time since we wrote propaganda about the Tracker project. That has a lot to do with both the holiday-season and the fact that we’re preparing for a stable release. This means that we are increasingly reluctant to new features.All quiet on the Western Front

We still made quite some progress, though. We for example ported almost everything from dbus-glib and dbus-1 to GDBus and GVariant. This was quite a work; next few weeks will be used for cleaning up and regression fixing. Jürg decided that it was more easy to simply port ~ all of tracker-store to Vala than to port tracker-store’s dbus-glib and dbus-1 C code to GDBus. This should please contributors who will now have a much more easy to understand codebase.

Tracker’s tracker-store is mostly an IPC layer above libtracker-data plus the signaling mechanism. The public libtracker-sparql is a API layer above the same private libtracker-data that protects you from doing things that you should not do. Like trying to do writes without going over the IPC layer.

This week we’re working on transient properties and adding tracker:modified to the journal and backup file. The tracker:modified property is an auto-incremented value. It’s useful for synchronization purposes. Right now when you replay the journal or restore a backup, we reset the count. This means that in between journal replays and/or backup restores you wont have the same tracker:modified values for your resources. This is what we are changing. We plan to restore the tracker:modified value during journal replay and backup-restore.

Transient properties are properties that are reset after restart of tracker-store (per Desktop session), they aren’t backed up (and therefor not restored either) and they aren’t journaled. They are useful for for example presence status of a contact.

The nice guys and girls at the QSparql team are working on a simple cursor that doesn’t do any buffering nor uses any threads. This is useful for Qt applications that want maximum performance while reading the results of a query.

Last week we introduced and ported to our newest location ontology, SLO.

IPC performance improvements for insert queries

Although with SQLite WAL we have direct-access now, we don’t support direct-access for insert and delete SPARQL queries. Those queries when made using libtracker-sparql still go over D-Bus using Adrien’s FD passing D-Bus IPC technique. The library will do that for you.

After investigating a performance analysis by somebody from Intel we learned that there is still a significant overhead per each IPC call. In the analysis the person made miner-fs combine multiple insert transactions together and then send it over as a single big transaction. This was noticeably faster than making many individual IPC requests.

The problem with this is that if one of the many insert queries fail, they all fail: not good.

We’re now experimenting with a private API that allows you to pass n individual insert transactions, and get n errors back, using one IPC call.

The numbers are promising even on Desktop D-Bus (the test):

$ cd tests/functional-tests/
$ ./update-array-performance-test
First run (first update then array)
Array: 0.103675, Update: 0.139094
Reversing run (first array then update)
Array: 0.290607, Update: 0.161749
$ ./update-array-performance-test
First run (first update then array)
Array: 0.105920, Update: 0.137554
Reversing run (first array then update)
Array: 0.118785, Update: 0.130630
$ ./update-array-performance-test
First run (first update then array)
Array: 0.108501, Update: 0.136524
Reversing run (first array then update)
Array: 0.117308, Update: 0.151192
$

We’re now deciding whether or not the API will become public; returning arrays of errors isn’t exactly ‘nice’ or ‘standard’.

LRU cache for prepared statements in Tracker’s RDF store

While trying to handle a bug that had a description like “if I do this, tracker-store’s memory grows to 80MB and my device starts swapping”, we where surprised to learn that a sqlite3_stmt consumes about 5 kb heap. Auwch.

Before we didn’t think that those prepared statements where very large, so we threw all of them in a hashtable for in case the query was ran again later. However, if you collect thousands of such statements, memory consumption obviously grows.

We decided to implement a LRU cache for these prepared statements. For clients that access the database using direct-access the cache will be smaller, so that max consumption is only a few megabytes. Because our INSERT and DELETE queries are more reusable than SELECT queries, we split it into two different caches per thread.

The implementation is done with a simple intrinsic linked ring list. We’re still testing it a little bit to get good cache-size numbers. I guess it’ll go in master soon. For your testing pleasure you can find the branch here.

Tracker’s new class signal system being developed

Tracker 0.8’s situation

In Tracker 0.8 we have a signal system that causes quite a bit of overhead. The overhead comes from:

  1. Having to store the URIs of the resources involved in a changeset in tracker-store‘s memory;
  2. Having to store the predicates involved in a changeset in tracker-store‘s memory (less severe than A because we can store a pointer to an instance instead of a string);
  3. Having to UTF-8 validate the strings when we emit them over D-Bus (D-Bus does this implicitly);
  4. DBus’s own copying and handling of string data;
  5. Heavy traffic on D-Bus;
  6. Context switching between tracker-store and dbus-daemon;
  7. We have to wait with turning on the D-Bus objects until after we have the latest ontology. So after journal replay. And we need to reset the situation after a backup restore. Complex!

Not all aggregators show this list as A, B, C, D, E, F and G. Sorry for that. I’ll nevertheless refer to the items as such later in this article.

Consumer’s problems with Tracker 0.8’s signal

  1. Aforementioned overhead: consumes a lot of D-Bus traffic. This is caused by sending over URLs for the subjects and the predicates;
  2. Doesn’t make it possible, in case of a delete of <a>, to know <b> in <a> nfo:isLogicalPartOf <b>, as <a> is removed at the point of signal emission;
  3. Round trips to know the literals create more D-Bus traffic;
  4. Transactional changes can’t be reliably identified with SubjectsAdded, SubjectsChanged and SubjectsRemoved being separate signals;
  5. A lot of D-Bus objects, instead of letting clients use D-Bus’s filtering system.

The solution that we’re developing for Tracker 0.9

Direct access

With direct-access we remove most of the round-trip cost of a query coming from a consumer that wants a literal object involved in a changeset: by utilizing the TrackerSparqlCursor API with direct-access enabled, you end up doing sqlite3_step() in your own process, directly on meta.db.

For the consumers of the signal, this removes 3.

Sending integer IDs instead of string URIs

A while ago we introduced the SPARQL function tracker:id(resource uri). The tracker:id(resource uri) function gives you a unique number that Tracker’s RDF store uses internally.

Each resource, each class and each predicate (latter are resources like any other) have such an unique internal ID.

Given that Tracker’s class signal system is specific anyway, we decided not to give you subject URL strings. Instead, we’ll give you the integer IDs.

The Writeback signal also got changed to do this, for the same reasons. But this API is entirely internal and shouldn’t be used outside of the project.

This for us removes A, B, C, D and E. For the consumers of the signal, this removes 1.

Merge added, changed and removed into the one signal

We give you two arrays in one signal: inserts and deletes.

For consumers of the signal, this removes 4.

Add the class name to the signal

This allows you to use a string filter on your signal subscription in D-Bus.

For us this removes G. For consumers of the signal, this removes 5.

Pass the object-id for resource objects

You’ll get a third number in the inserts and deletes arrays: object-id. We don’t send object literals, although for integral objects we’re still discussing this. But for resource objects we give without much extra cost the object-id.

For consumers of the signal, this removes 2.

SPARQL IN, tracker:id(resource uri) and tracker:uri(int id)

We recently added support for SPARQL IN, we already had tracker:id(resource uri) and I implemented tracker:uri(int id).

This makes things like this possible:

SELECT ?t { ?r nie:title ?t .
            FILTER (tracker:id(?r) IN (800, 801, 802, 807)) }

Where 800, 801, 802 and 807 will be the IDs that you receive in the class signal. And with tracker:uri(int id) it goes like:

SELECT tracker:uri (800) tracker:uri (801)
       tracker:uri (802) tracker:uri (807) { }

For consumers this removes most of the burden introduced by the IDs.

Context switching of processes

What is left is context switching between tracker-store and dbus-daemon, F. Mostly important for mobile targets (ARM hardware). We reduce them by grouping transactions together and then bursting larger sets. It’s both timeout and data-size based (after either a certain amount of time, or a certain memory limit, we emit). We’re still testing what the most ideal timeouts and sizes are on target hardware.

Where is the stuff?

The work isn’t yet reviewed nor thoroughly tested. This will happen next few days and weeks.

Anyway, here’s the branch, documentation, example in Plain C, example in Vala

Support for SPARQL IN and NOT IN, the new class signals

I made some documentation about our SPARQL-IN feature that we recently added. I added some interesting use-cases like doing an insert and a delete based on in values.

For the new class signal API that we’re developing this and next week, we’ll probably emit the IDs that tracker:id() would give you if you’d use that on a resource. This means that IN is very useful for the purpose of giving you metadata of resources that are in the list of IDs that you just received from the class signal.

We never documented tracker:id() very much, as it’s not an RDF standard; rather it’s something Tracker specific. But neither are the class signals a RDF standard; they are Tracker specific too. I guess here that makes it usable in combo and turns the status of ‘internal API’, irrelevant.

We’re right now prototyping the new class signals API. It’ll probably be a “sa(iii)a(iii)”:

That’s class-name and two arrays of subject-id, predicate-id, object-id. The class-name is to allow D-Bus filtering. The first array are the deletes and the second are the inserts. We’ll only give you object-ids of non-literal objects (literal objects have no internal object-id). This means that we don’t throw literals to you in the signal (you need to make a query to get them, we’ll throw 0 to you in the signal).

We give you the object-ids because of a use-case that we didn’t cover yet:

Given triple <a> nie:isLogicalPartOf <b>. When <a> is deleted, how do you know <b> during the signal? So the feature request was to do a select ?b { <a> nie:isLogicalPartOf ?b } when <a> is deleted (so the client couldn’t do that query anymore).

With the new signal we’ll give you the ID of <b> when <a> is deleted. We’ll also implement a tracker:uri(integer id) allowing you to get <b> out of that ID. It’ll do something like this, but then much faster: select ?subject { ?subject a rdfs:Resource . FILTER (tracker:id(?subject) IN (%d)) }

I know there will be people screaming for all objects, also literals, in the signals, but we don’t want to flood your D-Bus daemon with all that data. Scream all you want. Really, we don’t. Just do a roundtrip query.

“You’re just making an excuse” is a relative phrase

I recently stumbled upon this marvelous piece. I title the quote “making an excuse“:

Saying that you’re forced to do something when you really aren’t is a failure to take responsibility for your actions. I generally don’t think users of proprietary software are primarily to blame for the challenges of software freedom — nearly all the blame lies with those who write, market, and distribute proprietary software. However, I think that software users should be clear about why they are using the software. It’s quite rare for someone to be compelled under threat of economic (or other) harm to use proprietary software. Therefore, only rarely is it justifiable to say you have to use proprietary software. In most cases, saying so is just making an excuse.

Bradley M. Kuhn – 2010, on his blog

I’ll translate this for you to Catholicism. You can definitely adapt this to most religions (for some, add death penalties like stoning here and there):

Saying that you’re forced by your nature to masturbate when you really aren’t is a failure to take responsibility for your actions. The church generally doesn’t think masturbaters are primarily to blame for the challenges of sexuality — nearly all the blame lies with pornography. However, I think that people who masturbate should be clear about why they have sex with themselves: It’s quite rare for someone to be compelled under the desire of sexual pleasure. Therefore, only rarely is it justifiable to say you have to masturbate. In most cases, saying so is just making an excuse.

The translation

There you go.

Tracker this, Tracker that, everything Tracker

Busy handling

I made an article about reporting busy status in Tracker before.

But then it wasn’t yet possible to queue a query while Tracker’s RDF store is busy. We’re making this possible following next unstable release. Yeah I know you guys hate that Tracker’s RDF store can be busy. But you tell us what else to do while restoring a backup, or while replaying a journal?

While we are replaying the journal, or restoring a backup, we’ll accept your result-hungry queries into our queue. Meanwhile you get progress and status indication over a DBus signal. Some documentation about this is available here.

SPARQL 1.1 Draft features: IN and NOT IN

We had a feature requests for supporting SPARQL IN and NOT IN. As usual, we’re ahead of the SPARQL Draft specification. But I don’t think IN and NOT IN will look much different in the end. Anyway, it was straightforward so I just implemented both.

It goes like this:

SELECT ?abc { ?abc a nie:InformationElement ;
                   nie:title ?title .
               FILTER (?title IN ('abc', 'def')) }
SELECT ?abc { ?abc a nie:InformationElement ;
                   nie:title ?title .
               FILTER (?title NOT IN ('xyz', 'def')) }

It’s particularly useful to get metadata about a defined set of resources (give me the author of this, this and that file)

Direct access

This work is progressing nicely. Most of the guys on the team are working on this, and it’s going to be awesome thanks to SQLite’s WAL journal mode. SQLite’s WAL mode is still under development and probably unstable here and there, but we’re trusting the SQLite guys with this anyway.

What is left to do for direct-access is cleaning up a bit, getting the small nasty things right. You know. The basics are all in place now.

We’re doing most of the library code in Vala, but clever people can easily imagine the C API valac makes from the .vala files here. That’s the abstract API that client developers will use. Unless you use a higher level API like libqttracker, QSparql, Hormiga or sparql-glib.

All of which still need to be adapted to the direct-access work that we’re doing. But we’re in close contact with all of the developers involved in those libraries. And they’re all thrilled to implement backends for the new stuff.

Plans

We plan to change the signals-on-changes or class-signals feature a bit so that the three signals are merged into one. The problem with three is that you can’t reliably identify a change-transaction this way (a rename of a file, for example).

Another thing on our list is merging Zeitgeist’s ontology. To the other team members at Tracker: guys, Zeitgeist has been waiting for three months now. Let’s just get this done!

Oh there are a lot of plans, to be honest.

I wonder when, if ever, we go in feature freeze. Hehe. I guess we’ll just have very short feature-freeze periods. Whatever, it’s fun.

MeeGo in cars

Hey BMW & co, if you guys want to learn how to write music players and playlists for car entertainment on MeeGo, get in touch! This Tracker that I’m talking about is on that MeeGo OS; being the Music’s metadata database is among its purposes.

I can’t wait to have a better music player playlist my car.

Or maybe some integration with the in-car GPS and the car owner’s appointments and meetings? With geo-tagged photos on the car owner’s phone? Automatic and instant synchronization with Nokia’s future phones? Sounds all very doable, even easy, to me. I’d want all that stuff. Use-cases!

Let’s talk!

Julian on TED

I try to avoid posting about the same subject twice in a row. But I also really think that Wikileaks is worth violating about any such rule in existence. Maybe I should make a category on my blog just for Wikileaks?

So TED has decided to do an interview with Julian Assange:

I’d like to point out that I congratulate and thank everybody, not just but also Julian, who’s involved. Thank you.

That today ‘s gonna be a good day

Today is the day the world is witnessing the most significant military leak in the history of mankind, so I have a feeling that today ‘s gonna be a good day.

To all the people at Wikileaks, and to all whistle blowers in past, present and future: you are heroes. You guy’s ideas will be with us for centuries ahead of us. You’ll be remembered in history books. Let’s make sure you guys will.

Why make things complicated?

There are no open source companies. There are companies and there are open source projects.

Some companies work on open source projects, some parent open source projects, some don’t.

Some of those companies are good at fostering a community that contributes to these open source projects. Others are unwilling and some don’t yet understand the process. And again others have many open source projects being done by teams that do get it and have at the same time other projects being done by teams that don’t get it. Actually that last dual situation is the most common among the large companies. You know, the ones that often sponsor your community’s main conference and the ones that employ your heros.

If you do a quick reality-check then you’ll conclude there are no black / white companies. Actually, nothing in life nor in ethics is black / white. Nothing at all.

What you do have is a small group of amazingly disturbing purists who do zero coding themselves (that is, near zero) but do think black / white, and consequently write a lot of absurd nonsense in blog post-comments, on slashdot in particular, forums and mailing lists. These people are the reason numéro uno why many companies quit trying to understand open source.

It’s sad that the actual (open source) developers have to waste time explaining companies, for whom they do consultancy, that these people can be ignored. It’s also sad that these purists have turned so vocal, even violent, that they often can’t really be ignored anymore: people’s employers have been harassed.

“You have to fire somebody because he’s being unethical by disagreeing with my religious believe-system that Microsoft is evil!”. Maybe it’s just me who’s behind on ethics in this world? Well, those people can still get lost because I, in ethics, disagree with them.

Now, let’s get back to the projects and away from the open source vs. open core debates. We have a lot of work to do. And a lot of companies to convince opening their projects.

Open source developers succeeded in (for example) getting some software on phones. The people who did aren’t the religious black / white people. Maybe the media around open source should track down the people who did, and write quite a bit more about their work, ideas and passion?

Finally, the best companies are driven by the ideas and passions of their best employees. Those are the people who you should admire. Not their company’s open core PR.

Wrapping up 4.57 billion years

In 4.57 billion years our solar system went from creating simple bacteria to a large group of species. Several of which highly capable of making fairly intelligent decisions, one of which capable of having the indulgence of believing that it can think. That’s us.

The sun has an estimated 5 billion years to go before it turns into a Red Giant that in its very early stages will wipe out truly every single idea that exists inside at least our own solar system.

Unless radio waves that our planet started emitting since we invented radio are seen and understood (which requires a recipient in the first place), that will be the ultimate end of all of our ideas and culture. Unless we figure out a way to let the ideas cultivate outside of our solar system. Just the ideas would already be an insane achievement.

But imagine going from bacteria to beings, colonized by bacteria, that think that they can think, in far less time than the current age of our sun. Unless, of course, bacteria somehow arrived into our solar system from outside (unlikely, but perhaps equally unlikely than us ever exporting our ideas and culture to another solar system).

Imagine what could happen in the next 5 billion years …

Domain indexes finished, technical conclusions

The support for domain specific indexes is, awaiting review / finished. Although we can further optimize it now. More on that later in this post. Image that you have this ontology:

nie:InformationElement a rdfs:Class .

nie:title a rdf:Property ;
  nrl:maxCardinality 1 ;
  rdfs:domain nie:InformationElement ;
  rdfs:range xsd:string .

nmm:MusicPiece a rdfs:Class ;
  rdfs:subClassOf nie:InformationElement .

nmm:beatsPerMinute a rdf:Property ;
  nrl:maxCardinality 1 ;
  rdfs:domain nmm:MusicPiece ;
  rdfs:range xsd:integer .

With that ontology there are three tables called “Resource”, “nmo:MusicPiece” and “nie:InformationElement” in SQLite’s schema:

  • The “Resource” table has ID and the subject string
  • The “nie:InformationElement” has ID and “nie:title”
  • The “nmm:MusicPiece” one has ID and “nmm:beatsPerMinute”

That’s fairly simple, right? The problem is that when you ORDER BY “nie:title” that you’ll cause a full table scan on “nie:InformationElement”. That’s not good, because there are less “nmm:MusicPiece” records than “nie:InformationElement” ones.

Imagine that we do this SPARQL query:

SELECT ?title WHERE {
   ?resource a nmm:MusicPiece ;
             nie:title ?title
} ORDER BY ?title

We translate that, for you, to this SQL on our schema:

SELECT   "title_u" FROM (
  SELECT "nmm:MusicPiece1"."ID" AS "resource_u",
         "nie:InformationElement2"."nie:title" AS "title_u"
  FROM   "nmm:MusicPiece" AS "nmm:MusicPiece1",
         "nie:InformationElement" AS "nie:InformationElement2"
  WHERE  "nmm:MusicPiece1"."ID" = "nie:InformationElement2"."ID"
  AND    "title_u" IS NOT NULL
) ORDER BY "title_u"

OK, so with support for domain indexes we change the ontology like this:

nmm:MusicPiece a rdfs:Class ;
  rdfs:subClassOf nie:InformationElement ;
  tracker:domainIndex nie:title .

Now we’ll have the three tables called “Resource”, “nmo:MusicPiece” and “nie:InformationElement” in SQLite’s schema. But they will look like this:

  • The “Resource” table has ID and the subject string
  • The “nie:InformationElement” has ID and “nie:title”
  • The “nmm:MusicPiece” table now has three columns called ID, “nmm:beatsPerMinute” and “nie:title”

The same data, for titles of music pieces, will be in both “nie:InformationElement” and “nmm:MusicPiece”. We copy to the mirror column during ontology change coping, and when new inserts happen.

When now the rdf:type is known in the SPARQL query as a nmm:MusicPiece, like in the query mentioned earlier, we know that we can use the “nie:title” from the “nmm:MusicPiece” table in SQLite. That allows us to generate you this SQL query:

SELECT   "title_u" FROM (
  SELECT "nmm:MusicPiece1"."ID" AS "resource_u",
         "nmm:MusicPiece1"."nie:title" AS "title_u"
  FROM   "nmm:MusicPiece" AS "nmm:MusicPiece1"
  WHERE  "title_u" IS NOT NULL
) ORDER BY "title_u"

A remaining optimization is when you request a rdf:type that is a subclass of nmm:MusicPiece, like this:

SELECT ?title WHERE {
  ?resource a nmm:MusicPiece, nie:InformationElement ;
            nie:title ?title
} ORDER BY ?title

It’s still not as bad as now the “nie:title” is still taken from the “nmm:MusicPiece” table. But the join with “nie:InformationElement” is still needlessly there (we could just do the earlier SQL query in this case):

SELECT   "title_u" FROM (
  SELECT "nmm:MusicPiece1"."ID" AS "resource_u",
         "nmm:MusicPiece1"."nie:title" AS "title_u"
  FROM   "nmm:MusicPiece" AS "nmm:MusicPiece1",
         "nie:InformationElement" AS "nie:InformationElement2"
  WHERE  "nmm:MusicPiece1"."ID" = "nie:InformationElement2"."ID"
  AND    "title_u" IS NOT NULL
) ORDER BY "title_u"

We will probably optimize this specific use-case further later this week.

Smile or Die

In followup on the RSA animation videos here’s the original talk by Barbara Ehrenreich titled Smile or Die.

I think part of GNOME’s crisis is caused by the same atmosphere of “go with the program, don’t complain, or you’re out”. I wrote about this before:

It’s not popular to be critical about a (the leader of a) popular idea. This is illustrated by the intellectually absurd criticisms David Schlesinger receives.

Yet is the critic who monitors the organs of a society key to that organ either producing for its stakeholders, or failing and dragging the entire society it serves down with it.

Acknowledging the problem and changing course is what I seek in a candidate this year.

OK, two is enough. Back to technical articles.

FWD: The Secret Powers of Time



Video link



Video link

Friday’s performance improvements in Tracker

The crawler’s modification time queries

Yesterday we optimized the crawler’s query that gets the modification time of files. We use this timestamp to know whether or not a file must be reindexed.

Originally, we used a custom SQLite function called tracker:uri-is-parent() in SPARQL. This, however, caused a full table scan. As long as your SQL table for nfo:FileDataObjects wasn’t too large, that wasn’t a huge problem. But it didn’t scale linear. I started with optimizing the function itself. It was using a strlen() so I replaced that with a sqlite3_value_bytes(). We only store UTF-8, so that worked fine. It gained me ~ 10%; not enough.

So this commit was a better improvement. First it makes nfo:belongsToContainer an indexed property. The x nfo:belongsToContainer p means x is in a directory p for file resources. The commit changes the query to use the property that is now indexed.

The original query before we started with this optimization took 1.090s when you had ~ 300,000 nfo:FileDataObject resources. The new query takes about 0.090s. It’s of course an unfair comparison because now we use an indexed property. Adding the index only took a total of 10s for a ~ 300,000 large table and the table is being queried while we index (while we insert into it). Do the math, it’s a huge win in all situations. For the SQLite freaks; the SQLite database grew by 4 MB, with all items in the table indexed.

PDF extractor

Another optimization I did earlier was the PDF extractor. Originally, we used the poppler-glib library. This library doesn’t allow us to set the OutputDev at runtime. If compiled with Cairo, the OutputDev is in some versions a CairoOutputDev. We don’t want all images in the PDF to be rendered to a Cairo surface. So I ported this back to C++ and made it always use a TextOutputDev instead. In poppler-glib master this appears to have improved (in git master poppler_page_get_text_page is always using a TextOutputDev).

Another major problem with poppler-glib is the huge amount of copying strings in heap. The performance to extract metadata and content text for a 70 page PDF document without any images went from 1.050s to 0.550s. A lot of it was caused by copying strings and GValue boxing due to GObject properties.

Table locked problem

Last week I improved D-Bus marshaling by using a database cursor. I forgot to handle SQLITE_LOCKED while Jürg and Carlos had been introducing multithreaded SELECT support. Not good. I fixed this; it was causing random Table locked errors.

RDF propaganda, time for change

I’m not supposed to but I’m proud. It’s not only me who’s doing it.

Adrien is one of the new guys on the block. He’s working on integration with Tracker’s RDF service and various web services like Flickr, Facebook, Twitter, picasaweb and RSS. This is the kind of guy several companies should be afraid of. His work is competing with what they are trying to do do: integrating the social web with mobile.

Oh come on Steve, stop pretending that you aren’t. And you better come up with something good, because we are.

Not only that, Adrien is implementing so-called writeback. It means that when you change a local resource’s properties, that this integration will update Flickr, Facebook, picasaweb and Twitter.

You change a piece of info about a photo on your phone, and it’ll be replicated to Flickr. It’ll also be synchronized onto your phone as soon as somebody else made a change.

This is the future of computing and information technology. Integration with social networking and the phone is what people want. Dear Mark, it’s unstoppable. You better keep your eyes open, because we are going fast. Faster than your business.

I’m not somebody trying to guess how technology will look in a few years. I try to be in the middle of the technical challenge of actually doing it. Talking about it is telling history before your lip’s muscles moved.

At the Tracker project we are building a SPARQL endpoint that uses D-Bus as IPC. This is ideal on Nokia’s Meego. It’ll be a centerpiece for information gathering. On Meego you wont ask the filesystem, instead you’ll ask Tracker using SPARQL and RDF.

To be challenged is likely the most beautiful state of mind.

I invite everybody to watch this demo by Adrien. It’s just the beginning. It’s going to get better.

Tracker writeback & web service integration demo / MeegoTouch UI from Adrien Bustany on Vimeo.

I tagged this as ‘extremely controversial’. That’s fine, Adrien told me that “people are used to me anyway”.

Performance DBus handling of the query results in Tracker’s RDF service

Before

For returning the results of a SPARQL SELECT query we used to have a callback like this. I removed error handling, you can find the original here.

We need to marshal a database result_set to a GPtrArray because dbus-glib fancies that. This is a lot of boxing the strings into GValue and GStrv. It does allocations, so not good.

static void
query_callback(TrackerDBResultSet *result_set,GError *error,gpointer user_data)
{
  TrackerDBusMethodInfo *info = user_data;
  GPtrArray *values = tracker_dbus_query_result_to_ptr_array (result_set);
  dbus_g_method_return (info->context, values);
  tracker_dbus_results_ptr_array_free (&values);
}

void
tracker_resources_sparql_query (TrackerResources *self, const gchar *query,
                                DBusGMethodInvocation *context, GError **error)
{
  TrackerDBusMethodInfo *info = ...; guint request_id;
  TrackerResourcesPrivate *priv= ...; gchar *sender;
  info->context = context;
  tracker_store_sparql_query (query, TRACKER_STORE_PRIORITY_HIGH,
                              query_callback, ...,
                              info, destroy_method_info);
}

After

Last week I changed the asynchronous callback to return a database cursor. In SQLite that means an sqlite3_step(). SQLite returns const pointers to the data in the cell with its sqlite3_column_* APIs.

This means that now we’re not even copying the strings out of SQLite. Instead, we’re using them as const to fill in a raw DBusMessage:

static void
query_callback(TrackerDBCursor *cursor,GError *error,gpointer user_data)
{
  TrackerDBusMethodInfo *info = user_data;
  DBusMessage *reply; DBusMessageIter iter, rows_iter;
  guint cols; guint length = 0;
  reply = dbus_g_method_get_reply (info->context);
  dbus_message_iter_init_append (reply, &iter);
  cols = tracker_db_cursor_get_n_columns (cursor);
  dbus_message_iter_open_container (&iter, DBUS_TYPE_ARRAY,
                                    "as", &rows_iter);
  while (tracker_db_cursor_iter_next (cursor, NULL)) {
    DBusMessageIter cols_iter; guint i;
    dbus_message_iter_open_container (&rows_iter, DBUS_TYPE_ARRAY,
                                      "s", &cols_iter);
    for (i = 0; i < cols; i++, length++) {
      const gchar *result_str = tracker_db_cursor_get_string (cursor, i);
      dbus_message_iter_append_basic (&cols_iter,
                                      DBUS_TYPE_STRING,
                                      &result_str);
    }
    dbus_message_iter_close_container (&rows_iter, &cols_iter);
  }
  dbus_message_iter_close_container (&iter, &rows_iter);
  dbus_g_method_send_reply (info->context, reply);
}

Results

The test is a query on 13500 resources where we ask for two strings, repeated eleven times. I removed a first repeat from each round, because the first time the sqlite3_stmt still has to be created. This means that our measurement would get a few more milliseconds. I also directed the standard out to /dev/null to avoid the overhead created by the terminal. The results you see below are the value for “real”.

There is of course an overhead created by the “tracker-sparql” program. It does demarshaling using normal dbus-glib. If your application uses DBusMessage directly, then it can avoid the same overhead. But since for both rounds I used the same “tracker-sparql” it doesn’t matter for the measurement.

$ time tracker-sparql -q "SELECT ?u  ?m { ?u a rdfs:Resource ;
          tracker:modified ?m }" > /dev/null

Without the optimization:

0.361s, 0.399s, 0.327s, 0.355s, 0.340s, 0.377s, 0.346s, 0.380s, 0.381s, 0.393s, 0.345s

With the optimization:

0.279s, 0.271s, 0.305s, 0.296s, 0.295s, 0.294s, 0.295s, 0.244s, 0.289s, 0.237s, 0.307s

The improvement ranges between 7% and 40% with average improvement of 22%.

Focus on query performance

Every (good) developer knows that copying of memory and boxing, especially when dealing with a large amount of pieces like members of collections and the cells in a table, are a bad thing for your performance.

More experienced developers also know that novice developers tend to focus on just their algorithms to improve performance, while often the single biggest bottleneck is needless boxing and allocating. Experienced developers come up with algorithms that avoid boxing and copying; they master clever pragmatical engineering and know how to improve algorithms. A lot of newcomers use virtual machines and script languages that are terrible at giving you the tools to control this and then they start endless religious debates about how great their programming language is (as if it matters). (Anti-.NET people don’t get on your horses too soon: if you know what you are doing, C# is actually quite good here).

We were of course doing some silly copying ourselves. Apparently it had a significant impact on performance.

Once Jürg and Carlos have finished the work on parallelizing SELECT queries we plan to let the code that walks the SQLite statement fill in the DBusMessage directly without any memory copying or boxing (for marshalling to DBus). We found the get_reply and send_reply functions; they sound useful for this purpose.

I still don’t really like DBus as IPC for data transfer of Tracker’s RDF store’s query results. Personally I think I would go for a custom Unix socket here. But Jürg so far isn’t convinced. Admittedly he’s probably right; he’s always right. Still, DBus to me doesn’t feel like a good IPC for this data transfer..

We know about the requests to have direct access to the SQLite database from your own process. I explained in the bug that SQLite3 isn’t MVCC and that this means that your process will often get blocked for a long time on our transaction. A longer time than any IPC overhead takes.