Tracker’s new class signal system being developed

Tracker 0.8’s situation

In Tracker 0.8 we have a signal system that causes quite a bit of overhead. The overhead comes from:

  1. Having to store the URIs of the resources involved in a changeset in tracker-store‘s memory;
  2. Having to store the predicates involved in a changeset in tracker-store‘s memory (less severe than A because we can store a pointer to an instance instead of a string);
  3. Having to UTF-8 validate the strings when we emit them over D-Bus (D-Bus does this implicitly);
  4. DBus’s own copying and handling of string data;
  5. Heavy traffic on D-Bus;
  6. Context switching between tracker-store and dbus-daemon;
  7. We have to wait with turning on the D-Bus objects until after we have the latest ontology. So after journal replay. And we need to reset the situation after a backup restore. Complex!

Not all aggregators show this list as A, B, C, D, E, F and G. Sorry for that. I’ll nevertheless refer to the items as such later in this article.

Consumer’s problems with Tracker 0.8’s signal

  1. Aforementioned overhead: consumes a lot of D-Bus traffic. This is caused by sending over URLs for the subjects and the predicates;
  2. Doesn’t make it possible, in case of a delete of <a>, to know <b> in <a> nfo:isLogicalPartOf <b>, as <a> is removed at the point of signal emission;
  3. Round trips to know the literals create more D-Bus traffic;
  4. Transactional changes can’t be reliably identified with SubjectsAdded, SubjectsChanged and SubjectsRemoved being separate signals;
  5. A lot of D-Bus objects, instead of letting clients use D-Bus’s filtering system.

The solution that we’re developing for Tracker 0.9

Direct access

With direct-access we remove most of the round-trip cost of a query coming from a consumer that wants a literal object involved in a changeset: by utilizing the TrackerSparqlCursor API with direct-access enabled, you end up doing sqlite3_step() in your own process, directly on meta.db.

For the consumers of the signal, this removes 3.

Sending integer IDs instead of string URIs

A while ago we introduced the SPARQL function tracker:id(resource uri). The tracker:id(resource uri) function gives you a unique number that Tracker’s RDF store uses internally.

Each resource, each class and each predicate (latter are resources like any other) have such an unique internal ID.

Given that Tracker’s class signal system is specific anyway, we decided not to give you subject URL strings. Instead, we’ll give you the integer IDs.

The Writeback signal also got changed to do this, for the same reasons. But this API is entirely internal and shouldn’t be used outside of the project.

This for us removes A, B, C, D and E. For the consumers of the signal, this removes 1.

Merge added, changed and removed into the one signal

We give you two arrays in one signal: inserts and deletes.

For consumers of the signal, this removes 4.

Add the class name to the signal

This allows you to use a string filter on your signal subscription in D-Bus.

For us this removes G. For consumers of the signal, this removes 5.

Pass the object-id for resource objects

You’ll get a third number in the inserts and deletes arrays: object-id. We don’t send object literals, although for integral objects we’re still discussing this. But for resource objects we give without much extra cost the object-id.

For consumers of the signal, this removes 2.

SPARQL IN, tracker:id(resource uri) and tracker:uri(int id)

We recently added support for SPARQL IN, we already had tracker:id(resource uri) and I implemented tracker:uri(int id).

This makes things like this possible:

SELECT ?t { ?r nie:title ?t .
            FILTER (tracker:id(?r) IN (800, 801, 802, 807)) }

Where 800, 801, 802 and 807 will be the IDs that you receive in the class signal. And with tracker:uri(int id) it goes like:

SELECT tracker:uri (800) tracker:uri (801)
       tracker:uri (802) tracker:uri (807) { }

For consumers this removes most of the burden introduced by the IDs.

Context switching of processes

What is left is context switching between tracker-store and dbus-daemon, F. Mostly important for mobile targets (ARM hardware). We reduce them by grouping transactions together and then bursting larger sets. It’s both timeout and data-size based (after either a certain amount of time, or a certain memory limit, we emit). We’re still testing what the most ideal timeouts and sizes are on target hardware.

Where is the stuff?

The work isn’t yet reviewed nor thoroughly tested. This will happen next few days and weeks.

Anyway, here’s the branch, documentation, example in Plain C, example in Vala

Support for SPARQL IN and NOT IN, the new class signals

I made some documentation about our SPARQL-IN feature that we recently added. I added some interesting use-cases like doing an insert and a delete based on in values.

For the new class signal API that we’re developing this and next week, we’ll probably emit the IDs that tracker:id() would give you if you’d use that on a resource. This means that IN is very useful for the purpose of giving you metadata of resources that are in the list of IDs that you just received from the class signal.

We never documented tracker:id() very much, as it’s not an RDF standard; rather it’s something Tracker specific. But neither are the class signals a RDF standard; they are Tracker specific too. I guess here that makes it usable in combo and turns the status of ‘internal API’, irrelevant.

We’re right now prototyping the new class signals API. It’ll probably be a “sa(iii)a(iii)”:

That’s class-name and two arrays of subject-id, predicate-id, object-id. The class-name is to allow D-Bus filtering. The first array are the deletes and the second are the inserts. We’ll only give you object-ids of non-literal objects (literal objects have no internal object-id). This means that we don’t throw literals to you in the signal (you need to make a query to get them, we’ll throw 0 to you in the signal).

We give you the object-ids because of a use-case that we didn’t cover yet:

Given triple <a> nie:isLogicalPartOf <b>. When <a> is deleted, how do you know <b> during the signal? So the feature request was to do a select ?b { <a> nie:isLogicalPartOf ?b } when <a> is deleted (so the client couldn’t do that query anymore).

With the new signal we’ll give you the ID of <b> when <a> is deleted. We’ll also implement a tracker:uri(integer id) allowing you to get <b> out of that ID. It’ll do something like this, but then much faster: select ?subject { ?subject a rdfs:Resource . FILTER (tracker:id(?subject) IN (%d)) }

I know there will be people screaming for all objects, also literals, in the signals, but we don’t want to flood your D-Bus daemon with all that data. Scream all you want. Really, we don’t. Just do a roundtrip query.