Tracker 0.8’s situation
In Tracker 0.8 we have a signal system that causes quite a bit of overhead. The overhead comes from:
- Having to store the URIs of the resources involved in a changeset in tracker-store‘s memory;
- Having to store the predicates involved in a changeset in tracker-store‘s memory (less severe than A because we can store a pointer to an instance instead of a string);
- Having to UTF-8 validate the strings when we emit them over D-Bus (D-Bus does this implicitly);
- DBus’s own copying and handling of string data;
- Heavy traffic on D-Bus;
- Context switching between tracker-store and dbus-daemon;
- We have to wait with turning on the D-Bus objects until after we have the latest ontology. So after journal replay. And we need to reset the situation after a backup restore. Complex!
Not all aggregators show this list as A, B, C, D, E, F and G. Sorry for that. I’ll nevertheless refer to the items as such later in this article.
Consumer’s problems with Tracker 0.8’s signal
- Aforementioned overhead: consumes a lot of D-Bus traffic. This is caused by sending over URLs for the subjects and the predicates;
- Doesn’t make it possible, in case of a delete of <a>, to know <b> in <a> nfo:isLogicalPartOf <b>, as <a> is removed at the point of signal emission;
- Round trips to know the literals create more D-Bus traffic;
- Transactional changes can’t be reliably identified with SubjectsAdded, SubjectsChanged and SubjectsRemoved being separate signals;
- A lot of D-Bus objects, instead of letting clients use D-Bus’s filtering system.
The solution that we’re developing for Tracker 0.9
Direct access
With direct-access we remove most of the round-trip cost of a query coming from a consumer that wants a literal object involved in a changeset: by utilizing the TrackerSparqlCursor API with direct-access enabled, you end up doing sqlite3_step() in your own process, directly on meta.db.
For the consumers of the signal, this removes 3.
Sending integer IDs instead of string URIs
A while ago we introduced the SPARQL function tracker:id(resource uri). The tracker:id(resource uri) function gives you a unique number that Tracker’s RDF store uses internally.
Each resource, each class and each predicate (latter are resources like any other) have such an unique internal ID.
Given that Tracker’s class signal system is specific anyway, we decided not to give you subject URL strings. Instead, we’ll give you the integer IDs.
The Writeback signal also got changed to do this, for the same reasons. But this API is entirely internal and shouldn’t be used outside of the project.
This for us removes A, B, C, D and E. For the consumers of the signal, this removes 1.
Merge added, changed and removed into the one signal
We give you two arrays in one signal: inserts and deletes.
For consumers of the signal, this removes 4.
Add the class name to the signal
This allows you to use a string filter on your signal subscription in D-Bus.
For us this removes G. For consumers of the signal, this removes 5.
Pass the object-id for resource objects
You’ll get a third number in the inserts and deletes arrays: object-id. We don’t send object literals, although for integral objects we’re still discussing this. But for resource objects we give without much extra cost the object-id.
For consumers of the signal, this removes 2.
SPARQL IN, tracker:id(resource uri) and tracker:uri(int id)
We recently added support for SPARQL IN, we already had tracker:id(resource uri) and I implemented tracker:uri(int id).
This makes things like this possible:
SELECT ?t { ?r nie:title ?t . FILTER (tracker:id(?r) IN (800, 801, 802, 807)) }
Where 800, 801, 802 and 807 will be the IDs that you receive in the class signal. And with tracker:uri(int id) it goes like:
SELECT tracker:uri (800) tracker:uri (801) tracker:uri (802) tracker:uri (807) { }
For consumers this removes most of the burden introduced by the IDs.
Context switching of processes
What is left is context switching between tracker-store and dbus-daemon, F. Mostly important for mobile targets (ARM hardware). We reduce them by grouping transactions together and then bursting larger sets. It’s both timeout and data-size based (after either a certain amount of time, or a certain memory limit, we emit). We’re still testing what the most ideal timeouts and sizes are on target hardware.
Where is the stuff?
The work isn’t yet reviewed nor thoroughly tested. This will happen next few days and weeks.
Anyway, here’s the branch, documentation, example in Plain C, example in Vala