Gebruik maken van verbanden tussen metadata

Ik beweerde onlangs ergens dat een systeem dat verbanden (waar, wanneer, met wie, waarom) in plaats van louter metadata (titel, datum, auteur, enz.) over content verzamelt een oplossing zou kunnen bieden voor het probleem dat gebruikers van digitale media meer en meer zullen hebben; namelijk dat ze teveel materiaal gaan verzameld hebben om er ooit nog eens iets snel genoeg in terug te vinden.

Ik denk dat verbanden meer gewicht moeten krijgen dan louter de metadata omdat het door middel van verbanden is dat wij mensen in onze hersenen informatie onthouden. Niet door middel van feiten (titel, datum, auteur, enz.) maar wel door middel van verbanden (waar, wanneer, met wie, waarom) .

Ik gaf als hypothetisch voorbeeld dat ik een video wilde vinden die ik gekeken had met Erika toen ik op vakantie was met haar en die zij als super tof had gemarkeerd.

Wat zijn de verbanden die we moeten verzamelen? Dit is een eenvoudig oefeningetje in analyse: gewoon de zelfstandige naamwoorden onderlijnen en het probleem opnieuw uitschrijven:

  • Dat ik op vakantie was toen ik hem laatst zag. Dat is een point of interest (waar)
  • Dat het een film is (wat, is een feit over mijn te vinden onderwerp en dus geen verband. Maar we nemen dit mee)
  • Met wie ik de film gekeken heb en wanneer (met wie, wanneer)
  • Dat Erika, met wie ik de film gekeken heb, de film super tof vond (waarom)

Dus laat ik deze use-case eens in RDF gieten en oplossen met SPARQL. Dit moeten we verzamelen. Ik schrijf het in pseudo TTL. Bedenk er even bij dat deze ontology helemaal bestaat:

<erika> a Person ; name "Erika" .
<vakantiePlek> a PointOfInterest ; title "De vakantieplek" .
<filmA> a Movie ; lastSeenAt <vakantiePlek> ; sharedWith <erika>; title "The movie" .
<erika> likes <filmA> .

Dit is daarna de SPARQL query:

SELECT ?m { ?v a Movie ; title ?m . ?v lastSeenAt ?p . ?p title ?pt . ?v sharedWith <erika> . <erika> likes ?v . FILTER (?pt LIKE '%vakantieplek%') }

Ik laat het als een oefening aan de lezer om dit naar de ontology Nepomuk om te zetten (volgens mij kan het deze hele use-case aan). En dan kan je dat eens op je N9 of je standaard GNOME desktop testen met de tool tracker-sparql. Wedden dat het werkt. :-)

Het grote probleem is inderdaad de data aquisitie van de verbanden. De query maken is vrij eenvoudig. De ontology vastleggen en afspreken met alle partijen al wat minder. De informatie verzamelen is dé moeilijkheid.

Oh ja. En eens verzameld, de informatie veilig bijhouden zonder dat mijn privacy geschonden wordt. Dat lijkt tegenwoordig gewoonweg onmogelijk. Helaas.

Het is in ieder geval niet nodig dat een supercomputer of zo dit centraal moet oplossen (met AI en heel de gruwelijk complexe hype zooi van vandaag).

Ieder klein toestelletje kan dit soort use-cases zelfstandig oplossen. De bovenstaande inserts en query zijn eenvoudig op te lossen. SQLite doet dit in een paar milliseconden met een gedenormalizeerd schema. Uw fancy hipster NoSQL oplossing waarschijnlijk ook.

Dat is omdat het gewicht van data aquisitie op de verbanden ligt in plaats van op de feiten.

nrl:maxCardinality one-to-many ontology changes

I added support for changing the nrl:maxCardinality property of an rdfs:Property from one to many. Earlier Martyn Russel reverted such an ontology change as this was a blocker for the Debian packaging by Michael Biebl.

We only support going from one to many. That’s because going from many to one would obviously imply data-loss (a string-list could work with CSV, but an int-list can’t be stored as CSV in a single-value int type – instead of trying to support nonsense I decided to just not do it at all).

More supported ontology changes can be found here.

Not sure if people care but this stuff was made while listening to Infected Mushroom.

Tracker supports volume management under a minimal environment

While Nemo Mobile OS doesn’t ship with udisks2 nor with the GLib/GIO GVfs2 modules that interact with it, we still wanted removable volume management working with the file indexer.

It means that types like GVolume and GVolumeMonitor in GLib’s GIO will fall back to GUnixVolume and GUnixVolumeMonitor using GUnixMount and GUnixMounts instead of using the more competent GVfs2 modules.

The GUnixMounts fallback uses the _PATH_MNTTAB, which generally points to /proc/mounts, to know what the mount points are.

Removable volumes usually aren’t configured in the /etc/fstab file, which would or could affect /proc/mounts, plus if you’d do it this way the UUID label can’t be known upfront (you don’t know which sdcard the user will insert). Tracker’s FS miner needs this label to uniquely identify a removable volume to know if a previously seen volume is returning.

If you look at gunixvolume.c’s g_unix_volume_get_identifier you’ll notice that it always returns NULL in case the UUID label isn’t set in the mtab file: the pure-Unix fall back implementations aren’t fit for non-typical desktop usage; it’s what udisks2 and GVfs2 normally provide for you. But we don’t have it on the Nemo Mobile OS.

The mount_add in miners/fs/tracker-storage.c luckily has an alternative that uses the mountpoint’s name (line ~592). We’ll use this facility to compensate for the lacking UUID.

Basically, we add the UUID of the device to the mountpoint’s directory name and Tracker’s existing volume management will generate a unique UUID using MD5 for each unique mountpoint directory. What follows is specific for Nemo Mobile and its systemd setup.

We added some udev rules to /etc/udev/rules.d/90-mount-sd.rules:

SUBSYSTEM=="block", KERNEL=="mmcblk1*", ACTION=="add", MODE="0660", TAG+="systemd", 
  ENV{SYSTEMD_WANTS}="mount-sd@%k.service", ENV{SYSTEMD_USER_WANTS}="tracker-miner-fs.service
  tracker-store.service"

We added /etc/systemd/system/mount-sd@.service:

[Unit]
Description=Handle sdcard
After=init-done.service dev-%i.device
BindsTo=dev-%i.device

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/usr/sbin/mount-sd.sh add %i
ExecStop=/usr/sbin/mount-sd.sh remove %i

And we created mount-sd.sh:

if [ "$ACTION" = "add" ]; then
    eval "$(/sbin/blkid -c /dev/null -o export /dev/$2)"
    test -d $MNT/${UUID} || mkdir -p $MNT/${UUID}
    chown $DEF_UID:$DEF_GID $MNT $MNT/${UUID}
    touch $MNT/${UUID}
    mount ${DEVNAME} $MNT/${UUID} -o $MOUNT_OPTS || /bin/rmdir $MNT/${UUID}
    test -d $MNT/${UUID} && touch $MNT/${UUID}
else
    DIR=$(mount | grep -w ${DEVNAME} | cut -d \  -f 3)
    if [ -n "${DIR}" ] ; then
        umount $DIR || umount -l $DIR
    fi
fi

Now we just have to configure Tracker right:

gsettings set org.freedesktop.Tracker.Miner.Files index-removable-devices true

Let’s try that:

# Insert sdcard
[nemo@Jolla ~]$ mount | grep sdcard
/dev/mmcblk1 on /media/sdcard/F6D0-FC42 type vfat (rw,nosuid,nodev,noexec,...
[nemo@Jolla ~]$ 

[nemo@Jolla ~]$ touch  /media/sdcard/F6D0-FC42/test.txt
[nemo@Jolla ~]$ tracker-sparql -q "select tracker:available(?s) nfo:fileName(?s) \
     { ?s nie:url 'file:///media/sdcard/F6D0-FC42/test.txt' }"
Results:
  true, test.txt

# Take out the sdcard

[nemo@Jolla ~]$ mount | grep sdcard
[nemo@Jolla ~]$ tracker-sparql -q "select tracker:available(?s) nfo:fileName(?s) \
     { ?s nie:url 'file:///media/sdcard/F6D0-FC42/test.txt' }"
Results:
  (null), test.txt
[nemo@Jolla ~]$

FOSDEM presentation about Metadata Tracker

I will be doing a presentation about Tracker at FOSDEM this year.

Metadata Tracker is now being used not only on GNOME, the N900 and N9, but is also being used on the Jolla Phone. On top a software developer for several car brands, Pelagicore, claims to be using it with custom made ontologies; SerNet told us they are integrating Tracker for use as search engine backend for Apple OS X SMB clients and last year Tracker integration with Netatalk was done by NetAFP. Other hardware companies have approached the team about integrating the software with their products. In this presentation I’d like to highlight the difficulties those companies encountered and how the project deals with them, dependencies to get a minimal system up and running cleanly, recent things the upstream team is working on and I’d like to propose some future ideas.

Link on fosdem.org

Mr. Dillon; smartphone innovation in Europe ought to be about people’s privacy

Dear Mark,

Your team and you yourself are working on the Jolla Phone. I’m sure that you guys are doing a great job and although I think you’ve been generating hype and vaporware until we can actually buy the damn thing, I entrust you with leading them.

As their leader you should, I would like to, allow them to provide us with all of the device’s source code and build environments of their projects so that we can have the exact same binaries. With exactly the same I mean that it should be possible to use MD5 checksums. I’m sure you know what that means and you and I know that your team knows how to provide geeks like me with this. I worked with some of them together during Nokia’s Harmattan and Fremantle and we both know that you can easily identify who can make this happen.

The reason why is simple: I want Europe to develop a secure phone similar to how, among other open source projects, the Linux kernel can be trusted. By peer review of the source code.

Kind regards,

A former Harmattan developer who worked on a component of the Nokia N9 that stores the vast majority of user’s privacy.

ps. I also think that you should reunite Europe’s finest software developers and secure the funds to make this workable. But that’s another discussion which I’m eager to help you with.

A use-case for SPARQL and Nepomuk

As I got contacted by two different companies last few days who both had questions about integrating Tracker into their device, I started thinking that perhaps I should illustrate what Tracker can already do today.

I’m going to make a demo for the public transportation industry in combination with contacts and places of interest. Tracker’s ontologies cross many domains, of course (this is just an example).

I agree that in principle what I’m showing here isn’t rocket science. You can do this with almost any database technology. What is interesting is that as soon as many domains start sharing the ontology and store their data in a shared way, interesting queries and use-cases are made possible.

So let’s first insert a place of interest: the Pizza Hut in Nossegem

tracker-sparql -uq "
INSERT { _:1 a nco:PostalAddress ; nco:country 'Belgium';
               nco:streetAddress 'Weiveldlaan 259 Zaventem' ;
               nco:postalcode '1930' .
        _:2 a slo:Landmark; nie:title 'Pizza Hut Nossegem';
              slo:location [ a slo:GeoLocation;
                  slo:latitude '50.869949'; slo:longitude '4.490477';
                  slo:postalAddress _:1 ];
              slo:belongsToCategory slo:predefined-landmark-category-food-beverage  }"

And let’s add some busstops:

tracker-sparql -uq "
INSERT { _:1 a nco:PostalAddress ; nco:country 'Belgium';
               nco:streetAddress 'Leuvensesteenweg 544 Zaventem' ;
               nco:postalcode '1930' .
         _:2 a slo:Landmark; nie:title 'Busstop Sint-Martinusweg';
               slo:location [ a slo:GeoLocation;
                   slo:latitude '50.87523'; slo:longitude '4.49426';
                   slo:postalAddress _:1 ];
               slo:belongsToCategory slo:predefined-landmark-category-transport  }"
tracker-sparql -uq "
INSERT  { _:1 a nco:PostalAddress ; nco:country 'Belgium';
                nco:streetAddress 'Leuvensesteenweg 550 Zaventem' ;
                nco:postalcode '1930' .
          _:2 a slo:Landmark; nie:title 'Busstop Hoge-Wei';
                slo:location [ a slo:GeoLocation;
                    slo:latitude '50.875988'; slo:longitude '4.498208';
                    slo:postalAddress _:1 ];
                slo:belongsToCategory slo:predefined-landmark-category-transport  }"
tracker-sparql -uq "
INSERT  { _:1 a nco:PostalAddress ; nco:country 'Belgium';
                nco:streetAddress 'Guldensporenlei Turnhout' ;
                nco:postalcode '2300' .
          _:2 a slo:Landmark; nie:title 'Busstop Guldensporenlei';
                slo:location [ a slo:GeoLocation;
                    slo:latitude '51.325463'; slo:longitude '4.938047';
                    slo:postalAddress _:1 ];
                slo:belongsToCategory slo:predefined-landmark-category-transport  }"

Let’s now get all the busstops nearby the Pizza Hut in Nossegem:

tracker-sparql -q "
SELECT ?name ?lati ?long WHERE {
   ?p slo:belongsToCategory slo:predefined-landmark-category-food-beverage;
       slo:location [ slo:latitude ?plati; slo:longitude ?plong ] .
   ?b slo:belongsToCategory slo:predefined-landmark-category-transport ;
       slo:location [ slo:latitude ?lati; slo:longitude ?long ] ;
      nie:title ?name .
   FILTER (tracker:cartesian-distance (?lati, ?plati, ?long, ?plong) < 1000)
}"
Results:
  Busstop Sint-Martinusweg, 50.87523, 4.49426
  Busstop Hoge-Wei, 50.875988, 4.498208

This of course was an example with only slo:Landmark. But that slo:location property can be placed on any nie:InformationElement. Meaning that for example a nco:PersonContact can also be involved in such a cartesian-distance query (which is of course just an example).

Let’s make an example use-case: We want contact details of friends (with publicized coordinates) who are nearby a slo:Landmark that is in a food and beverage landmark category, so that the messenger application can prepare a text message window where you’ll type that you want to get together to get lunch at the Pizza Hut.

Ok, so let’s add some nco:PersonContact to our SPARQL endpoint who are nearby the Pizza Hut:

tracker-sparql -uq "
INSERT { _:1 a nco:PersonContact ; nco:fullname 'John Carmack';
               slo:location [ a slo:GeoLocation;
                   slo:latitude '51.325413'; slo:longitude '4.938037' ];
               nco:hasEmailAddress [ a nco:EmailAddress;
                 nco:emailAddress 'john.carmack@somewhere.com'] }"
tracker-sparql -uq "
INSERT { _:1 a nco:PersonContact ; nco:fullname 'Greg Kroah-Hartman';
               slo:location [ a slo:GeoLocation;
                   slo:latitude '51.325453'; slo:longitude '4.938027' ];
               nco:hasEmailAddress [ a nco:EmailAddress;
                 nco:emailAddress 'greg.kroah@somewhere.com'] }"

And let’s add one person who isn’t nearby the Pizza Hut in Nossegem:

tracker-sparql -uq "
INSERT { _:1 a nco:PersonContact ; nco:fullname 'Jean Pierre';
               slo:location [ a slo:GeoLocation;
                   slo:latitude '50.718091'; slo:longitude '4.880134' ];
               nco:hasEmailAddress [ a nco:EmailAddress;
                 nco:emailAddress 'jean.pierre@somewhere.com'] }"

And now, the query:

tracker-sparql -q "
SELECT ?name ?email ?lati ?long WHERE {
   ?p slo:belongsToCategory slo:predefined-landmark-category-food-beverage;
       slo:location [ slo:latitude ?plati; slo:longitude ?plong ] ;
      nie:title ?pname .
   ?b a nco:PersonContact;
        slo:location [ slo:latitude ?lati; slo:longitude ?long ] ;
      nco:fullname ?name ; nco:hasEmailAddress [ nco:emailAddress ?email ].
   FILTER (tracker:cartesian-distance (?lati, ?plati, ?long, ?plong) < 10000)
}"
Results:
  Greg Kroah-Hartman, greg.kroah@somewhere.com, 50.874715, 4.49158
  John Carmack, john.carmack@somewhere.com, 50.874715, 4.49154

These use-cases of course only illustrate the simplified location ontology in combination with the Nepomuk contacts ontology. There are many such domains in Nepomuk and when defining your own platform and/or a new domain on the desktop you can add (your own) ontologies. Mind that for the desktop you should preferably talk to Nepomuk first.

The strength of such a platform is also its weakness: if no information sources put their data into the SPARQL endpoint, no information sink can do queries that’ll yield meaningful results. You of course don’t have this problem in a contained environment where you define what does and what doesn’t get stored and where, like an embedded device.

A desktop like KDE or GNOME shouldn’t have this problem either, if only everybody would agree on the technology and share the ontologies. Which isn’t necessarily happening (fair point), although both KDE with Nepomuk-KDE and GNOME with Tracker share most of Nepomuk.

But indeed; if you don’t store anything in Tracker, it’s useless. That’s why Tracker comes with a file system miner and provides a framework for writing your own miners. The idea is that with time more and more applications will use Tracker, making it increasingly useful. Hopefully.

 

Bypassing Tracker’s file system miner, for example for MTP daemons

Recapping from my last blog article; I worked a bit on this concept during the weekend.

When a program is responsible for delivery of a file to the file system that program knows precisely when the rename syscall, completing the file transfer transaction, takes place.

An example of such a program is an MTP daemon. I quote from wikipedia: A main reason for using MTP rather than, for example, the USB mass-storage device class (MSC) is that the latter operates at the granularity of a mass storage device block (usually in practice, a FAT block), rather than at the logical file level.

One solution for metadata extraction for those files is to have file monitoring on the target storage directory with Tracker’s FS miner. The unfortunate thing with such a solution is that file monitoring will inevitably always trigger after the rename syscall. This means that only moments after the transfer has completed, the system can update the RDF storage. Not during and not just in time.

With this new feature I plan to allow a software like an MTP daemon to be ahead of that. For example while the file is being transferred or just in time upfront and / or just after the rename syscall depending on the use-case and how the developer plans to use the feature.

The API might still change. I plan to for example allow passing the value of tracker:available among other useful properties for which a MTP daemon might want to safely tamper with the values (edit: this is done and API in this blog article is adapted). The tracker:available property can be used to indicate to other software the availability of a file. For example while the file is being transferred you could set it to false and right after the rename you set it to true.

When you are building a device that has no other entry points for user files or documents than MTP, this feature helps you turning off Tracker’s FS miner completely. This could be ideal for certain tablets and phones.

Currently it looks like this. Branch is available here:

static void
on_finished (GObject *none, GAsyncResult *result, gpointer user_data) {
    GMainLoop *loop = user_data;
    GError *error = NULL;
    gchar *sparql = tracker_extract_get_sparql_finish (result, &error);
    if (error == NULL) {
        g_print ("%s", sparql);
        g_free (sparql);
    } else
        g_error("%s", error->message);
    g_clear_error (&error);
    g_main_loop_quit (loop);
}   

int main (int argc, char **argv) {
    const gchar *file = "/tmp/file.png";
    const gchar *dest = "file:///home/pvanhoof/Documents/Photos/photo.png"
    const gchar *graph = "urn:mygraph"
    GMainLoop *loop;
    g_type_init();
    loop = g_main_loop_new (NULL, FALSE);
    tracker_extract_get_sparql (file, dest, graph, time(0), time(0),
                                TRUE, on_finished, loop);
    g_main_loop_run (loop);
    g_object_unref (loop);
}

This will result in something like this:

INSERT SILENT { GRAPH  <urn:mygraph> {
    _:file a nfo:FileDataObject , nie:InformationElement ;
	 nfo:fileName "photo.png" ;
	 nfo:fileSize 38155 ;
	 nfo:fileLastModified "2012-12-17T09:20:18Z" ;
	 nfo:fileLastAccessed "2012-12-17T09:20:18Z" ;
	 nie:isStoredAs _:file ;
	 nie:url "file:///home/pvanhoof/Documents/Photos/photo.png" ;
	 nie:mimeType "image/png" ;
	 a nfo:FileDataObject ;
	 nie:dataSource <urn:nepomuk:datasource:9291a450-etc-etc> ;
	 tracker:available true .
    _:file a nfo:Image , nmm:Photo ;
	 nfo:width 150 ;
	 nfo:height 192 ;
	 nmm:dlnaProfile "PNG_LRG" ;
         # more extracted metadata
	 nmm:dlnaMime "image/png" .
  } }

As usual with stuff that I blog about: this feature isn’t finished, it’s not in master yet, not even reviewed. The API might change. All the usual stuff.

Battery drain on N9 caused by a combination of Battery-Icon, Tracker and Smartsearch

Tired of the fact that my N9 had few battery time I decided to “as a developer” investigate my device a little bit. Last time I did that I was still contracted by Nokia and a few days later I had to fly to Helsinki to help fix a Tracker in combination with contactsd bug. I’m btw. no longer working for Nokia since a few months. So this time I can’t fix it for everyone. Lemme write it here instead.

It’s pretty funny what is going on: I installed Battery-Icon at some point. The software is writing periodically to /usr/share/applications/battery-icon.desktop. Having been a developer at Nokia for the metadata subsystem I know that tracker-miner-fs will reindex .desktop files that change. You don’t really need to be a developer to know that: Tracker’s FS miner is, among other things, responsible for keeping up to date a list of known applications.

Because of Battery-Icon, which people are probably installing to monitor their battery, tracker-miner-fs wakes up to update the metadata. That in turn wakes up tracker-store to store the metadata. That in turn wakes up smartsearch which will fetch from Tracker some textual data. All three will consume power periodically because of this .desktop file write trigger. I’m guessing the power consumption is triggering Battery-Icon to update the .desktop file. And circular power consumption was born.

I guess I should file a bug on Battery-Icon and tell its author to update the .desktop file less often. I think he could  for example wait ten minutes before doing that write. Or is the user really interested in accurate battery information each and every second? Looks like Battery-Icon is even writing to the file more frequent every hour. Interesting behavior for a tool monitoring battery to do things in a way that influences power consumption significantly.

Btw, while it’s not fixed: devel-su (enable developer mode, install terminal and password for devel-su is rootme) on your N9 and chmod -x /usr/bin/smartsearch, reboot, then uninstall Battery-Icon and your battery will last longer. I know the guys who were or are on the smartsearch team are going to hate me for that advice. Sorry guys.

Avoiding duplicate album art storage on the N9

At Tracker (core component of Nokia N9‘s MeeGo Harmattan’s Content Framework) we extract album art out of music files like MP3s, and we do a heuristic scan in the same directory of the music files for files like cover.jpg.

Right now we use the media art storage spec which we at a Boston Summit a few years ago, together with the Banshee guys, came up with. This specification allows for artist + album media art.

This is a bit problematic now on the N9 because (embedded) album art is getting increasingly bigger. We’ve seen music stores with album art of up to 2MB. The storage space for this kind of data isn’t unlimited on the device. In particular is it a problem that for an album with say 20 songs by 20 different artists, with each having embedded album art, 20 times the same album art is stored. Just each time for a different artist-album combination.

To fix this we’re working on a solution that compares the MD5 of the image data of the file album-md5(space)-md5(album).jpg with the MD5 of the image data of the file album-md5(artist)-md5(album).jpg. If the contents are the same we will make a symlink from the latter to the former instead of creating a normal new album art file.

When none exist yet, we first make album-md5(space)-md5(album).jpg and then symlink album-md5(artist)-md5(album).jpg to it. And when the contents aren’t the same we create a normal file called album-md5(artist)-md5(album).jpg.

Consumers of the album art can now choose between using a space for artist if they are only interested in ‘just album’ album art, or filling in both artist and album for artist-album album art.

This is a first idea to solve this issue, we have some other ideas in mind for in case this solution comes with unexpected problems.

I usually blog about unfinished stuff. Also this time. You can find the work in progress here.

Null support for INSERT OR REPLACE available in master

About

Last week I wrote about adding a feature to our SPARQL Update’s INSERT OR REPLACE. With that feature it’s not needed to put a DELETE upfront the INSERT to clear a field. This makes our SPARQL-ish INSERT OR REPLACE in some ways more powerful than SQL’s UPDATE. Note, however, that all of INSERT OR REPLACE is non-standard in the SPARQL language. And this new null support certainly isn’t.

Support for null with INSERT OR REPLACE is now available in Tracker‘s master branch. How to use it is illustrated in the functional test. I’ll briefly explain the test.

For single value properties:

This is of course very simple.

INSERT { <subject> nie:title 'new test' }
INSERT OR REPLACE { <subject> nie:title null }

If you now select nie:title for <subject> then of course you’ll get that its nie:title field is unset.

For multi value properties:

Begin situation:

INSERT { <subject> a nie:DataObject, nie:InformationElement }
INSERT { <ds1> a nie:DataSource }
INSERT { <ds2> a nie:DataSource }
INSERT { <ds3> a nie:DataSource }
INSERT { <subject> nie:dataSource <ds1>, <ds2>, <ds3> }

This will be the test query I’ll use for all cases:

SELECT ?ds WHERE { <subject> nie:dataSource ?ds }

For the begin situation that of course gives us <ds1>, <ds2> and <ds3>.

With null upfront, reset of list, rewrite of new list:

INSERT OR REPLACE { <subject> nie:dataSource null, <ds1>, <ds2> }

This will give us <ds1> and <ds2> for the test query. The first null resets the existing list, then <ds1> and <ds2> are added. This is probably the most sensible one to use for multi value properties.

With null in the middle, rewrite of new list:

INSERT OR REPLACE { <subject> nie:dataSource <ds1>, null, <ds2>, <ds3> }

This also gives us <ds2> and <ds3>. First <ds1> is added, but the null that follows clears it again. Then <ds2> and <ds3> get added. So the <ds1> there doesn’t make much sense, indeed.

With null at the end:

INSERT OR REPLACE { <subject> nie:dataSource <ds1>, <ds2>, <ds3>, null }

This one doesn’t make much sense either. The <ds1>, <ds2> and <ds3> get cleared by the null at the end. So the query gives us zero results.

With null as only element:

INSERT OR REPLACE { <subject> nie:dataSource null }

This one makes sense, you can use it to clear a multi value property of a resource. The query gives us zero results.

Multiple nulls:

INSERT OR REPLACE { <subject> nie:dataSource null, <ds1>, null, <ds2>, <ds3> }

Again doesn’t make much sense. First the list is cleared, then <ds1> is added, then it’s again cleared, then <ds2> and <ds3> are added. So the query gives <ds2> and <ds3>.

Support for null with Tracker’s INSERT OR REPLACE feature.

I believe it was the QtContacts Tracker team who requested this feature. When they have to unset the value of a resource’s property and at the same time set a bunch of other properties, they need to use a DELETE statement upfront an INSERT OR REPLACE. The DELETE increases the amount of queries and introduces a SQL SELECT internally for solving the SPARQL DELETE’s WHERE.

Instead of that they wanted a way to express this in the INSERT OR REPLACE, and that way gain a bit of performance. Today I implemented this.

So let’s say we start with:

INSERT { <subject> a nie:InformationElement ; nie:title 'test' }

And then we replace the nie:title:

INSERT OR REPLACE { <subject> nie:title 'new test' }

Then of course we get ‘new test’ for the nie:title of the resource:

SELECT ?title { <subject> nie:title ?title }

Then let’s say we want to unset the nie:title, we can either use:

DELETE { <subject> nie:title ?title } WHERE { <subject> nie:title ?title }

or we can now also use this (and avoid an extra internal SQL SELECT to solve the SPARQL DELETE’s WHERE):

INSERT OR REPLACE { <subject> nie:title null }

For multi value properties will a null object in INSERT OR REPLACE results in a reset of the entire list of objects. There is still a SQL SELECT happening internally to get the so called old values, but that one is sort of unavoidable and is also used by a normal DELETE. I hope this feature helps the QtContacts Tracker team gain performance for their contact synchronization use cases.

You can find this in a branch, it might take some time before it reaches master as most of the Tracker team is at the Berlin Desktop Summit; it must be reviewed, of course. Since it doesn’t really change any of the existing APIs, as it only adds a feature, we might also bring it to 0.10. Although now that we started with 0.11, I think it probably belongs in 0.11 only. Distributions should probably just upgrade, wait for the new features until they decide to bump the version of their packages, or backport features themselves.

Refactoring our writeback system

Tracker writes back certain metadata to your files. It for example writes back in XMP the title of a JPeg file, among other fields that XMP supports.

We had a service that runs in the background waiting for signals coming from the RDF store that tell it to perform a writeback.

To avoid that our FS miner would pick up the changes that the writeback service made, and that way index the file again, we introduced a D-Bus API for our FS miner called IgnoreNextUpdate. When the API is issued will the FS miner ignore the first next filesystem event that would otherwise be handled on a specific file.

That API is now among our biggest sources of race conditions. Although we wont remove it from 0.10 due to API promises, we don’t like it and want to get rid of it. Or at least we want to replace all its users.

To get rid of it we of course had to change the writeback service in a way that it wouldn’t need the API call on the FS miner any longer.

The solution we came up with was to move the handling of the signal and the queuing to the FS miner‘s process. There we have all the control we need.

The original reason why writing back was done as a service was to be robust against the libraries, used for the actual writeback, crashing or hanging. We wanted to keep this capability, so just like the extractor is a portion of the writeback system going to run out of process of the FS miner.

When a queued writeback task is to be run, an IPC call to a writeback process is made and returns only when it’s finished. Then the next task in the queue, in the FS miner, is selected. A lot like how the extracting of metadata works.

We have and will be working on this in the writeback-refactor branches next few days.

The ever growing journal problem

Current upstream situation

In Tracker‘s RDF store we journal all inserts and deletes. When we replay the journal, we replay every event that ever happened. That way you end up in precisely the same situation as when the last journal entry was appended. We use the journal also for making a backup. At restore we remove the SQLite database, put your backup file where the journal belongs, and replay it.

We also use the journal to cope with ontology changes. When an ontology change takes place for which we have no support using SQLite’s limited ALTER, we replay the journal over a new SQLite database schema. While we replay we ignore errors; some ontology changes can cause loss of data (ie. removal of a property or class).

This journal has a few problems:

  • First the obvious space problem: when you insert a lot of data and later remove it all; instead of consuming no space at all it consumes twice the amount of space for an empty database. Unless you remove the journal, you can’t get it back. It’s all textual data so even when trying really, really hard wont you consume gigabytes that way. Nowadays are typical hard drives several hundreds of gigabytes in size. But yes, it’s definitely not nice.
  • Second problem is less obvious, but far worse: your privacy. When you delete data you expect it to be gone. Especially when a lot of desktop interaction involves inserting or deleting data with Tracker. For example recently visited websites. When a user wants to permanently remove his browser history, he doesn’t want us to keep a copy of the insert and the delete of that information. With some effort it’s still retrievable. That’s not only bad, it’s a defect!

This was indeed not acceptable for Nokia’s N9. We decided to come up with an ad-hoc solution which we plan to someday replace with a permanent solution. I’ll discuss the permanent solution last.

The ad-hoc solution for the N9

For the N9 we decided to add a compile option to disable our own journal and instead use SQLite’s synchronous journaling. In this mode SQLite guarantees safe writes using fsync.

Before we didn’t use synchronous journaling of SQLite and had it replaced with our own journal for earlier features (backup, ontology change coping) but also, more importantly, because the N9’s storage hardware has a high latency on fsync: we wanted to take full control by using our own journal. Also because at first we were told it wouldn’t be possible to force-shutdown the device, and then this suddenly was again possible in some ways: we needed high performance plus we don’t want to lose your data, ever.

The storage space issue was less severe: the device’s storage capacity is huge compared to the significance of that problem. However, we did not want the privacy issue so I managed to get ourselves the right priorities for this problem before any launch of the N9.

The performance was significantly worse with SQLite’s synchronous journaling, so we implemented manual checkpointing in a background thread for our usage of SQLite. With this we have more control over when fsync happens on SQLite’s WAL journal. After some tuning we got comparable performance figures even with our high latency storage hardware.

We of course replaced the backup / restore to just use a copy of the SQLite database using SQLite’s backup API.

Above solution means that we lost an important feature: coping with certain ontology changes. It’s true that the N9 will not cope with just any ontology change, whereas upstream Tracker does cope with more kinds of ontology changes.

The solution for the N9 will be pragmatic: we won’t do any ontology changes, on any future release that is to be deployed on the phone, that we can’t cope with, unless the new ontology gets shipped alongside a new release of Tracker that is specifically adapted and tested to cope with that ontology change.

Planned permanent solution for upstream

The permanent solution will probably be one where the custom journal isn’t disabled and periodically gets truncated to have a first transaction that contains an entire copy of the SQLite database. This doesn’t completely solve the privacy issue, but we can provide an API to make the truncating happen at a specific time, wiping deleted information from the journal.

We delivered

Damned guys, we’re too shy about what we delivered. When the N900 was made public we flooded the planets with our blogs about it. And now?

I’m proud of the software on this device. It’s good. Look at what Engadget is writing about it! Amazing. We should all be proud! And yes, I know about the turbulence in Nokia-land. Deal with it, it’s part of our job. Para-commandos don’t complain that they might get shot. They just know. It’s called research and development! (I know, bad metaphor)

I don’t remember that many good reviews about even the N900, and that phone was by many of its owners seen as among the best they’ve ever owned. Now is the time to support Harmattan the same way we passionately worked on the N900 and its predecessor tablets (N810, N800 and 770). Even if the N9’s future is uncertain: who cares? It’s mostly open source! And not open source in the ‘Android way’. You know what I mean.

The N9 will be a good phone. The Harmattan software is awesome. Note that Tracker and QSparql are being used by many of its standard applications. We have always been allowed to develop Tracker the way it’s supposed to be done. Like many other similar projects: in upstream.

As for short term future I can announce that we’re going to make Michael Meeks happy by finally solving the ever growing journal problem. Michael repeatedly and rightfully complained about this to us at conferences. Thanks Michael. I’ll write about how we’ll do it, soon. We have some ideas.

We have many other plans for long term future. But let’s for now work step by step. Our software, at least what goes to Harmattan, must be rock solid and very stable from now on. Introducing a serious regression would be a catastrophe.

I’m happy because with that growing journal – problem, I can finally focus on a tough coding problem again. I don’t like bugfixing-only periods. But yeah, I have enough experience to realize that sometimes this is needed.

And now, now we’re going to fight.

INSERT OR REPLACE explained in more detail

A few weeks ago we were asked to improve data entry performance of Tracker’s RDF store.

From earlier investigations we knew that a large amount of the RDF store’s update time was going to the application having to first delete triples and internally to the insert having to look up preexisting values.

For this reason we came up with the idea of providing a replace feature on top of standard SPARQL 1.1 Update.

When working with triples is a feature like replace of course a bit ambiguous. I’ll first briefly explain working with triples to describe things. When I want to describe a person Mark who has two dogs, we could do it like this:

  • Max is a Dog
  • Max is 10 years old
  • Mimi is a Dog
  • Mimi is 11 years old
  • Mark is a Person
  • Mark is 30 years old
  • Mark owns Max
  • Mark owns Mimi

If you look at those descriptions, you can simplify each by writing exactly three things: the subject, the property and the value.

In RDF we call these three subject, predicate and object. All subjects and predicates will be resources, the objects can either be a resource or a literal. You wrap resources in inequality signs.

You can continue talking about a resource using semicolon, and you continue talking about a predicate using comma. When you want to finish talking about a resource, you write a dot. Now you know how the Turtle format works.

In SPARQL Update you insert data with INSERT { Turtle formatted data }. Let’s translate that to Mark’s story:

INSERT {
  <Max> a <Dog> ;
        <hasName> ‘Max’ ;
        <hasAge> 10 .
  <Mimi> a <Dog> ;
        <hasName> ‘Mimi’ ;
        <hasAge> 11 .
  <Mark> a <Person> ;
         <hasName> ‘Mark’ ;
         <hasAge> 30 ;
         <owns> <Max>, <Mimi>
}

In the example we are using both single value property and multiple value properties. You can have only one name and one age, so <hasName> and <hasAge> are single value properties. But you can own more than one dog, so <owns> is a multiple value property.

The ambiguity with a replace feature for SPARQL Update is at multiple value properties. Does it need to replace the entire list of values? Does it need to append to the list? Does it need to update just one item in the list? And which one? This probably explains why it’s not specified in SPARQL Update.

For single value properties there’s no ambiguity. For multiple value properties on a resource where the particular triple already exists, there’s also no ambiguity: RDF doesn’t allow duplicate triples. This means that in RDF you can’t own <Max> twice. This is also true for separate insert executions.

In the next two examples the first query is equivalent to the second query. Keep this in mind because it will matter for our replace feature:

INSERT { <Mark> <owns> <Max>, <Max>, <Mimi> }

Is the same as

INSERT { <Mark> <owns> <Max>, <Mimi> }

There is no ambiguity for single value properties so we can implement replace for single value properties:

INSERT OR REPLACE {
  <Max> a <Dog> ;
        <hasName> ‘Max’ ;
        <hasAge> 11 .
  <Mimi> a <Dog> ;
        <hasName> ‘Mimi’ ;
        <hasAge> 12 .
  <Mark> a <Person> ;
         <hasName> ‘Mark’ ;
         <hasAge> 31 ;
         <owns> <Max>, <Mimi>
}

As mentioned earlier doesn’t RDF allow duplicate triples, so nothing will change to the ownerships of Mark. However, would we have added a new dog then just as if OR REPLACE was not there would he be added to Mark’s ownerships. The following example will actually add Morm to Mark’s dogs (and this is different than with the single value properties, they are overwritten instead).

INSERT OR REPLACE {
  <Morm> a <Dog> ;
        <hasName> ‘Morm’ ;
        <hasAge> 2 .
  <Max> a <Dog> ;
        <hasName> ‘Max’ ;
        <hasAge> 12 .
  <Mimi> a <Dog> ;
         <hasName> ‘Mimi’ ;
         <hasAge> 13 .
  <Mark> a <Person> ;
          <hasName> ‘Mark’ ;
          <hasAge> 32 ;
          <owns> <Max>, <Mimi>, <Morm>
}

We know that this looks a bit strange, but in RDF it kinda makes sense too. Note again that our replace feature is not part of standard SPARQL 1.1 Update (and will probably never be).

If for some reason you want to completely overwrite Mark’s ownerships then you need to precede the insert with a delete. If you also want to remove the dogs from the store (let’s say because, however unfortunate, they died), then you also have to remove their rdfs:Resource type:

DELETE { <Mark> <owns> ?dog . ?dog a rdfs:Resource }
WHERE { <Mark> <owns> ?dog }
INSERT OR REPLACE {
  <Fred> a <Dog> ;
        <hasName> ‘Fred’ ;
        <hasAge> 1 .
  <Mark> a <Person> ;
         <hasName> ‘Mark’ ;
         <hasAge> 32 ;
         <owns> <Fred> .
}

We don’t plan to add a syntax for overwriting, adding or deleting individual items or entire lists of a multiple value property at this time (other than with the preceding delete). There are technical reasons for this, but I will spare you the details. You can find the code that implements replace in the branch sparql-update where it’s awaiting review and then merge to master.

We saw performance improvements, whilst greatly depending on the use-case, of 30% and more. A use-case that was tested in particular was synchronizing contact data. The original query was varying in time between 17s and 23s for 1000 contacts. With the replace feature it takes around 13s for 1000 contacts. For more information on this performance test, read this mailing list thread and experiment yourself with this example.

The team working on qtcontacts-tracker, which is a backend for the QtContacts API that uses Tracker’s RDF store, are working on integrating with our replace feature. They promised me tests and numbers by next week.

A REPLACE extension for Tracker’s SPARQL’s Update

SPARQL Update has INSERT and DELETE. To update an existing triple in RDF you need to DELETE it first. You of course already have our INSERT-SILENT but that just ignores certain errors; it doesn’t replace triples.

A (performance) problem is that with each DELETE having to solve all possible solutions you create an extra query for each time you want to update using a ‘DELETE-WHERE INSERT’-construction.

INSERT also checks for old values. It has to do this to implement SPARQL Update where you can’t insert a triple with a different value than the old value: If the value of a triple is identical, the insert for that triple is ignored; if the triple didn’t exist yet, it’s inserted; if the values aren’t identical, error is thrown — you need to use DELETE upfront.

Both having to do the extra delete and the old-values come at a performance price.

To solve this we plan to provide Tracker specific support for REPLACE. It’ll be Tracker specific simply because this isn’t specified in SPARQL Update. That has a probable reason:

Replacing or updating doesn’t fit well in the RDF world. Updating properties that have multiple values, like nie:keyword, is ambiguous: does it need to replace the entire list of values; does it need to append to the list; does it need to update just one item in the list, and which one? This probably explains why it’s not specified in SPARQL Update.

We decided to let our REPLACE be only different than INSERT for single value properties. For multi value properties will our REPLACE behave the same as normal INSERT.

How a GraphUpdated triggered by a REPLACE behaves is still being decided. Especially the value of the object’s ID for resource objects in the ‘deletes’-array. Having to look up the old ID kinda defeats the purpose of having a REPLACE (as we’d still need to look it up, like what an INSERT does, destroying part of the performance gain).

Either way, let me show you some examples:

We start with an insert of a resource that has a single value and two times a multi value property filled in:

INSERT { <r> a nie:InformationElement ;
             nie:title 'title';
             nie:keyword 'keyw1';
             nie:keyword 'keyw2' }

A quick query to verify, and yes it’s in:

SELECT ?t ?k { <r> nie:title ?t; nie:keyword ?k }
Results:
  title, keyw1
  title, keyw2

If we repeat the query a second time then the old-values check will turn the insert into a noop:

INSERT { <r> a nie:InformationElement ;
             nie:title 'title';
             nie:keyword 'keyw1';
             nie:keyword 'keyw2' }

And a quick query to verify that, and indeed nothing has changed:

SELECT ?t ?k { <r> nie:title ?t; nie:keyword ?k }
Results:
  title, keyw1
  title, keyw2

If we’d do that last insert query but with different values, we’d get this:

INSERT { <r> a nie:InformationElement ;
             nie:title 'title new';
             nie:keyword 'keyw4';
             nie:keyword 'keyw3' }

SparqlError.Constraint: Unable to insert multiple values for subject
`r' and single valued property `dc:title' (old_value: 'title', new
 value: 'title new')

Note that for the two nie:keyword triples this would have worked, but given that each query is a transaction and because the nie:title part failed, aren’t those two written either.

Let’s now try the same with INSERT OR REPLACE (edit: changed from just REPLACE to INSERT OR REPLACE):

INSERT OR REPLACE { <r> a nie:InformationElement ;
                        nie:title 'title new';
                        nie:keyword 'keyw4';
                        nie:keyword 'keyw3' }

And a quick query now yields:

SELECT ?t ?k { <r> nie:title ?t; nie:keyword ?k }
Results:
  title new, keyw1
  title new, keyw2
  title new, keyw3
  title new, keyw4

You can see that how it behaved for nie:title was different than for nie:keyword. That’s because nie:title is a single value -and nie:keyword is a multi value property.

What if we do want to reset the multi value property and insert a complete new list? Simple, just do this as a single query (space or newline delimited) (edit: changed to INSERT OR REPLACE from just REPLACE):

DELETE { <r> nie:keyword ?k } WHERE { <r> nie:keyword ?k }
INSERT OR REPLACE { <r> a nie:InformationElement ;
                        nie:title 'title new';
                        nie:keyword 'keyw4';
                        nie:keyword 'keyw3' }

And a quick query now yields:

SELECT ?t ?k { <r> nie:title ?t; nie:keyword ?k }
Results:
  title new, keyw3
  title new, keyw4

The work on this is in progress. You can find it in the branch sparql-update. It’s working but especially the GraphUpdated stuff is unfinished.

Also note that the final syntax may change.

Synchronizing your application’s data with Tracker’s RDF store

A few months ago we added the implicit tracker:modified property to all resources. This property is an auto-increment. It used to be that the property was incremented on ~ each SQL update-query that happens. The value is stored per resource.Synchronization in water

We are now changing this to be per transaction. A transaction in Tracker is one set of SPARQL-Update INSERT or DELETE queries. You can do inserts and deletes about multiple resources in one such sentence (a sentence can contain multiple space delimited Update queries). An exception is everything related to ontology changes. These ontology changes get the first increment as their value for tracker:modified. This is also for ontology changes that happen after the initial ontology transaction (at the first start, is this first transaction made). The exception is made for supporting future ontology changes and the possibly needed data conversions.

The per-resource tracker:modified value is useful for application’s synchronization purposes: you can test your application’s stored tracker:modified value against the always increasing (w. exception at int. overflow) Tracker’s tracker:modified value to know whether or not your version is older.

The reason why we are changing this to per-transaction is because this way we can guarantee that the value will be restored after a journal replay and/or a backup’s restore without having to store it in either the journal nor the backup. This means that we now guarantee the value being restored without having to change either the backup’s format nor the journal’s format.

Having a persistent journal we actually make a simple copy of the journal to deliver you a backup in a fast file-copy. But let this deception be known only by the people who care about the implementation. Sssht!

We’re already rotating and compressing the rotated chunks for reducing the journal size. We’re working on not journaling data that is embedded in local files this week. A re-index of that local file will re-insert the data anyway. This will significantly reduce the size of the journal too.

IPC performance improvements for insert queries

Although with SQLite WAL we have direct-access now, we don’t support direct-access for insert and delete SPARQL queries. Those queries when made using libtracker-sparql still go over D-Bus using Adrien’s FD passing D-Bus IPC technique. The library will do that for you.

After investigating a performance analysis by somebody from Intel we learned that there is still a significant overhead per each IPC call. In the analysis the person made miner-fs combine multiple insert transactions together and then send it over as a single big transaction. This was noticeably faster than making many individual IPC requests.

The problem with this is that if one of the many insert queries fail, they all fail: not good.

We’re now experimenting with a private API that allows you to pass n individual insert transactions, and get n errors back, using one IPC call.

The numbers are promising even on Desktop D-Bus (the test):

$ cd tests/functional-tests/
$ ./update-array-performance-test
First run (first update then array)
Array: 0.103675, Update: 0.139094
Reversing run (first array then update)
Array: 0.290607, Update: 0.161749
$ ./update-array-performance-test
First run (first update then array)
Array: 0.105920, Update: 0.137554
Reversing run (first array then update)
Array: 0.118785, Update: 0.130630
$ ./update-array-performance-test
First run (first update then array)
Array: 0.108501, Update: 0.136524
Reversing run (first array then update)
Array: 0.117308, Update: 0.151192
$

We’re now deciding whether or not the API will become public; returning arrays of errors isn’t exactly ‘nice’ or ‘standard’.

LRU cache for prepared statements in Tracker’s RDF store

While trying to handle a bug that had a description like “if I do this, tracker-store’s memory grows to 80MB and my device starts swapping”, we where surprised to learn that a sqlite3_stmt consumes about 5 kb heap. Auwch.

Before we didn’t think that those prepared statements where very large, so we threw all of them in a hashtable for in case the query was ran again later. However, if you collect thousands of such statements, memory consumption obviously grows.

We decided to implement a LRU cache for these prepared statements. For clients that access the database using direct-access the cache will be smaller, so that max consumption is only a few megabytes. Because our INSERT and DELETE queries are more reusable than SELECT queries, we split it into two different caches per thread.

The implementation is done with a simple intrinsic linked ring list. We’re still testing it a little bit to get good cache-size numbers. I guess it’ll go in master soon. For your testing pleasure you can find the branch here.

Less exciting features also need to be done, return types

We have a feature request to support return types and to give back variable names. We currently return an array (of array) of just strings, with no typing. This doesn’t work very well for knowing whether a cell is (for example) unbound. Empty string isn’t the same as unbound. So what can you do?

With direct-access the implementation is easy, we’ll just read it from the SPARQL engine. We have all this info already anyway. For filedescriptor passing with D-Bus we need to marshal it over the protocol.

Although we might come back to this decision short term, we wont yet do it for our “normal” (non-FD passing) D-Bus query method. SPARQL’s type system is different from D-Bus’s, so we shouldn’t try to match them somehow. Any custom format that we’d come up with, would be arbitrary.

Maybe someday we’ll add another “normal” D-Bus method that gives you a big string with SPARQL Query Results in JSON or SPARQL Query Results in XML back. Right now this has no priority for us, plus it would be a lot slower due to serialization. Post 0.9 everybody should be using libtracker-sparql and that’ll select either FD passing or direct-access.

Anyway, this will likely be the API for Sparql.Cursor. The methods get_value_type and get_variable_name got added.

public enum Tracker.Sparql.ValueType {
	UNBOUND, URI, STRING, INTEGER,
	DOUBLE, DATETIME, BLANK_NODE
}

public abstract class Tracker.Sparql.Cursor : Object {
	public Connection connection { get; }
	public abstract int n_columns { get; }
+	public abstract ValueType get_value_type (int column);
+	public abstract unowned string? get_variable_name (int column);
	public abstract unowned string? get_string (int column, out long length = null);
	public abstract bool next (Cancellable? cancellable = null) throws GLib.Error;
	public async abstract bool next_async (Cancellable? cancellable = null) throws GLib.Error;
	public abstract void rewind ();
}

I usually post about work in progress, not about something that is done. Same this time, of course. You can find the branch where we’re working on this here.