Reporting busy status

We’re nearing our first release since very long, so I’ll do another technical blog post about Tracker ;)

When the RDF store is replaying its journal at startup and when the RDF store is restoring a backup it can be in busy state. This means that we can’t handle your DBus requests during that time; your DBus method will be returned late.

Because that’s not very nice from a UI perspective (the uh, what is going on?? -syndrome kicks in) we’re adding a signal emission that emits the progression and status. You can also ask it using DBus methods GetProgress and GetStatus.

The miners already had something like this, so I kept the API more or less the same.

signal sender=:1.99 -> dest=(null destination) serial=1454
  path=/org/freedesktop/Tracker1/Status;
  interface=org.freedesktop.Tracker1.Status; member=Progress
   string "Journal replaying"
   double 0.197824
signal sender=:1.99 -> dest=(null destination) serial=1455
  path=/org/freedesktop/Tracker1/Status;
  interface=org.freedesktop.Tracker1.Status; member=Progress
   string "Journal replaying"
   double 0.698153

Jürg just reviewed the SPARQL regex performance improvement of yesterday, so that’s now in master. If you want this busy status notifying today already you can test with the busy-notifications branch.

Performance improvements for SPARQL’s regex in Tracker

The original SPARQL regex support of Tracker is using a custom SQLite function. But of course back when we wrote it we didn’t yet think much about optimizing. As a result, we were using g_regex_match_simple which of course recompiles the regular expression each time.

Today Jürg and me found out about sqlite3_get_auxdata and sqlite3_set_auxdata which allows us to cache a compiled value for a specific custom SQLite function for the duration of the query.

This is much better:

static void
function_sparql_regex (sqlite3_context *context,
                       int              argc,
                       sqlite3_value   *argv[])
{
  gboolean ret;
  const gchar *text, *pattern, *flags;
  GRegexCompileFlags regex_flags;
  GRegex *regex;

  if (argc != 3) {
    sqlite3_result_error (context, "Invalid argument count", -1);
    return;
  }

  regex = sqlite3_get_auxdata (context, 1);
  text = sqlite3_value_text (argv[0]);
  flags = sqlite3_value_text (argv[2]);
  if (regex == NULL) {
    gchar *err_str;
    GError *error = NULL;
    pattern = sqlite3_value_text (argv[1]);
    regex_flags = 0;
    while (*flags) {
      switch (*flags) {
      case 's': regex_flags |= G_REGEX_DOTALL; break;
      case 'm': regex_flags |= G_REGEX_MULTILINE; break;
      case 'i': regex_flags |= G_REGEX_CASELESS; break;
      case 'x': regex_flags |= G_REGEX_EXTENDED; break;
      default:
        err_str = g_strdup_printf ("Invalid SPARQL regex flag '%c'", *flags);
        sqlite3_result_error (context, err_str, -1);
        g_free (err_str);
        return;
      }
      flags++;
    }
    regex = g_regex_new (pattern, regex_flags, 0, &error);
    if (error) {
      sqlite3_result_error (context, error->message, error->code);
      g_clear_error (&error);
      return;
    }
    sqlite3_set_auxdata (context, 1, regex, (void (*) (void*)) g_regex_unref);
  }
  ret = g_regex_match (regex, text, 0, NULL);
  sqlite3_result_int (context, ret);
  return;
}

Before (this was a test on a huge amount of resources):

$ time tracker-sparql -q "select ?u { ?u a rdfs:Resource . FILTER (regex(?u, '^titl', 'i')) }"
real	0m3.337s
user	0m0.004s
sys	0m0.008s

After:

$ time tracker-sparql -q "select ?u { ?u a rdfs:Resource . FILTER (regex(?u, '^titl', 'i')) }"
real	0m1.887s
user	0m0.008s
sys	0m0.008s

This will hit Tracker’s master today or tomorrow.

Working hard at the Tracker project

Today we improved journal replaying from 1050s for my test of 25249 resources to 58s.

Journal replaying happens when your cache database gets corrupted. Also when you restore a backup: restore uses the same code the journal replaying uses, backup just makes a copy of your journal.

During the performance improvements we of course found other areas related to data entry. It looks like we’re entering a period of focus on performance, as we have a few interesting ideas for next week already. The ideas for next week will focus on performance of some SPARQL functions like regex.

Meanwhile are Michele Tameni and Roberto Guido working on a RSS miner for Tracker and has Adrien Bustany been working on other web miners like for Flickr, GData, Twitter and Facebook.

I think the first pieces of the RSS- and the other web miners will start becoming available in this week’s unstable 0.7 release. Martyn is still reviewing the branches of the guys, but we’re very lucky with such good software developers as contributors. Very nice work Michele, Roberto and Adrien!

An ode to our testers

You know about those guys that use your software against huge datasets like their entire filesystem, with thousands of files?

We do. His name is Tshepang Lekhonkhobe and we owe him a few beers for reporting to us many scalability issues.

Today we found and fixed such a scalability issue: the update query to reset the availability of file resources (this is for support for removable media) was causing at least a linear increase of VmRss usage per amount of file resources. For Tshepang’s situation that meant 600 MB of VmRss. Jürg reduced this to 30 MB of peak VmRss in the same use-case, and a performance improvement from minutes to a second or two, three. Without memory fragmentation as glibc is returning almost all of the VmRss back to the kernel.

Thursday is our usual release day. I invite all of the 0.7 pioneers to test us with your huge filesystems, just like Tshepang always does.

So long and thanks for all the testing, Tshepang! I’m glad we finally found it.

Invisible costs


We would rather suffer the visible costs of a few bad decisions than incur the many invisible costs that come from decisions made too slowly – or not at all – because of a stifling bureaucracy.

Letter by Warren E. Buffett to the shareholders of Berkshire, February 26, 2010

Working hard

I don’t decide about Tracker‘s release. The team of course does.

But when you look at our roadmap you notice one remaining ‘big feature’. That’s coping with modest ontology changes.

Right now if we’d make even a small ontology change all of our users would have to recreate their databases. We also don’t support restoring a backup of your metadata over a modified ontology.

This is about to change. This week I started working in a branch on supporting class and property ontology additions.

I finished it today and it appears to be working. The patches obviously need a thorough review by the other team members, and then testing of course. I invite all the contributors and people who have been testing Tracker 0.7’s releases to tryout the branch. It only supports additions, so don’t try to change properties or classes, or remove them. You can only add new ones. You might have noticed the nao:deprecated property in the ontology files? That’s what we do with deleted properties.

Anyway

Meanwhile are Martyn and Carlos working on a bugfix in the miner about duplicate entries for file resources and on a timeout for the extractor so that extraction of large or complicated documents doesn’t block the entire filesystem miner.

Jürg is working on timezone storage for xsd:dateTime fields and last few days he implemented limited support for named graphs.

By the looks of it, one would almost believe that Tracker’s first new stable release is almost ready!

Please don’t rewrite softwares (that are) written in .NET

This (super) cool .NET developer and good friend came to me at the FOSDEM bar to tell me he was confused about why during the Tracker presentation I was asking people to replace F-Spot and Banshee.

I hope I didn’t say it like that, I would never intent to say that. But I’ll review the video of the presentation as soon as Rob publishes it.

Anyway, to ensure everybody understood correctly what I did wanted to say (whether or not I did, is another question):

The call was to inspire people to reimplement or to provide different implementations of F-Spot’s and Banshee’s data backends, so that they would use an RDF store like tracker-store instead of each app its own metadata database.

I think I also mentioned Rhythmbox in the same sentence because the last thing I would want is to turn this into a .NET vs. anti-.NET debate. It just happens to be that the best GNOME softwares for photo and music management are written in .NET (and that has a good reason).

People who know me also know that I think those anti-.NET people are disruptive ignorable people. I also actively and willingly ignore them (and they should know this). I’m actually a big fan of the Mono platform.

I’ll try to ensure that I don’t create this confusion during presentations anymore.

FWD: [Tracker] tracker-miner-rss 0.3

This is the kind of stuff that needs a forward on the planets:

From: Roberto -MadBob- Guido

This is just an update about tracker-miner-rss effort, already mentioned in this list some time ago.

Website, SVN, Last release (0.3)

Since 0.2 we (Michele and me) have just dropped dependency from rss-glib due some limitation found, and created our own Glib-oriented feeds handling library, libgrss, starting from the code of Liferea and adding nice stuffs such as a PubSub subscriber implementation. At the moment it is shipped with tracker-miner-rss itself, in the future may be splitted so to easy usage by other developers.

Next will come integration with libchamplain to describe geographic points found in geo-rss enabled feeds, integration with libedataserver to better handle “person” rappresentation (suggestions for a better PIM-like shared library with useful objects?), and perhaps a first full-featured feed reader using Tracker as backend.

Enjoy :-)

Roberto is doing a demo on FSter at FOSDEM during our presentation. My role in the presentation will be light this year. I decided to give most of the talk away to Rob Taylor and Roberto. I will probably demo Debarshi Ray‘s Solang and if time permits his work on the Nautilus integration. Regretfully Debarshi can’t come and so he asked me to do the demo.

Solang, a photo manager

For the last few weeks has Debarshi Ray contributed to Tracker’s Nautilus plugin and worked on Solang, a photo manager that will start using Tracker’s SPARQL capability to get a language to query for metadata about the photos and the photos themselves.

Debarshi explains it all very well himself on his own blog.

We’ll probably do a lightening demo during our Tracker presentation at FOSDEM about how Solang did this integration. We’re also planning to demo the code of a few other applications that are working on integrating with Tracker’s store.

Somebody should port Solang to the next version of Maemo!

SPARQL subqueries

This style of subqueries will also work (you can do this one without a subquery too, but it’s just an example of course):

SELECT ?name COUNT(?msg)
WHERE {
	?from a nco:Contact  ;
	          nco:hasEmailAddress ?name . {
		SELECT ?from
		WHERE {
			?msg a nmo:Email ;
			         nmo:from ?from .
		}
	}
} GROUP BY ?from  

The same query in QtTracker will look like this (I have not tested this, let me know if it’s wrong Iridian):

#include <QObject>
#include <QtTracker/Tracker>
#include <QtTracker/ontologies/nco.h>
#include <QtTracker/ontologies/nmo.h>

void someFunction () {
	RDFSelect outer;
	RDFVariable from;
	RDFVariable name = outer.newColumn<nco::Contact>("name");
	from.isOfType<nco::Contact>();
	from.property<nco::hasEmailAddress>(name);
	RDFSelect inner = outer.subQuery();
	RDFVariable in_from = inner.newColumn("from");
	RDFVariable msg;
	msg.property<nmo::from>(in_from);
	msg.isOfType<nmo::Email>();
	outer.addCountColumn("total messages", msg);
	outer.groupBy(from);
	LiveNodes from_and_count = ::tracker()->modelQuery(outer);
}

What you find in this branch already supports it. You can find early support for subqueries in QtTracker in this branch.

To quickly put some stuff about Emails into your RDF store, read this page (copypaste the turtle examples in a file and use the tracker-import tool). You can also enable our Evolution Tracker plugin, of course.

ps. Yes, somebody should while building a GLib/GObject based client library for Tracker copy ideas from QtTracker.

Bla bla bla, subqueries in SPARQL, bla bla

Coming to you in a few days is what Jürg has been working on for last week.

Yeah, you guess it right by looking at the query below: subqueries!

This example shows you the amount of E-mails each contact has ever sent to you:

SELECT ?address
    (SELECT COUNT(?msg) AS ?msgcnt WHERE { ?msg nmo:from ?from })
WHERE {
    ?from a nco:Contact ;
          nco:hasEmailAddress ?address .
}

The usual warnings apply here: I’m way early with this announcement. It’s somewhat implemented but insanely experimental. The SPARQL spec has something for this in a draft wiki page. Due to lack of error reporting and detection it’s easy to make stuff crash or to get it to generate wrong native SQL queries.

But then again, you guys are developers. You like that!

Why are we doing this? Ah, some team at an undisclosed company was worried about performance and D-Bus overhead: They had to do a lot of small queries after doing a parent query. You know, a bunch of aggregate functions for counts, showing the last message of somebody, stuff like that.

I should probably not mention this feature yet. It’s too experimental. But so exciting!

Anyway, here’s the messy branch and here’s the reviewed stuff for bringing this feature into master.

ps. I wish I could show you guys the query that we support for that team. It’s awesome. I’ll ask around.

Tracker’s write back support now in master

Whoohoo!

We just committed the support for write back in master.

What is it?

Tracker has a limited capability to write metadata back into the data resource. In case of a file that means writing it back into the file. For example writing some of the metadata the user sets using a SPARQL Update back into an MP3 file as ID3 tags.

Which ones do we support already?

Right now the write back capability is under development and only supports a bunch of fields for a few XMP formats (JPEG, PNG and TIFF) and the Title of MP3 files. In near future we will start supporting increasingly more fields.

Documentation?

For people who want to write support for their properties and file formats, read the documentation.

Party like it’s 2009!

Handling triplets arriving in tracker-store, CouchDB integration as use-case

At GCDS Jamie told us that he wants to make a plugin for tracker-store that writes all the triplets to a CouchDB instance.

Letting a CouchDB be a sort of offline backup isn’t very interesting. You want triples to go into the CouchDB at the moment of guaranteed storage: at commit time.

For the purpose of developing this we provide the following internal API.

typedef void (*TrackerStatementCallback) (const gchar *graph,
                                          const gchar *subject,
                                          const gchar *predicate,
                                          const gchar *object,
                                          GPtrArray   *rdf_types,
                                          gpointer     user_data);
typedef void (*TrackerCommitCallback)    (gpointer     user_data);

tracker_data_add_insert_statement_callback (TrackerStatementCallback callback,
                                            gpointer                 user_data);
tracker_data_add_delete_statement_callback (TrackerStatementCallback callback,
                                            gpointer                 user_data);
tracker_data_add_commit_statement_callback (TrackerCommitCallback callback,
                                            gpointer              user_data);

You’ll need to make a plugin for tracker-store and make the hook at the initialization of your plugin.

Current behaviour is when graph is NULL, it means that the default graph is being used. If it’s not NULL, it means that you probably don’t want the data in CouchDB: it’s data that’s coming from a miner. You probably only want to store data that is coming from the user. His applications won’t use FROM and INTO for their SPARQL Update queries, meaning that graph is NULL.

Very important is that your callback handler works with bottom halves: put your expensive task on a queue and handle the queued item somewhere else. You can for example use a GThreadPool or a GQueue plus a g_idle_add_full with G_PRIORITY_LOW callback picking items one by one on the mainloop. You should never have a TrackerStatementCallback or a TrackerCommitCallback that blocks. Not even a tiny tiny bit of blocking: it’ll bring everything in tracker-store on its knees. It’s why we aren’t giving you a public plugin API with a way to install your own plugins outside of the Tracker project.

By the way: we want to see code instead of talk before we further optimize things for this purpose.

Writeback, writing metadata back into your files

Today, I feel like exposing you to some bleeding edge development going on as we speak at the Tracker team. I know you’re scared of that and that’s precisely why I want to expose you! Hah.

We are prototyping writeback support for Tracker.

With writeback we mean writing metadata that the user passes to us via SPARQL UPDATE into the file that he’s describing.

This means that it must be about a thing that is stored, that it must update a property that we want to writeback and it means that we need to support the format.

OK, that’s three requirements before we write anything back. Let’s explain how this stuff works in the prototype!

In our prototype you mark properties that are eligible for being written into the files using tracker:writeback.

It goes like this:

nie:title a rdf:Property ;
   rdfs:label "Title" ;
   rdfs:comment "The title of the document" ;
   rdfs:subPropertyOf dc:title ;
   nrl:maxCardinality 1 ;
   rdfs:domain nie:InformationElement ;
   rdfs:range xsd:string ;
   tracker:fulltextIndexed true ;
   tracker:weight 10 ;
   tracker:writeback true .

Next you need a writeback module for tracker-writeback. We implemented a prototype one that can only write the title of MP3 files. It uses ID3lib‘s C API.

When the user is describing a file, the resource must have nie:isStoredAs. The property being changed ‘s tracker:writeback must be true. We want the value of the property too. That’s simple in SPARQL, right? Sure it is!

SELECT ?url ?predicate ?object {
    <$subject> ?predicate ?object ;
               nie:isStoredAs ?url .
    ?predicate tracker:writeback true
 }

You’ll find this query in the code, go look!

Now it’s simple: using ID3lib we map Nepomuk to ID3 and write it.

No don’t be afraid, we’re not going to writeback metadata that we found ourselves. We’ll only writeback data that the user provided in the form of a SPARQL Update on the default graph. No panic. Besides, using tracker-writeback is going to be completely optional (just don’t run it).

This is a prototype, I repeat, this is a prototype. No expectations yet please. Just feel exposed to scary stuff, get overly excited and then join us by contributing. It’s all public what we’re doing in the branch ‘writeback’.

ps. Whether this will be Maemo’s future metadata-write stuff? Hmm, I don’t know. Do you know? ;-)

FWD: Using Tumbler in Client Applications

Tumbler‘s maintainer wrote an interesting tutorial on how to use the thumbnail DBus API today.

Check it out if your application needs to use thumbnails.

Tumbler

Last few weeks I have been working on the new thumbnail infrastructure for future Maemo products.

Last year I made a specification for requesting thumbnails over D-Bus. Afterward I made a quick prototype and replaced the hildon-thumbnailer library of Maemo with it. This prototype will be deployed on the standard N900 image. It’s too late to replace Fremantle’s thumbnailer with the new stuff. It takes time to properly test it.

While I was developing both the specification and the prototype XFCE developer Jannis Pohlmann contacted me about rewriting my prototype for use in the XFCE project. Tumbler was born.

The nice people at Nokia are more interested in working with upstream projects instead of maintaining own products separately, so I shifted my focus from hildon-thumbnail to contributing to Jannis’ Tumbler project.

We realized that we needed different kinds of schedulers so while Jannis was developing Tumbler I kindly asked to consider abstracting scheduling a bit. Tumbler now has two schedulers. The background one sets I/O and scheduler priorities to IDLE and processes its thumbnail tasks in FIFO order. The foreground uses LIFO and will instead of grouping Ready signals together, emit them immediately after each single thumbnail is finished. Default is of course foreground.

We also realized that thumbnail flavors are going to be platform specific. So we added some support for this in the DBus APIs that we further fine tuned and versioned.

Congratulations and appreciation to Jannis who made Tumbler’s code and design really nice. Also thanks a lot for constructively considering our requirements and helping adapting Tumbler’s code to cope with them.

I know you for example worked one long night on this stuff, so I officially owe you a few beers and/or cocktails next conference.

How about FOSDEM?

Keeping the autotools guys happy with qmake

I’m still figuring out how to do the same thing with cmake, but various bloggers and comments appear to be promising that it’ll be even more easy.

But this is a message for probably all Nokia teams who are making Qt-based libraries:

First open your src/src.pro file and add this stuff:

CONFIG += create_pc create_prl
QMAKE_PKGCONFIG_REQUIRES = QtGui
QMAKE_PKGCONFIG_DESTDIR = pkgconfig
INSTALLS += target headers

Now open your debian/$package-dev.install file and add this line:

usr/lib/pkgconfig

You’ll be doing all the autotools people a tremendous favor.

Next, open the README file and document that you need to use qmake-qt4 on Debian or make either qmake-qt3 or qmake-qt4 work flawlessly with your build environment. Perhaps also mention how to set the install prefix, how to make qmake find and install .pc files in another location, stuff like that. I find that this is lacking for almost every Qt-based library.

You’ll be doing everybody who wants to use your software a tremendous favor.

Release fast, release often (has finally started)

Martyn is right, we did release Tracker 0.7!

Now remember kids. It’s only a alpha or at least unstable release. The 0.8 will be what we will call the stable series for RDF, SPARQL, the new miner infrastructure, etc.

Database cursors used in Tracker

A cursor on a query of a database is a finger pointing to the current row. Most databases do this without pulling the entire resultset into memory. It’s indeed much like a C/C++ pointer, except that a pointer can only point to memory in your process’ virtual memory. A cursor is a bit more abstract.

POSIX developers can compare a cursor with a pointer to a region in an mmap. For people who don’t know about mmap, mmap can be used to map a file into your process’ memory. You get a C/C++ pointer back, from which you can read the data as-if it’s in memory. With mmap, when you create a pagefault, the kernel will pull pages into your memory (from the file, or whichever resource is behind the mapping).

In Tracker all database operations used to be much like how using g_file_get_contents works: you read the entire thing into memory, and then you operate on that memory. Internally it used the database’s cursor API too, of course. The sqlite3_step is sqlite’s cursor API too.

First the database has filled up its pagecache with this data, then you copied it to your application’s memory, then you used it, then you freed it.

That’s kinda silly! Why not use it straight from the database’s caches instead? That’s what you use a DB cursor for.

The result is less copying of memory. This means less memory fragmentation and fewer memory operations to perform (which should result in a small performance improvement).

This effort is ongoing but a lot of Tracker’s internal loops over resultsets are now using a cursor instead of a in-memory result-set.

A reason to get up in the morning

Ever since Nokia contacted me about improving Tinymail to make it suitable for their Modest E-mail client have they given me a reason to get up in the morning, to work on something of which I knew would someday kick ass.

With the Maemo5 based Nokia N900 device we’ll have Modest shipped by default, and Tracker being actively used by several of its softwares. Future is going to shine even brighter for Tracker. Hard to brag about it, Tracker is inherently a background thing. Ah, well, technical people know about it.

Having worked on Tracker for more than a year, I now understand Tracker’s potential. At first, while I was trying to make an API for- and store the summary of E-mail envelope headers, so that E-mail clients can access this in a memory efficient way, I was critical of this Tracker stuff.

But then I joined Ivan, Urho, Ottela, Martyn and Carlos who were working on Tracker. Later Jürg joined and at the Berlin Hackfest people like Rob Taylor, Jürg and Urho discussed replacing Tracker’s poorer own ontology with Nepomuk and replacing its query language with SPARQL.

Given the implied complexity I was again critical, but then that crazy Jürg guy in a few weeks time turned Tracker into 99.9% pure fine awesomeness. I quickly joined working on this crazy “vstore” branch. Since a few months we have convinced the other Tracker guys to just start calling it “master”.

Ever since I feel again like a student who is learning how to develop software. Jürg is utilizing so many good techniques and we’re implementing so many specifications that are just “the right thing to do”, that the beautify of it all could sometimes make me cry of happiness.

Thanks to creating the opportunity to develop on software that will be used on for example their N900 device, Nokia continues giving people like me a reason to get up in the morning.

Don’t tell the native Nokians, but that’s why the N900 announcements secretly also made me a little bit proud. To whoever of us that worked on this stuff: guys, we’re all doing a great job. Let’s make the next one even better!