Rearchitecting Tracker

Jürg and me have started working on the rearchitecture plans that we have for Tracker. You can follow the code being changed here and here.

What is finished?

  • Jürg took all database code out of the indexer. The indexer is now a consumer of tracker-store like any other. It commands tracker-store to store metadata. The indexer now also queries tracker-store for things like the modification time. Currently it has no access to the database directly. This might change, for performance reasons, we’re not sure about that yet.
  • The trackerd process got renamed to tracker-store.
  • The DBus object in tracker-store now executes the SPARQL Update requests itself. It used to send this request to tracker-indexer.

  • Jürg moved the watching and crawling code that used to be in the daemon to the indexer. This means that tracker-store doesn’t depend on inotify anymore. This work made it possible to make your own indexer or not to have an indexer at all. This was quite a big task and got pushed today. This is of course being tested as we speak.

  • I wrote an internal API to queue database store requests, making it possible to asynchronously deal with large amounts of data when multiple metadata deliverers will be giving tracker-store commands to store their metadata.
  • I also ported existing code to use this internal API. This task item is ongoing and being tested. For example the Turtle Import, support for removable device caches in Turtle, Push modules (support for E-mail clients) and the DBus SPARQL Update API are affected by this.
  • The class signals feature, which now doesn’t require involvement of the indexer, got fixed.

What is left to do?

Right now the indexer will instruct an extractor process to extract metadata from a file. This extractor process communicates the metadata first to the indexer, which in turn communicates the same metadata to tracker-store. This can be done more efficient by letting the extractor communicate the metadata directly to tracker-store.

We also have quite a few other plans for the indexer’s code. Such plans are a bit less short term planning. For example splitting support for the removable devices and the normal filesystem into two processes.

E-mail as a desktop service, this is how it should be done

While developing Tinymail, a library for writing E-mail clients, I was convinced that the storage of the summary was something Tinymail itself must handle. Back then there was, even pragmatically, nothing that could cope with the requirements of E-mail on mobile devices for this task.

Meanwhile I got the opportunity to work on the Tracker project. Using the Nepomuk Ontology, I made sure that the message ontology that Tracker uses can actually handle these requirements. I believe that the adoption of Nepomuk and SPARQL evolved Tracker from something that isn’t useful for E-mail software to something that should be involved when writing a desktop service for E-mail today.

Ryan, a pioneer in experimenting E-mail as a desktop service, advised me to be careful with my bias for RDF and SPARQL. I’ll keep it in mind! However…

I believe such a desktop service for E-mail should:

  • Download metadata by getting and parsing ENVELOPE and BODYSTRUCTURE using the FETCH programme of an IMAP server. As explained in this document.
  • Give priority to downloading metadata of those E-mails nearby the user’s scroll position.
  • Use IMAP’s pipelining. It gives the user the feeling that his technology operates faster than his human brains, even when on high latency connections like GPRS.
  • Cache the information, using Tracker’s Nepomuk Message Ontology as schema.
  • Make it possible to fetch just one particular MIME part and not the entire messages.
  • Enable it to create a new message, consisting of individual MIME parts.
  • Make it possible for those MIME parts to have their source in existing messages on an IMAP server when creating a message. When the IMAP server supports CATENATE, it should be used for this purpose.
  • Make applications use SPARQL with Tracker’s NMO to query metadata about E-mails.
  • Provide a stream API to get access to the the actual data of individual MIME parts. If not cached, the service should download the MIME part on demand. The DBus Stream API should look like GInputStream. Except for read(): I think for the transfer of the chunks of data that Unix Sockets or named pipes are better than using D-Bus.

To this I would like to add that although many people falsely believe that E-mails are like files, E-mails are more like recursive directories (container MIME parts) with items: the E-mail’s MIME parts. Any API that doesn’t admit this, is incorrectly designed.

This goes all the way up to the protocol, where you fetch per MIME part. You don’t fetch entire messages. You can indeed do that but that doesn’t mean it isn’t wrong. IMAP is not POP3. It’s also better to design for IMAP, than to use IMAP as a POP3 service. Better have hacks to support POP3 in your model (I’m serious).

Please don’t make the same mistake nearly every newcomer of E-mail solutions makes. There’s plenty of rubbish already, seriously.