Rearchitecting Tracker

Jürg and me have started working on the rearchitecture plans that we have for Tracker. You can follow the code being changed here and here.

What is finished?

  • Jürg took all database code out of the indexer. The indexer is now a consumer of tracker-store like any other. It commands tracker-store to store metadata. The indexer now also queries tracker-store for things like the modification time. Currently it has no access to the database directly. This might change, for performance reasons, we’re not sure about that yet.
  • The trackerd process got renamed to tracker-store.
  • The DBus object in tracker-store now executes the SPARQL Update requests itself. It used to send this request to tracker-indexer.

  • Jürg moved the watching and crawling code that used to be in the daemon to the indexer. This means that tracker-store doesn’t depend on inotify anymore. This work made it possible to make your own indexer or not to have an indexer at all. This was quite a big task and got pushed today. This is of course being tested as we speak.

  • I wrote an internal API to queue database store requests, making it possible to asynchronously deal with large amounts of data when multiple metadata deliverers will be giving tracker-store commands to store their metadata.
  • I also ported existing code to use this internal API. This task item is ongoing and being tested. For example the Turtle Import, support for removable device caches in Turtle, Push modules (support for E-mail clients) and the DBus SPARQL Update API are affected by this.
  • The class signals feature, which now doesn’t require involvement of the indexer, got fixed.

What is left to do?

Right now the indexer will instruct an extractor process to extract metadata from a file. This extractor process communicates the metadata first to the indexer, which in turn communicates the same metadata to tracker-store. This can be done more efficient by letting the extractor communicate the metadata directly to tracker-store.

We also have quite a few other plans for the indexer’s code. Such plans are a bit less short term planning. For example splitting support for the removable devices and the normal filesystem into two processes.

7 thoughts on “Rearchitecting Tracker”

  1. Hi, I have an idea but im not sure if its a good one. It involves packagers for distributions.
    If tracker is installed as default by a distributer then shouldnt the distributor create an initial index right before the final build goes live?
    Then the time it would need to create an index of the system when a user tries tracker for the first time on their fresh install would be reduced right?

  2. @tretle: right (and that’s up to distributors). They can for example make a Turtle file and import this right after package install. We’re still working on a backup function (which will also emit a turtle file). You can already import Turtle files.

  3. Hi Philip,
    one thing I’m still not quite sure about: will tracker become a RDF triple store, or a quadruple store, also answering the question “who told me about this triple” ?
    What I’m aiming at: how will tracker differentiate between e.g. tags that are user-set and tags that were found by the indexer (and can be overwritten) ?
    Can this information somehow be preserved in the turte-formatted cache files on external media that you described before ? (For this use case, it might be enough to just remember the specific cache file as source, though.)

  4. @pixelpapst: We have ideas to store using a quadruple in parallel to our decomposed triple store. Indeed to support things like named graphs (which as you probably know will allow you to store tags while differentiating between who told you about the triple: remote, local, user-set, indexer-set, etc).

    We also plan to use this capability for backup purposes, for example. And to give developers a store to write a synchronization framework for.

    However, this is long-term planning. We can definitely use your help if you are interested in this specific aspect.

    I asked Jürg to reply here too, as the quadruple ideas that we have came from him.

  5. @pixelpapst: At the moment, Tracker is only a triple store. However, we are considering adding support for named graphs / quadruple store as it would help in various situations. There is an overhead in disk space and indexing performance, though, so we have to make sure that the advantages of a quad store outweigh the disadvantages.

  6. why is there a triple store being invented? Virtuoso or Hexastore (using file:// URIs or even inode integers to survive renames/moves) are certainly ‘doing one thing well’ in that category

    also requiring SPARQL and dbus is not condoning fractal decomposition/reuse of the (meta)data. and ‘crawlers’ are universally annoying and certainly not necessary on a local FS. i’m much more interested in an Inotify type solution which updates the indexes whenever files of specific types are created (eg EXIF from JPEGs, HTTP-environment data from browser requests and the RDFa within those documents) and exposes this index data as directories which can be queried using sh and/or python/ruby etc’s stdlib FS tools to compose queries (see the Reiser4 ‘vision’ docs from about oh.. 8,9 years ago?) and of course exposing the basic join/set-intersection primitives in a composable manner rather than requiring the godawful SPARQL syntax – something that sadly most triplestores don’t do, but the hash-table type dfs tools are providing (whether precalc’d views a la couch or hadoop esque reductions

    ive already written much of what i described but it is pure-ruby so i’ve yet offload the heavy bits to the kernel and mess with userspace bridges between FS events and aforementioned stores.

    heck, even a decent stab at automated indexing/querying of POSIX eattrs would be fantasticly useful compared to our crude heirchical path treewalking as the only way to list groups of files. one can use URIs for the attribute names after all.

    i would really only find this sort of automated indexing useful for certain directories anyways (mail, browser cache, media), and of course it also entails patching apps (browser cache being a few opaque blobs completely unusable without the accompanying VMbeast firepig/webkit/opera and their hodpodge of adhoc SQLite databases bugs the crap out of me, not to mention the complete lack of interopability between said implementations in anything other than completely vain/trivial things like.. CSS ACIDtest compliance). bottling up the web in some mainframepickle track is preventing an evolution to a distributed/decentralized web, and ignoring all the metadata inherent in our files is just silly.
    well at least i can agree with you on that last point!

Comments are closed.