Tumbler

Last few weeks I have been working on the new thumbnail infrastructure for future Maemo products.

Last year I made a specification for requesting thumbnails over D-Bus. Afterward I made a quick prototype and replaced the hildon-thumbnailer library of Maemo with it. This prototype will be deployed on the standard N900 image. It’s too late to replace Fremantle’s thumbnailer with the new stuff. It takes time to properly test it.

While I was developing both the specification and the prototype XFCE developer Jannis Pohlmann contacted me about rewriting my prototype for use in the XFCE project. Tumbler was born.

The nice people at Nokia are more interested in working with upstream projects instead of maintaining own products separately, so I shifted my focus from hildon-thumbnail to contributing to Jannis’ Tumbler project.

We realized that we needed different kinds of schedulers so while Jannis was developing Tumbler I kindly asked to consider abstracting scheduling a bit. Tumbler now has two schedulers. The background one sets I/O and scheduler priorities to IDLE and processes its thumbnail tasks in FIFO order. The foreground uses LIFO and will instead of grouping Ready signals together, emit them immediately after each single thumbnail is finished. Default is of course foreground.

We also realized that thumbnail flavors are going to be platform specific. So we added some support for this in the DBus APIs that we further fine tuned and versioned.

Congratulations and appreciation to Jannis who made Tumbler’s code and design really nice. Also thanks a lot for constructively considering our requirements and helping adapting Tumbler’s code to cope with them.

I know you for example worked one long night on this stuff, so I officially owe you a few beers and/or cocktails next conference.

How about FOSDEM?

Keeping the autotools guys happy with qmake

I’m still figuring out how to do the same thing with cmake, but various bloggers and comments appear to be promising that it’ll be even more easy.

But this is a message for probably all Nokia teams who are making Qt-based libraries:

First open your src/src.pro file and add this stuff:

CONFIG += create_pc create_prl
QMAKE_PKGCONFIG_REQUIRES = QtGui
QMAKE_PKGCONFIG_DESTDIR = pkgconfig
INSTALLS += target headers

Now open your debian/$package-dev.install file and add this line:

usr/lib/pkgconfig

You’ll be doing all the autotools people a tremendous favor.

Next, open the README file and document that you need to use qmake-qt4 on Debian or make either qmake-qt3 or qmake-qt4 work flawlessly with your build environment. Perhaps also mention how to set the install prefix, how to make qmake find and install .pc files in another location, stuff like that. I find that this is lacking for almost every Qt-based library.

You’ll be doing everybody who wants to use your software a tremendous favor.

Database cursors used in Tracker

A cursor on a query of a database is a finger pointing to the current row. Most databases do this without pulling the entire resultset into memory. It’s indeed much like a C/C++ pointer, except that a pointer can only point to memory in your process’ virtual memory. A cursor is a bit more abstract.

POSIX developers can compare a cursor with a pointer to a region in an mmap. For people who don’t know about mmap, mmap can be used to map a file into your process’ memory. You get a C/C++ pointer back, from which you can read the data as-if it’s in memory. With mmap, when you create a pagefault, the kernel will pull pages into your memory (from the file, or whichever resource is behind the mapping).

In Tracker all database operations used to be much like how using g_file_get_contents works: you read the entire thing into memory, and then you operate on that memory. Internally it used the database’s cursor API too, of course. The sqlite3_step is sqlite’s cursor API too.

First the database has filled up its pagecache with this data, then you copied it to your application’s memory, then you used it, then you freed it.

That’s kinda silly! Why not use it straight from the database’s caches instead? That’s what you use a DB cursor for.

The result is less copying of memory. This means less memory fragmentation and fewer memory operations to perform (which should result in a small performance improvement).

This effort is ongoing but a lot of Tracker’s internal loops over resultsets are now using a cursor instead of a in-memory result-set.

A reason to get up in the morning

Ever since Nokia contacted me about improving Tinymail to make it suitable for their Modest E-mail client have they given me a reason to get up in the morning, to work on something of which I knew would someday kick ass.

With the Maemo5 based Nokia N900 device we’ll have Modest shipped by default, and Tracker being actively used by several of its softwares. Future is going to shine even brighter for Tracker. Hard to brag about it, Tracker is inherently a background thing. Ah, well, technical people know about it.

Having worked on Tracker for more than a year, I now understand Tracker’s potential. At first, while I was trying to make an API for- and store the summary of E-mail envelope headers, so that E-mail clients can access this in a memory efficient way, I was critical of this Tracker stuff.

But then I joined Ivan, Urho, Ottela, Martyn and Carlos who were working on Tracker. Later Jürg joined and at the Berlin Hackfest people like Rob Taylor, Jürg and Urho discussed replacing Tracker’s poorer own ontology with Nepomuk and replacing its query language with SPARQL.

Given the implied complexity I was again critical, but then that crazy Jürg guy in a few weeks time turned Tracker into 99.9% pure fine awesomeness. I quickly joined working on this crazy “vstore” branch. Since a few months we have convinced the other Tracker guys to just start calling it “master”.

Ever since I feel again like a student who is learning how to develop software. Jürg is utilizing so many good techniques and we’re implementing so many specifications that are just “the right thing to do”, that the beautify of it all could sometimes make me cry of happiness.

Thanks to creating the opportunity to develop on software that will be used on for example their N900 device, Nokia continues giving people like me a reason to get up in the morning.

Don’t tell the native Nokians, but that’s why the N900 announcements secretly also made me a little bit proud. To whoever of us that worked on this stuff: guys, we’re all doing a great job. Let’s make the next one even better!

SPARQL’s str() function in Tracker

Today I implemented the str() function for our SPARQL engine.

This makes it possible to use a <subject> just like a string.

Let’s first insert some data into our SPARQL store.

tracker-sparql -u -q \
   "INSERT { <urn:baaa> a rdfs:Resource }"

Following query doesn’t work, as variable ?s isn’t assigned with a xsd:string here, but a rdfs:Resource.

tracker-sparql -q
"SELECT ?s WHERE {
	?s a rdfs:Resource .
	FILTER REGEX (?s, '.*baaa', 's')
}"

This version works, because we introduce the str() function.

tracker-sparql -q
"SELECT ?s WHERE {
	?s a rdfs:Resource .
	FILTER REGEX (str(?s), '.*baaa', 's')
}"
  urn:uuid:94baaa45-99a6-e0f4-0bd9-f83ca90a9039
  urn:uuid:6e909006-a6ac-baaa-2ae4-cc01adcd5de7
  urn:baaa

You can also use a direct match, of course.

tracker-sparql -q
"SELECT ?s WHERE {
	?s a rdfs:Resource .
	FILTER (str(?s) = 'urn:baaa')
}"
  urn:baaa

By the way. Ivan made a cute tool in Python for typing in your queries:

It even does some code completion. If you type nco:[TAB] it’ll show you the NCO ontology. Nice!

More introduction to RDF and SPARQL

Introduction

I plan to give an introduction to features like COUNT, FILTER REGEX and GROUP BY which are supported by Tracker‘s SPARQL engine. We support more such features but I have to start the introduction somewhere. And overloading people with introductions to all features wont help me much with explaining things.

Since my last introduction to RDF and SPARQL I have added a few relationships and actors to the game.

We have Morrel, Max and Sasha being dogs, Sheeba and Query are cats, Picca is still a parrot, Fred and John are contacts. Fred claims that John is his friend. I changed the ontology to allow friendships between the animals too: Sasha claims that Morrel and Max are her friends. Sheeba claims Query is her friend. John bought Query. Fred being inspired by John decided to also get some pets: Morrel, Sasha and Sheeba.

Ontology

Let’s put this story in Turtle:

<test:Picca> a test:Parrot, test:Pet ;
	test:name "Picca" .

<test:Max> a test:Dog, test:Pet ;
	test:name "Max" .

<test:Morrel> a test:Dog, test:Pet ;
	test:name "Morrel" ;
	test:hasFriend <test:Max> .

<test:Sasha> a test:Dog, test:Pet ;
	test:name "Sasha" ;
	test:hasFriend <test:Morrel> ;
	test:hasFriend <test:Max> .

<test:Sheeba> a test:Cat, test:Pet ;
	test:name "Sheeba" ;
	test:hasFriend <test:Query> .

<test:Query> a test:Cat, test:Pet ;
	test:name "Query" .

<test:John> a test:Contact ;
	test:owns <test:Max> ;
	test:owns <test:Picca> ;
	test:owns <test:Query> ;
	test:name "John" .

<test:Fred> a test:Contact ;
	test:hasFriend <test:John> ;
	test:name "Fred" ;
	test:owns <test:Morrel> ;
	test:owns <test:Sasha> ;
	test:owns <test:Sheeba> .

Querytime!

Let’s first start with all friend relationships:

SELECT ?subject ?friend
WHERE { ?subject test:hasFriend ?friend }

  test:Morrel, test:Max
  test:Sasha, test:Morrel
  test:Sasha, test:Max
  test:Sheeba, test:Query
  test:Fred, test:John

Just counting these is pretty simple. In SPARQL all selectable fields must have a variable name, so we add the “as c” here.

SELECT COUNT (?friend) AS c
WHERE { ?subject test:hasFriend ?friend }

  5

We counted friend relationships, of course. Let’s say we want to count how many friends each subject has. This is a more interesting query than the previous one.

SELECT ?subject COUNT (?friend) AS c
WHERE { ?subject test:hasFriend ?friend }
GROUP BY ?subject

  test:Fred, 1
  test:Morrel, 1
  test:Sasha, 2
  test:Sheeba, 1

Actually, we’re only interested in the human friends:

SELECT ?subject COUNT (?friend) AS c
WHERE { ?subject test:hasFriend ?friend .
        ?friend a test:Contact
} GROUP BY ?subject

  test:Fred, 1

No no, we are only interested in friends that are either cats or dogs:

SELECT ?subject COUNT (?friend) AS c
WHERE { ?subject test:hasFriend ?friend .
       ?friend a ?type .
       FILTER ( ?type = test:Dog || ?type = test:Cat)
} GROUP BY ?subject"

  test:Morrel, 1
  test:Sasha, 2
  test:Sheeba, 1

Now we are only interested in friends that are either a cat or a dog, but whose name starts with a ‘S’.

SELECT ?subject COUNT (?friend) as c
WHERE { ?subject test:hasFriend ?friend ;
                 test:name ?n .
       ?friend a ?type .
       FILTER ( ?type = test:Dog || ?type = test:Cat) .
       FILTER REGEX (?n, '^S', 'i')
} GROUP BY ?subject

  test:Sasha, 2
  test:Sheeba, 1

Conclusions

Should we stop talking about ontologies and start talking about searchboxes and user interfaces instead? Although I certainly agree more UI-stuff is needed, I’m not sure yet. RDF and SPARQL are also about relationships and roles. Not just about matching stuff. Whenever we explain the new Tracker to people, most are stuck with ‘matching’ in their mind. They don’t think about a lot of other use-cases.

Such a search is just one use-case starting point: user entered a random search string and gives zero other meaning about what he needs. Many more situations can be starting points: When I select a contact in a user interface designed to show an archive of messages that he once sent to me, the searchbox becomes much more narrow, much more helpful.

As soon as you have RDF and SPARQL, and with Tracker you do, an application developer can start taking into account relationships between resources: The relationship between a contact in Instant Messaging and the attachments in an E-mail that he as a person has sent to you. Why not combine it with friendship relationships synced from online services?

With a populated store you can make the relationship between a friend who joined you on a trip, and photos of a friend of your friend who suggested the holiday location.

With GeoClue integration we could link his photos up with actual location markers. You’d find these photos that came from the friend of your friend, and we could immediately feed the location markers to the GPS software on your phone.

I really hope application developers have more imagination than just global searchboxes.

And this is just a use-case that is technically already possible with today’s high-end phones.

Introduction to RDF and SPARQL

Let’s start with a relatively simple graph. The graph shows the relationships between John, Fred, Max and Picca. John and Fred are humans who we’ll refer to as contacts. Max and Picca are pets. Max is a dog and Picca is a parrot. Both Picca and Max are owned by John. Fred claims that John is his friend.

If we would want to represent this story semantically we would first need to make an dictionary that describes pets, contacts, dogs, parrots. The dictionary would also describe possible relationships like ownership of a pet and the friendship between two contacts. Don’t forget, making something semantic means that you want to give meaning to the things that interest you.

Giving meaning is exactly what we’ll start with. We will write the schema for making this story possible. We will call this an ontology.

We describe our ontology using the Turtle format. In Turtle you can have prefixes. The prefix test: for example is the same as using <http://test.org/ontologies/tracker#>.

In Turtle you describe statements by giving a subject, a predicate and then an object. The subject is what you are talking about. The predicate is what about the subject your are talking about. And finally the object is the value. This value can be a resource or a literal.

When you write a . (a dot) in Turtle it means that you end describing the subject. When you write a ; (semicolon) it means that you continue with the same subject, but will start describing a new predicate. When you write a , (comma) it means that you even continue with the same predicate. The same rules apply in the WHERE section of a SPARQL query. But first things first: the ontology.

Note that the “test” ontology is not officially registered at tracker-project.org. It serves merely as an example.

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix tracker: <http://www.tracker-project.org/ontologies/tracker#> .
@prefix test: <http://www.tracker-project.org/ontologies/test#> .

test: a tracker:Namespace ;
	tracker:prefix "test" .

test:Entity a rdfs:Class .

test:Contact a rdfs:Class ;
	rdfs:subClassOf test:Entity .

test:Pet a rdfs:Class ;
	rdfs:subClassOf test:Entity .

test:Dog a rdfs:Class ;
	rdfs:subClassOf test:Entity .

test:Parrot a rdfs:Class ;
	rdfs:subClassOf test:Entity .

test:name a rdf:Property ;
	rdfs:domain test:Entity ;
	rdfs:range xsd:string .

test:owns a rdf:Property ;
	rdfs:domain test:Contact ;
	rdfs:range test:Pet .

test:hasFriend a rdf:Property ;
	rdfs:domain test:Contact ;
	rdfs:range test:Contact .

Now that we have meaning, we will introduce the actors: Picca, Max, John and Fred. Copy the @prefix lines of the ontology file from above, put the ontology file in the share/tracker/ontologies directory and run tracker-processes -r before restarting tracker-store in master. After doing all that you can actually store this as a /tmp/import.ttl file and then run tracker-import /tmp/import.ttl and it should import just fine. Ready for the queries below to be executed with the tracker-sparql -q ‘$query’ command.

Note that tracker-processes -r destroys all your RDF data in Tracker. We don’t yet support adding custom ontologies at runtime, so for doing this test you have to start everything from scratch.

<test:Picca> a test:Parrot, test:Pet ;
	test:name "Picca" .

<test:Max> a test:Dog, test:Pet ;
	test:name "Max" .

<test:John> a test:Contact ;
	test:owns <test:Max> ;
	test:owns <test:Picca> ;
	test:name "John" .

<test:Fred> a test:Contact ;
	test:hasFriend <test:John> ;
	test:name "Fred" .

Let’s do some simple SPARQL queries. You can execute these queries this way:

tracker-sparql -q "SELECT ?subject WHERE { ?subject a test:Parrot }"

In this query we ask for the subject of each entity that is a parrot. The query will yield test:Picca because Picca is the only parrot in our situation.

  test:Picca

Usually we aren’t interested in the subject, but in a real property of the parrot. We can ask for such a property this way:

SELECT ?subject ?name WHERE { ?subject a test:Parrot ; test:name ?name}
  test:Picca, Picca

Another simple example, give me all the contacts:

SELECT ?subject WHERE { ?subject a test:Contact }"
  test:John
  test:Fred

Just the contacts doesn’t illustrate much. Give me all contacts that have a friend. And display the contact and the friend’s names:

SELECT ?name ?friend
WHERE { ?subject test:hasFriend ?f ;
                 test:name ?name .
        ?f test:name ?friend }
  Fred, John

Let’s ask for all the pets that are owned:

SELECT ?subject WHERE { ?unknown test:owns ?subject }
  test:Max
  test:Picca

Oh, not the subject. The names. How did we do that again? Right:

SELECT ?name
WHERE { ?unknown test:owns ?subject .
        ?subject test:name ?name }
  Max
  Picca

This will of course yield the same results in our situation:

SELECT ?name
WHERE { <test:John> test:owns ?subject .
        ?subject test:name ?name }
  Max
  Picca

But this wont, Fred doesn’t own any pets. Only John owns pets.

SELECT ?name
WHERE { <test:Fred> test:owns ?subject .
        ?subject test:name ?name }

Let’s print the owner’s and the pet’s names:

SELECT ?owner ?name
 WHERE { ?unknown test:owns ?subject ;
                  test:name ?owner .
         ?subject test:name ?name }"
  John, Max
  John, Picca

Still with me? Let’s now conclude with requesting the names of the contacts who are a friend of the person who owns Picca:

SELECT ?name
WHERE { ?subject test:owns <test:Picca> .
        ?unknown test:hasFriend ?subject ;
                 test:name ?name }
  Fred

Invitation for Jürg and Rob: How about you guys writing a introduction to OPTIONAL, SUM, COUNT, GROUP-BY and FILTER, etc in SPARQL? :-) The more advanced stuff.

The subject of a resource, Nepomuk’s isStoredAs

After the many discussions the Tracker team did at the Desktop Summit in Gran Canaria I think a lot of people will start trying out Tracker’s master. We will indeed start making 0.7.x releases somewhere this or next month.

Meanwhile I’d like to point out that among the decisions that we made during the meetings and at the Ontology BOFs is that we wont use the URL of resources as the RDF’s subject field anymore. Instead we’ll use the nie:isStoredAs predicate for storing the URL.

Right now we already set nie:isStoredAs, but we still use the URL as subject. This will change, though. Just assume the subject to be something you should only use as an unique piece of data about the resource, pointing at it (in the RDF store). More details can be found here. If you want the thing itself (the file, the E-mail, the .desktop file, the website’s URL), ask for nie:isStoredAs.

For example:

<file:///tmp/myfile.png> a nfo:FileDataObject .
<urn:nepomuk:file:d7ea...> a nfo:Image ;
	nie:isStoredAs <file:///tmp/myfile.png> .

And to query:

tracker-sparql -q "SELECT ?url WHERE { ?subject a nfo:Image ; nie:isStoredAs ?url }

We know that many people want these 0.7.x releases to happen soon. I can only invite those people to just join coding. Awesome stuff is indeed taking place, but at the same time there is a lot of work and decision making to do.

Things like a user interface like the T-S-T (Tracker Search Tool) from Tracker 0.6, documentation with a lot of examples. SPARQL, SPARQL Update and Nepomuk all have quite a lot of documentation by themselves. But people are still asking for even more examples. Anybody interested in making that? Maybe if somebody who was at Rob Taylor’s BOF could write down his and Jürg’s lectures on RDF and SPARQL? I think they explained it all very well.

Tracker experimental merged to main development tree, Ivan’s presentation

I’m currently involved in the Tracker project and our project will be presented by Ivan Frade at the Desktop Summit this Sunday.

We merged our experimental branch tracker-store to master. This means that our reachitecture plans for Tracker have mostly been implemented and are being pushed forward into the main development tree.

I will start with a comparison with Tracker’s 0.6.x series.

Tracker master:

  • Uses SPARQL as query language
  • Uses Nepomuk for its base ontologies
  • Supports SPARQL Update
  • Supports aggregates COUNT, AVG, SUM, MIN and MAX in SPARQL
  • Operates for all its storage functionality as a separate binary
  • Operates all its indexing, crawling and monitoring functionalities in a separately packagable binary

Tracker 0.6.9x:

  • Uses RDFQuery as query language
  • Has its own ontology
  • Has very limited support for storing your own data
  • Supports several aggregate functions in its query language
  • Operates for all its storage functionality in the indexer
  • Operates for all its query functionality in the permanent daemon
  • Does file monitoring and crawling in the permanent daemon
  • Operates all its indexing functionality in a separately packagable binary

Tracker master:

Architecture

The storage service uses the Nepomuk ontologies as schema. It allows you to both query and insert or delete data.

The fs-miner finds, crawls and monitors file resources. It also analyses those files and extracts the metadata. It instructs the storage service to store the metadata.

External applications and other such miners are allowed to both query and insert using the storage service. Priority is given to queries over stores.

Plugins that run in process of the application can push information into Tracker. We indeed don’t try to scan Evolution’s cache formats, we have a plugin that gets it out of Evolution and into Tracker.

Storage service’s API and IPC

The storage service gives priority to SELECT queries to ensure that apps in need of metadata get serviced quickly.

INSERT and DELETE calls get queued. SELECT ones get executed immediately. For apps that require consistency and/or insertion speed we provide a batch mode that has a commit barrier. When the commit calls back you know that everything that came before it, is in a consistent shape. We don’t support full transactions with rollback.

The standard API operates over DBus. This means while using it you are subject to DBus’s performance limitations. In SPARQL Update it is possible to group a lot of writes. Due to DBus’s latency overhead this is recommended when inserting larger sets of data. We’re experimenting with a custom IPC system, based on unix sockets, to get increased throughput for apps that want to put a lot of INSERTs onto our queue.

We provide a feature that signals on changes happening to certain types. You can see this as a poor man’s live search. Full live search for SPARQL is fairly complicated. Maybe in future we’ll implement something like that.

Ontology

We support the majority of the Nepomuk base ontologies and our so called filesystem miners will store found metadata using Nepomuk’s ontologies. We support static custom ontologies right now. This means that it’s impossible to dynamically add a new ontology unless you reset the entire database first.

We’re planning to support dynamically adding and removing ontologies. The ontology format that we use is Turtle.

Backup and import

Right now we support loading data into our database using either SPARQL Update, an experimental unix-socket based IPC, and by passing us a Turtle file.

We currently have no support for making a backup. Support for this is on priority planning. It will write a Turtle file (which can be loaded afterward).

Backup and import of ontology specific metadata

When we introduce support for custom ontologies it’ll be useful for apps that provided their own custom ontology to get a backup of just the data that has relevance to said ontology. We plan to provide a method to do that.

Volume support

Having a static custom ontology for volume support, volumes and their status is queryable over SPARQL. File resources also get linked to said volumes. This makes it possible to get the availability of a file resource. For example: return metadata about all photos that are located on a specific camera, although the camera isn’t connected to this device.

Volume support is a work in progress at this moment.

Rearchitecting Tracker

Jürg and me have started working on the rearchitecture plans that we have for Tracker. You can follow the code being changed here and here.

What is finished?

  • Jürg took all database code out of the indexer. The indexer is now a consumer of tracker-store like any other. It commands tracker-store to store metadata. The indexer now also queries tracker-store for things like the modification time. Currently it has no access to the database directly. This might change, for performance reasons, we’re not sure about that yet.
  • The trackerd process got renamed to tracker-store.
  • The DBus object in tracker-store now executes the SPARQL Update requests itself. It used to send this request to tracker-indexer.

  • Jürg moved the watching and crawling code that used to be in the daemon to the indexer. This means that tracker-store doesn’t depend on inotify anymore. This work made it possible to make your own indexer or not to have an indexer at all. This was quite a big task and got pushed today. This is of course being tested as we speak.

  • I wrote an internal API to queue database store requests, making it possible to asynchronously deal with large amounts of data when multiple metadata deliverers will be giving tracker-store commands to store their metadata.
  • I also ported existing code to use this internal API. This task item is ongoing and being tested. For example the Turtle Import, support for removable device caches in Turtle, Push modules (support for E-mail clients) and the DBus SPARQL Update API are affected by this.
  • The class signals feature, which now doesn’t require involvement of the indexer, got fixed.

What is left to do?

Right now the indexer will instruct an extractor process to extract metadata from a file. This extractor process communicates the metadata first to the indexer, which in turn communicates the same metadata to tracker-store. This can be done more efficient by letting the extractor communicate the metadata directly to tracker-store.

We also have quite a few other plans for the indexer’s code. Such plans are a bit less short term planning. For example splitting support for the removable devices and the normal filesystem into two processes.

E-mail as a desktop service, this is how it should be done

While developing Tinymail, a library for writing E-mail clients, I was convinced that the storage of the summary was something Tinymail itself must handle. Back then there was, even pragmatically, nothing that could cope with the requirements of E-mail on mobile devices for this task.

Meanwhile I got the opportunity to work on the Tracker project. Using the Nepomuk Ontology, I made sure that the message ontology that Tracker uses can actually handle these requirements. I believe that the adoption of Nepomuk and SPARQL evolved Tracker from something that isn’t useful for E-mail software to something that should be involved when writing a desktop service for E-mail today.

Ryan, a pioneer in experimenting E-mail as a desktop service, advised me to be careful with my bias for RDF and SPARQL. I’ll keep it in mind! However…

I believe such a desktop service for E-mail should:

  • Download metadata by getting and parsing ENVELOPE and BODYSTRUCTURE using the FETCH programme of an IMAP server. As explained in this document.
  • Give priority to downloading metadata of those E-mails nearby the user’s scroll position.
  • Use IMAP’s pipelining. It gives the user the feeling that his technology operates faster than his human brains, even when on high latency connections like GPRS.
  • Cache the information, using Tracker’s Nepomuk Message Ontology as schema.
  • Make it possible to fetch just one particular MIME part and not the entire messages.
  • Enable it to create a new message, consisting of individual MIME parts.
  • Make it possible for those MIME parts to have their source in existing messages on an IMAP server when creating a message. When the IMAP server supports CATENATE, it should be used for this purpose.
  • Make applications use SPARQL with Tracker’s NMO to query metadata about E-mails.
  • Provide a stream API to get access to the the actual data of individual MIME parts. If not cached, the service should download the MIME part on demand. The DBus Stream API should look like GInputStream. Except for read(): I think for the transfer of the chunks of data that Unix Sockets or named pipes are better than using D-Bus.

To this I would like to add that although many people falsely believe that E-mails are like files, E-mails are more like recursive directories (container MIME parts) with items: the E-mail’s MIME parts. Any API that doesn’t admit this, is incorrectly designed.

This goes all the way up to the protocol, where you fetch per MIME part. You don’t fetch entire messages. You can indeed do that but that doesn’t mean it isn’t wrong. IMAP is not POP3. It’s also better to design for IMAP, than to use IMAP as a POP3 service. Better have hacks to support POP3 in your model (I’m serious).

Please don’t make the same mistake nearly every newcomer of E-mail solutions makes. There’s plenty of rubbish already, seriously.

Tracker, our near future plans

Apparently

Apparently this hasn’t been echoed enough times. A lot of teams are still wondering what they should use if they want to store RDF metadata in Nepomuk and how to query it.

What happened before

We have refactoring to bring Tracker’s codebase into a better state. This is being released as Tracker 0.6.9x. This one sentence is really not enough to describe the changes. We can’t continue talking about the past forever. Sorry guys.

We have introduced support for SPARQL and Nepomuk in Tracker. We also added the class-signals feature, Turtle import & export, and many other features like SPARQL UPDATE support. Making the storage engine effectively a generic Nepomuk RDF store that can be used to store and query RDF data.

What will happen

We are at this moment planning to rearchitect Tracker a little bit.
Among our plans we want to make the RDF metadata store standalone. The store stores your metadata using Nepomuk as ontology and enables the application developer to query in SPARQL. This means that it’ll be possible to use this storage service without the indexer even installed. This is already possible but right now we do the crawling and monitoring in the storage service.

We plan to move the crawling and monitoring to the indexer. One idea is that the indexer will instruct the extractor to do an analysis and then the extractor will push the extracted metadata to the RDF storage service. Making the indexer and extractor a provider & consumer like any other. Making them optional and separately packagable.

This because we get requests from other teams who don’t want the indexing. Modularizing is usually a good thing, so we now have plans to make this possible as a feature.

Other plans

Other plans that we haven’t thoroughly planned yet include support for custom ontologies. We have a good idea for this, though. We want to wait for it until after the rearchitecturing. Support for custom ontologies will include removing ontologies, installing ontologies and asking for a backup that’ll contain the metadata that is specific for an installed ontology.

Support for custom ontologies doesn’t mean that application developers should all go spastic and start making ontologies. I know you guys! Don’t do it! We want applications to reuse as much of the Nepomuk set as possible. The more Nepomuk gets reused, the more interopability between apps is possible.

Volume support in experimental Tracker

In Tracker’s trunk we have support for volumes. This means that we track removable devices appearing and disappearing. A removable device that disappears means that in your search result you wont see the resources that are on the disconnected removable device.

In trunk we keep a separate table with volume registrations for this.

In master we simply use the ontology. Which also means that we now make this information available to you as metadata in a clean way.

Some examples. List all the volumes that we know about:

tracker-sparql -q "SELECT ?o ?m ?z WHERE {
   ?o a tracker:Volume ;
   tracker:mountPoint ?m ;
   tracker:unmountDate ?z }"

That will return something like this (wrapping the line):

urn:nepomuk:datasource:/org/freed../Hal/devices/volume_uuid_XXX_ABCDE,
    file:///media/USBStick, 2009-04-01T13:38:20

The ones that we know about but aren’t mounted:

tracker-sparql -q "SELECT ?o ?m WHERE {
   ?o a tracker:Volume ;
   tracker:mountPoint ?m ;
   tracker:isMounted false } "

The ones that we know about and are mounted:

tracker-sparql -q "SELECT ?o ?m WHERE {
   ?o a tracker:Volume ;
   tracker:mountPoint ?m ;
   tracker:isMounted true } "

Let’s just for fun list the volumes that got unmounted before a specific date and didn’t get mounted anymore:

tracker-sparql -q "SELECT ?o ?m WHERE {
   ?o a tracker:Volume ;
   tracker:mountPoint ?m ;
   tracker:isMounted false ;
   tracker:unmountDate ?z .
   FILTER (?z < \"`date -u +"%Y-%m-%dT%H:%M:%S%z"`\") }"

I'd like to add that replacing HAL isn't our intention. Not at all. We depend on HAL to track this information ourselves. We need to know the availability of a volume and because we link every file resource to the volume's resource we can that way know about the availability of a resource.

It's under Tracker's own ontology prefix, and it's not really to be considered as decided or stable ontology API yet. We might change some things about it. It wasn't even a design goal to make this publicly accessible.

Why am I showing it? Well, because it's a nice way to explain one of our sprint tasks for the the next two weeks. I promised you guys that I would talk about what we are up to at Tracker. So here you have it!

Documentation. Documentation. Documentation!

As I was rereading Ivan Frade’s blog post about the class-signals feature that me and Ivan developed for Tracker last week I got reminded of something I wrote a long time ago:

In my opinion, there’s nothing as important as developer documentation for a framework or library.

Ivan decided to ~ dump our internal planning document on GNOME’s Live wiki service. He didn’t write the document so he’s of course not to blame, but the document was in my opinion not suitable as end user framework documentation. Application developers should not be required to understand what goes on in our team’s minds. So I rewrote this.

You can find the document at the SignalsOnChanges page. Let’s see if starting next week I can convince the other team members to document their new features in a similar way. Many new things in Tracker’s experimental branch could use more clarification and examples.

I happen to believe that undocumented libraries and frameworks aren’t meaningful. That’s because letting it remain undocumented the developer renders his work utterly irrelevant for a serious application developer. I also believe that undocumented infrastructure software is worse than no infrastructure software.

I strongly recommend to application developers to refrain from using any piece of undocumented library or framework. Don’t lower yourself to their standards. Demand documentation by refusing to use undocumented infrastructure. Doing it as free software is no excuse for delivering extremely low quality, like what undocumented infrastructure is (in my opinion).

Thanks to Ivan’s blog post for reminding me of what I once wrote, and today still believe in, myself. While passionately coding new features you sometimes forget about this.

Which is unacceptable.

Tracker’s class signals, live search capabilities

Ivan Frade already blogged this, so that that means that I can keep this one shorter.

But yeah, we have just finished what we decided to call class-signals for Tracker. You are looking at software development taking place as we speak, because we still need to do a peer-review of the branch and then merge it to our experimental branch. This will probably happen in a few minutes.

The new feature means that if the ontology of a rdfs:Class contains a rdfs:Property tracker:notify that is set to true, then Tracker will offer a DBus object for that class. Whenever a subject of that class is added, updated or removed you will receive a signal about it on that DBus object.

Ok, ok, I know. You don’t want to click links and search for it yourself. Here are some “screenshots”:

nmo:FeedMessage a rdfs:Class ;
         tracker:notify true ;
         rdfs:subClassOf nmo:Message .

The DBus API is split per rdfs:Class. That’s because we don’t want that a lot of applications will be listening for each and every change to each and every subject. Now they can instead select the classes that they are interested in. That way they can avoid getting woken up constantly by our damn signal.

This feature is a change signal mechanism. We will eventually implement support for full live-queries.

Supporting full live-queries means that we’ll need to store some state about your session to test whether your SPARQL query requires an update from Tracker to get your client’s previous result list synchronized. It’s quite a bit more tricky, especially if you want to support most of SPARQL to go together with the live-query notifications, and especially if you want to be accurate.

Finally you don’t want the live-query capability to be a huge performance burden each time material becomes updated.

This is why we are now putting in place this feature. We think that for 90% of the applications the capabilities of the class-signals feature will be sufficient for their live update needs.

Biweekly Tracker development planning

We realize

At the Tracker team we realize that we should not only tell you guys about Tracker’s new features. We should also involve you in the planning and realization of software development at the Tracker project.

I have discussed this with our team and we agreed that I will make a summary at each beginning of our biweekly planning.

The reason is fairly simple: we want to involve the community and inspire them to become active contributors. But how can they contribute unless they know what the plans of the development team of Tracker for the next few days are? You are right, they can’t.

Which is a situation that we want to change. Therefor, here’s the summary of the plans for the next two weeks:

Ontologies

  • Making tickets for an audio ontology that we made during a previous sprint to upstream Nepomuk.
  • Craft an ontology for images that will later be proposed to Nepomuk. We will convert the extractors in our experimental branch to use the new image ontology.
  • Applications need to be able to create arbitrary key value pairs to existing resources. We plan to add a custom ontology on top of Nepomuk’s NAO ontology that will make this possible.

Fixing things in the 0.6.9x series

  • Changing unique value sorting to use collations and we’re planning to change the unique value queries for better performance.
  • Making sure tracker-indexer and tracker-extract support detecting unmount events, and marking resources as unavailable, at a very early stage. This helps cleanly unmounting a removable device without keeping the device busy as the unmount request takes place.
  • Supporting listening for a signal that tells us to pause the indexer. This will be helpful on devices that want us to be silent while they are performing a higher priority task.
  • Evaluating and testing concurrent access.

Our experimental branch

  • Start discussing merging our experimental branch to trunk together with Jamie.
  • We have updated the new .ontology files in our experimental branch to include more specific rdfs:range information.
  • Making it possible to observe changes about resources that belong to a class. For example observing changes in all nfo:Document classes. The granularity will be at the level of a resource. You’ll know whether a resource is updated, deleted or created. This task might take more than a week to complete. Ivan Frade is planning to write a blog item dedicated to this subject.
  • To fulfill the requirements of certain use-cases we’re planning to make it possible to delete a complete resource (or a set of) with one function call.
  • We might start working on adding support for application domain (custom) ontologies.
  • Porting recent improvements on support for detecting file renames to our experimental branch.
  • Evaluating isLogicalPartOf or alternative link as cascading rule for deletes.

When people are interested in joining development of Tracker, they can ping us at the channel #tracker on GimpNet. They can also send a E-mail to Tracker’s mailing list.

The ontology descriptions of Tracker, now in Turtle

Thanks to Jürg is the experimental branch of Tracker storing its ontology descriptions using the Turtle format.

What is an ontology anyway?

Wikipedia sums it up pretty well: In computer science and information science, an ontology is a formal representation of a set of concepts within a domain and the relationships between those concepts.

What is Turtle?

The w3 specification explains it as a textual syntax for RDF that allows RDF graphs to be completely written in a compact and natural text form, with abbreviations for common usage patterns and datatypes.

Turtle is the format that we want to standardize on. We for example plan to use it for some of our interprocess communication needs, we are already using it for backup and restore support and we used it as base format for persisting user metadata on removable devices.

An example snippet from the ontology in Turtle:

nie:InformationElement a rdfs:Class ;
     rdfs:label "Information Element" ;
     rdfs:subClassOf rdfs:Resource .
nie:title a rdf:Property ;
     rdfs:label "Title" ;
     rdfs:comment "The title of the document" ;
     rdfs:subPropertyOf dc:title ;
     nrl:maxCardinality 1 ;
     rdfs:domain nie:InformationElement ;
     rdfs:range xsd:string ;
     tracker:fulltextIndexed true .
nfo:Document a rdfs:Class ;
      rdfs:label "Document" ;
      rdfs:comment "A generic document. A common superclass for all documents on the desktop." ;
      rdfs:subClassOf nie:InformationElement .

Okay …

It only made sense for us to destroy and eliminate inifile formats like the ontology descriptions of the current non-experimental Tracker. Let me explain:

We have plans to add support for adding domain specific custom ontologies. By that I mean that it’ll be possible for an application to install and remove an ontology, to get the application specific metadata out and to restore this data as part of reinstalling a custom ontology.

As I already pointed out during my Tracker presentation at FOSDEM doesn’t this mean that we encourage application developers not to care about the base ontology. In fact we strongly recommend application developers to stick to- instead of diverting much from Nepomuk.

Meanwhile experimental Tracker’s indexer has started to work and is storing things in our decomposed RDF storage that uses Nepomuk as schema and can be queried using SPARQL.

Vision

My vision for metadata storage is that files are just one kind of resource. Tracker’s indexer collects the metadata about mostly those file-based resources. However. Metadata is everywhere: in META tags of websites and RSS feeds, in both locally stored and streamable remotely located multimedia resources, in E-mails, meeting requests and calendar items, in your roster and contacts list, in daily events, in computer events and installed applications.

Computer events? I hear you thinking. Well, for example a “hardware” event like your location changing just before you took a picture. That location can be harvested as metadata about the picture. There are many more examples imaginable.

For some datas is metadata the only thing really being stored. For example aren’t contact-resources having much more data than today’s ontologies describe. For most of the others will metadata describe the resource. And often, more importantly, its relationship with other resources.

We want Tracker to be your framework for metadata on both desktops and mobile devices. This is why we want to use w3 standards like Turtle over our own formats. We just happen to see things quite big.

No more old XML RDFQuery, but SPARQL. No more inifiles, but Turtle. No more home brewed ontology, but Nepomuk.

Merge-to-trunk plans

We are planning to start merging the experimental stuff to trunk, and start calling it the 0.7.x series of Tracker. Let’s see how many cocktails we’ll need for Jamie to get drunk enough to start accepting our insane, but already working, ideas.

TrackerClue!

I was just skipping some memes and version control system flames debates when suddenly I bumped on an interesting blog post by Henri Bergius on how he sees integration of the GeoClue project with the desktop.

I’m more into mobile myself, considering a desktop a necessary evil that people don’t really enjoy using. But have to use, because there’s nothing better. Meanwhile is there a trend towards mobile uses. Music players, cameras, in-car entertainment, navigation assistance, movie players, setup boxes. And sooner or later ePaper devices to replace magazines, books and newspapers.

But whenever the desktop’s software gets integrated with a location framework, it wakes me up. That’s because I consider having access to meaningful location clues to be a creator for a large amount of very interesting use-cases for mobile. Use-cases which we might not be seeing yet, today. Because we humans are walking blinded into the future.

Which means that we at the Tracker project must and will welcome such integration. We too want to enable the app developers of tomorrow and today to convert their innovative ideas into elegant solutions. Location clues about events and resources will be very interesting meta information for those apps, indeed.

Such systems can already update the Nepomuk structured meta information that Tracker collects using the SPARQL INSERT, DELETE and UPDATE support that Jürg started working on since a week or so.

It’s actually finished… maybe not sufficiently tested. But isn’t crashing hard pure fun anyway? Gives you a reason to go code and fix it!

Although we are very hard at work to get the indexer working again wont our experimental branch index your documents just yet. We have been testing our query stuff by importing generated Turtle files to be honest.

Nonetheless I kindly invite people to completely break their Tracker install by trying out our experimental stuff. Read a bit about Nepomuk’s ontology and mentally glue that together with the query flexibility SPARQL enables, and you’ll pretty soon grasp how cool it will be.

And yeah, there’s still a lot of hard work to do. But that’s great and a lot of fun.

And you should grow a pointy hat, put on a beard, jolt drink cola, and fun the join!

Ok. That’s enough Tracker propaganda for a day. Let’s now check if Nepomuk’s current stuff is good for storing GeoClue’s info.

SPARQL, Nepomuk, StreamAnalyzer and Tracker

We at the Tracker team should in my opinion report more often in our blogs about our progress on things like Tracker’s SPARQL and Nepomuk support.

Let’s start with the awesome SPARQL support that Jürg has been working on. Just a few minutes ago when you made a SPARQL query that had a unknown predicate, Tracker returned an empty array over D-Bus.

dbus-send --print-reply --dest=org.freedesktop.Tracker
   --type=method_call /org/freedesktop/Tracker/Search
   org.freedesktop.Tracker.Search.SparqlQuery
   string:'SELECT ?title WHERE { ?s nie:ttle
      ?title FILTER regex(?title, ".*in.*") . }'

method return sender=:1.66 -> dest=:1.98 reply_serial=2
   array [
   ]

Leaving you in the unknown about your query being in error. Jürg fixed this and now you get something like this instead.

tracker-sparql --query="SELECT ?title WHERE { ?s nie:ttle
      ?title FILTER regex(?title, \".*in.*\") . }"
Could not query search, Unknown property `http://.../nie#ttle'

This way you can fix your query’s error and do something like this instead:

tracker-sparql --query="SELECT ?title WHERE { ?s nie:title
      ?title FILTER regex(?title, \".*in.*\") . }"

  The final metadata solution
  Tracker in gnome bugzilla

Today I migrated the code in Tracker that implements support for the metadata D-Bus API for E-mail to the Nepomuk Message Ontology. Meaning that Tracker will store the metadata it receives from E-mail clients like KMail and Evolution using the NMO ontology and that it’ll make this metadata available to the SPARQL query engine.

Great news that we got informed of this week is that a developer has started implementing the metadata D-Bus API for E-mail in Thunderbird. He left a pointer to his git repository on the wiki-page.

Meanwhile I have implemented the API in KMail. This patch is pending review. We are planning to add support for this in Modest soon too.

Next. We are migrating the indexers and extractors to Nepomuk. These tasks come with all sorts of extra work related to integrating with Nepomuk as ontology.

I have also implemented integration with Strigi’s truly awesome StreamAnalyzer. I have rarely seen such a beautifully designed piece of code that in my opinion outperforms whatever Tracker has at this moment for extracting metadata in several interesting ways.

I don’t know why we shouldn’t join Strigi on making StreamAnalyzer kick ass. I can find no reason why instead of trying to compete with it we shouldn’t integrate with it. I’m pushing our team to consider the integration option and so far they are enthusiastic about it.

StreamAnalyzer needs a migration from Xesam as ontology to Nepomuk. But Evgeny Egorochkin and Jos Vandenoever already told me that they have put this on their agenda. After that, with the integration that I did for Tracker, can StreamAnalyzer become the core analysis code that Tracker uses. Right now the plan is to let StreamAnalyzer be the first to run and then letting Tracker’s own extractors follow up.

Let’s make some more bridges with KDE projects. Why not!

E-mail metadata, “E-mail as a (desktop) service”

Not on the desktop but on mobiles I think the era of E-mail clients will soon be over. Just like the era of filemanagers will be over. A person who’s using a mobile or a phone doesn’t really want to start and stop applications. Those users don’t start and stop applications to receive phonecalls and text messages. Why would they want to start and stop E-mail clients?

Not only that. People also want their text messages, E-mails, history of calls, meetings, contacts and photos to be integrated.

When we search for a meeting, we want to find the photos we took at that meeting. We also want to find the contacts that were at that meeting. We want to find the invitation E-mail and all the replies to the invitation. And when we select a contact, we want to see a tree of E-mail discussions that we once had with the contact.

On a mobile we don’t want one big application that does all this. Instead we want all applications to integrate with all this information. And we want it to be very easy for application developers to integrate with this system. An application framework.

This will be the purpose of Tracker on mobiles.

This means that the concept of an E-mail client will eventually be moved to the background of the mobile device. E-mail must just be there, not started. Something must communicate with the IMAP server whenever needed. Meanwhile all applications on the mobile need to have easy access to the metadata of the E-mails. It must be easy for them to get a MIME part of an E-mail, perhaps as a InputStream?

The combination of E-mail metadata querying and handling of the E-mail’s MIME parts is what I refer to as “E-mail as a service”. Some people in the past tried to explain me that if they would just put JavaMail’s API over D-Bus that this would already solve “E-mail as a service”. I don’t think this is true. Camel, which had its API based on JavaMail, offers a truly weak query interface. It’s for example not possible to ask for MIME parts that are images that have a specific Exif tag set.

That’s of course because it’s not Camel’s purpose to do metadata indexing of the attachments. But that was yesterday. Yesterday was boring.

With modern IMAP servers that have the CONVERT capability it will be possible to ask for a converted MIME part of an E-mail. Converted to Exif plain-text data. Meaning that we don’t have to fetch 5MB of JPEG data just to read a few hundred bytes of Exif metadata.

Meanwhile normal IMAP servers already offer ENVELOPE and BODYSTRUCTURE which of course gives us a lot of metadata too.

To assist people who want to write a “E-mail as a service” D-Bus service today I have decided to write a document that explains some of the capabilities of modern IMAP.

I think the future of E-mail “infrastructure” lies in:

  1. Using an RDF store that can be accessed using SPARQL, like Tracker. This stores the ENVELOPEs and BODYSTRUCTUREs of your E-mails next to the attachment’s other metadata triples. The query language can then be used to query against metadata found by analyzers like Tracker’s own extractors and/or Strigi’s StreamAnalyzer as well as metadata coming from IMAP itself.

    People who saw my presentation at FOSDEM already know that we are planning to push Tracker in the direction of SPARQL + Nepomuk as ontology. Meanwhile we are in discussion with the Xesam and Nepomuk people to change the Nepomuk Message Ontology to be suitable for this. As a result Evgeny Egorochkin made this proposal.

  2. Having a small service for dealing with E-mail specific things. Like getting the contents of the MIME parts as streams. Requesting them to be downloaded if they aren’t cached locally yet. Requesting a CONVERTed version of them.

    There are some experiments happening that will implement this capability. It’s all still very early. If I ever start a Tinymail 2.0, I will probably make it focus primarily on this.

I don’t think the conventional E-mail client will survive for very long. Especially not on mobiles where “integration” is far more important for the end-user.

Every (mobile) application can soon become as capable of handling E-mail as what we today call “the E-mail client”. At least from the point of view of the user. In reality a desktop service will solve the hard stuff.