Archive for the ‘Tracker’ Category

Handling triplets arriving in tracker-store, CouchDB integration as use-case

Sunday, November 22nd, 2009

At GCDS Jamie told us that he wants to make a plugin for tracker-store that writes all the triplets to a CouchDB instance.

Letting a CouchDB be a sort of offline backup isn’t very interesting. You want triples to go into the CouchDB at the moment of guaranteed storage: at commit time.

For the purpose of developing this we provide the following internal API.

typedef void (*TrackerStatementCallback) (const gchar *graph,
                                          const gchar *subject,
                                          const gchar *predicate,
                                          const gchar *object,
                                          GPtrArray   *rdf_types,
                                          gpointer     user_data);
typedef void (*TrackerCommitCallback)    (gpointer     user_data);

tracker_data_add_insert_statement_callback (TrackerStatementCallback callback,
                                            gpointer                 user_data);
tracker_data_add_delete_statement_callback (TrackerStatementCallback callback,
                                            gpointer                 user_data);
tracker_data_add_commit_statement_callback (TrackerCommitCallback callback,
                                            gpointer              user_data);

You’ll need to make a plugin for tracker-store and make the hook at the initialization of your plugin.

Current behaviour is when graph is NULL, it means that the default graph is being used. If it’s not NULL, it means that you probably don’t want the data in CouchDB: it’s data that’s coming from a miner. You probably only want to store data that is coming from the user. His applications won’t use FROM and INTO for their SPARQL Update queries, meaning that graph is NULL.

Very important is that your callback handler works with bottom halves: put your expensive task on a queue and handle the queued item somewhere else. You can for example use a GThreadPool or a GQueue plus a g_idle_add_full with G_PRIORITY_LOW callback picking items one by one on the mainloop. You should never have a TrackerStatementCallback or a TrackerCommitCallback that blocks. Not even a tiny tiny bit of blocking: it’ll bring everything in tracker-store on its knees. It’s why we aren’t giving you a public plugin API with a way to install your own plugins outside of the Tracker project.

By the way: we want to see code instead of talk before we further optimize things for this purpose.

Writeback, writing metadata back into your files

Wednesday, November 11th, 2009

Today, I feel like exposing you to some bleeding edge development going on as we speak at the Tracker team. I know you’re scared of that and that’s precisely why I want to expose you! Hah.

We are prototyping writeback support for Tracker.

With writeback we mean writing metadata that the user passes to us via SPARQL UPDATE into the file that he’s describing.

This means that it must be about a thing that is stored, that it must update a property that we want to writeback and it means that we need to support the format.

OK, that’s three requirements before we write anything back. Let’s explain how this stuff works in the prototype!

In our prototype you mark properties that are eligible for being written into the files using tracker:writeback.

It goes like this:

nie:title a rdf:Property ;
   rdfs:label "Title" ;
   rdfs:comment "The title of the document" ;
   rdfs:subPropertyOf dc:title ;
   nrl:maxCardinality 1 ;
   rdfs:domain nie:InformationElement ;
   rdfs:range xsd:string ;
   tracker:fulltextIndexed true ;
   tracker:weight 10 ;
   tracker:writeback true .

Next you need a writeback module for tracker-writeback. We implemented a prototype one that can only write the title of MP3 files. It uses ID3lib‘s C API.

When the user is describing a file, the resource must have nie:isStoredAs. The property being changed ‘s tracker:writeback must be true. We want the value of the property too. That’s simple in SPARQL, right? Sure it is!

SELECT ?url ?predicate ?object {
    <$subject> ?predicate ?object ;
               nie:isStoredAs ?url .
    ?predicate tracker:writeback true
 }

You’ll find this query in the code, go look!

Now it’s simple: using ID3lib we map Nepomuk to ID3 and write it.

No don’t be afraid, we’re not going to writeback metadata that we found ourselves. We’ll only writeback data that the user provided in the form of a SPARQL Update on the default graph. No panic. Besides, using tracker-writeback is going to be completely optional (just don’t run it).

This is a prototype, I repeat, this is a prototype. No expectations yet please. Just feel exposed to scary stuff, get overly excited and then join us by contributing. It’s all public what we’re doing in the branch ‘writeback’.

ps. Whether this will be Maemo’s future metadata-write stuff? Hmm, I don’t know. Do you know? ;-)

FWD: Using Tumbler in Client Applications

Friday, October 30th, 2009

Tumbler‘s maintainer wrote an interesting tutorial on how to use the thumbnail DBus API today.

Check it out if your application needs to use thumbnails.

Tumbler

Wednesday, October 28th, 2009

Last few weeks I have been working on the new thumbnail infrastructure for future Maemo products.

Last year I made a specification for requesting thumbnails over D-Bus. Afterward I made a quick prototype and replaced the hildon-thumbnailer library of Maemo with it. This prototype will be deployed on the standard N900 image. It’s too late to replace Fremantle’s thumbnailer with the new stuff. It takes time to properly test it.

While I was developing both the specification and the prototype XFCE developer Jannis Pohlmann contacted me about rewriting my prototype for use in the XFCE project. Tumbler was born.

The nice people at Nokia are more interested in working with upstream projects instead of maintaining own products separately, so I shifted my focus from hildon-thumbnail to contributing to Jannis’ Tumbler project.

We realized that we needed different kinds of schedulers so while Jannis was developing Tumbler I kindly asked to consider abstracting scheduling a bit. Tumbler now has two schedulers. The background one sets I/O and scheduler priorities to IDLE and processes its thumbnail tasks in FIFO order. The foreground uses LIFO and will instead of grouping Ready signals together, emit them immediately after each single thumbnail is finished. Default is of course foreground.

We also realized that thumbnail flavors are going to be platform specific. So we added some support for this in the DBus APIs that we further fine tuned and versioned.

Congratulations and appreciation to Jannis who made Tumbler’s code and design really nice. Also thanks a lot for constructively considering our requirements and helping adapting Tumbler’s code to cope with them.

I know you for example worked one long night on this stuff, so I officially owe you a few beers and/or cocktails next conference.

How about FOSDEM?

Keeping the autotools guys happy with qmake

Tuesday, October 20th, 2009

I’m still figuring out how to do the same thing with cmake, but various bloggers and comments appear to be promising that it’ll be even more easy.

But this is a message for probably all Nokia teams who are making Qt-based libraries:

First open your src/src.pro file and add this stuff:

CONFIG += create_pc create_prl
QMAKE_PKGCONFIG_REQUIRES = QtGui
QMAKE_PKGCONFIG_DESTDIR = pkgconfig
INSTALLS += target headers

Now open your debian/$package-dev.install file and add this line:

usr/lib/pkgconfig

You’ll be doing all the autotools people a tremendous favor.

Next, open the README file and document that you need to use qmake-qt4 on Debian or make either qmake-qt3 or qmake-qt4 work flawlessly with your build environment. Perhaps also mention how to set the install prefix, how to make qmake find and install .pc files in another location, stuff like that. I find that this is lacking for almost every Qt-based library.

You’ll be doing everybody who wants to use your software a tremendous favor.

Release fast, release often (has finally started)

Wednesday, September 30th, 2009

Martyn is right, we did release Tracker 0.7!

Now remember kids. It’s only a alpha or at least unstable release. The 0.8 will be what we will call the stable series for RDF, SPARQL, the new miner infrastructure, etc.

Database cursors used in Tracker

Monday, August 31st, 2009

A cursor on a query of a database is a finger pointing to the current row. Most databases do this without pulling the entire resultset into memory. It’s indeed much like a C/C++ pointer, except that a pointer can only point to memory in your process’ virtual memory. A cursor is a bit more abstract.

POSIX developers can compare a cursor with a pointer to a region in an mmap. For people who don’t know about mmap, mmap can be used to map a file into your process’ memory. You get a C/C++ pointer back, from which you can read the data as-if it’s in memory. With mmap, when you create a pagefault, the kernel will pull pages into your memory (from the file, or whichever resource is behind the mapping).

In Tracker all database operations used to be much like how using g_file_get_contents works: you read the entire thing into memory, and then you operate on that memory. Internally it used the database’s cursor API too, of course. The sqlite3_step is sqlite’s cursor API too.

First the database has filled up its pagecache with this data, then you copied it to your application’s memory, then you used it, then you freed it.

That’s kinda silly! Why not use it straight from the database’s caches instead? That’s what you use a DB cursor for.

The result is less copying of memory. This means less memory fragmentation and fewer memory operations to perform (which should result in a small performance improvement).

This effort is ongoing but a lot of Tracker’s internal loops over resultsets are now using a cursor instead of a in-memory result-set.

A reason to get up in the morning

Friday, August 28th, 2009

Ever since Nokia contacted me about improving Tinymail to make it suitable for their Modest E-mail client have they given me a reason to get up in the morning, to work on something of which I knew would someday kick ass.

With the Maemo5 based Nokia N900 device we’ll have Modest shipped by default, and Tracker being actively used by several of its softwares. Future is going to shine even brighter for Tracker. Hard to brag about it, Tracker is inherently a background thing. Ah, well, technical people know about it.

Having worked on Tracker for more than a year, I now understand Tracker’s potential. At first, while I was trying to make an API for- and store the summary of E-mail envelope headers, so that E-mail clients can access this in a memory efficient way, I was critical of this Tracker stuff.

But then I joined Ivan, Urho, Ottela, Martyn and Carlos who were working on Tracker. Later Jürg joined and at the Berlin Hackfest people like Rob Taylor, Jürg and Urho discussed replacing Tracker’s poorer own ontology with Nepomuk and replacing its query language with SPARQL.

Given the implied complexity I was again critical, but then that crazy Jürg guy in a few weeks time turned Tracker into 99.9% pure fine awesomeness. I quickly joined working on this crazy “vstore” branch. Since a few months we have convinced the other Tracker guys to just start calling it “master”.

Ever since I feel again like a student who is learning how to develop software. Jürg is utilizing so many good techniques and we’re implementing so many specifications that are just “the right thing to do”, that the beautify of it all could sometimes make me cry of happiness.

Thanks to creating the opportunity to develop on software that will be used on for example their N900 device, Nokia continues giving people like me a reason to get up in the morning.

Don’t tell the native Nokians, but that’s why the N900 announcements secretly also made me a little bit proud. To whoever of us that worked on this stuff: guys, we’re all doing a great job. Let’s make the next one even better!

SPARQL’s str() function in Tracker

Thursday, July 30th, 2009

Today I implemented the str() function for our SPARQL engine.

This makes it possible to use a <subject> just like a string.

Let’s first insert some data into our SPARQL store.

tracker-sparql -u -q \
   "INSERT { <urn:baaa> a rdfs:Resource }"

Following query doesn’t work, as variable ?s isn’t assigned with a xsd:string here, but a rdfs:Resource.

tracker-sparql -q
"SELECT ?s WHERE {
	?s a rdfs:Resource .
	FILTER REGEX (?s, '.*baaa', 's')
}"

This version works, because we introduce the str() function.

tracker-sparql -q
"SELECT ?s WHERE {
	?s a rdfs:Resource .
	FILTER REGEX (str(?s), '.*baaa', 's')
}"
  urn:uuid:94baaa45-99a6-e0f4-0bd9-f83ca90a9039
  urn:uuid:6e909006-a6ac-baaa-2ae4-cc01adcd5de7
  urn:baaa

You can also use a direct match, of course.

tracker-sparql -q
"SELECT ?s WHERE {
	?s a rdfs:Resource .
	FILTER (str(?s) = 'urn:baaa')
}"
  urn:baaa

By the way. Ivan made a cute tool in Python for typing in your queries:

It even does some code completion. If you type nco:[TAB] it’ll show you the NCO ontology. Nice!

More introduction to RDF and SPARQL

Sunday, July 19th, 2009

Introduction

I plan to give an introduction to features like COUNT, FILTER REGEX and GROUP BY which are supported by Tracker‘s SPARQL engine. We support more such features but I have to start the introduction somewhere. And overloading people with introductions to all features wont help me much with explaining things.

Since my last introduction to RDF and SPARQL I have added a few relationships and actors to the game.

We have Morrel, Max and Sasha being dogs, Sheeba and Query are cats, Picca is still a parrot, Fred and John are contacts. Fred claims that John is his friend. I changed the ontology to allow friendships between the animals too: Sasha claims that Morrel and Max are her friends. Sheeba claims Query is her friend. John bought Query. Fred being inspired by John decided to also get some pets: Morrel, Sasha and Sheeba.

Ontology

Let’s put this story in Turtle:

<test:Picca> a test:Parrot, test:Pet ;
	test:name "Picca" .

<test:Max> a test:Dog, test:Pet ;
	test:name "Max" .

<test:Morrel> a test:Dog, test:Pet ;
	test:name "Morrel" ;
	test:hasFriend <test:Max> .

<test:Sasha> a test:Dog, test:Pet ;
	test:name "Sasha" ;
	test:hasFriend <test:Morrel> ;
	test:hasFriend <test:Max> .

<test:Sheeba> a test:Cat, test:Pet ;
	test:name "Sheeba" ;
	test:hasFriend <test:Query> .

<test:Query> a test:Cat, test:Pet ;
	test:name "Query" .

<test:John> a test:Contact ;
	test:owns <test:Max> ;
	test:owns <test:Picca> ;
	test:owns <test:Query> ;
	test:name "John" .

<test:Fred> a test:Contact ;
	test:hasFriend <test:John> ;
	test:name "Fred" ;
	test:owns <test:Morrel> ;
	test:owns <test:Sasha> ;
	test:owns <test:Sheeba> .

Querytime!

Let’s first start with all friend relationships:

SELECT ?subject ?friend
WHERE { ?subject test:hasFriend ?friend }

  test:Morrel, test:Max
  test:Sasha, test:Morrel
  test:Sasha, test:Max
  test:Sheeba, test:Query
  test:Fred, test:John

Just counting these is pretty simple. In SPARQL all selectable fields must have a variable name, so we add the “as c” here.

SELECT COUNT (?friend) AS c
WHERE { ?subject test:hasFriend ?friend }

  5

We counted friend relationships, of course. Let’s say we want to count how many friends each subject has. This is a more interesting query than the previous one.

SELECT ?subject COUNT (?friend) AS c
WHERE { ?subject test:hasFriend ?friend }
GROUP BY ?subject

  test:Fred, 1
  test:Morrel, 1
  test:Sasha, 2
  test:Sheeba, 1

Actually, we’re only interested in the human friends:

SELECT ?subject COUNT (?friend) AS c
WHERE { ?subject test:hasFriend ?friend .
        ?friend a test:Contact
} GROUP BY ?subject

  test:Fred, 1

No no, we are only interested in friends that are either cats or dogs:

SELECT ?subject COUNT (?friend) AS c
WHERE { ?subject test:hasFriend ?friend .
       ?friend a ?type .
       FILTER ( ?type = test:Dog || ?type = test:Cat)
} GROUP BY ?subject"

  test:Morrel, 1
  test:Sasha, 2
  test:Sheeba, 1

Now we are only interested in friends that are either a cat or a dog, but whose name starts with a ‘S’.

SELECT ?subject COUNT (?friend) as c
WHERE { ?subject test:hasFriend ?friend ;
                 test:name ?n .
       ?friend a ?type .
       FILTER ( ?type = test:Dog || ?type = test:Cat) .
       FILTER REGEX (?n, '^S', 'i')
} GROUP BY ?subject

  test:Sasha, 2
  test:Sheeba, 1

Conclusions

Should we stop talking about ontologies and start talking about searchboxes and user interfaces instead? Although I certainly agree more UI-stuff is needed, I’m not sure yet. RDF and SPARQL are also about relationships and roles. Not just about matching stuff. Whenever we explain the new Tracker to people, most are stuck with ‘matching’ in their mind. They don’t think about a lot of other use-cases.

Such a search is just one use-case starting point: user entered a random search string and gives zero other meaning about what he needs. Many more situations can be starting points: When I select a contact in a user interface designed to show an archive of messages that he once sent to me, the searchbox becomes much more narrow, much more helpful.

As soon as you have RDF and SPARQL, and with Tracker you do, an application developer can start taking into account relationships between resources: The relationship between a contact in Instant Messaging and the attachments in an E-mail that he as a person has sent to you. Why not combine it with friendship relationships synced from online services?

With a populated store you can make the relationship between a friend who joined you on a trip, and photos of a friend of your friend who suggested the holiday location.

With GeoClue integration we could link his photos up with actual location markers. You’d find these photos that came from the friend of your friend, and we could immediately feed the location markers to the GPS software on your phone.

I really hope application developers have more imagination than just global searchboxes.

And this is just a use-case that is technically already possible with today’s high-end phones.

Introduction to RDF and SPARQL

Tuesday, July 14th, 2009

Let’s start with a relatively simple graph. The graph shows the relationships between John, Fred, Max and Picca. John and Fred are humans who we’ll refer to as contacts. Max and Picca are pets. Max is a dog and Picca is a parrot. Both Picca and Max are owned by John. Fred claims that John is his friend.

If we would want to represent this story semantically we would first need to make an dictionary that describes pets, contacts, dogs, parrots. The dictionary would also describe possible relationships like ownership of a pet and the friendship between two contacts. Don’t forget, making something semantic means that you want to give meaning to the things that interest you.

Giving meaning is exactly what we’ll start with. We will write the schema for making this story possible. We will call this an ontology.

We describe our ontology using the Turtle format. In Turtle you can have prefixes. The prefix test: for example is the same as using <http://test.org/ontologies/tracker#>.

In Turtle you describe statements by giving a subject, a predicate and then an object. The subject is what you are talking about. The predicate is what about the subject your are talking about. And finally the object is the value. This value can be a resource or a literal.

When you write a . (a dot) in Turtle it means that you end describing the subject. When you write a ; (semicolon) it means that you continue with the same subject, but will start describing a new predicate. When you write a , (comma) it means that you even continue with the same predicate. The same rules apply in the WHERE section of a SPARQL query. But first things first: the ontology.

Note that the “test” ontology is not officially registered at tracker-project.org. It serves merely as an example.

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix tracker: <http://www.tracker-project.org/ontologies/tracker#> .
@prefix test: <http://www.tracker-project.org/ontologies/test#> .

test: a tracker:Namespace ;
	tracker:prefix "test" .

test:Entity a rdfs:Class .

test:Contact a rdfs:Class ;
	rdfs:subClassOf test:Entity .

test:Pet a rdfs:Class ;
	rdfs:subClassOf test:Entity .

test:Dog a rdfs:Class ;
	rdfs:subClassOf test:Entity .

test:Parrot a rdfs:Class ;
	rdfs:subClassOf test:Entity .

test:name a rdf:Property ;
	rdfs:domain test:Entity ;
	rdfs:range xsd:string .

test:owns a rdf:Property ;
	rdfs:domain test:Contact ;
	rdfs:range test:Pet .

test:hasFriend a rdf:Property ;
	rdfs:domain test:Contact ;
	rdfs:range test:Contact .

Now that we have meaning, we will introduce the actors: Picca, Max, John and Fred. Copy the @prefix lines of the ontology file from above, put the ontology file in the share/tracker/ontologies directory and run tracker-processes -r before restarting tracker-store in master. After doing all that you can actually store this as a /tmp/import.ttl file and then run tracker-import /tmp/import.ttl and it should import just fine. Ready for the queries below to be executed with the tracker-sparql -q ‘$query’ command.

Note that tracker-processes -r destroys all your RDF data in Tracker. We don’t yet support adding custom ontologies at runtime, so for doing this test you have to start everything from scratch.

<test:Picca> a test:Parrot, test:Pet ;
	test:name "Picca" .

<test:Max> a test:Dog, test:Pet ;
	test:name "Max" .

<test:John> a test:Contact ;
	test:owns <test:Max> ;
	test:owns <test:Picca> ;
	test:name "John" .

<test:Fred> a test:Contact ;
	test:hasFriend <test:John> ;
	test:name "Fred" .

Let’s do some simple SPARQL queries. You can execute these queries this way:

tracker-sparql -q "SELECT ?subject WHERE { ?subject a test:Parrot }"

In this query we ask for the subject of each entity that is a parrot. The query will yield test:Picca because Picca is the only parrot in our situation.

  test:Picca

Usually we aren’t interested in the subject, but in a real property of the parrot. We can ask for such a property this way:

SELECT ?subject ?name WHERE { ?subject a test:Parrot ; test:name ?name}
  test:Picca, Picca

Another simple example, give me all the contacts:

SELECT ?subject WHERE { ?subject a test:Contact }"
  test:John
  test:Fred

Just the contacts doesn’t illustrate much. Give me all contacts that have a friend. And display the contact and the friend’s names:

SELECT ?name ?friend
WHERE { ?subject test:hasFriend ?f ;
                 test:name ?name .
        ?f test:name ?friend }
  Fred, John

Let’s ask for all the pets that are owned:

SELECT ?subject WHERE { ?unknown test:owns ?subject }
  test:Max
  test:Picca

Oh, not the subject. The names. How did we do that again? Right:

SELECT ?name
WHERE { ?unknown test:owns ?subject .
        ?subject test:name ?name }
  Max
  Picca

This will of course yield the same results in our situation:

SELECT ?name
WHERE { <test:John> test:owns ?subject .
        ?subject test:name ?name }
  Max
  Picca

But this wont, Fred doesn’t own any pets. Only John owns pets.

SELECT ?name
WHERE { <test:Fred> test:owns ?subject .
        ?subject test:name ?name }

Let’s print the owner’s and the pet’s names:

SELECT ?owner ?name
 WHERE { ?unknown test:owns ?subject ;
                  test:name ?owner .
         ?subject test:name ?name }"
  John, Max
  John, Picca

Still with me? Let’s now conclude with requesting the names of the contacts who are a friend of the person who owns Picca:

SELECT ?name
WHERE { ?subject test:owns <test:Picca> .
        ?unknown test:hasFriend ?subject ;
                 test:name ?name }
  Fred

Invitation for Jürg and Rob: How about you guys writing a introduction to OPTIONAL, SUM, COUNT, GROUP-BY and FILTER, etc in SPARQL? :-) The more advanced stuff.

The subject of a resource, Nepomuk’s isStoredAs

Monday, July 13th, 2009

After the many discussions the Tracker team did at the Desktop Summit in Gran Canaria I think a lot of people will start trying out Tracker’s master. We will indeed start making 0.7.x releases somewhere this or next month.

Meanwhile I’d like to point out that among the decisions that we made during the meetings and at the Ontology BOFs is that we wont use the URL of resources as the RDF’s subject field anymore. Instead we’ll use the nie:isStoredAs predicate for storing the URL.

Right now we already set nie:isStoredAs, but we still use the URL as subject. This will change, though. Just assume the subject to be something you should only use as an unique piece of data about the resource, pointing at it (in the RDF store). More details can be found here. If you want the thing itself (the file, the E-mail, the .desktop file, the website’s URL), ask for nie:isStoredAs.

For example:

<file:///tmp/myfile.png> a nfo:FileDataObject .
<urn:nepomuk:file:d7ea...> a nfo:Image ;
	nie:isStoredAs <file:///tmp/myfile.png> .

And to query:

tracker-sparql -q "SELECT ?url WHERE { ?subject a nfo:Image ; nie:isStoredAs ?url }

We know that many people want these 0.7.x releases to happen soon. I can only invite those people to just join coding. Awesome stuff is indeed taking place, but at the same time there is a lot of work and decision making to do.

Things like a user interface like the T-S-T (Tracker Search Tool) from Tracker 0.6, documentation with a lot of examples. SPARQL, SPARQL Update and Nepomuk all have quite a lot of documentation by themselves. But people are still asking for even more examples. Anybody interested in making that? Maybe if somebody who was at Rob Taylor’s BOF could write down his and Jürg’s lectures on RDF and SPARQL? I think they explained it all very well.

Tracker experimental merged to main development tree, Ivan’s presentation

Thursday, July 2nd, 2009

I’m currently involved in the Tracker project and our project will be presented by Ivan Frade at the Desktop Summit this Sunday.

We merged our experimental branch tracker-store to master. This means that our reachitecture plans for Tracker have mostly been implemented and are being pushed forward into the main development tree.

I will start with a comparison with Tracker’s 0.6.x series.

Tracker master:

  • Uses SPARQL as query language
  • Uses Nepomuk for its base ontologies
  • Supports SPARQL Update
  • Supports aggregates COUNT, AVG, SUM, MIN and MAX in SPARQL
  • Operates for all its storage functionality as a separate binary
  • Operates all its indexing, crawling and monitoring functionalities in a separately packagable binary

Tracker 0.6.9x:

  • Uses RDFQuery as query language
  • Has its own ontology
  • Has very limited support for storing your own data
  • Supports several aggregate functions in its query language
  • Operates for all its storage functionality in the indexer
  • Operates for all its query functionality in the permanent daemon
  • Does file monitoring and crawling in the permanent daemon
  • Operates all its indexing functionality in a separately packagable binary

Tracker master:

Architecture

The storage service uses the Nepomuk ontologies as schema. It allows you to both query and insert or delete data.

The fs-miner finds, crawls and monitors file resources. It also analyses those files and extracts the metadata. It instructs the storage service to store the metadata.

External applications and other such miners are allowed to both query and insert using the storage service. Priority is given to queries over stores.

Plugins that run in process of the application can push information into Tracker. We indeed don’t try to scan Evolution’s cache formats, we have a plugin that gets it out of Evolution and into Tracker.

Storage service’s API and IPC

The storage service gives priority to SELECT queries to ensure that apps in need of metadata get serviced quickly.

INSERT and DELETE calls get queued. SELECT ones get executed immediately. For apps that require consistency and/or insertion speed we provide a batch mode that has a commit barrier. When the commit calls back you know that everything that came before it, is in a consistent shape. We don’t support full transactions with rollback.

The standard API operates over DBus. This means while using it you are subject to DBus’s performance limitations. In SPARQL Update it is possible to group a lot of writes. Due to DBus’s latency overhead this is recommended when inserting larger sets of data. We’re experimenting with a custom IPC system, based on unix sockets, to get increased throughput for apps that want to put a lot of INSERTs onto our queue.

We provide a feature that signals on changes happening to certain types. You can see this as a poor man’s live search. Full live search for SPARQL is fairly complicated. Maybe in future we’ll implement something like that.

Ontology

We support the majority of the Nepomuk base ontologies and our so called filesystem miners will store found metadata using Nepomuk’s ontologies. We support static custom ontologies right now. This means that it’s impossible to dynamically add a new ontology unless you reset the entire database first.

We’re planning to support dynamically adding and removing ontologies. The ontology format that we use is Turtle.

Backup and import

Right now we support loading data into our database using either SPARQL Update, an experimental unix-socket based IPC, and by passing us a Turtle file.

We currently have no support for making a backup. Support for this is on priority planning. It will write a Turtle file (which can be loaded afterward).

Backup and import of ontology specific metadata

When we introduce support for custom ontologies it’ll be useful for apps that provided their own custom ontology to get a backup of just the data that has relevance to said ontology. We plan to provide a method to do that.

Volume support

Having a static custom ontology for volume support, volumes and their status is queryable over SPARQL. File resources also get linked to said volumes. This makes it possible to get the availability of a file resource. For example: return metadata about all photos that are located on a specific camera, although the camera isn’t connected to this device.

Volume support is a work in progress at this moment.

Rearchitecting Tracker

Monday, May 25th, 2009

Jürg and me have started working on the rearchitecture plans that we have for Tracker. You can follow the code being changed here and here.

What is finished?

  • Jürg took all database code out of the indexer. The indexer is now a consumer of tracker-store like any other. It commands tracker-store to store metadata. The indexer now also queries tracker-store for things like the modification time. Currently it has no access to the database directly. This might change, for performance reasons, we’re not sure about that yet.
  • The trackerd process got renamed to tracker-store.
  • The DBus object in tracker-store now executes the SPARQL Update requests itself. It used to send this request to tracker-indexer.

  • Jürg moved the watching and crawling code that used to be in the daemon to the indexer. This means that tracker-store doesn’t depend on inotify anymore. This work made it possible to make your own indexer or not to have an indexer at all. This was quite a big task and got pushed today. This is of course being tested as we speak.

  • I wrote an internal API to queue database store requests, making it possible to asynchronously deal with large amounts of data when multiple metadata deliverers will be giving tracker-store commands to store their metadata.
  • I also ported existing code to use this internal API. This task item is ongoing and being tested. For example the Turtle Import, support for removable device caches in Turtle, Push modules (support for E-mail clients) and the DBus SPARQL Update API are affected by this.
  • The class signals feature, which now doesn’t require involvement of the indexer, got fixed.

What is left to do?

Right now the indexer will instruct an extractor process to extract metadata from a file. This extractor process communicates the metadata first to the indexer, which in turn communicates the same metadata to tracker-store. This can be done more efficient by letting the extractor communicate the metadata directly to tracker-store.

We also have quite a few other plans for the indexer’s code. Such plans are a bit less short term planning. For example splitting support for the removable devices and the normal filesystem into two processes.

E-mail as a desktop service, this is how it should be done

Monday, May 4th, 2009

While developing Tinymail, a library for writing E-mail clients, I was convinced that the storage of the summary was something Tinymail itself must handle. Back then there was, even pragmatically, nothing that could cope with the requirements of E-mail on mobile devices for this task.

Meanwhile I got the opportunity to work on the Tracker project. Using the Nepomuk Ontology, I made sure that the message ontology that Tracker uses can actually handle these requirements. I believe that the adoption of Nepomuk and SPARQL evolved Tracker from something that isn’t useful for E-mail software to something that should be involved when writing a desktop service for E-mail today.

Ryan, a pioneer in experimenting E-mail as a desktop service, advised me to be careful with my bias for RDF and SPARQL. I’ll keep it in mind! However…

I believe such a desktop service for E-mail should:

  • Download metadata by getting and parsing ENVELOPE and BODYSTRUCTURE using the FETCH programme of an IMAP server. As explained in this document.
  • Give priority to downloading metadata of those E-mails nearby the user’s scroll position.
  • Use IMAP’s pipelining. It gives the user the feeling that his technology operates faster than his human brains, even when on high latency connections like GPRS.
  • Cache the information, using Tracker’s Nepomuk Message Ontology as schema.
  • Make it possible to fetch just one particular MIME part and not the entire messages.
  • Enable it to create a new message, consisting of individual MIME parts.
  • Make it possible for those MIME parts to have their source in existing messages on an IMAP server when creating a message. When the IMAP server supports CATENATE, it should be used for this purpose.
  • Make applications use SPARQL with Tracker’s NMO to query metadata about E-mails.
  • Provide a stream API to get access to the the actual data of individual MIME parts. If not cached, the service should download the MIME part on demand. The DBus Stream API should look like GInputStream. Except for read(): I think for the transfer of the chunks of data that Unix Sockets or named pipes are better than using D-Bus.

To this I would like to add that although many people falsely believe that E-mails are like files, E-mails are more like recursive directories (container MIME parts) with items: the E-mail’s MIME parts. Any API that doesn’t admit this, is incorrectly designed.

This goes all the way up to the protocol, where you fetch per MIME part. You don’t fetch entire messages. You can indeed do that but that doesn’t mean it isn’t wrong. IMAP is not POP3. It’s also better to design for IMAP, than to use IMAP as a POP3 service. Better have hacks to support POP3 in your model (I’m serious).

Please don’t make the same mistake nearly every newcomer of E-mail solutions makes. There’s plenty of rubbish already, seriously.

Tracker, our near future plans

Friday, April 24th, 2009

Apparently

Apparently this hasn’t been echoed enough times. A lot of teams are still wondering what they should use if they want to store RDF metadata in Nepomuk and how to query it.

What happened before

We have refactoring to bring Tracker’s codebase into a better state. This is being released as Tracker 0.6.9x. This one sentence is really not enough to describe the changes. We can’t continue talking about the past forever. Sorry guys.

We have introduced support for SPARQL and Nepomuk in Tracker. We also added the class-signals feature, Turtle import & export, and many other features like SPARQL UPDATE support. Making the storage engine effectively a generic Nepomuk RDF store that can be used to store and query RDF data.

What will happen

We are at this moment planning to rearchitect Tracker a little bit.
Among our plans we want to make the RDF metadata store standalone. The store stores your metadata using Nepomuk as ontology and enables the application developer to query in SPARQL. This means that it’ll be possible to use this storage service without the indexer even installed. This is already possible but right now we do the crawling and monitoring in the storage service.

We plan to move the crawling and monitoring to the indexer. One idea is that the indexer will instruct the extractor to do an analysis and then the extractor will push the extracted metadata to the RDF storage service. Making the indexer and extractor a provider & consumer like any other. Making them optional and separately packagable.

This because we get requests from other teams who don’t want the indexing. Modularizing is usually a good thing, so we now have plans to make this possible as a feature.

Other plans

Other plans that we haven’t thoroughly planned yet include support for custom ontologies. We have a good idea for this, though. We want to wait for it until after the rearchitecturing. Support for custom ontologies will include removing ontologies, installing ontologies and asking for a backup that’ll contain the metadata that is specific for an installed ontology.

Support for custom ontologies doesn’t mean that application developers should all go spastic and start making ontologies. I know you guys! Don’t do it! We want applications to reuse as much of the Nepomuk set as possible. The more Nepomuk gets reused, the more interopability between apps is possible.

Volume support in experimental Tracker

Wednesday, April 1st, 2009

In Tracker’s trunk we have support for volumes. This means that we track removable devices appearing and disappearing. A removable device that disappears means that in your search result you wont see the resources that are on the disconnected removable device.

In trunk we keep a separate table with volume registrations for this.

In master we simply use the ontology. Which also means that we now make this information available to you as metadata in a clean way.

Some examples. List all the volumes that we know about:

tracker-sparql -q "SELECT ?o ?m ?z WHERE {
   ?o a tracker:Volume ;
   tracker:mountPoint ?m ;
   tracker:unmountDate ?z }"

That will return something like this (wrapping the line):

urn:nepomuk:datasource:/org/freed../Hal/devices/volume_uuid_XXX_ABCDE,
    file:///media/USBStick, 2009-04-01T13:38:20

The ones that we know about but aren’t mounted:

tracker-sparql -q "SELECT ?o ?m WHERE {
   ?o a tracker:Volume ;
   tracker:mountPoint ?m ;
   tracker:isMounted false } "

The ones that we know about and are mounted:

tracker-sparql -q "SELECT ?o ?m WHERE {
   ?o a tracker:Volume ;
   tracker:mountPoint ?m ;
   tracker:isMounted true } "

Let’s just for fun list the volumes that got unmounted before a specific date and didn’t get mounted anymore:

tracker-sparql -q "SELECT ?o ?m WHERE {
   ?o a tracker:Volume ;
   tracker:mountPoint ?m ;
   tracker:isMounted false ;
   tracker:unmountDate ?z .
   FILTER (?z < \"`date -u +"%Y-%m-%dT%H:%M:%S%z"`\") }"

I'd like to add that replacing HAL isn't our intention. Not at all. We depend on HAL to track this information ourselves. We need to know the availability of a volume and because we link every file resource to the volume's resource we can that way know about the availability of a resource.

It's under Tracker's own ontology prefix, and it's not really to be considered as decided or stable ontology API yet. We might change some things about it. It wasn't even a design goal to make this publicly accessible.

Why am I showing it? Well, because it's a nice way to explain one of our sprint tasks for the the next two weeks. I promised you guys that I would talk about what we are up to at Tracker. So here you have it!

Documentation. Documentation. Documentation!

Sunday, March 22nd, 2009

As I was rereading Ivan Frade’s blog post about the class-signals feature that me and Ivan developed for Tracker last week I got reminded of something I wrote a long time ago:

In my opinion, there’s nothing as important as developer documentation for a framework or library.

Ivan decided to ~ dump our internal planning document on GNOME’s Live wiki service. He didn’t write the document so he’s of course not to blame, but the document was in my opinion not suitable as end user framework documentation. Application developers should not be required to understand what goes on in our team’s minds. So I rewrote this.

You can find the document at the SignalsOnChanges page. Let’s see if starting next week I can convince the other team members to document their new features in a similar way. Many new things in Tracker’s experimental branch could use more clarification and examples.

I happen to believe that undocumented libraries and frameworks aren’t meaningful. That’s because letting it remain undocumented the developer renders his work utterly irrelevant for a serious application developer. I also believe that undocumented infrastructure software is worse than no infrastructure software.

I strongly recommend to application developers to refrain from using any piece of undocumented library or framework. Don’t lower yourself to their standards. Demand documentation by refusing to use undocumented infrastructure. Doing it as free software is no excuse for delivering extremely low quality, like what undocumented infrastructure is (in my opinion).

Thanks to Ivan’s blog post for reminding me of what I once wrote, and today still believe in, myself. While passionately coding new features you sometimes forget about this.

Which is unacceptable.

Tracker’s class signals, live search capabilities

Thursday, March 19th, 2009

Ivan Frade already blogged this, so that that means that I can keep this one shorter.

But yeah, we have just finished what we decided to call class-signals for Tracker. You are looking at software development taking place as we speak, because we still need to do a peer-review of the branch and then merge it to our experimental branch. This will probably happen in a few minutes.

The new feature means that if the ontology of a rdfs:Class contains a rdfs:Property tracker:notify that is set to true, then Tracker will offer a DBus object for that class. Whenever a subject of that class is added, updated or removed you will receive a signal about it on that DBus object.

Ok, ok, I know. You don’t want to click links and search for it yourself. Here are some “screenshots”:

nmo:FeedMessage a rdfs:Class ;
         tracker:notify true ;
         rdfs:subClassOf nmo:Message .

The DBus API is split per rdfs:Class. That’s because we don’t want that a lot of applications will be listening for each and every change to each and every subject. Now they can instead select the classes that they are interested in. That way they can avoid getting woken up constantly by our damn signal.

This feature is a change signal mechanism. We will eventually implement support for full live-queries.

Supporting full live-queries means that we’ll need to store some state about your session to test whether your SPARQL query requires an update from Tracker to get your client’s previous result list synchronized. It’s quite a bit more tricky, especially if you want to support most of SPARQL to go together with the live-query notifications, and especially if you want to be accurate.

Finally you don’t want the live-query capability to be a huge performance burden each time material becomes updated.

This is why we are now putting in place this feature. We think that for 90% of the applications the capabilities of the class-signals feature will be sufficient for their live update needs.

Biweekly Tracker development planning

Tuesday, March 17th, 2009

We realize

At the Tracker team we realize that we should not only tell you guys about Tracker’s new features. We should also involve you in the planning and realization of software development at the Tracker project.

I have discussed this with our team and we agreed that I will make a summary at each beginning of our biweekly planning.

The reason is fairly simple: we want to involve the community and inspire them to become active contributors. But how can they contribute unless they know what the plans of the development team of Tracker for the next few days are? You are right, they can’t.

Which is a situation that we want to change. Therefor, here’s the summary of the plans for the next two weeks:

Ontologies

  • Making tickets for an audio ontology that we made during a previous sprint to upstream Nepomuk.
  • Craft an ontology for images that will later be proposed to Nepomuk. We will convert the extractors in our experimental branch to use the new image ontology.
  • Applications need to be able to create arbitrary key value pairs to existing resources. We plan to add a custom ontology on top of Nepomuk’s NAO ontology that will make this possible.

Fixing things in the 0.6.9x series

  • Changing unique value sorting to use collations and we’re planning to change the unique value queries for better performance.
  • Making sure tracker-indexer and tracker-extract support detecting unmount events, and marking resources as unavailable, at a very early stage. This helps cleanly unmounting a removable device without keeping the device busy as the unmount request takes place.
  • Supporting listening for a signal that tells us to pause the indexer. This will be helpful on devices that want us to be silent while they are performing a higher priority task.
  • Evaluating and testing concurrent access.

Our experimental branch

  • Start discussing merging our experimental branch to trunk together with Jamie.
  • We have updated the new .ontology files in our experimental branch to include more specific rdfs:range information.
  • Making it possible to observe changes about resources that belong to a class. For example observing changes in all nfo:Document classes. The granularity will be at the level of a resource. You’ll know whether a resource is updated, deleted or created. This task might take more than a week to complete. Ivan Frade is planning to write a blog item dedicated to this subject.
  • To fulfill the requirements of certain use-cases we’re planning to make it possible to delete a complete resource (or a set of) with one function call.
  • We might start working on adding support for application domain (custom) ontologies.
  • Porting recent improvements on support for detecting file renames to our experimental branch.
  • Evaluating isLogicalPartOf or alternative link as cascading rule for deletes.

When people are interested in joining development of Tracker, they can ping us at the channel #tracker on GimpNet. They can also send a E-mail to Tracker’s mailing list.