SPARQL’s str() function in Tracker

Today I implemented the str() function for our SPARQL engine.

This makes it possible to use a <subject> just like a string.

Let’s first insert some data into our SPARQL store.

tracker-sparql -u -q \
   "INSERT { <urn:baaa> a rdfs:Resource }"

Following query doesn’t work, as variable ?s isn’t assigned with a xsd:string here, but a rdfs:Resource.

tracker-sparql -q
"SELECT ?s WHERE {
	?s a rdfs:Resource .
	FILTER REGEX (?s, '.*baaa', 's')
}"

This version works, because we introduce the str() function.

tracker-sparql -q
"SELECT ?s WHERE {
	?s a rdfs:Resource .
	FILTER REGEX (str(?s), '.*baaa', 's')
}"
  urn:uuid:94baaa45-99a6-e0f4-0bd9-f83ca90a9039
  urn:uuid:6e909006-a6ac-baaa-2ae4-cc01adcd5de7
  urn:baaa

You can also use a direct match, of course.

tracker-sparql -q
"SELECT ?s WHERE {
	?s a rdfs:Resource .
	FILTER (str(?s) = 'urn:baaa')
}"
  urn:baaa

By the way. Ivan made a cute tool in Python for typing in your queries:

It even does some code completion. If you type nco:[TAB] it’ll show you the NCO ontology. Nice!

Async with the mainloop

A technique that we started using in Tracker is utilizing the mainloop to do asynchronous functions. We decided that avoiding threads is often not a bad idea.

Instead of instantly falling back to throwing work to a worker thread we try to encapsulate the work into a GSource’s callback, then we let the callback happen until all of the work is done.

An example

You probably know sqlite3’s backup API? If not, it’s fairly simple: you do sqlite3_backup_init, followed by a bunch of sqlite3_backup_step calls, finalizing with sqlite3_backup_finish. How does that work if we don’t want to block the mainloop?

I removed all error handling for keeping the code snippet short. If you want that you can take a look at the original code.

static gboolean
backup_file_step (gpointer user_data)
{
  BackupInfo *info = user_data; int i;
  for (i = 0; i < 100; i++) {
    if ((info->result = sqlite_backup_step(info->backup_db, 5)) != SQLITE_OK)
        return FALSE;
  }
  return TRUE;
}

static void
backup_file_finished (gpointer user_data)
{
  BackupInfo *info = user_data;
  GError *error = NULL;
  if (info->result != SQLITE_DONE) {
    g_set_error (&error, _DB_BACKUP_ERROR,
                 DB_BACKUP_ERROR_UNKNOWN,
                 "%s", sqlite3_errmsg (
                    info->backup_db));
  }
  if (info->finished)
    info->finished (error, info->user_data);
  if (info->destroy)
    info->destroy (info->user_data);
  g_clear_error (&error);
  sqlite3_backup_finish (info->backup);
  sqlite3_close (info->db);
  sqlite3_close (info->backup_db);
  g_free (info);
}

void
my_function_make_backup (const gchar *dbf, OnBackupFinished finished,
                         gpointer user_data, GDestroyNotify destroy)
{
  BackupInfo *info = g_new0(BackupInfo, 1);
  info->user_data = user_data;
  info->destroy = destroy;
  info->finished = finished;
  info->db = db;
  sqlite3_open_v2 (dbf, &info->db, SQLITE_OPEN_READONLY, NULL);
  sqlite3_open ("/tmp/backup.db", &info>backup_db);
  info->backup = sqlite3_backup_init (info->backup_db, "main",
                                      info->db, "main");
  g_idle_add_full (G_PRIORITY_DEFAULT, backup_file_step,
                   info, backup_file_finished);
}

Note that I’m not suggesting to throw away all your threads and GThreadPool uses now.
Note that just like with threads you have to be careful about shared data: this way you’ll allow that other events on the mainloop will interleave your backup procedure. This is async(ish), it’s precisely what you want, of course.

More introduction to RDF and SPARQL

Introduction

I plan to give an introduction to features like COUNT, FILTER REGEX and GROUP BY which are supported by Tracker‘s SPARQL engine. We support more such features but I have to start the introduction somewhere. And overloading people with introductions to all features wont help me much with explaining things.

Since my last introduction to RDF and SPARQL I have added a few relationships and actors to the game.

We have Morrel, Max and Sasha being dogs, Sheeba and Query are cats, Picca is still a parrot, Fred and John are contacts. Fred claims that John is his friend. I changed the ontology to allow friendships between the animals too: Sasha claims that Morrel and Max are her friends. Sheeba claims Query is her friend. John bought Query. Fred being inspired by John decided to also get some pets: Morrel, Sasha and Sheeba.

Ontology

Let’s put this story in Turtle:

<test:Picca> a test:Parrot, test:Pet ;
	test:name "Picca" .

<test:Max> a test:Dog, test:Pet ;
	test:name "Max" .

<test:Morrel> a test:Dog, test:Pet ;
	test:name "Morrel" ;
	test:hasFriend <test:Max> .

<test:Sasha> a test:Dog, test:Pet ;
	test:name "Sasha" ;
	test:hasFriend <test:Morrel> ;
	test:hasFriend <test:Max> .

<test:Sheeba> a test:Cat, test:Pet ;
	test:name "Sheeba" ;
	test:hasFriend <test:Query> .

<test:Query> a test:Cat, test:Pet ;
	test:name "Query" .

<test:John> a test:Contact ;
	test:owns <test:Max> ;
	test:owns <test:Picca> ;
	test:owns <test:Query> ;
	test:name "John" .

<test:Fred> a test:Contact ;
	test:hasFriend <test:John> ;
	test:name "Fred" ;
	test:owns <test:Morrel> ;
	test:owns <test:Sasha> ;
	test:owns <test:Sheeba> .

Querytime!

Let’s first start with all friend relationships:

SELECT ?subject ?friend
WHERE { ?subject test:hasFriend ?friend }

  test:Morrel, test:Max
  test:Sasha, test:Morrel
  test:Sasha, test:Max
  test:Sheeba, test:Query
  test:Fred, test:John

Just counting these is pretty simple. In SPARQL all selectable fields must have a variable name, so we add the “as c” here.

SELECT COUNT (?friend) AS c
WHERE { ?subject test:hasFriend ?friend }

  5

We counted friend relationships, of course. Let’s say we want to count how many friends each subject has. This is a more interesting query than the previous one.

SELECT ?subject COUNT (?friend) AS c
WHERE { ?subject test:hasFriend ?friend }
GROUP BY ?subject

  test:Fred, 1
  test:Morrel, 1
  test:Sasha, 2
  test:Sheeba, 1

Actually, we’re only interested in the human friends:

SELECT ?subject COUNT (?friend) AS c
WHERE { ?subject test:hasFriend ?friend .
        ?friend a test:Contact
} GROUP BY ?subject

  test:Fred, 1

No no, we are only interested in friends that are either cats or dogs:

SELECT ?subject COUNT (?friend) AS c
WHERE { ?subject test:hasFriend ?friend .
       ?friend a ?type .
       FILTER ( ?type = test:Dog || ?type = test:Cat)
} GROUP BY ?subject"

  test:Morrel, 1
  test:Sasha, 2
  test:Sheeba, 1

Now we are only interested in friends that are either a cat or a dog, but whose name starts with a ‘S’.

SELECT ?subject COUNT (?friend) as c
WHERE { ?subject test:hasFriend ?friend ;
                 test:name ?n .
       ?friend a ?type .
       FILTER ( ?type = test:Dog || ?type = test:Cat) .
       FILTER REGEX (?n, '^S', 'i')
} GROUP BY ?subject

  test:Sasha, 2
  test:Sheeba, 1

Conclusions

Should we stop talking about ontologies and start talking about searchboxes and user interfaces instead? Although I certainly agree more UI-stuff is needed, I’m not sure yet. RDF and SPARQL are also about relationships and roles. Not just about matching stuff. Whenever we explain the new Tracker to people, most are stuck with ‘matching’ in their mind. They don’t think about a lot of other use-cases.

Such a search is just one use-case starting point: user entered a random search string and gives zero other meaning about what he needs. Many more situations can be starting points: When I select a contact in a user interface designed to show an archive of messages that he once sent to me, the searchbox becomes much more narrow, much more helpful.

As soon as you have RDF and SPARQL, and with Tracker you do, an application developer can start taking into account relationships between resources: The relationship between a contact in Instant Messaging and the attachments in an E-mail that he as a person has sent to you. Why not combine it with friendship relationships synced from online services?

With a populated store you can make the relationship between a friend who joined you on a trip, and photos of a friend of your friend who suggested the holiday location.

With GeoClue integration we could link his photos up with actual location markers. You’d find these photos that came from the friend of your friend, and we could immediately feed the location markers to the GPS software on your phone.

I really hope application developers have more imagination than just global searchboxes.

And this is just a use-case that is technically already possible with today’s high-end phones.

Introduction to RDF and SPARQL

Let’s start with a relatively simple graph. The graph shows the relationships between John, Fred, Max and Picca. John and Fred are humans who we’ll refer to as contacts. Max and Picca are pets. Max is a dog and Picca is a parrot. Both Picca and Max are owned by John. Fred claims that John is his friend.

If we would want to represent this story semantically we would first need to make an dictionary that describes pets, contacts, dogs, parrots. The dictionary would also describe possible relationships like ownership of a pet and the friendship between two contacts. Don’t forget, making something semantic means that you want to give meaning to the things that interest you.

Giving meaning is exactly what we’ll start with. We will write the schema for making this story possible. We will call this an ontology.

We describe our ontology using the Turtle format. In Turtle you can have prefixes. The prefix test: for example is the same as using <http://test.org/ontologies/tracker#>.

In Turtle you describe statements by giving a subject, a predicate and then an object. The subject is what you are talking about. The predicate is what about the subject your are talking about. And finally the object is the value. This value can be a resource or a literal.

When you write a . (a dot) in Turtle it means that you end describing the subject. When you write a ; (semicolon) it means that you continue with the same subject, but will start describing a new predicate. When you write a , (comma) it means that you even continue with the same predicate. The same rules apply in the WHERE section of a SPARQL query. But first things first: the ontology.

Note that the “test” ontology is not officially registered at tracker-project.org. It serves merely as an example.

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix tracker: <http://www.tracker-project.org/ontologies/tracker#> .
@prefix test: <http://www.tracker-project.org/ontologies/test#> .

test: a tracker:Namespace ;
	tracker:prefix "test" .

test:Entity a rdfs:Class .

test:Contact a rdfs:Class ;
	rdfs:subClassOf test:Entity .

test:Pet a rdfs:Class ;
	rdfs:subClassOf test:Entity .

test:Dog a rdfs:Class ;
	rdfs:subClassOf test:Entity .

test:Parrot a rdfs:Class ;
	rdfs:subClassOf test:Entity .

test:name a rdf:Property ;
	rdfs:domain test:Entity ;
	rdfs:range xsd:string .

test:owns a rdf:Property ;
	rdfs:domain test:Contact ;
	rdfs:range test:Pet .

test:hasFriend a rdf:Property ;
	rdfs:domain test:Contact ;
	rdfs:range test:Contact .

Now that we have meaning, we will introduce the actors: Picca, Max, John and Fred. Copy the @prefix lines of the ontology file from above, put the ontology file in the share/tracker/ontologies directory and run tracker-processes -r before restarting tracker-store in master. After doing all that you can actually store this as a /tmp/import.ttl file and then run tracker-import /tmp/import.ttl and it should import just fine. Ready for the queries below to be executed with the tracker-sparql -q ‘$query’ command.

Note that tracker-processes -r destroys all your RDF data in Tracker. We don’t yet support adding custom ontologies at runtime, so for doing this test you have to start everything from scratch.

<test:Picca> a test:Parrot, test:Pet ;
	test:name "Picca" .

<test:Max> a test:Dog, test:Pet ;
	test:name "Max" .

<test:John> a test:Contact ;
	test:owns <test:Max> ;
	test:owns <test:Picca> ;
	test:name "John" .

<test:Fred> a test:Contact ;
	test:hasFriend <test:John> ;
	test:name "Fred" .

Let’s do some simple SPARQL queries. You can execute these queries this way:

tracker-sparql -q "SELECT ?subject WHERE { ?subject a test:Parrot }"

In this query we ask for the subject of each entity that is a parrot. The query will yield test:Picca because Picca is the only parrot in our situation.

  test:Picca

Usually we aren’t interested in the subject, but in a real property of the parrot. We can ask for such a property this way:

SELECT ?subject ?name WHERE { ?subject a test:Parrot ; test:name ?name}
  test:Picca, Picca

Another simple example, give me all the contacts:

SELECT ?subject WHERE { ?subject a test:Contact }"
  test:John
  test:Fred

Just the contacts doesn’t illustrate much. Give me all contacts that have a friend. And display the contact and the friend’s names:

SELECT ?name ?friend
WHERE { ?subject test:hasFriend ?f ;
                 test:name ?name .
        ?f test:name ?friend }
  Fred, John

Let’s ask for all the pets that are owned:

SELECT ?subject WHERE { ?unknown test:owns ?subject }
  test:Max
  test:Picca

Oh, not the subject. The names. How did we do that again? Right:

SELECT ?name
WHERE { ?unknown test:owns ?subject .
        ?subject test:name ?name }
  Max
  Picca

This will of course yield the same results in our situation:

SELECT ?name
WHERE { <test:John> test:owns ?subject .
        ?subject test:name ?name }
  Max
  Picca

But this wont, Fred doesn’t own any pets. Only John owns pets.

SELECT ?name
WHERE { <test:Fred> test:owns ?subject .
        ?subject test:name ?name }

Let’s print the owner’s and the pet’s names:

SELECT ?owner ?name
 WHERE { ?unknown test:owns ?subject ;
                  test:name ?owner .
         ?subject test:name ?name }"
  John, Max
  John, Picca

Still with me? Let’s now conclude with requesting the names of the contacts who are a friend of the person who owns Picca:

SELECT ?name
WHERE { ?subject test:owns <test:Picca> .
        ?unknown test:hasFriend ?subject ;
                 test:name ?name }
  Fred

Invitation for Jürg and Rob: How about you guys writing a introduction to OPTIONAL, SUM, COUNT, GROUP-BY and FILTER, etc in SPARQL? :-) The more advanced stuff.

The subject of a resource, Nepomuk’s isStoredAs

After the many discussions the Tracker team did at the Desktop Summit in Gran Canaria I think a lot of people will start trying out Tracker’s master. We will indeed start making 0.7.x releases somewhere this or next month.

Meanwhile I’d like to point out that among the decisions that we made during the meetings and at the Ontology BOFs is that we wont use the URL of resources as the RDF’s subject field anymore. Instead we’ll use the nie:isStoredAs predicate for storing the URL.

Right now we already set nie:isStoredAs, but we still use the URL as subject. This will change, though. Just assume the subject to be something you should only use as an unique piece of data about the resource, pointing at it (in the RDF store). More details can be found here. If you want the thing itself (the file, the E-mail, the .desktop file, the website’s URL), ask for nie:isStoredAs.

For example:

<file:///tmp/myfile.png> a nfo:FileDataObject .
<urn:nepomuk:file:d7ea...> a nfo:Image ;
	nie:isStoredAs <file:///tmp/myfile.png> .

And to query:

tracker-sparql -q "SELECT ?url WHERE { ?subject a nfo:Image ; nie:isStoredAs ?url }

We know that many people want these 0.7.x releases to happen soon. I can only invite those people to just join coding. Awesome stuff is indeed taking place, but at the same time there is a lot of work and decision making to do.

Things like a user interface like the T-S-T (Tracker Search Tool) from Tracker 0.6, documentation with a lot of examples. SPARQL, SPARQL Update and Nepomuk all have quite a lot of documentation by themselves. But people are still asking for even more examples. Anybody interested in making that? Maybe if somebody who was at Rob Taylor’s BOF could write down his and Jürg’s lectures on RDF and SPARQL? I think they explained it all very well.

Tracker experimental merged to main development tree, Ivan’s presentation

I’m currently involved in the Tracker project and our project will be presented by Ivan Frade at the Desktop Summit this Sunday.

We merged our experimental branch tracker-store to master. This means that our reachitecture plans for Tracker have mostly been implemented and are being pushed forward into the main development tree.

I will start with a comparison with Tracker’s 0.6.x series.

Tracker master:

  • Uses SPARQL as query language
  • Uses Nepomuk for its base ontologies
  • Supports SPARQL Update
  • Supports aggregates COUNT, AVG, SUM, MIN and MAX in SPARQL
  • Operates for all its storage functionality as a separate binary
  • Operates all its indexing, crawling and monitoring functionalities in a separately packagable binary

Tracker 0.6.9x:

  • Uses RDFQuery as query language
  • Has its own ontology
  • Has very limited support for storing your own data
  • Supports several aggregate functions in its query language
  • Operates for all its storage functionality in the indexer
  • Operates for all its query functionality in the permanent daemon
  • Does file monitoring and crawling in the permanent daemon
  • Operates all its indexing functionality in a separately packagable binary

Tracker master:

Architecture

The storage service uses the Nepomuk ontologies as schema. It allows you to both query and insert or delete data.

The fs-miner finds, crawls and monitors file resources. It also analyses those files and extracts the metadata. It instructs the storage service to store the metadata.

External applications and other such miners are allowed to both query and insert using the storage service. Priority is given to queries over stores.

Plugins that run in process of the application can push information into Tracker. We indeed don’t try to scan Evolution’s cache formats, we have a plugin that gets it out of Evolution and into Tracker.

Storage service’s API and IPC

The storage service gives priority to SELECT queries to ensure that apps in need of metadata get serviced quickly.

INSERT and DELETE calls get queued. SELECT ones get executed immediately. For apps that require consistency and/or insertion speed we provide a batch mode that has a commit barrier. When the commit calls back you know that everything that came before it, is in a consistent shape. We don’t support full transactions with rollback.

The standard API operates over DBus. This means while using it you are subject to DBus’s performance limitations. In SPARQL Update it is possible to group a lot of writes. Due to DBus’s latency overhead this is recommended when inserting larger sets of data. We’re experimenting with a custom IPC system, based on unix sockets, to get increased throughput for apps that want to put a lot of INSERTs onto our queue.

We provide a feature that signals on changes happening to certain types. You can see this as a poor man’s live search. Full live search for SPARQL is fairly complicated. Maybe in future we’ll implement something like that.

Ontology

We support the majority of the Nepomuk base ontologies and our so called filesystem miners will store found metadata using Nepomuk’s ontologies. We support static custom ontologies right now. This means that it’s impossible to dynamically add a new ontology unless you reset the entire database first.

We’re planning to support dynamically adding and removing ontologies. The ontology format that we use is Turtle.

Backup and import

Right now we support loading data into our database using either SPARQL Update, an experimental unix-socket based IPC, and by passing us a Turtle file.

We currently have no support for making a backup. Support for this is on priority planning. It will write a Turtle file (which can be loaded afterward).

Backup and import of ontology specific metadata

When we introduce support for custom ontologies it’ll be useful for apps that provided their own custom ontology to get a backup of just the data that has relevance to said ontology. We plan to provide a method to do that.

Volume support

Having a static custom ontology for volume support, volumes and their status is queryable over SPARQL. File resources also get linked to said volumes. This makes it possible to get the availability of a file resource. For example: return metadata about all photos that are located on a specific camera, although the camera isn’t connected to this device.

Volume support is a work in progress at this moment.