Why do you need Tracker?

(Or why our project’s name wasn’t wrong after all)

First and foremost, because the Internet isn’t available everywhere all the times. To put it simple: 3G (and 4G) suck. The latency is a joke in even the most modern countries, even at the center of their capital cities. Reliable availability of The Internet simply doesn’t exist for most people. Not even after a decade of promises and billion dollar investments from advertising firms like Google. Balloons that bring The Internet everywhere? Yeah, sure.

I invite you to try a serious Google maps use-case in a Swiss tunnel. The kind of use-case the nineties Microsoft Automap Streets & Tips – like software easily managed for planning vacations and trips across Europe decades ago (without The Internet). Or how about reading the newspapers on the airplane? Technology of pre the year 1700 could do it. Today Google’s tablets & glasses can’t, because I have no reliable Internet (the flight-attendant will reliably give me a newspaper, though – go ahead, try it Sergey). And if Google installed 3G routers in those tunnels, airplanes, forests, all third world countries, seas and the truly remote areas of the planet , I could still come up with a lot more places. Everybody can. By the time the Googles of today are finished, we’ll all travel to Mars with latencies of up to an hour while Google’s data still only travels at the speed of light (that, or fix quantum entanglement to be workable and more importantly: scalable to billions of users).

It’s great that sometimes we can use Google maps, sure. But if I can’t rely on it always and everywhere, it means that in embedded: it doesn’t exist. I want my Boeing 747’s landing software to work (oh and when we land on Mars, too). Always. I want software to land us safely. Same for my car’s ABS, the subway and train’s breaking systems. I’ll probably be dead when those services don’t operate when I need them. During those moments I don’t care about Google fans and their Google HTML5 religion. Screw HTML5, JavaScript, WebSocket and 3G and thank you interrupt based real time kernels that open my airbag, stop the tram and land the airplane.

For the car industry it’s probably cheaper to provide a storage hardware upgrade when the car must be serviced, than it would be to sell your company’s soul to privacy invading Cloud  hosting services. Because in future I would like you to provide Facebook-like services in my car, as reliable as my airbag works. Without the Internet being everywhere. I want you to deliver it to me when I’m visiting Mars. And my kids … who knows what they’ll want?!

Your embedded technology needs to provide graph data about the users’ activity to services that your business wants to share the data with. I’ll illustrate this with This-is-Possible-Today use cases:

– The fridge contains no more milk. While walking the street watching his smartphone, the user opens a recipe for a meal that requires milk. When the user is at the supermarket the technology that will in future be installed on supermarket shopping carts (or his glasses) needs to show the recipe, its ingredients and highlight the fact that the fridge doesn’t contain one of the required ingredients (milk, sugar, butter). And if the user allows this, advertise different brands of milk, sugar and butter based on who paid most plus his wife’s buying habits.

– Your kid talked with a school friend on Facebook about an amusement park (De Efteling! Phantasialand!). Your wife decides that because it’s good weather this weekend, the family should (will) go to an amusement park (yes, she’s the boss. And that’s fine: you own the car – don’t worry, she’ll drive the way back so you can have your weekend nap). So the entire family gets in the (your) car and you ask your son: what amusement park shall I drive to?! Your kid opens the infotainment system at the back seat of the car and sees what he has been interested in last few weeks. After privacy authorization (or not) you as the driver of the car sees the list on the dashboard infotainment system. You select it and the navigation software of the (your) car navigates to it. What a dad! Meanwhile your front passenger seat’s infotainment system goes to the ticket ordering website of the amusement park. What a mom! Advertising related to amusement parks and ticket vending is shown. Of course! Phantasialand!

That is why you need Tracker’s Nepomuk based storage with its SPARQL querying and updating capability.

It lets your embedded appliance do what Facebook does. But in a light way, isolated (or not) from the rest of the world. You decide what happens with the data and who receives it. Allowing you to provide a trust relationship with your customers and consumers. You are the industry providing those cars, fridges and TV sets.

As a BMW driver myself, I would stop buying BMW cars as soon as I learn that BMW sells my driving habits (or whatever) to Google or Facebook (or the NSA). Today I’d trust BMW to integrate those habits into the infotainment system of my next car. IF I can trust BMW. I think that in future it’ll be the difference between succeeding and failing as a industry.

With Tracker you get the use-cases and features. But it’s not for free: you must hire brains instead of paying Google or Facebook’s marketing boys. This comes as a surprise? It has always been that way in tech: Brains and hard work, innovate. Ask Wernher von Braun. They landed on the wrong planet, but in the sixties and seventies his rockets got us to the moon.

A car is a car. A fridge is a fridge. A TV set is a TV set. They shouldn’t be Google’s or Facebook’s data mining devices. Besides, why would you give the data away? Your appliances collected it, not theirs. Talk with your customers fairly and openly on how it can and how it can’t be used.

As managers at these industries it’s up to you to solve the crisis of social features vs. privacy mining.

Kind regards, from one of the guys who developed such technology for Nokia’s N9.

Morals? Forbidding stuff?

It isn’t freedom to have to choose for Richard Stallman’s world view. It isn’t ‘freedom’ to be called immoral just because you choose another ethic. It isn’t freedom when a single person or group with a single view on morality tries to forbid you something based on just their point of view.

For example, Stallman has repeatedly said about Trusted Computing (which he in a childish way apparently calls Treacherous Computing) that it ‘should be illegal’ (that’s a quote from official FSF and GNU pages). I also recall Stallman trying to forbid blog posts about proprietary software (it was about VMWare) on planet-gnome (original thread here).

Richard Stallman and some of his followers don’t seem to understand that it isn’t necessarily moral to impose your world view, about morality, on everybody else by claiming, for example, that the other’s view ‘should be illegal’ or ‘is immoral’ (these are terms that he and some of his followers frequently use).

Firstly something should be only illegal when all procedures for making a new law in a country have been followed. In most democratic countries that means getting a majority in parliament but also getting advise from your country’s judges and from experts in the field who’ll be affected by your new law. So not just by listening and following a guy like Richard Stallman blindly. This is why I was very much against a rule for planet-gnome to forbid posts about proprietary software that uses GNOME: nor the majority of GNOME foundation members nor all experts in the field who’d be affected by that new law nor all the maintainers of planet-gnome (its judges) followed Richard’s opinion.

In this new situation it also isn’t only Richard Stallman who should be blindly followed. Ubuntu needs to take into account all stakeholders and not just Stallman and his followers.

Secondly is morality defined by a person’s own views and for a huge part by that person’s culture. ‘How we ought to live’ is (also) a question at the individual level. Not per definition answered by Richard Stallman alone. Although, sure, it can be one’s choice to strictly copy Richard’s morals. Morality is not necessarily a single option nor is it necessarily written in a single book.

For me it’s not fine when your morality includes enforcing others to copy exactly your morals. To put it in a way that Richard’s strict followers might understand: for me morality isn’t like the GPL; agreeing to some of Stallman’s morals does not mean having to puristic copy them all.

How I think companies like Jolla should do it

I’ll focus on the technical stuff; I think I would only Peter Principle myself if I would try giving management advice.

What I’ve seen too much are community projects, companies or groups who think that the synchronization of Harmattan with Moblin or MeeGo was done well to make what is now the OS on the N9. Luckily is Jolla hiring Harmattan staff, so they understand the situation.

For me it was always clear that “MeeGo” was a more or less failed PR thing between Intel and Nokia. By the time the N9 was first released wasn’t Harmattan synchronized with Moblin or MeeGo technically very much. And after several updates of Harmattan it still isn’t.

The situation on the N9 now is an OS that has relatively few technical resemblance with “MeeGo”. For me is N9’s software Harmattan or Maemo 6. It’s the continuation of the software on the N900: Maemo 5 or Fremantle (after ~ two or three rather big rewrites, that much is true). That the rewrites happened doesn’t mean that during those rewrites Harmattan suddenly became MeeGo. MeeGo is, in other words, a different platform.

A successful project will have to work with what Harmattan is, and not try to replace it with what MeeGo is today. If they do want to end up with “MeeGo” on an N9 they will have to progressively improve Harmattan towards that goal by for example asking Nokia to open closed components, by developing fixes for softwares that are already open source (a lot are), by repackaging them and by explaining N9 owners how to add a repository and how to upgrade their phone safely.

I understand the idea isn’t to deploy on an N9, but if you want a new phone or device that resembles what the N9 is; the N9’s software is in my opinion not MeeGo but Harmattan. Rewrites have happened too often already. It’s my opinion that yet another rewrite of Harmattan isn’t a good idea at all.

For example replacing the Debian package management system with RPM doesn’t sound like a viable option to me at all. Nor is replacing any of the major middleware really doable within the timeframe you’d have to deliver to be relevant.

Instead software project per software project improve the phone’s OS. Kinda like how Ximian did Red Carpet many years ago (which also supported multiple package management systems).

No more big rewrites, no more starting from scratch. No more politics about how it should have been done. Start with the platform as it is. There are reasons why the OS is good, and among the reasons is that good middleware choices and compromises were made.

Kind regards, good luck.

 

We delivered

Damned guys, we’re too shy about what we delivered. When the N900 was made public we flooded the planets with our blogs about it. And now?

I’m proud of the software on this device. It’s good. Look at what Engadget is writing about it! Amazing. We should all be proud! And yes, I know about the turbulence in Nokia-land. Deal with it, it’s part of our job. Para-commandos don’t complain that they might get shot. They just know. It’s called research and development! (I know, bad metaphor)

I don’t remember that many good reviews about even the N900, and that phone was by many of its owners seen as among the best they’ve ever owned. Now is the time to support Harmattan the same way we passionately worked on the N900 and its predecessor tablets (N810, N800 and 770). Even if the N9’s future is uncertain: who cares? It’s mostly open source! And not open source in the ‘Android way’. You know what I mean.

The N9 will be a good phone. The Harmattan software is awesome. Note that Tracker and QSparql are being used by many of its standard applications. We have always been allowed to develop Tracker the way it’s supposed to be done. Like many other similar projects: in upstream.

As for short term future I can announce that we’re going to make Michael Meeks happy by finally solving the ever growing journal problem. Michael repeatedly and rightfully complained about this to us at conferences. Thanks Michael. I’ll write about how we’ll do it, soon. We have some ideas.

We have many other plans for long term future. But let’s for now work step by step. Our software, at least what goes to Harmattan, must be rock solid and very stable from now on. Introducing a serious regression would be a catastrophe.

I’m happy because with that growing journal – problem, I can finally focus on a tough coding problem again. I don’t like bugfixing-only periods. But yeah, I have enough experience to realize that sometimes this is needed.

And now, now we’re going to fight.

IPC performance improvements for insert queries

Although with SQLite WAL we have direct-access now, we don’t support direct-access for insert and delete SPARQL queries. Those queries when made using libtracker-sparql still go over D-Bus using Adrien’s FD passing D-Bus IPC technique. The library will do that for you.

After investigating a performance analysis by somebody from Intel we learned that there is still a significant overhead per each IPC call. In the analysis the person made miner-fs combine multiple insert transactions together and then send it over as a single big transaction. This was noticeably faster than making many individual IPC requests.

The problem with this is that if one of the many insert queries fail, they all fail: not good.

We’re now experimenting with a private API that allows you to pass n individual insert transactions, and get n errors back, using one IPC call.

The numbers are promising even on Desktop D-Bus (the test):

$ cd tests/functional-tests/
$ ./update-array-performance-test
First run (first update then array)
Array: 0.103675, Update: 0.139094
Reversing run (first array then update)
Array: 0.290607, Update: 0.161749
$ ./update-array-performance-test
First run (first update then array)
Array: 0.105920, Update: 0.137554
Reversing run (first array then update)
Array: 0.118785, Update: 0.130630
$ ./update-array-performance-test
First run (first update then array)
Array: 0.108501, Update: 0.136524
Reversing run (first array then update)
Array: 0.117308, Update: 0.151192
$

We’re now deciding whether or not the API will become public; returning arrays of errors isn’t exactly ‘nice’ or ‘standard’.

LRU cache for prepared statements in Tracker’s RDF store

While trying to handle a bug that had a description like “if I do this, tracker-store’s memory grows to 80MB and my device starts swapping”, we where surprised to learn that a sqlite3_stmt consumes about 5 kb heap. Auwch.

Before we didn’t think that those prepared statements where very large, so we threw all of them in a hashtable for in case the query was ran again later. However, if you collect thousands of such statements, memory consumption obviously grows.

We decided to implement a LRU cache for these prepared statements. For clients that access the database using direct-access the cache will be smaller, so that max consumption is only a few megabytes. Because our INSERT and DELETE queries are more reusable than SELECT queries, we split it into two different caches per thread.

The implementation is done with a simple intrinsic linked ring list. We’re still testing it a little bit to get good cache-size numbers. I guess it’ll go in master soon. For your testing pleasure you can find the branch here.

Support for SPARQL IN and NOT IN, the new class signals

I made some documentation about our SPARQL-IN feature that we recently added. I added some interesting use-cases like doing an insert and a delete based on in values.

For the new class signal API that we’re developing this and next week, we’ll probably emit the IDs that tracker:id() would give you if you’d use that on a resource. This means that IN is very useful for the purpose of giving you metadata of resources that are in the list of IDs that you just received from the class signal.

We never documented tracker:id() very much, as it’s not an RDF standard; rather it’s something Tracker specific. But neither are the class signals a RDF standard; they are Tracker specific too. I guess here that makes it usable in combo and turns the status of ‘internal API’, irrelevant.

We’re right now prototyping the new class signals API. It’ll probably be a “sa(iii)a(iii)”:

That’s class-name and two arrays of subject-id, predicate-id, object-id. The class-name is to allow D-Bus filtering. The first array are the deletes and the second are the inserts. We’ll only give you object-ids of non-literal objects (literal objects have no internal object-id). This means that we don’t throw literals to you in the signal (you need to make a query to get them, we’ll throw 0 to you in the signal).

We give you the object-ids because of a use-case that we didn’t cover yet:

Given triple <a> nie:isLogicalPartOf <b>. When <a> is deleted, how do you know <b> during the signal? So the feature request was to do a select ?b { <a> nie:isLogicalPartOf ?b } when <a> is deleted (so the client couldn’t do that query anymore).

With the new signal we’ll give you the ID of <b> when <a> is deleted. We’ll also implement a tracker:uri(integer id) allowing you to get <b> out of that ID. It’ll do something like this, but then much faster: select ?subject { ?subject a rdfs:Resource . FILTER (tracker:id(?subject) IN (%d)) }

I know there will be people screaming for all objects, also literals, in the signals, but we don’t want to flood your D-Bus daemon with all that data. Scream all you want. Really, we don’t. Just do a roundtrip query.

“You’re just making an excuse” is a relative phrase

I recently stumbled upon this marvelous piece. I title the quote “making an excuse“:

Saying that you’re forced to do something when you really aren’t is a failure to take responsibility for your actions. I generally don’t think users of proprietary software are primarily to blame for the challenges of software freedom — nearly all the blame lies with those who write, market, and distribute proprietary software. However, I think that software users should be clear about why they are using the software. It’s quite rare for someone to be compelled under threat of economic (or other) harm to use proprietary software. Therefore, only rarely is it justifiable to say you have to use proprietary software. In most cases, saying so is just making an excuse.

Bradley M. Kuhn – 2010, on his blog

I’ll translate this for you to Catholicism. You can definitely adapt this to most religions (for some, add death penalties like stoning here and there):

Saying that you’re forced by your nature to masturbate when you really aren’t is a failure to take responsibility for your actions. The church generally doesn’t think masturbaters are primarily to blame for the challenges of sexuality — nearly all the blame lies with pornography. However, I think that people who masturbate should be clear about why they have sex with themselves: It’s quite rare for someone to be compelled under the desire of sexual pleasure. Therefore, only rarely is it justifiable to say you have to masturbate. In most cases, saying so is just making an excuse.

The translation

There you go.

Tracker this, Tracker that, everything Tracker

Busy handling

I made an article about reporting busy status in Tracker before.

But then it wasn’t yet possible to queue a query while Tracker’s RDF store is busy. We’re making this possible following next unstable release. Yeah I know you guys hate that Tracker’s RDF store can be busy. But you tell us what else to do while restoring a backup, or while replaying a journal?

While we are replaying the journal, or restoring a backup, we’ll accept your result-hungry queries into our queue. Meanwhile you get progress and status indication over a DBus signal. Some documentation about this is available here.

SPARQL 1.1 Draft features: IN and NOT IN

We had a feature requests for supporting SPARQL IN and NOT IN. As usual, we’re ahead of the SPARQL Draft specification. But I don’t think IN and NOT IN will look much different in the end. Anyway, it was straightforward so I just implemented both.

It goes like this:

SELECT ?abc { ?abc a nie:InformationElement ;
                   nie:title ?title .
               FILTER (?title IN ('abc', 'def')) }
SELECT ?abc { ?abc a nie:InformationElement ;
                   nie:title ?title .
               FILTER (?title NOT IN ('xyz', 'def')) }

It’s particularly useful to get metadata about a defined set of resources (give me the author of this, this and that file)

Direct access

This work is progressing nicely. Most of the guys on the team are working on this, and it’s going to be awesome thanks to SQLite’s WAL journal mode. SQLite’s WAL mode is still under development and probably unstable here and there, but we’re trusting the SQLite guys with this anyway.

What is left to do for direct-access is cleaning up a bit, getting the small nasty things right. You know. The basics are all in place now.

We’re doing most of the library code in Vala, but clever people can easily imagine the C API valac makes from the .vala files here. That’s the abstract API that client developers will use. Unless you use a higher level API like libqttracker, QSparql, Hormiga or sparql-glib.

All of which still need to be adapted to the direct-access work that we’re doing. But we’re in close contact with all of the developers involved in those libraries. And they’re all thrilled to implement backends for the new stuff.

Plans

We plan to change the signals-on-changes or class-signals feature a bit so that the three signals are merged into one. The problem with three is that you can’t reliably identify a change-transaction this way (a rename of a file, for example).

Another thing on our list is merging Zeitgeist’s ontology. To the other team members at Tracker: guys, Zeitgeist has been waiting for three months now. Let’s just get this done!

Oh there are a lot of plans, to be honest.

I wonder when, if ever, we go in feature freeze. Hehe. I guess we’ll just have very short feature-freeze periods. Whatever, it’s fun.

MeeGo in cars

Hey BMW & co, if you guys want to learn how to write music players and playlists for car entertainment on MeeGo, get in touch! This Tracker that I’m talking about is on that MeeGo OS; being the Music’s metadata database is among its purposes.

I can’t wait to have a better music player playlist my car.

Or maybe some integration with the in-car GPS and the car owner’s appointments and meetings? With geo-tagged photos on the car owner’s phone? Automatic and instant synchronization with Nokia’s future phones? Sounds all very doable, even easy, to me. I’d want all that stuff. Use-cases!

Let’s talk!

Julian on TED

I try to avoid posting about the same subject twice in a row. But I also really think that Wikileaks is worth violating about any such rule in existence. Maybe I should make a category on my blog just for Wikileaks?

So TED has decided to do an interview with Julian Assange:

I’d like to point out that I congratulate and thank everybody, not just but also Julian, who’s involved. Thank you.

That today ‘s gonna be a good day

Today is the day the world is witnessing the most significant military leak in the history of mankind, so I have a feeling that today ‘s gonna be a good day.

To all the people at Wikileaks, and to all whistle blowers in past, present and future: you are heroes. You guy’s ideas will be with us for centuries ahead of us. You’ll be remembered in history books. Let’s make sure you guys will.

Why make things complicated?

There are no open source companies. There are companies and there are open source projects.

Some companies work on open source projects, some parent open source projects, some don’t.

Some of those companies are good at fostering a community that contributes to these open source projects. Others are unwilling and some don’t yet understand the process. And again others have many open source projects being done by teams that do get it and have at the same time other projects being done by teams that don’t get it. Actually that last dual situation is the most common among the large companies. You know, the ones that often sponsor your community’s main conference and the ones that employ your heros.

If you do a quick reality-check then you’ll conclude there are no black / white companies. Actually, nothing in life nor in ethics is black / white. Nothing at all.

What you do have is a small group of amazingly disturbing purists who do zero coding themselves (that is, near zero) but do think black / white, and consequently write a lot of absurd nonsense in blog post-comments, on slashdot in particular, forums and mailing lists. These people are the reason numéro uno why many companies quit trying to understand open source.

It’s sad that the actual (open source) developers have to waste time explaining companies, for whom they do consultancy, that these people can be ignored. It’s also sad that these purists have turned so vocal, even violent, that they often can’t really be ignored anymore: people’s employers have been harassed.

“You have to fire somebody because he’s being unethical by disagreeing with my religious believe-system that Microsoft is evil!”. Maybe it’s just me who’s behind on ethics in this world? Well, those people can still get lost because I, in ethics, disagree with them.

Now, let’s get back to the projects and away from the open source vs. open core debates. We have a lot of work to do. And a lot of companies to convince opening their projects.

Open source developers succeeded in (for example) getting some software on phones. The people who did aren’t the religious black / white people. Maybe the media around open source should track down the people who did, and write quite a bit more about their work, ideas and passion?

Finally, the best companies are driven by the ideas and passions of their best employees. Those are the people who you should admire. Not their company’s open core PR.

Wrapping up 4.57 billion years

In 4.57 billion years our solar system went from creating simple bacteria to a large group of species. Several of which highly capable of making fairly intelligent decisions, one of which capable of having the indulgence of believing that it can think. That’s us.

The sun has an estimated 5 billion years to go before it turns into a Red Giant that in its very early stages will wipe out truly every single idea that exists inside at least our own solar system.

Unless radio waves that our planet started emitting since we invented radio are seen and understood (which requires a recipient in the first place), that will be the ultimate end of all of our ideas and culture. Unless we figure out a way to let the ideas cultivate outside of our solar system. Just the ideas would already be an insane achievement.

But imagine going from bacteria to beings, colonized by bacteria, that think that they can think, in far less time than the current age of our sun. Unless, of course, bacteria somehow arrived into our solar system from outside (unlikely, but perhaps equally unlikely than us ever exporting our ideas and culture to another solar system).

Imagine what could happen in the next 5 billion years …

Domain indexes finished, technical conclusions

The support for domain specific indexes is, awaiting review / finished. Although we can further optimize it now. More on that later in this post. Image that you have this ontology:

nie:InformationElement a rdfs:Class .

nie:title a rdf:Property ;
  nrl:maxCardinality 1 ;
  rdfs:domain nie:InformationElement ;
  rdfs:range xsd:string .

nmm:MusicPiece a rdfs:Class ;
  rdfs:subClassOf nie:InformationElement .

nmm:beatsPerMinute a rdf:Property ;
  nrl:maxCardinality 1 ;
  rdfs:domain nmm:MusicPiece ;
  rdfs:range xsd:integer .

With that ontology there are three tables called “Resource”, “nmo:MusicPiece” and “nie:InformationElement” in SQLite’s schema:

  • The “Resource” table has ID and the subject string
  • The “nie:InformationElement” has ID and “nie:title”
  • The “nmm:MusicPiece” one has ID and “nmm:beatsPerMinute”

That’s fairly simple, right? The problem is that when you ORDER BY “nie:title” that you’ll cause a full table scan on “nie:InformationElement”. That’s not good, because there are less “nmm:MusicPiece” records than “nie:InformationElement” ones.

Imagine that we do this SPARQL query:

SELECT ?title WHERE {
   ?resource a nmm:MusicPiece ;
             nie:title ?title
} ORDER BY ?title

We translate that, for you, to this SQL on our schema:

SELECT   "title_u" FROM (
  SELECT "nmm:MusicPiece1"."ID" AS "resource_u",
         "nie:InformationElement2"."nie:title" AS "title_u"
  FROM   "nmm:MusicPiece" AS "nmm:MusicPiece1",
         "nie:InformationElement" AS "nie:InformationElement2"
  WHERE  "nmm:MusicPiece1"."ID" = "nie:InformationElement2"."ID"
  AND    "title_u" IS NOT NULL
) ORDER BY "title_u"

OK, so with support for domain indexes we change the ontology like this:

nmm:MusicPiece a rdfs:Class ;
  rdfs:subClassOf nie:InformationElement ;
  tracker:domainIndex nie:title .

Now we’ll have the three tables called “Resource”, “nmo:MusicPiece” and “nie:InformationElement” in SQLite’s schema. But they will look like this:

  • The “Resource” table has ID and the subject string
  • The “nie:InformationElement” has ID and “nie:title”
  • The “nmm:MusicPiece” table now has three columns called ID, “nmm:beatsPerMinute” and “nie:title”

The same data, for titles of music pieces, will be in both “nie:InformationElement” and “nmm:MusicPiece”. We copy to the mirror column during ontology change coping, and when new inserts happen.

When now the rdf:type is known in the SPARQL query as a nmm:MusicPiece, like in the query mentioned earlier, we know that we can use the “nie:title” from the “nmm:MusicPiece” table in SQLite. That allows us to generate you this SQL query:

SELECT   "title_u" FROM (
  SELECT "nmm:MusicPiece1"."ID" AS "resource_u",
         "nmm:MusicPiece1"."nie:title" AS "title_u"
  FROM   "nmm:MusicPiece" AS "nmm:MusicPiece1"
  WHERE  "title_u" IS NOT NULL
) ORDER BY "title_u"

A remaining optimization is when you request a rdf:type that is a subclass of nmm:MusicPiece, like this:

SELECT ?title WHERE {
  ?resource a nmm:MusicPiece, nie:InformationElement ;
            nie:title ?title
} ORDER BY ?title

It’s still not as bad as now the “nie:title” is still taken from the “nmm:MusicPiece” table. But the join with “nie:InformationElement” is still needlessly there (we could just do the earlier SQL query in this case):

SELECT   "title_u" FROM (
  SELECT "nmm:MusicPiece1"."ID" AS "resource_u",
         "nmm:MusicPiece1"."nie:title" AS "title_u"
  FROM   "nmm:MusicPiece" AS "nmm:MusicPiece1",
         "nie:InformationElement" AS "nie:InformationElement2"
  WHERE  "nmm:MusicPiece1"."ID" = "nie:InformationElement2"."ID"
  AND    "title_u" IS NOT NULL
) ORDER BY "title_u"

We will probably optimize this specific use-case further later this week.

Smile or Die

In followup on the RSA animation videos here’s the original talk by Barbara Ehrenreich titled Smile or Die.

I think part of GNOME’s crisis is caused by the same atmosphere of “go with the program, don’t complain, or you’re out”. I wrote about this before:

It’s not popular to be critical about a (the leader of a) popular idea. This is illustrated by the intellectually absurd criticisms David Schlesinger receives.

Yet is the critic who monitors the organs of a society key to that organ either producing for its stakeholders, or failing and dragging the entire society it serves down with it.

Acknowledging the problem and changing course is what I seek in a candidate this year.

OK, two is enough. Back to technical articles.

Friday’s performance improvements in Tracker

The crawler’s modification time queries

Yesterday we optimized the crawler’s query that gets the modification time of files. We use this timestamp to know whether or not a file must be reindexed.

Originally, we used a custom SQLite function called tracker:uri-is-parent() in SPARQL. This, however, caused a full table scan. As long as your SQL table for nfo:FileDataObjects wasn’t too large, that wasn’t a huge problem. But it didn’t scale linear. I started with optimizing the function itself. It was using a strlen() so I replaced that with a sqlite3_value_bytes(). We only store UTF-8, so that worked fine. It gained me ~ 10%; not enough.

So this commit was a better improvement. First it makes nfo:belongsToContainer an indexed property. The x nfo:belongsToContainer p means x is in a directory p for file resources. The commit changes the query to use the property that is now indexed.

The original query before we started with this optimization took 1.090s when you had ~ 300,000 nfo:FileDataObject resources. The new query takes about 0.090s. It’s of course an unfair comparison because now we use an indexed property. Adding the index only took a total of 10s for a ~ 300,000 large table and the table is being queried while we index (while we insert into it). Do the math, it’s a huge win in all situations. For the SQLite freaks; the SQLite database grew by 4 MB, with all items in the table indexed.

PDF extractor

Another optimization I did earlier was the PDF extractor. Originally, we used the poppler-glib library. This library doesn’t allow us to set the OutputDev at runtime. If compiled with Cairo, the OutputDev is in some versions a CairoOutputDev. We don’t want all images in the PDF to be rendered to a Cairo surface. So I ported this back to C++ and made it always use a TextOutputDev instead. In poppler-glib master this appears to have improved (in git master poppler_page_get_text_page is always using a TextOutputDev).

Another major problem with poppler-glib is the huge amount of copying strings in heap. The performance to extract metadata and content text for a 70 page PDF document without any images went from 1.050s to 0.550s. A lot of it was caused by copying strings and GValue boxing due to GObject properties.

Table locked problem

Last week I improved D-Bus marshaling by using a database cursor. I forgot to handle SQLITE_LOCKED while Jürg and Carlos had been introducing multithreaded SELECT support. Not good. I fixed this; it was causing random Table locked errors.

RDF propaganda, time for change

I’m not supposed to but I’m proud. It’s not only me who’s doing it.

Adrien is one of the new guys on the block. He’s working on integration with Tracker’s RDF service and various web services like Flickr, Facebook, Twitter, picasaweb and RSS. This is the kind of guy several companies should be afraid of. His work is competing with what they are trying to do do: integrating the social web with mobile.

Oh come on Steve, stop pretending that you aren’t. And you better come up with something good, because we are.

Not only that, Adrien is implementing so-called writeback. It means that when you change a local resource’s properties, that this integration will update Flickr, Facebook, picasaweb and Twitter.

You change a piece of info about a photo on your phone, and it’ll be replicated to Flickr. It’ll also be synchronized onto your phone as soon as somebody else made a change.

This is the future of computing and information technology. Integration with social networking and the phone is what people want. Dear Mark, it’s unstoppable. You better keep your eyes open, because we are going fast. Faster than your business.

I’m not somebody trying to guess how technology will look in a few years. I try to be in the middle of the technical challenge of actually doing it. Talking about it is telling history before your lip’s muscles moved.

At the Tracker project we are building a SPARQL endpoint that uses D-Bus as IPC. This is ideal on Nokia’s Meego. It’ll be a centerpiece for information gathering. On Meego you wont ask the filesystem, instead you’ll ask Tracker using SPARQL and RDF.

To be challenged is likely the most beautiful state of mind.

I invite everybody to watch this demo by Adrien. It’s just the beginning. It’s going to get better.

Tracker writeback & web service integration demo / MeegoTouch UI from Adrien Bustany on Vimeo.

I tagged this as ‘extremely controversial’. That’s fine, Adrien told me that “people are used to me anyway”.

Performance DBus handling of the query results in Tracker’s RDF service

Before

For returning the results of a SPARQL SELECT query we used to have a callback like this. I removed error handling, you can find the original here.

We need to marshal a database result_set to a GPtrArray because dbus-glib fancies that. This is a lot of boxing the strings into GValue and GStrv. It does allocations, so not good.

static void
query_callback(TrackerDBResultSet *result_set,GError *error,gpointer user_data)
{
  TrackerDBusMethodInfo *info = user_data;
  GPtrArray *values = tracker_dbus_query_result_to_ptr_array (result_set);
  dbus_g_method_return (info->context, values);
  tracker_dbus_results_ptr_array_free (&values);
}

void
tracker_resources_sparql_query (TrackerResources *self, const gchar *query,
                                DBusGMethodInvocation *context, GError **error)
{
  TrackerDBusMethodInfo *info = ...; guint request_id;
  TrackerResourcesPrivate *priv= ...; gchar *sender;
  info->context = context;
  tracker_store_sparql_query (query, TRACKER_STORE_PRIORITY_HIGH,
                              query_callback, ...,
                              info, destroy_method_info);
}

After

Last week I changed the asynchronous callback to return a database cursor. In SQLite that means an sqlite3_step(). SQLite returns const pointers to the data in the cell with its sqlite3_column_* APIs.

This means that now we’re not even copying the strings out of SQLite. Instead, we’re using them as const to fill in a raw DBusMessage:

static void
query_callback(TrackerDBCursor *cursor,GError *error,gpointer user_data)
{
  TrackerDBusMethodInfo *info = user_data;
  DBusMessage *reply; DBusMessageIter iter, rows_iter;
  guint cols; guint length = 0;
  reply = dbus_g_method_get_reply (info->context);
  dbus_message_iter_init_append (reply, &iter);
  cols = tracker_db_cursor_get_n_columns (cursor);
  dbus_message_iter_open_container (&iter, DBUS_TYPE_ARRAY,
                                    "as", &rows_iter);
  while (tracker_db_cursor_iter_next (cursor, NULL)) {
    DBusMessageIter cols_iter; guint i;
    dbus_message_iter_open_container (&rows_iter, DBUS_TYPE_ARRAY,
                                      "s", &cols_iter);
    for (i = 0; i < cols; i++, length++) {
      const gchar *result_str = tracker_db_cursor_get_string (cursor, i);
      dbus_message_iter_append_basic (&cols_iter,
                                      DBUS_TYPE_STRING,
                                      &result_str);
    }
    dbus_message_iter_close_container (&rows_iter, &cols_iter);
  }
  dbus_message_iter_close_container (&iter, &rows_iter);
  dbus_g_method_send_reply (info->context, reply);
}

Results

The test is a query on 13500 resources where we ask for two strings, repeated eleven times. I removed a first repeat from each round, because the first time the sqlite3_stmt still has to be created. This means that our measurement would get a few more milliseconds. I also directed the standard out to /dev/null to avoid the overhead created by the terminal. The results you see below are the value for “real”.

There is of course an overhead created by the “tracker-sparql” program. It does demarshaling using normal dbus-glib. If your application uses DBusMessage directly, then it can avoid the same overhead. But since for both rounds I used the same “tracker-sparql” it doesn’t matter for the measurement.

$ time tracker-sparql -q "SELECT ?u  ?m { ?u a rdfs:Resource ;
          tracker:modified ?m }" > /dev/null

Without the optimization:

0.361s, 0.399s, 0.327s, 0.355s, 0.340s, 0.377s, 0.346s, 0.380s, 0.381s, 0.393s, 0.345s

With the optimization:

0.279s, 0.271s, 0.305s, 0.296s, 0.295s, 0.294s, 0.295s, 0.244s, 0.289s, 0.237s, 0.307s

The improvement ranges between 7% and 40% with average improvement of 22%.

Focus on query performance

Every (good) developer knows that copying of memory and boxing, especially when dealing with a large amount of pieces like members of collections and the cells in a table, are a bad thing for your performance.

More experienced developers also know that novice developers tend to focus on just their algorithms to improve performance, while often the single biggest bottleneck is needless boxing and allocating. Experienced developers come up with algorithms that avoid boxing and copying; they master clever pragmatical engineering and know how to improve algorithms. A lot of newcomers use virtual machines and script languages that are terrible at giving you the tools to control this and then they start endless religious debates about how great their programming language is (as if it matters). (Anti-.NET people don’t get on your horses too soon: if you know what you are doing, C# is actually quite good here).

We were of course doing some silly copying ourselves. Apparently it had a significant impact on performance.

Once Jürg and Carlos have finished the work on parallelizing SELECT queries we plan to let the code that walks the SQLite statement fill in the DBusMessage directly without any memory copying or boxing (for marshalling to DBus). We found the get_reply and send_reply functions; they sound useful for this purpose.

I still don’t really like DBus as IPC for data transfer of Tracker’s RDF store’s query results. Personally I think I would go for a custom Unix socket here. But Jürg so far isn’t convinced. Admittedly he’s probably right; he’s always right. Still, DBus to me doesn’t feel like a good IPC for this data transfer..

We know about the requests to have direct access to the SQLite database from your own process. I explained in the bug that SQLite3 isn’t MVCC and that this means that your process will often get blocked for a long time on our transaction. A longer time than any IPC overhead takes.

Supporting ontology changes in Tracker

It used to be in Tracker that you couldn’t just change the ontology. When you did, you had to reboot the database. This means loosing all the non-embedded data. For example your tags or other such information that’s uniquely stored in Tracker’s RDF store.

This was of course utterly unacceptable and this was among the reasons why we kept 0.8 from being released for so long: we were afraid that we would need to make ontology changes during the 0.8 series.

So during 0.7 I added support for what I call modest ontology changes. This means adding a class, adding a property. But just that. Not changing an existing property. This was sufficient for 0.8 because now we could at least do some changes like adding a property to a class, or adding a new class. You know, making implementing the standard feature requests possible.

Last two weeks I worked on supporting more intrusive ontology changes. The branch that I’m working on currently supports changing tracker:notify for the signals on changes feature, tracker:writeback for the writeback features and tracker:indexed which controls the indexes in the SQLite tables.

But also certain range changes are supported. For example integer to string, double and boolean. String to integer, double and boolean. Double to integer, string and boolean. Range changes will sometimes of course mean data loss.

Plenty of code was also added to detect an unsupported ontology change and to ensure that we just abort the process and don’t do any changes in that case.

It’s all quite complex so it might take a while before the other team members have tested and reviewed all this. It should probably take even longer before it hits the stable 0.8 branch.

We wont yet open the doors to custom ontologies. Several reasons:

  • We want more testing on the support for ontology changes. We know that once we open the doors to custom ontologies that we’ll see usage of this rather sooner than later.
  • We don’t yet support removing properties and classes. This would be easy (drop the table and columns away and log the event in the journal) but it’s not yet supported. Mostly because we don’t need it ourselves (which is a good reason).
  • We don’t want you to meddle with the standard ontologies (we’ll do that, don’t worry). So we need a bit of ontology management code to also look in other directories, etc.
  • The error handling of unsupported ontology changes shouldn’t abort like mentioned above. Another piece of software shouldn’t make Tracker unusable just because they install junk ontologies.
  • We actually want to start using OSCAF‘s ontology format. Perhaps it’s better that we wait for this instead of later asking everybody to convert their custom ontologies?
  • We’re a bunch of pussies who are afraid of the can of worms that you guys’ custom ontologies will open.

But yes, you could say that the basics are being put in place as we speak.