IPC performance, the report
The Tracker team will be doing a codecamp this month. Among the subjects we will address is the IPC overhead of tracker-store, our RDF query service.
We plan to investigate whether a direct connection with our SQLite database is possible for clients. Jürg did some work on this. Turns out that due to SQLite not being MVCC we need to override some of SQLite’s VFS functions and perhaps even implement ourselves a custom page cache.
Another track that we are investigating involves using a custom UNIX domain socket and sending the data over in such a way that at either side the marshalling is cheap.
For that idea I asked Adrien Bustany, a computer sciences student who’s doing an internship at Codeminded, to develop three tests: A test that uses D-Bus the way tracker-store does (by using the DBusMessage API directly), a test that uses an as ideal as possible custom protocol and technique to get the data over a UNIX domain socket and a simple program that does the exact same query but connects to SQLite by itself.
Exposing a SQLite database remotely: comparison of various IPC methods
By Adrien Bustany
Computer Sciences student
National Superior School of Informatics and Applied Mathematics of Grenoble (ENSIMAG)
This study aims at comparing the overhead of an IPC layer when accessing a SQLite database. The two IPC methods included in this comparison are DBus, a generic message passing system, and a custom IPC method using UNIX sockets. As a reference, we also include in the results the performance of a client directly accessing the SQLite database, without involving any IPC layer.
Comparison methodology
In this section, we detail what the client and server are supposed to do during the test, regardless of the IPC method used.
The server has to:
- Open the SQLite database and listen to the client requests
- Prepare a query at the client’s request
- Send the resulting rows at the client’s request
Queries are only “SELECT” queries, no modification is performed on the database. This restriction is not enforced on server side though.
The client has to:
- Connect to the server
- Prepare a “SELECT” query
- Fetch all the results
- Copy the results in memory (not just fetch and forget them), so that memory pages are really used
Test dataset
For testing, we use a SQLite database containing only one table. This table has 31 columns, the first one is the identifier and the 30 others are columns of type TEXT. The table is filled with 300 000 rows, with randomly generated strings of 20 ASCII lowercase characters.
Implementation details
In this section, we explain how the server and client for both IPC methods were implemented.
Custom IPC (UNIX socket based)
In this case, we use a standard UNIX socket to communicate between the client and the server. The socket protocol is a binary protocol, and is detailed below. It has been designed to minimize CPU usage (there is no marshalling/demarshalling on strings, nor intensive computation to decode the message). It is fast over a local socket, but not suitable for other types of sockets, like TCP sockets.
Message types
There are two types of operations, corresponding to the two operations of the test: prepare a query, and fetch results.
Message format
All numbers are encoded in little endian form.
Prepare
Client sends:
| Size | Contents |
| 4 bytes | Prepare opcode (0x50) |
| 4 bytes | Size of the query (without trailing \0) |
| … | Query, in ASCII |
Server answers:
| Size | Contents |
| 4 bytes | Return code of the sqlite3_prepare_v2 call |
Fetch
Client sends:
| Size | Contents |
| 4 bytes | Fetch opcode (0x46) |
Server sends rows grouped in fixed size buffers. Each buffer contains a variable number of rows. Each row is complete. If some padding is needed (when a row doesn’t fit in a buffer, but there is still space left in the buffer), the server adds an “End of Page” marker. The “End of page” marker is the byte 0xFF. Rows that are larger than the buffer size are not supported.
Each row in a buffer has the following format:
| Size | Contents |
| 4 bytes | SQLite return code. This is generally SQLITE_ROW (there is a row to read), or SQLITE_DONE (there are no more rows to read). When the return code is not SQLITE_ROW, the rest of the message must be ignored. |
| 4 bytes | Number of columns in the row |
| 4 bytes | Index of trailing \0 for first column (index is 0 after the “number of columns” integer, that is, index is equal to 0 8 bytes after the message begins) |
| 4 bytes | Index of trailing \0 for second column |
| … | |
| 4 bytes | Index of trailing \0 for last column |
| … | Row data. All columns are concatenated together, and separated by \0 |
For the sake of clarity, we describe here an example row
100 4 1 7 13 19 1\0aaaaa\0bbbbb\0ccccc\0
The first 100 is the return code, in this case SQLITE_ROW. This row has 4 columns. The 4 following numbers are the offset of the \0 terminating each column in the row data. Finally comes the row data.
Memory usage
We try to minimize the calls to malloc and memcpy in the client and server. As we know the size of a buffer, we allocate the memory only once, and then use memcpy to write the results to it.
DBus
The DBus server exposes two methods, Prepare and Fetch.
Prepare
The Prepare method accepts a query string as a parameter, and returns nothing. If the query preparation fails, an error message is returned.
Fetch
Ideally, we should be able to send all the rows in one batch. DBus, however, puts a limitation on the message size. In our case, the complete data to pass over the IPC is around 220MB, which is more than the maximum size allowed by DBus (moreover, DBus marshalls data, which augments the message size a little). We are therefore obliged to split the result set.
The Fetch method accepts an integer parameter, which is the number of rows to fetch, and returns an array of rows, where each row is itself an array of columns. Note that the server can return less rows than asked. When there are no more rows to return, an empty array is returned.
Results
All tests are ran against the dataset described above, on a warm disk cache (the database is accessed several time before every run, to be sure the entire database is in disk cache). We use SQLite 3.6.22, on a 64 bit Linux system (kernel 2.6.33.3). All test are ran 5 times, and we use the average of the 5 intermediate results as the final number.
For the custom IPC, we test with various buffer sizes varying from 1 to 256 kilobytes. For DBus, we fetch 75000 rows with every Fetch call, which is close to the maximum we can fetch with each call (see the paragraph on DBus message size limitation).
The first tests were to determine the optimal buffer size for the UNIX socket based IPC. The following graph describes the time needed to fetch all rows, depending on the buffer size:
The graph shows that the IPC is the fastest using 64kb buffers. Those results depend on the type of system used, and might have to be tuned for different platforms. On Linux, a memory page is (generally) 4096 bytes, as a consequence buffers smaller than 4kB will use a full memory page when sent over the socket and waste memory bandwidth. After determining the best buffer size for socket IPC, we run tests for speed and memory usage, using a buffer size of 64kb for the UNIX socket based method.
Speed
We measure the time it takes for various methods to fetch a result set. Without any surprise, the time needed to fetch the results grows linearly with the amount of rows to fetch.
| IPC method | Best time |
| None (direct access) | 2910 ms |
| UNIX socket | 3470 ms |
| DBus | 12300 ms |
Memory usage
Memory usage varies greatly (actually, so much that we had to use a log scale) between IPC methods. DBus memory usage is explained by the fact that we fetch 75 000 rows at a time, and that it has to allocate all the message before sending it, while the socket IPC uses 64 kB buffers.
Conclusions
The results clearly show that in such a specialized case, designing a custom IPC system can highly reduce the IPC overhead. The overhead of a UNIX socket based IPC is around 19%, while the overhead of DBus is 322%. However, it is important to take into account the fact that DBus is a much more flexible system, offering far more features and flexibility than our socket protocol. Comparing DBus and our custom UNIX socket based IPC is like comparing an axe with a swiss knife: it’s much harder to cut the tree with the swiss knife, but it also includes a tin can opener, a ball pen and a compass (nowadays some of them even include USB keys).
The real conclusion of this study is: if you have to pass a lot of data between two programs and don’t need a lot of flexibility, then DBus is not the right answer, and never intended to be.
The code source used to obtain these results, as well as the numbers and graphs used in this document can be checked out from the following git repository: git://git.mymadcat.com/ipc-performance . Please check the various README files to see how to reproduce them and/or how to tune the parameters.
Friday’s performance improvements in Tracker
The crawler’s modification time queries
Yesterday we optimized the crawler’s query that gets the modification time of files. We use this timestamp to know whether or not a file must be reindexed.
Originally, we used a custom SQLite function called tracker:uri-is-parent() in SPARQL. This, however, caused a full table scan. As long as your SQL table for nfo:FileDataObjects wasn’t too large, that wasn’t a huge problem. But it didn’t scale linear. I started with optimizing the function itself. It was using a strlen() so I replaced that with a sqlite3_value_bytes(). We only store UTF-8, so that worked fine. It gained me ~ 10%; not enough.
So this commit was a better improvement. First it makes nfo:belongsToContainer an indexed property. The x nfo:belongsToContainer p means x is in a directory p for file resources. The commit changes the query to use the property that is now indexed.
The original query before we started with this optimization took 1.090s when you had ~ 300,000 nfo:FileDataObject resources. The new query takes about 0.090s. It’s of course an unfair comparison because now we use an indexed property. Adding the index only took a total of 10s for a ~ 300,000 large table and the table is being queried while we index (while we insert into it). Do the math, it’s a huge win in all situations. For the SQLite freaks; the SQLite database grew by 4 MB, with all items in the table indexed.
PDF extractor
Another optimization I did earlier was the PDF extractor. Originally, we used the poppler-glib library. This library doesn’t allow us to set the OutputDev at runtime. If compiled with Cairo, the OutputDev is in some versions a CairoOutputDev. We don’t want all images in the PDF to be rendered to a Cairo surface. So I ported this back to C++ and made it always use a TextOutputDev instead. In poppler-glib master this appears to have improved (in git master poppler_page_get_text_page is always using a TextOutputDev).
Another major problem with poppler-glib is the huge amount of copying strings in heap. The performance to extract metadata and content text for a 70 page PDF document without any images went from 1.050s to 0.550s. A lot of it was caused by copying strings and GValue boxing due to GObject properties.
Table locked problem
Last week I improved D-Bus marshaling by using a database cursor. I forgot to handle SQLITE_LOCKED while Jürg and Carlos had been introducing multithreaded SELECT support. Not good. I fixed this; it was causing random Table locked errors.
RDF propaganda, time for change
I’m not supposed to but I’m proud. It’s not only me who’s doing it.
Adrien is one of the new guys on the block. He’s working on integration with Tracker’s RDF service and various web services like Flickr, Facebook, Twitter, picasaweb and RSS. This is the kind of guy several companies should be afraid of. His work is competing with what they are trying to do do: integrating the social web with mobile.
Oh come on Steve, stop pretending that you aren’t. And you better come up with something good, because we are.
Not only that, Adrien is implementing so-called writeback. It means that when you change a local resource’s properties, that this integration will update Flickr, Facebook, picasaweb and Twitter.
You change a piece of info about a photo on your phone, and it’ll be replicated to Flickr. It’ll also be synchronized onto your phone as soon as somebody else made a change.
This is the future of computing and information technology. Integration with social networking and the phone is what people want. Dear Mark, it’s unstoppable. You better keep your eyes open, because we are going fast. Faster than your business.
I’m not somebody trying to guess how technology will look in a few years. I try to be in the middle of the technical challenge of actually doing it. Talking about it is telling history before your lip’s muscles moved.
At the Tracker project we are building a SPARQL endpoint that uses D-Bus as IPC. This is ideal on Nokia’s Meego. It’ll be a centerpiece for information gathering. On Meego you wont ask the filesystem, instead you’ll ask Tracker using SPARQL and RDF.
To be challenged is likely the most beautiful state of mind.
I invite everybody to watch this demo by Adrien. It’s just the beginning. It’s going to get better.
Tracker writeback & web service integration demo / MeegoTouch UI from Adrien Bustany on Vimeo.
I tagged this as ‘extremely controversial’. That’s fine, Adrien told me that “people are used to me anyway”.
Performance DBus handling of the query results in Tracker’s RDF service
Before
For returning the results of a SPARQL SELECT query we used to have a callback like this. I removed error handling, you can find the original here.
We need to marshal a database result_set to a GPtrArray because dbus-glib fancies that. This is a lot of boxing the strings into GValue and GStrv. It does allocations, so not good.
static void
query_callback(TrackerDBResultSet *result_set,GError *error,gpointer user_data)
{
TrackerDBusMethodInfo *info = user_data;
GPtrArray *values = tracker_dbus_query_result_to_ptr_array (result_set);
dbus_g_method_return (info->context, values);
tracker_dbus_results_ptr_array_free (&values);
}
void
tracker_resources_sparql_query (TrackerResources *self, const gchar *query,
DBusGMethodInvocation *context, GError **error)
{
TrackerDBusMethodInfo *info = ...; guint request_id;
TrackerResourcesPrivate *priv= ...; gchar *sender;
info->context = context;
tracker_store_sparql_query (query, TRACKER_STORE_PRIORITY_HIGH,
query_callback, ...,
info, destroy_method_info);
}
After
Last week I changed the asynchronous callback to return a database cursor. In SQLite that means an sqlite3_step(). SQLite returns const pointers to the data in the cell with its sqlite3_column_* APIs.
This means that now we’re not even copying the strings out of SQLite. Instead, we’re using them as const to fill in a raw DBusMessage:
static void
query_callback(TrackerDBCursor *cursor,GError *error,gpointer user_data)
{
TrackerDBusMethodInfo *info = user_data;
DBusMessage *reply; DBusMessageIter iter, rows_iter;
guint cols; guint length = 0;
reply = dbus_g_method_get_reply (info->context);
dbus_message_iter_init_append (reply, &iter);
cols = tracker_db_cursor_get_n_columns (cursor);
dbus_message_iter_open_container (&iter, DBUS_TYPE_ARRAY,
"as", &rows_iter);
while (tracker_db_cursor_iter_next (cursor, NULL)) {
DBusMessageIter cols_iter; guint i;
dbus_message_iter_open_container (&rows_iter, DBUS_TYPE_ARRAY,
"s", &cols_iter);
for (i = 0; i < cols; i++, length++) {
const gchar *result_str = tracker_db_cursor_get_string (cursor, i);
dbus_message_iter_append_basic (&cols_iter,
DBUS_TYPE_STRING,
&result_str);
}
dbus_message_iter_close_container (&rows_iter, &cols_iter);
}
dbus_message_iter_close_container (&iter, &rows_iter);
dbus_g_method_send_reply (info->context, reply);
}
Results
The test is a query on 13500 resources where we ask for two strings, repeated eleven times. I removed a first repeat from each round, because the first time the sqlite3_stmt still has to be created. This means that our measurement would get a few more milliseconds. I also directed the standard out to /dev/null to avoid the overhead created by the terminal. The results you see below are the value for “real”.
There is of course an overhead created by the “tracker-sparql” program. It does demarshaling using normal dbus-glib. If your application uses DBusMessage directly, then it can avoid the same overhead. But since for both rounds I used the same “tracker-sparql” it doesn’t matter for the measurement.
$ time tracker-sparql -q "SELECT ?u ?m { ?u a rdfs:Resource ;
tracker:modified ?m }" > /dev/null
Without the optimization:
0.361s, 0.399s, 0.327s, 0.355s, 0.340s, 0.377s, 0.346s, 0.380s, 0.381s, 0.393s, 0.345s
With the optimization:
0.279s, 0.271s, 0.305s, 0.296s, 0.295s, 0.294s, 0.295s, 0.244s, 0.289s, 0.237s, 0.307s
The improvement ranges between 7% and 40% with average improvement of 22%.
Focus on query performance
Every (good) developer knows that copying of memory and boxing, especially when dealing with a large amount of pieces like members of collections and the cells in a table, are a bad thing for your performance.
More experienced developers also know that novice developers tend to focus on just their algorithms to improve performance, while often the single biggest bottleneck is needless boxing and allocating. Experienced developers come up with algorithms that avoid boxing and copying; they master clever pragmatical engineering and know how to improve algorithms. A lot of newcomers use virtual machines and script languages that are terrible at giving you the tools to control this and then they start endless religious debates about how great their programming language is (as if it matters). (Anti-.NET people don’t get on your horses too soon: if you know what you are doing, C# is actually quite good here).
We were of course doing some silly copying ourselves. Apparently it had a significant impact on performance.
Once Jürg and Carlos have finished the work on parallelizing SELECT queries we plan to let the code that walks the SQLite statement fill in the DBusMessage directly without any memory copying or boxing (for marshalling to DBus). We found the get_reply and send_reply functions; they sound useful for this purpose.
I still don’t really like DBus as IPC for data transfer of Tracker’s RDF store’s query results. Personally I think I would go for a custom Unix socket here. But Jürg so far isn’t convinced. Admittedly he’s probably right; he’s always right. Still, DBus to me doesn’t feel like a good IPC for this data transfer..
We know about the requests to have direct access to the SQLite database from your own process. I explained in the bug that SQLite3 isn’t MVCC and that this means that your process will often get blocked for a long time on our transaction. A longer time than any IPC overhead takes.
Supporting ontology changes in Tracker
It used to be in Tracker that you couldn’t just change the ontology. When you did, you had to reboot the database. This means loosing all the non-embedded data. For example your tags or other such information that’s uniquely stored in Tracker’s RDF store.
This was of course utterly unacceptable and this was among the reasons why we kept 0.8 from being released for so long: we were afraid that we would need to make ontology changes during the 0.8 series.
So during 0.7 I added support for what I call modest ontology changes. This means adding a class, adding a property. But just that. Not changing an existing property. This was sufficient for 0.8 because now we could at least do some changes like adding a property to a class, or adding a new class. You know, making implementing the standard feature requests possible.
Last two weeks I worked on supporting more intrusive ontology changes. The branch that I’m working on currently supports changing tracker:notify for the signals on changes feature, tracker:writeback for the writeback features and tracker:indexed which controls the indexes in the SQLite tables.
But also certain range changes are supported. For example integer to string, double and boolean. String to integer, double and boolean. Double to integer, string and boolean. Range changes will sometimes of course mean data loss.
Plenty of code was also added to detect an unsupported ontology change and to ensure that we just abort the process and don’t do any changes in that case.
It’s all quite complex so it might take a while before the other team members have tested and reviewed all this. It should probably take even longer before it hits the stable 0.8 branch.
We wont yet open the doors to custom ontologies. Several reasons:
- We want more testing on the support for ontology changes. We know that once we open the doors to custom ontologies that we’ll see usage of this rather sooner than later.
- We don’t yet support removing properties and classes. This would be easy (drop the table and columns away and log the event in the journal) but it’s not yet supported. Mostly because we don’t need it ourselves (which is a good reason).
- We don’t want you to meddle with the standard ontologies (we’ll do that, don’t worry). So we need a bit of ontology management code to also look in other directories, etc.
- The error handling of unsupported ontology changes shouldn’t abort like mentioned above. Another piece of software shouldn’t make Tracker unusable just because they install junk ontologies.
- We actually want to start using OSCAF‘s ontology format. Perhaps it’s better that we wait for this instead of later asking everybody to convert their custom ontologies?
- We’re a bunch of pussies who are afraid of the can of worms that you guys’ custom ontologies will open.
But yes, you could say that the basics are being put in place as we speak.
Wikileaks
MSNBC: You have more tapes like this?
Julian Assange: Yes we do.
Assange: I won’t go into the precise number. But there was a rumor that the tape that we were about to release was about a similar incident in Afghanistan, where 97 people were bombed in May last year. We euhm, have that video.
MSNBC: Do you intent to release that video as well?
Assange: Yes, as soon as we have finished our analysis, we will release it.
Thank you Wikileaks. Thank you Julian Assange. You are bringing Wikileak’s perspective calm and clear in the media. You’re an example to all whistleblowers. Julian, you’re doing a great job.
I understand more people are involved in this leak; thanks everybody. You’re being respected.
Information technology is all about information. Information for humanity.
Don’t you guys stop believing in this! We now believe in you. Many people like me are highly focused and when intelligence services want a battle: we’ll listen. People like me are prepared to act.
I understand you guys like Belgium’s law that protects journalist’ sources. As the owner of a Belgian Ltd. maybe I can help?
I’m not often proud about my country. Last week I told my Swiss friends here in Zürich that I have about 3000 reasons to leave Belgium and a 1000 reasons to come to Switzerland. I wasn’t exaggerating.
I’m a guy with principles and ethics. So thank you.
Zürichsee
Today after I brought Tinne to the airport I drove around Zürichsee. She can’t stay in Switzerland the entire month; she has to go back to school on Monday.
While driving on the Seestrasse I started counting luxury cars. After I reached two for Lamborgini and three for Ferrari I started thinking: Zimmerberg Sihltal and Pfannenstiel must be expensive districts too… And yes, they are.
I was lucky today that it was nice weather. But wow, what a nice view on the mountain tops when you look south over Zürichsee. People from Zürich, you guys are so lucky! Such immense calming feeling the view gives me! For me, it beats sauna. And I’m a real sauna fan.
I’m thinking to check it out south of Zürich. But not the canton. I think the house prices are just exaggerated high in the canton of Zürich. I was thinking Sankt Gallen, Toggenburg. I’ve never been there; I’ll check it out tomorrow.
Hmmr, meteoswiss gives rain for tomorrow. Doesn’t matter.
Actually, when I came back from the airport the first thing I really did was fix coping with property changes in ontologies for Tracker. Yesterday it wasn’t my day, I think. I couldn’t find this damn problem in my code! And in the evening I lost three chess games in a row against Tinne. That’s really a bad score for me. Maybe after two weeks of playing chess almost every evening, she got better than me? Hmmrr, that’s a troubling idea.
Anyway, so when I got back from the airport I couldn’t resist beating the code problem that I didn’t find on Friday. I found it! It works!
I guess I’m both a dreamer and a realist programmer. But don’t tell my customers that I’m such a dreamer.
Bern, an idyllic capital city
Today Tinne and I visited Switzerland’s capital, Bern.
We were really surprised; we’d never imagined that a capital city could offer so much peace and calm. It felt good to be there.
The fountains, the old houses, the river and the snowy mountain peaks give the city an idyllic image.
Standing on the bridge, you see the roofs of all these lovely small houses.
The bear is the symbol of Bern. Near the House of Parliament there was this statue of a bear. Tinne just couldn’t resist to give it a hug. Bern has also got real bears. Unfortunately, Tinne was not allowed to cuddle those bears.
The House of Parliament is a truly impressive building. It looks over the snowy mountains, its people and its treasury, the National Bank of Switzerland.
As you can imagine, the National Bank building is a master piece as well. And even more impressive; it issues a world leading currency.
On the market square in Oerlikon we first saw this chess board on the street; black and white stones and giant chess pieces. In Bern there was also a giant chess board in the backyard of the House of Parliament. Tinne couldn’t resist to challenge me for a game of chess. (*edit*, Armin noted in a comment that the initial position of knight and bishop are swapped. And OMG, he’s right!)
And she won!
At the House of Parliament you get a stunning, idyllic view on the mountains of Switzerland.
Confoederatio Helvetica
It’s crossing my mind to move here in ~ two years.
Today we visited Zug; it has a Ferrari shop.
Zug, where an apartment costs far more than a villa in Belgium. Briefly a million euros.

It also comforts me. I could be here. Zug has a volière with exotic birds and a lake.
When Tinne and me were driving back to Oerlikon, we listened to Karoliina’s Symphonic dream.
The music; a canvas for the paint, Switzerland.
Die Lichter auf dem Berg. Die sind alle Seelen.
From grey mouse to putschist. That was quick.
Congratulations to Mr. Van Rompuy for helping the EU powers to find a compromise.
Diplomats credit him with a shrewd sense of deal-making and a determination that is belied by his quiet anti-charisma, and he has already begun to win plaudits from Paris, Berlin and other capitals.
— Financial Times, Saturday Mar 27 2010 (alt. link)
Finally a politician to be proud of as a Belgian!
The mouse is dull grey
It steps into the sunshine
The mouse is snow white
Reporting busy status
We’re nearing our first release since very long, so I’ll do another technical blog post about Tracker ;)
When the RDF store is replaying its journal at startup and when the RDF store is restoring a backup it can be in busy state. This means that we can’t handle your DBus requests during that time; your DBus method will be returned late.
Because that’s not very nice from a UI perspective (the uh, what is going on?? -syndrome kicks in) we’re adding a signal emission that emits the progression and status. You can also ask it using DBus methods GetProgress and GetStatus.
The miners already had something like this, so I kept the API more or less the same.
signal sender=:1.99 -> dest=(null destination) serial=1454 path=/org/freedesktop/Tracker1/Status; interface=org.freedesktop.Tracker1.Status; member=Progress string "Journal replaying" double 0.197824 signal sender=:1.99 -> dest=(null destination) serial=1455 path=/org/freedesktop/Tracker1/Status; interface=org.freedesktop.Tracker1.Status; member=Progress string "Journal replaying" double 0.698153
Jürg just reviewed the SPARQL regex performance improvement of yesterday, so that’s now in master. If you want this busy status notifying today already you can test with the busy-notifications branch.
Performance improvements for SPARQL’s regex in Tracker
The original SPARQL regex support of Tracker is using a custom SQLite function. But of course back when we wrote it we didn’t yet think much about optimizing. As a result, we were using g_regex_match_simple which of course recompiles the regular expression each time.
Today Jürg and me found out about sqlite3_get_auxdata and sqlite3_set_auxdata which allows us to cache a compiled value for a specific custom SQLite function for the duration of the query.
This is much better:
static void
function_sparql_regex (sqlite3_context *context,
int argc,
sqlite3_value *argv[])
{
gboolean ret;
const gchar *text, *pattern, *flags;
GRegexCompileFlags regex_flags;
GRegex *regex;
if (argc != 3) {
sqlite3_result_error (context, "Invalid argument count", -1);
return;
}
regex = sqlite3_get_auxdata (context, 1);
text = sqlite3_value_text (argv[0]);
flags = sqlite3_value_text (argv[2]);
if (regex == NULL) {
gchar *err_str;
GError *error = NULL;
pattern = sqlite3_value_text (argv[1]);
regex_flags = 0;
while (*flags) {
switch (*flags) {
case 's': regex_flags |= G_REGEX_DOTALL; break;
case 'm': regex_flags |= G_REGEX_MULTILINE; break;
case 'i': regex_flags |= G_REGEX_CASELESS; break;
case 'x': regex_flags |= G_REGEX_EXTENDED; break;
default:
err_str = g_strdup_printf ("Invalid SPARQL regex flag '%c'", *flags);
sqlite3_result_error (context, err_str, -1);
g_free (err_str);
return;
}
flags++;
}
regex = g_regex_new (pattern, regex_flags, 0, &error);
if (error) {
sqlite3_result_error (context, error->message, error->code);
g_clear_error (&error);
return;
}
sqlite3_set_auxdata (context, 1, regex, (void (*) (void*)) g_regex_unref);
}
ret = g_regex_match (regex, text, 0, NULL);
sqlite3_result_int (context, ret);
return;
}
Before (this was a test on a huge amount of resources):
$ time tracker-sparql -q "select ?u { ?u a rdfs:Resource . FILTER (regex(?u, '^titl', 'i')) }"
real 0m3.337s
user 0m0.004s
sys 0m0.008s
After:
$ time tracker-sparql -q "select ?u { ?u a rdfs:Resource . FILTER (regex(?u, '^titl', 'i')) }"
real 0m1.887s
user 0m0.008s
sys 0m0.008s
This will hit Tracker’s master today or tomorrow.
Working hard at the Tracker project
Today we improved journal replaying from 1050s for my test of 25249 resources to 58s.
Journal replaying happens when your cache database gets corrupted. Also when you restore a backup: restore uses the same code the journal replaying uses, backup just makes a copy of your journal.
During the performance improvements we of course found other areas related to data entry. It looks like we’re entering a period of focus on performance, as we have a few interesting ideas for next week already. The ideas for next week will focus on performance of some SPARQL functions like regex.
Meanwhile are Michele Tameni and Roberto Guido working on a RSS miner for Tracker and has Adrien Bustany been working on other web miners like for Flickr, GData, Twitter and Facebook.
I think the first pieces of the RSS- and the other web miners will start becoming available in this week’s unstable 0.7 release. Martyn is still reviewing the branches of the guys, but we’re very lucky with such good software developers as contributors. Very nice work Michele, Roberto and Adrien!
The future of the European community, a European Monetary Fund.
I’m worried about the EURO’s M3 if a European version of the IMF (a EMF) is to be installed.
Nonetheless, I think the European community should do it just to strengthen Europe’s economy. I’m not satisfied by Europe’s economic strength: I want it to be undefeatable.
We must not let the IMF solve our problems. Europe might be a political dwarf, but we Europeans should show that we will solve our own problems. We’re an adult composition of cultures with vast amounts of experience. We know how to solve any imaginable problem. And let’s not, in our defeatism, pretend we don’t.
A EMF is a commitment to future member states: Europe often asks them fundamental changes; economic strength is what Europe offers in return. This needs to come at a highest price: Greece will have to fix their deficit problem. Even if their entire population goes on strike. Greece will be an example for countries like my own: Belgium has to fix a serious deficit problem, too.
An EMF comes at an equally high price, and that frightens me a bit: I don’t want the ECB to go as ballistic on money creation as the FED has been last two years. I want the EURO to be the strongest relevant currency mankind has ever created. No matter how insane the rest of the world thinks that ambition is: I believe that keeping the EURO’s M3 in check is a key to creating a wealthy society in Europe.
Politically I want European nations to negotiate more and more often. The European Union is a political dwarf only because finding agreement is hard. But in the long run will our solution be the most negotiated, most tested on this planet.
Together we can deal with anything. That doesn’t mean it’ll be easy; it has never been easy: just seventy years ago we were still killing each other. We’re all guilty of that one way or another. And before that it wasn’t any better. Today, not that many people still care: “it wasn’t me”, right? So stop being a bitch about it, then.
It’s time to let it be. It’s time to start a new European century that will be better. With respect for all European cultures, languages, nations, nationalities, values, borders and interests.
But also a European century with economic responsibilities for each member. It’s our strength: we figured out how to keep our population wealthy: let’s continue doing so in the future.
Emotional (and social) intelligence
It was the dawn of the 1970s, at the height of worldwide student protests against the Vietnam War, and a librarian stationed at a U.S. Information Agency post abroad had received bad news: A student group was threatening to burn down her library.
But the librarian had friends among the group of student activists who made the threat. Her response on first glance might seem either naïve or foolhardy — or both: She invited the group to use the library facilities for some of their meetings.
But she also brought Americans living in the country there to listen to them — and so engineered a dialogue instead of a confrontation.
In doing so, she was capitalizing on her personal relationship with the handful of student leaders she knew well enough to trust — and for them to trust her. The tactic opened new channels of mutual understanding, and it strengthened her friendship with the student leaders. The library was never touched.
(More available at the flash preview widget’s page 21)
— Daniel Goleman, Working With Emotional Intelligence, Competencies of the stars. 1998
In Working with Emotional Intelligence, Daniel Goleman explains several practical methods to improve the social skills of people. Before I bought this book a year or two ago, I read Daniel’s first book Emotional Intelligence. This weekend I finally started reading Working With.
I recommend the section Some Misconceptions. Regretfully ain’t this section available for display in the flash preview widget. Instead of violating copyright laws by typing it down here, I’m recommending to just buy the book.
You can find audiobooks online. The section about misconceptions is at track three. Track five talks about two computer programmers, which is very illustrative for many of my blog’s readers (and possibly myself). I hope you wont illegally download using torrents. Instead, buy the material.
Also very interesting is this lecture by Daniel:
Here you can also find a Authors@Google talk by Daniel Goleman:
What distinguishes Daniel Goleman from old line proponents of positive thinking, however, is his grounding in psychology and neuroscience. Armed with a Ph.D in psychology from Harvard and a first-grade journalism background at the New York Times, Dr. Goleman has authored half a dozen books that explore the physical and chemical workings on the brain and their relationship with what we experience as everyday life.
— Peter Allen, director of Google university, introduction to Daniel Goleman. August 3, 2007
I hope readers of my blog will shun away from pseudo science when it comes to emotional and social intelligence, but instead read and learn from authors like Daniel Goleman. I also (still) recommend the books available at The Moral Brain by for example Dr. Jan Verplaetse.
Tinymail 1.0!
Tinymail‘s co-maintainer Sergio Villar just released Tinymail’s first release.
psst. I have inside information that I might not be allowed to share that 1.2 is being prepared already, and will have bodystructure and envelope summary fetch. And it’ll fetch E-mail body content per requested MIME part, instead of always entire E-mails. Whoohoo!
An ode to our testers
You know about those guys that use your software against huge datasets like their entire filesystem, with thousands of files?
We do. His name is Tshepang Lekhonkhobe and we owe him a few beers for reporting to us many scalability issues.
Today we found and fixed such a scalability issue: the update query to reset the availability of file resources (this is for support for removable media) was causing at least a linear increase of VmRss usage per amount of file resources. For Tshepang’s situation that meant 600 MB of VmRss. Jürg reduced this to 30 MB of peak VmRss in the same use-case, and a performance improvement from minutes to a second or two, three. Without memory fragmentation as glibc is returning almost all of the VmRss back to the kernel.
Thursday is our usual release day. I invite all of the 0.7 pioneers to test us with your huge filesystems, just like Tshepang always does.
So long and thanks for all the testing, Tshepang! I’m glad we finally found it.
Invisible costs
We would rather suffer the visible costs of a few bad decisions than incur the many invisible costs that come from decisions made too slowly – or not at all – because of a stifling bureaucracy.
— Letter by Warren E. Buffett to the shareholders of Berkshire, February 26, 2010




