The crawler’s modification time queries
Yesterday we optimized the crawler’s query that gets the modification time of files. We use this timestamp to know whether or not a file must be reindexed.
Originally, we used a custom SQLite function called tracker:uri-is-parent() in SPARQL. This, however, caused a full table scan. As long as your SQL table for nfo:FileDataObjects wasn’t too large, that wasn’t a huge problem. But it didn’t scale linear. I started with optimizing the function itself. It was using a strlen() so I replaced that with a sqlite3_value_bytes(). We only store UTF-8, so that worked fine. It gained me ~ 10%; not enough.
So this commit was a better improvement. First it makes nfo:belongsToContainer an indexed property. The x nfo:belongsToContainer p means x is in a directory p for file resources. The commit changes the query to use the property that is now indexed.
The original query before we started with this optimization took 1.090s when you had ~ 300,000 nfo:FileDataObject resources. The new query takes about 0.090s. It’s of course an unfair comparison because now we use an indexed property. Adding the index only took a total of 10s for a ~ 300,000 large table and the table is being queried while we index (while we insert into it). Do the math, it’s a huge win in all situations. For the SQLite freaks; the SQLite database grew by 4 MB, with all items in the table indexed.
PDF extractor
Another optimization I did earlier was the PDF extractor. Originally, we used the poppler-glib library. This library doesn’t allow us to set the OutputDev at runtime. If compiled with Cairo, the OutputDev is in some versions a CairoOutputDev. We don’t want all images in the PDF to be rendered to a Cairo surface. So I ported this back to C++ and made it always use a TextOutputDev instead. In poppler-glib master this appears to have improved (in git master poppler_page_get_text_page is always using a TextOutputDev).
Another major problem with poppler-glib is the huge amount of copying strings in heap. The performance to extract metadata and content text for a 70 page PDF document without any images went from 1.050s to 0.550s. A lot of it was caused by copying strings and GValue boxing due to GObject properties.
Table locked problem
Last week I improved D-Bus marshaling by using a database cursor. I forgot to handle SQLITE_LOCKED while Jürg and Carlos had been introducing multithreaded SELECT support. Not good. I fixed this; it was causing random Table locked errors.