The crawler’s modification time queries
Yesterday we optimized the crawler’s query that gets the modification time of files. We use this timestamp to know whether or not a file must be reindexed.
Originally, we used a custom SQLite function called tracker:uri-is-parent() in SPARQL. This, however, caused a full table scan. As long as your SQL table for nfo:FileDataObjects wasn’t too large, that wasn’t a huge problem. But it didn’t scale linear. I started with optimizing the function itself. It was using a strlen() so I replaced that with a sqlite3_value_bytes(). We only store UTF-8, so that worked fine. It gained me ~ 10%; not enough.
So this commit was a better improvement. First it makes nfo:belongsToContainer an indexed property. The x nfo:belongsToContainer p means x is in a directory p for file resources. The commit changes the query to use the property that is now indexed.
The original query before we started with this optimization took 1.090s when you had ~ 300,000 nfo:FileDataObject resources. The new query takes about 0.090s. It’s of course an unfair comparison because now we use an indexed property. Adding the index only took a total of 10s for a ~ 300,000 large table and the table is being queried while we index (while we insert into it). Do the math, it’s a huge win in all situations. For the SQLite freaks; the SQLite database grew by 4 MB, with all items in the table indexed.
PDF extractor
Another optimization I did earlier was the PDF extractor. Originally, we used the poppler-glib library. This library doesn’t allow us to set the OutputDev at runtime. If compiled with Cairo, the OutputDev is in some versions a CairoOutputDev. We don’t want all images in the PDF to be rendered to a Cairo surface. So I ported this back to C++ and made it always use a TextOutputDev instead. In poppler-glib master this appears to have improved (in git master poppler_page_get_text_page is always using a TextOutputDev).
Another major problem with poppler-glib is the huge amount of copying strings in heap. The performance to extract metadata and content text for a 70 page PDF document without any images went from 1.050s to 0.550s. A lot of it was caused by copying strings and GValue boxing due to GObject properties.
Table locked problem
Last week I improved D-Bus marshaling by using a database cursor. I forgot to handle SQLITE_LOCKED while Jürg and Carlos had been introducing multithreaded SELECT support. Not good. I fixed this; it was causing random Table locked errors.
As you talk of performance: Is it normal that tracker starts an index run whenever I login for the first time after a reboot? It is not a full index run but it takes quite some time and slows down my whole computer, because all my files reside on an encrypted disk. (nie:DataObject = 113672)
Beside of that I like tracker :-)
@steve: That what you call ‘initial index’ is precisely what we optimized and what the first part of this blog is about. A lot of the time of that ‘initial index’ goes to the crawler’s modification time queries. By the way, we don’t call it an ‘initial index’ but an ‘initial crawler’. It’s a process that checks every ‘already indexed file’ for the necessity of an ‘miner’s update’. Carlos and Martyn are working on making this ‘initial crawling’ optional and/or possible to run it at a user defined frequency or event (like idle) (instead of always at startup of tracker-miner-fs). But given the 10x performance improvement of its query it’s less important to disable it (it’s very fast now).
I recommend you try again with git master and let us know if this ‘initial crawling’ is still too slow for you. My test was with ~300,000 items, yours is with ~120,000 items. We’re in the same league. The performance of the query is now more linear constant when comparing with total amount of items (due to the added index), and the query itself is faster in all situations (due to avoiding the full table scan).
ps. We also stopped calling ‘indexing’, ‘indexing’. We call indexing, ‘mining’ nowadays. I know these are just terms and words, but to understand each other better it helps if we both use the same terms and words ;-)
So, to summarize:
– Crawling is checking if files got changed while we where not watching it (changed behind our back). Crawling is atm done at the start of tracker-miner-fs. The first part of tis blog item is about a major performance improvement for it. We have plans to make it either optional or at an event like frequency per time or during system idle time.
– Mining is getting metadata from files and inserting it into the RDF store. The performance improvement of the PDF extractor (second part of the blog item) influences mining performance.
Will the improvements be merged back into tracker 0.8? The “new tracker” looks great now, despite having very complicated developer API (I think you probably agree to call the sparql stuff atleast a significant threshold that has to be conquered).
Hey Philip,
thanks for your quick answer and the vocabulary lesson :-)
Based on your description I took a closer look at why crawling takes so long on my system. My project folder contains lots of source code (363350 items, totalling 4.4 GB) and given that I don’t really need those to be searchable I excluded them.
If I encounter any more problems I will be happy to report them here.
But for so far, thanks!
@Ulrik: Yes this will be backported to 0.8 by Martyn either next week or the week after. You probably understand that we need to test things in 0.9 before we move it to 0.8, so that 0.8 never has any of those SQLITE_LOCK issues ;-)
We think SPARQL is a major improvement in API to be honest. It’s a w3 specification that is being adopted by most RDF products, http://www.w3.org/TR/rdf-sparql-query .
Owen Taylor on one of the GNOME mailinglists recently ~ refuted the argument that SPARQL is hard to learn with his claim that he mastered it in ~30 minutes. Admittedly is Owen a very experienced software developer. Apparently he’s even advocating pro Tracker for gnome-shell now. Cool!
But SPARQL isn’t very hard to be honest. And neither is Nepomuk, the ontology, http://www.semanticdesktop.org/ontologies . Perhaps check out this introduction video by Rob Taylor: http://vimeo.com/9848513 , we are giving trainings and presentations at conferences too. We also have some documentation here: http://live.gnome.org/Tracker/Documentation/ , on our IRC channel #tracker on GimpNET you see the number of SPARQL wizards growing by the day, you can ask questions on our mailing list, etc etc.
@steve: Try master. It’ll solve a lot of your performance problems during initial crawling due to the query mentioned in this blog item. Reducing the amount of items of course also reduces the scalability problem. But that’s not the solution for me the engineer, of course. Haha. It has to scale, to scale!
Good, good, you guys are getting interested in this stuff. Awesome. “If you build it, they/he will come”?
I guess I am impatient. I think the Gnome Wiki needs a for-idiots explanation of the most basic parts of a query. I ended up simply straced ‘tracker-search’ to see what kind of query it used, and replaced my search term into that :-)
@Ulrik: Learning SPARQL wont harm you. Besides, why strace a binary if you have the code?
http://git.gnome.org/browse/tracker/tree/src/tracker-utils/tracker-search.c#n243
Btw. while learning SPARQL, feel free to make a “SPARQL for dummies” page on our GNOME Live pages. We’ll correct it if you write inaccuracies. Just join #tracker on GimpNET and let us know when you change stuff.
Of course it won’t harm me. It would not harm to have a gentle introduction rather than massive manuals and conferences for something this simple.
You trace processes, not binaries. I had it installed, I did’t have the code. Not that I don’t usually read code in git.gnome.org, but because of that I also know that reading and understanding code always takes more time than you expect.
@Ulrik: http://vimeo.com/9848513 is a gentle introduction