LRU cache for prepared statements in Tracker’s RDF store

While trying to handle a bug that had a description like “if I do this, tracker-store’s memory grows to 80MB and my device starts swapping”, we where surprised to learn that a sqlite3_stmt consumes about 5 kb heap. Auwch.

Before we didn’t think that those prepared statements where very large, so we threw all of them in a hashtable for in case the query was ran again later. However, if you collect thousands of such statements, memory consumption obviously grows.

We decided to implement a LRU cache for these prepared statements. For clients that access the database using direct-access the cache will be smaller, so that max consumption is only a few megabytes. Because our INSERT and DELETE queries are more reusable than SELECT queries, we split it into two different caches per thread.

The implementation is done with a simple intrinsic linked ring list. We’re still testing it a little bit to get good cache-size numbers. I guess it’ll go in master soon. For your testing pleasure you can find the branch here.

3 thoughts on “LRU cache for prepared statements in Tracker’s RDF store”

  1. Is it good to really cache all queries ever made as prepared statements?

    That is, shouldn’t a query be considered for long-term caching if it was used more than X times, and parsing etc took longer than Y, on avg (to propose a simplistic selection criterion).

    A ring buffer could still show pathological cache performance:
    Consider a job that creates a lot of unique but short-lived queries, and that job runs every other hour (or day).

    This job would almost always wipe out your ring buffer, replacing queries with good cache hit performance with queries with very poor cache hit performance (too unique, and short lived).

  2. @michael: That’s a valid concern and for that reason it’s likely that we’ll make the size of the cache configurable (so that on a Desktop, where memory-usage doesn’t matter much nowadays, you can make it very large or something).

    We have different LRU caches for SQL SELECT and SQL INSERT/DELETE which reduces the problem you mention too: A SQL SELECT query is far more often used only once than a SQL INSERT or SQL DELETE query: That’s because we generate most of the SQL using a fixed ontology for SPARQL INSERT and SPARQL DELETE (except their WHERE part, which often is less reusable).

    A query that gets ran twice will get shifted in the LRU ring-list so that it’s definitely not the first-next to go. A full new loop of the entire ring would be needed to push this query out (a whole new set of queries, filling up the cache again, would need to arrive). As also the second time the query runs, it becomes the most recently used.

    However, for our own main-use-case (which is: mobile/limited devices running Harmattan & MeeGo) we will test all important use-case against the scenario that you give:

    We’re testing with for example qtcontact-tracker which is among the softwares that produces most different queries. It was this team that gave us the bug that had a description like “if I do this, tracker-store’s memory grows to 80MB and my device starts swapping”. The reason was indeed our sqlite3_stmt cache. For Harmattan & MeeGo devices we have a max memory limit of about 50MB atm, so we plan to configure the cache sizes such that we wont consume more.

    If a use-case, that has a cycle that gets repeated often, per repeated cycle creates cache-size+1 new queries, then per each repeated cycle a query would be pushed out. That’s not good, so we’re testing all such use-cases and we’ll adjust the cache sizes accordingly.

    Right now the branch uses cache size 100 for SELECT and 100 for INSERT+DELETE. This consumes about 12MB VmRss while the qtcontact-tracker performance tests are running (which inserts & deletes thousands of contacts). Testing this and other use-cases is ongoing.

    ps. A problem with only caching the stmt of queries that happen more than once is that at arrival of a query we of course don’t know whether said query will run more often in future (we can’t predict the future). We could of course hash the query to an int64 and store all past queries’ hashes and then on arrival of a query check, and if it happened before, prepare the statement the second time, and etc etc. But it could be that so many queries happen that the int64 hashes start consuming a lot of memory, killing the very idea behind purging the statements in the first place (memory growth). A LRU-cache has a ~ fixed memory consumption in time.

  3. By the way, some more documentation on LRU and why we picked it:

    http://en.wikipedia.org/wiki/Page_replacement_algorithm#Least_recently_used

    “LRU works on the idea that pages that have been most heavily used in the past few instructions are most likely to be used heavily in the next few instructions too.”

    Note that the implementation overhead of maintaining the (intrinsic) linked list is relatively small compared to the cost of preparing a statement. So that aspect is not a big(gest) concern for us.

Comments are closed.