Domain indexes finished, technical conclusions

The support for domain specific indexes is, awaiting review / finished. Although we can further optimize it now. More on that later in this post. Image that you have this ontology:

nie:InformationElement a rdfs:Class .

nie:title a rdf:Property ;
  nrl:maxCardinality 1 ;
  rdfs:domain nie:InformationElement ;
  rdfs:range xsd:string .

nmm:MusicPiece a rdfs:Class ;
  rdfs:subClassOf nie:InformationElement .

nmm:beatsPerMinute a rdf:Property ;
  nrl:maxCardinality 1 ;
  rdfs:domain nmm:MusicPiece ;
  rdfs:range xsd:integer .

With that ontology there are three tables called “Resource”, “nmo:MusicPiece” and “nie:InformationElement” in SQLite’s schema:

  • The “Resource” table has ID and the subject string
  • The “nie:InformationElement” has ID and “nie:title”
  • The “nmm:MusicPiece” one has ID and “nmm:beatsPerMinute”

That’s fairly simple, right? The problem is that when you ORDER BY “nie:title” that you’ll cause a full table scan on “nie:InformationElement”. That’s not good, because there are less “nmm:MusicPiece” records than “nie:InformationElement” ones.

Imagine that we do this SPARQL query:

SELECT ?title WHERE {
   ?resource a nmm:MusicPiece ;
             nie:title ?title
} ORDER BY ?title

We translate that, for you, to this SQL on our schema:

SELECT   "title_u" FROM (
  SELECT "nmm:MusicPiece1"."ID" AS "resource_u",
         "nie:InformationElement2"."nie:title" AS "title_u"
  FROM   "nmm:MusicPiece" AS "nmm:MusicPiece1",
         "nie:InformationElement" AS "nie:InformationElement2"
  WHERE  "nmm:MusicPiece1"."ID" = "nie:InformationElement2"."ID"
  AND    "title_u" IS NOT NULL
) ORDER BY "title_u"

OK, so with support for domain indexes we change the ontology like this:

nmm:MusicPiece a rdfs:Class ;
  rdfs:subClassOf nie:InformationElement ;
  tracker:domainIndex nie:title .

Now we’ll have the three tables called “Resource”, “nmo:MusicPiece” and “nie:InformationElement” in SQLite’s schema. But they will look like this:

  • The “Resource” table has ID and the subject string
  • The “nie:InformationElement” has ID and “nie:title”
  • The “nmm:MusicPiece” table now has three columns called ID, “nmm:beatsPerMinute” and “nie:title”

The same data, for titles of music pieces, will be in both “nie:InformationElement” and “nmm:MusicPiece”. We copy to the mirror column during ontology change coping, and when new inserts happen.

When now the rdf:type is known in the SPARQL query as a nmm:MusicPiece, like in the query mentioned earlier, we know that we can use the “nie:title” from the “nmm:MusicPiece” table in SQLite. That allows us to generate you this SQL query:

SELECT   "title_u" FROM (
  SELECT "nmm:MusicPiece1"."ID" AS "resource_u",
         "nmm:MusicPiece1"."nie:title" AS "title_u"
  FROM   "nmm:MusicPiece" AS "nmm:MusicPiece1"
  WHERE  "title_u" IS NOT NULL
) ORDER BY "title_u"

A remaining optimization is when you request a rdf:type that is a subclass of nmm:MusicPiece, like this:

SELECT ?title WHERE {
  ?resource a nmm:MusicPiece, nie:InformationElement ;
            nie:title ?title
} ORDER BY ?title

It’s still not as bad as now the “nie:title” is still taken from the “nmm:MusicPiece” table. But the join with “nie:InformationElement” is still needlessly there (we could just do the earlier SQL query in this case):

SELECT   "title_u" FROM (
  SELECT "nmm:MusicPiece1"."ID" AS "resource_u",
         "nmm:MusicPiece1"."nie:title" AS "title_u"
  FROM   "nmm:MusicPiece" AS "nmm:MusicPiece1",
         "nie:InformationElement" AS "nie:InformationElement2"
  WHERE  "nmm:MusicPiece1"."ID" = "nie:InformationElement2"."ID"
  AND    "title_u" IS NOT NULL
) ORDER BY "title_u"

We will probably optimize this specific use-case further later this week.

SQLite’s WAL, deleting a domain specific index

SQLite’s WAL

SQLite is working on WAL, which stands for Write Ahead Logging.

The new logging technique means that we can probably keep read statements open for multiple processes. It’s not full MVCC yet as writes are still not doable simultaneously. But in our use-case is reading with multiple processes vastly more important anyway.

We’re investigating WAL mode of SQLite thoroughly these next few days. Jürg is working most on this at the moment. If WAL is fit for our purpose then we’ll probably also start developing a direct-access library that’ll allow your process to connect directly with our SQLite database, avoiding any form of IPC.

Adrien‘s FD-passing is in master, though. And it’s performing quite well!

We’re thrilled that SQLite’s team is taking this direction with WAL. Very awesome guys!

Domain specific indexes

Yesterday I worked on support for deleting a domain specific index from the ontology. Because SQLite doesn’t support dropping a column with its ALTER support, I had to do it by renaming the original table, recreating the table without the mirror column, and then copying the data from the renamed table. And finally dropping the renamed table. It’s nasty, but it works. I think SQLite should just add DROP COLUMN to ALTER. Why is this so hard to add?

I finally got it working, now it must of course be tested and then again tested.

Next for the feature is adapting the SPARQL engine to start using the indexed mirror column and produce better performing SQL queries.

Working on domain specific indexes

So … what is involved in a “simple change” like what I wrote about yesterday?

First you add support for annotating the domain specific index in the ontology files. This is straight forward as we of course have a generic Turtle parser, and it’s just a matter of adding properties to certain classes, and filling the values from the ontology in in the instances in our in-memory representation of the ontology. You of course also need to change the CREATE-TABLE statements. Trivial.

Then you implement detecting changes in the ontology. And more complex; coping with the changes. This means doing ALTER on the SQL tables. You also need to copy from the InformationElement table to the MusicPiece table (I’m using MusicPiece to clarify, it’s of course generic) in case of such a domain specific index being added during an ontology change, and put an implicit index on the column. After all, that index is why we’re doing this.

I finished those two yesterday. I have not finished detecting a deletion of a domain specifix index yet. That will have to ALTER the table with a DROP of the column. The most difficult here is detecting the deletion itself. We don’t yet have any code to diff on multivalue properties in the ontology (the ontology is a collection of RDF statements like everything else, describing itself).

Today I finished writing copy values to the MusicPiece table’s mirror column

Next few days will be about adapting the SPARQL engine and of course coping with a deletion of a domain specific index. And then testing, and again testing. Mind that this has to work from a journal replay situation too. In which case no ontology is involved (it’s all stored in the history of the persistent journal).

Where’s my Redbull? Ah, waiting for me in the fridge. Good!