January 4, 2008 – How easy it is to make people believe a lie, and how hard it is to undo that work again

First of all, a summary is the overview of your E-mail folder or mailbox. It shows you the cc, to, from, subject of each E-mail. In IMAP terminology people also call this the ENVELOPE of each E-mail. Showing all ENVELOPEs of a folder is showing the summary to the user. Some people want more than just the cc, to, from and subject to be visible. Most E-mail clients also indicate the read and importance status of the E-mails in this view. Some E-mail clients also show the size of the E-mail. Whatever yours shows, that is what I’ll here call … the summary.

About a year ago I was telling people about how few memory Tinymail consumed: it consumes fewer memory than most other E-mail clients because it maps the summary data. That’s still true and, in my opinion, a lot better than just copying the strings in memory. However …

The implementation didn’t care a enough about the VmRss. The analysis of memory usage was focused on what valgrind’s massif tool showed. That of course shows just the heap and the stack. Both are relevant, but not the only kind of numbers that you have to consider. The referred-to page explains this too (don’t worry, I was never trying to hide this fact).

What matters for a mobile device is the VmRss. The VmRss is a number that indicates how much of your data is in real memory modules. For a mobile device this is important because these devices often can’t have a swap partition, or a slow one on a level wearing flash device, have relatively few RAM installed, or consume more battery and/or are more expensive per unit to produce. I know RAM is not important for your server, and I don’t care about your server.

Oh .. that desktop or laptop that you just bought? In terms of amount of memory, it’s a server too in my eyes. I don’t care about your desktop machine either.

Another something that ain’t really good about the current implementation is that writing the file takes a long time, that it grows relatively large, and that it can contain redundant information. This redundant information contributes to the VmSize growing.

Note: VmSize is the total amount of memory being used, including things that got swapped out, libraries, heap, stack, everything. As a number it’s good for making wannabe geeks scared about the memory consumption of your application. Other than that, it’s not interesting as a number unless you have a good idea what exactly it means (like, know how shared libraries are handled nowadays and things like that).

What is worse is that because of the redundancy of data the locality of the data that might be required becomes fragmented (you’ll get more page faults). This contributes to a growing VmRss, which is bad news for the mobile device that is sparse on memory availability. I’d love to explain why this contributes to VmRss, … it comes down to “a kernel can only page in using buffers/blocks of 4k”, “no matter how small your string is”, “so keep the ‘needed’ things close together and try to fit them in as few pages as possible”.

Average Joe Six Pack the kernel developer will simply translate that to: try to avoid page faults.

I started writing an experimental new summary storage engine which will in future be used by Tinymail. This one will store duplicate strings uniquely, will sort the strings on reference count, will store lists of addresses (the to and cc fields) in a sorted way (in the hopes of creating more duplicate strings this way) and finally to speedup the writing it will store blocks of summary data rather than all summary of a folder in one big file. The reason for the blocks is that a summary is relatively read only and usually only grows in size. (ps. Desrt is the cool guy who came up with the idea of sorting the lists of addresses)

Next to all these improvements, it’ll have a few new features like out of order adding and expunging of items using both the UID and the sequence number of the item. The reason for this flexibility in the API is that modern IMAP servers more or less require you to look them up using both the UID and the sequence number while handling EXPUNGE, FETCH and VANISHED responses.

These are the results of a summary with 50,000 completely unique summary items (each string in each field of the summary item is unique). It’s a worse case scenario. Everybody probably understands that a large amount of strings in the summary view of his E-mail’s INBOX are duplicates. Right? The strings where Around 20 – 30 bytes each. This is a low average, most E-mails have larger strings. (But most blog items have smaller texts, I know, patience, I’m almost finished)

Mmap file's size: 13 Mb

VmPeak:    22076 kB
VmSize:    22076 kB <-
VmLck:         0 kB
VmHWM:      7604 kB
VmRSS:      7604 kB <-
VmData:     7084 kB
VmStk:        88 kB
VmExe:        16 kB
VmLib:      2152 kB
VmPTE:        16 kB

We can see the large VmSize (large, as we expected, since the mapped file is 13 Mb in size). Interesting the VmRss is just around 8 Mb. These eight megs of mostly heap and stack is being used by pointers that point to the data in the mapping, hashtable nodes and admin info like the reference-count integer of the items. I did the measurement before touching any of the items's data. The kernel has therefore effectively not paged in any of the mapped file's data (on demand paging). This memory, however, is what I call "must have": you wont ever get rid of those eight megs with a folder that has 50,000 items loaded. If after those the VmRss grows caused by pages from the mapped file, when memory availability gets sparse your other applications can get it back from the kernel (depending on how active you are using the E-mail client's summary data at the moment, of course).

You can get the experimental summary store here (it's attached to the mail).

M	T	W	T	F	S	S
« Dec				Feb »
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Day: January 4, 2008

Warning. This one is a little bit technical