Tundra is the initial implementation of an object store for the
server. This object store makes it possible to move local email files to remote object
storage such as Amazon S3 and fetch them on demand when requested by the imap/pop servers;
all transparently to the clients and to
dovecot. The implementation is pretty simple
and takes advantage of various known preconditions within
The Tundra implementation relies on certain attributes, or preconditions to function.
dovecot must be configured to use the
sdbox storage format so that each email
is stored as an immutable single file in either the primary mail storage location or the
alternative storage location.
Second; that all the email file
open() calls in the
dovecot source code can be readily and
comprehensively identified. Fortunately it appears that there is a single piece of code in
src/lib-storage/index/dbox-common/dbox-file.c where these files are opened for
Third; that the expectations of the upper layers when getting a file-descriptor from
dbox-file.c are minimal. Specifically that they do not expect to do anything more than
seek and read the content with no other expectations such as what an
fstat() call might
return. See the Seekability section for more details.
The final pre-condition is reliance on the fact that
dovecot always tries to open the
email file from the primary storage location first and only if that open fails does it
attempt to open the same named email file from the alternative storage location. This
‘open order guarantee’ makes it easy to safely replace email files at any time by simply
ensuring the primary is only removed after the alternate file is created. Or if the
alternate is removed, that the primary is created first.
Other important considerations
Initially our limited POC suggested that
dovecot could survive with a non-seekable
file-descriptor which simplified the implementation considerably. Unfortunately, not so.
It turns out that while the
dovecot pop server only requires a non-seekable mail
file-description the imap server requires a seekable one. For reasons unknown it wants to
seek backwards after reading part of a message. This is unfortunate as it means that
amoProxy has to support the notion of fully caching the remote object as a local file
before passing the fd back to the client. This complicates
amoProxy significant, incurs
a fairly serious performance impact to the local file system and increases latency before
the imap code receives the first byte of the message.
The net result is that
amoProxy has to manage a local file system cache of recently
Dovecot code-base risk
Anyone who has examined the
dovecot code-base knows that it is huge, brittle, very
vulnerable to change and/or refactoring, poorly documented and thus largely inscrutable.
The intent is not to criticize
dovecot, after all, it is a very good imap server and
such beasts are few and far between. And it is open-source and free to use by anyone.
No, the intent is merely to recognize that adding code deep into
risky. Unless you are prepared to invest in a long-term deep-dive into the huge code-base
you may well end up having no clue what you are doing or why your code is failing. An
examination of the main
dovecot mailing list confirms that very few if anyone outside
the original authors ever make any code changes. The collective hive-mind has spoken.
This code-base risk is the reason why this approach has been chosen because it allows for
minimal changes to the code based with no internal knowledge of
Performance and Costs
This document does not discuss performance and cost implications. Obviously there are many, but it’s out of scope for this document. Please see the project and deployment documentation for details on that front.
The Tundra system takes advantage of the aforemention preconditions and makes a
significant effort to minimize
dovecot code changes due to the code-base risk. There are
two components to the Tundra implementation which interact with
amoClient detects object store files and sends fetch
Component Relationship Diagram
Mail File Open Sequence
dbox-file.care replaced with calls to a shim-library
- If the shim-library determines an email file represents a remote object it sends a fetch request to an on-system
amoProxymanages the fetch of the remote object and fd-passes the read fd of a pipe back to the shim-library
- The shim-library passes the read fd back to
dovecotas the result of the replacement
amoProxywrites the incoming remote object to the write fd side of the pipe
dovecotreads the read fd of the pipe as if its a local mail file
The main advantage of this approach is that most of the intelligence is in
rather than modified
dovecot code. The second advantage is that it’s a very minor code
dovecot which can easily be applied to newer versions with minimal effort.
Perhaps most importantly, the final advantage is that if there are any debugging issues
dovecot then by the simple expedient of making the relevant mail files local, you
are back to a situation where you can run a completely standard
dovecot and completely
amoProxy from the debugging equation.
But how does the shim-library determine whether an email file represents a remote object or not?
Symlinks to the rescue
When it is determined that a local email file is suitable for remote storage, the file is first copied to the remote object store, a symlink with the remote URL is then created in the alternate storage location and finally the original local email file in the primary storage is deleted. This migration sequence relies on the ‘open order guarantee’ mentioned earlier.
With that in mind, all the shim-library has to do on each open request is detect whether
the path is a suitable symlink. If so, send a fetch request with the embedded URL to
amoProxy and expect a readable pipe file-descriptor in return. If not, open it as usual and
return the opened file-descriptor to the caller.
If you’re wondering about embedding URLs in symlinks; wonder no more. No Unix-like operating system cares about the destination data of a symlink. Sure it may try to open it if you ask, but otherwise a symlink is nothing more than a container of bytes that can have anything in them. Importantly that container of bytes is embedded in the parent directory and takes up no additional file-system space.
If you want to create “URL-type” symlinks for yourself, try this sequence of shell commands:
$ ln -s 'http://www.yahoo.com' myurl $ ls -l myurl lrwxr-xr-x 1 markd staff 20 3 Mar 18:54 myurl -> http://www.yahoo.com $ echo My URL link is `readlink myurl` $ curl `readlink myurl`
Hopefully you can see that a symlink can contain anything we want it to contain.
Symbolic Link Format
The object store symlinks are not just raw URLs as shown above, rather they have a structure which wraps the various functions and helps protect against mis-detecting real symlinks as object store symlink.
Here’s an example of a mailbox folder where some messages have been relocated to a remote object on AWS S3 while other messages are still stored as local files.
$ ls -l u.* lrwxr-xr-x 1 markd staff 23 16 Jan 16:17 u.1 -> amo:s:s3://amob1/u1/u.1 lrwxr-xr-x 1 markd staff 25 21 Jan 08:38 u.154 -> amo:s:s3://amob1/u1/u.154 -rw-r--r-- 1 markd staff 840 19 Jan 16:30 u.2 lrwxr-xr-x 1 markd staff 23 21 Jan 08:10 u.3 -> amo:s:s3://amob1/u1/u.3 -rw-r--r-- 1 markd staff 4122 03 Feb 13:45 u.36
Symbolic Link Content semantics
The symbolic link consists of three components separated by a colon. These three components are “magic pattern”, “URL Type” and “URL” respectively. In the above examples it can be seen that ‘amo’ is the magic pattern to identify object store symbolic links.
The second component in the example is ’s’ which indicates that the “URL Type” is an AWS S3 URL with the third component being the actually S3 URL.
Other “URL Types” are “/” for file system with the “URL” being a Unix path and “a” for Azure storage with the “URL” being an Azure blob store URL. Note that Azure fetching has not been implemented.
It may be that we want to embed some other attributes in the symbolic link, such as message size, compression algorithm, encryption key index and so on. If so, it might be better to either reserve some positional colon parameters prior to the “URL” before the format is set in stone.