federated storage – TWENTY FIRST CENTURY STORAGE

This text has been posted to Docker development community, at https://github.com/docker/docker/issues/14049 – might be a bit too technical at times.

1. Terms

Global Namespace: often refers to the capability to aggregate remote filesystems via unified (file/directory) naming while at the same time supporting unmodified clients. Not to be confused with LXC pid etc. namespaces

2. sha256

Docker Registry V2 introduces content-addressable globally unique (*) digests for both image manifests and image layers. The default checksum is sha256.

Side note: sha256 covers a space of more than 10 ** 77 unique random digests, which is about as much as the number of atoms in the observable universe. Apart from this unimaginable number sha256 has all the good crypto-qualities including collision resistance, avalanche effect for small changes, pre-image resistance and second pre-image resistance.

The same applies to sha512 and SHA-3 crypto-checksums, as well as, likely, Edon-R and Blake2 to name a few.

Those are the distinct properties that allows us to say the following: two docker images that have the same sha256 digest are bitwise identical; the same holds for layers and manifests or, for that matter, any other sha256 content-addressable “asset”.

This simple fact can be used not only to self-validate the images and index them locally via Graph’s in-memory index. This can be further used to support global container/image namespace and global deduplication. That is:

Global Namespace
Global Deduplication

for image layers. Hence, this Proposal.

3. Docker Cluster

Rest of this document describes only the initial implementation and the corresponding proof-of-concept patch:

The setup is a number (N >= 2) of hosts or VMs, logically grouped in a cluster and visible to each other through, for instance, NFS. Every node in the cluster runs docker daemon. Each node performs a dual role: it is NFS server to all other nodes, with NFS share sitting directly on the node’s local rootfs. Simultaneously, each node is NFS client, as per the diagram below:

Blue arrows reflect actual NFS mounts.

There are no separate NAS servers: each node, on one hand, shares its docker (layers, images) metadata and, separately, driver-specific data. And vice versa, each node mounts all clustered shares locally, under respective hostnames as shown above.

Note: hyper-convergence

Often times this type of depicted clustered symmetry, combined with the lack of physically separate storage backend is referred to as storage/compute “hyper-convergence”. But that’s another big story outside this scope..

Note: runtime mounting

As far as this initial implementation (link above) all the NFS shares are mounted statically and prior to the daemon’s startup. This can be changed to on-demand mount and more..

Back to the diagram. There are two logical layers: Graph (image and container metadata) and Driver (image and container data). This patch patches them both – the latter currently is done for aufs only.

4. Benefits

An orchestrator can run container on an image-less node, without waiting for the image to get pulled
Scale-out: by adding a new node to the cluster, we incrementally add CPU, memory and storage capacity for more docker images and containers that, in turn, can use the aggregated resource
Deduplication: any image or layer that exists in two or more instances can be, effectively, deduplicated. This may require pause/commit and restart of associated containers; this will require reference-counting (next)

5. Comments

It’s been noted in the forums and elsewhere that mixing images and containers in the Graph layer is probably not a good idea. From the clustered perspective it is easy to see that it is definitely not a good idea – makes sense to fork /var/lib/docker/graph/images and /var/lib/docker/graph/containers, or similar.

6. What’s Next

The patch works as it is, with the capability to “see” and run remote images. There are multiple next steps, some self-evident others may be less.

The most obvious one is to un-HACK aufs and introduce a new multi-rooted (suggested name: namespace) driver that would be in-turn configurable to use the underlying OS aufs or overlayfs mount/unmount.

This is easy but this, as well as the other points below, requires positive feedback and consensus.

Other immediate steps include:

graph.TagStore to tag all layers including remote
rootNFS setting via .conf for Graph
fix migrate.go accordingly

Once done, next steps could be:

on demand mounting and remounting via distributed daemon (likely etcd)
node add/delete runtime support – same
local cache invalidation upon new-image-pulled, image-deleted, etc. events (“cache” here implies Graph.idIndex, etc.)
image/layer reference counting, to correctly handle remote usage vs. ‘docker rmi’ for instance
and more

And later:

shadow copying of read-only layers, to trade local space for performance
and vice versa, removal of duplicated layers (the “dedup”)
container inter-node migration
container HA failover
object storage as the alternative backend for docker images and layers (which are in fact immutable versioned objects, believe it or not).

Some of these are definitely beyond just the docker daemon and would require API and orchestrator (cluster-level) awareness. But that’s, again, outside the scope of this proposal.

7. Instead of Conclusion

In the end the one thing that makes it – all of the above – doable and feasible is the immutable nature of image layers and their unique and global naming via crypto-content-hashes.

cubes

The idea is probably not new, bits and pieces of it exist in multiple forms for many years and likely constitute some Platonic truth that is above and beyond anyconcrete implementation. This applies to image storage systems(*), object storage systems, file archival storage systems and even block storage. There’s often a content checksum that accompanies the (image, object, file, block) as part of the corresponding metadata – checksum of the corresponding content that is used to validate the latter AND, at least sometimes, address it – that is, find its logical and/or physical location when reading/searching.

Historically those content checksums were 16bit and 32bit wide CRCs and simlar; nowadays, however, more often than not storage systems use cryptographically-secure, collision- and pre-image-attack resistant SHA-2 and SHA-3, possibly Edon-R and Blake2 as well.

Now.. example. Let’s say, Vendor A and Vendor B both store a given popular image, say, one of those Ubuntu LTS as per https://wiki.ubuntu.com/LTS. For instance. Vendor A would store this content as a file over blocks, while Vendor B would use object backend to store the same image as bunch of distributed chunks.

When storing the image, in addition to (and preferably, in parallel with) checksumming blocks – A, or chunks – B, both vendors would compute the whole-image checksum: SHA-2 and/or SHA-3 for starters (NIST’s final report provides other viable alternatives). The latter then must become the stored-image metadata – alias on the image’s name, or more exactly, a specific version of the latter. Further, users may request this crypto-alias, or aliases, to be returned as part of the operation of storing the image, or later, in “exchange” for the stored version name.

Let’s see what it gives..

User can now perform
```
Get(hash-type, crypto-hash alias)
```
– on any vendor that supports the above, to uniquely retrieve and validate a given version of the image
The like/dislike and trust/distrust feedback can be metadata-attached to the image’s cryptohash and thus becomes globally cumulative and available to all users
Vendors can begin partnering to deduplicate common content, independently of their respective vendor-specific and often-proprietary backend mechanisms
The content can be uniquely identified and end-to-end validated with 100% certainty

The long and short of it is that secure crypto-hash on a given content can be used for vendor-neutral access to trusted images which can further enable cross-vendor interoperability, global internet-wide deduplication, peer-to-peer networking with torrent-like service to distribute the same trusted self-validating content, and beyond.

See related: Global Namespace for Docker

TWENTY FIRST CENTURY STORAGE

more than just faster disks

Tag: federated storage

Global Namespace for Docker