Global Namespace: often refers to the capability to aggregate remote filesystems via unified (file/directory) naming while at the same time supporting unmodified clients. Not to be confused with LXC pid etc. namespaces
Docker Registry V2 introduces content-addressable globally unique (*) digests for both image manifests and image layers. The default checksum is sha256.
Side note: sha256 covers a space of more than 10 ** 77 unique random digests, which is about as much as the number of atoms in the observable universe. Apart from this unimaginable number sha256 has all the good crypto-qualities including collision resistance, avalanche effect for small changes, pre-image resistance and second pre-image resistance.
The same applies to sha512 and SHA-3 crypto-checksums, as well as, likely, Edon-R and Blake2 to name a few.
Those are the distinct properties that allows us to say the following: two docker images that have the same sha256 digest are bitwise identical; the same holds for layers and manifests or, for that matter, any other sha256 content-addressable “asset”.
This simple fact can be used not only to self-validate the images and index them locally via Graph’s in-memory index. This can be further used to support global container/image namespace and global deduplication. That is:
- for image layers. Hence, this Proposal.
3. Docker Cluster
Rest of this document describes only the initial implementation and the corresponding proof-of-concept patch:
The setup is a number (N >= 2) of hosts or VMs, logically grouped in a cluster and visible to each other through, for instance, NFS. Every node in the cluster runs docker daemon. Each node performs a dual role: it is NFS server to all other nodes, with NFS share sitting directly on the node’s local rootfs. Simultaneously, each node is NFS client, as per the diagram below:
Blue arrows reflect actual NFS mounts.
There are no separate NAS servers: each node, on one hand, shares its docker (layers, images) metadata and, separately, driver-specific data. And vice versa, each node mounts all clustered shares locally, under respective hostnames as shown above.
Often times this type of depicted clustered symmetry, combined with the lack of physically separate storage backend is referred to as storage/compute “hyper-convergence”. But that’s another big story outside this scope..
Note: runtime mounting
As far as this initial implementation (link above) all the NFS shares are mounted statically and prior to the daemon’s startup. This can be changed to on-demand mount and more..
Back to the diagram. There are two logical layers: Graph (image and container metadata) and Driver (image and container data). This patch patches them both – the latter currently is done for aufs only.
- An orchestrator can run container on an image-less node, without waiting for the image to get pulled
- Scale-out: by adding a new node to the cluster, we incrementally add CPU, memory and storage capacity for more docker images and containers that, in turn, can use the aggregated resource
- Deduplication: any image or layer that exists in two or more instances can be, effectively, deduplicated. This may require pause/commit and restart of associated containers; this will require reference-counting (next)
It’s been noted in the forums and elsewhere that mixing images and containers in the Graph layer is probably not a good idea. From the clustered perspective it is easy to see that it is definitely not a good idea – makes sense to fork /var/lib/docker/graph/images and /var/lib/docker/graph/containers, or similar.
6. What’s Next
The patch works as it is, with the capability to “see” and run remote images. There are multiple next steps, some self-evident others may be less.
The most obvious one is to un-HACK aufs and introduce a new multi-rooted (suggested name: namespace) driver that would be in-turn configurable to use the underlying OS aufs or overlayfs mount/unmount.
This is easy but this, as well as the other points below, requires positive feedback and consensus.
Other immediate steps include:
- graph.TagStore to tag all layers including remote
- rootNFS setting via .conf for Graph
- fix migrate.go accordingly
Once done, next steps could be:
- on demand mounting and remounting via distributed daemon (likely etcd)
- node add/delete runtime support – same
- local cache invalidation upon new-image-pulled, image-deleted, etc. events (“cache” here implies Graph.idIndex, etc.)
- image/layer reference counting, to correctly handle remote usage vs. ‘docker rmi’ for instance
- and more
- shadow copying of read-only layers, to trade local space for performance
- and vice versa, removal of duplicated layers (the “dedup”)
- container inter-node migration
- container HA failover
- object storage as the alternative backend for docker images and layers (which are in fact immutable versioned objects, believe it or not).
Some of these are definitely beyond just the docker daemon and would require API and orchestrator (cluster-level) awareness. But that’s, again, outside the scope of this proposal.
7. Instead of Conclusion
In the end the one thing that makes it – all of the above – doable and feasible is the immutable nature of image layers and their unique and global naming via crypto-content-hashes.