Docker, NFS, ZFS, and extended attributes

It may be difficult to develop an emotional connection to all of the features of filesystems and filers. Take deduplication for instance. Dedup is cool. Rabin-Karp rolling hash, sliding-window Content Defined Chunking (CDC) – those were cool 15 years ago and remain cool today. Improvements and products (and startups) keep pouring in.

But when it comes to extended file attributes (xattrs), emotions range from a blank stare to dismay. As in: wouldn’t touch with a ten-foot pole.

Come to think of it, part of the problem is – NFS. And part of the NFS problem is that both v3 and v4 do not support xattrs. There is no support whatsoever: none, nada, zilch. And how there can be with no interoperable standard?

Read More »

Docker Detour

Docker keeps fascinating me, purely as a use case. From the image hosting perspective, there are a couple things that are missing in its current stage of development. The biggest and the most obvious one is – a shared,  distributed, and deduplicated store for both image manifests (image metadata) and layer content (the data).

Due to the immutable sha256-protected nature of both the related complexity is about 3 orders of magnitude lower than (this complexity would look like for) anything less specialized.

Distributing the content-hashed and stacked stuff like this:


Read More »

HyperConverged Storage Directions — How to get users down the path

I wanted to title this Hypo Diverged Storage, as seemingly there are many ideas of what “hyper converged” storage and architectures are in the industry. Targeted at developers and solution providers alike, this piece will strive to combine a string of ideas of where I think we want to go with our solutions, and I hope to dispel some arguments that are assumptive or techno-religious in nature. Namely
  • What is really necessary for a strong foundational technology/product
  • Emulate the service architectures
  • It just needs to be good enough
  • Containers: Just build out the minimum and rely of VMs/hypervisors to handle the rest
  • You must build fresh cloud native, immutable applications only
In my role at Stanford Electrical Engineering I have numerous interactions with researchers and vendors, including research projects that have now become the formative infrastructure companies in the industry. My hand in many of these has been slight, but we have had more effect in guiding the build out of usable solutions, taking many into production use first at our facilities. Among the many projects over time, we have dealt with the beginnings and maturation of virtualization in compute, storage, and networking. Today’s emerging on-prem and cloud infrastructure now takes as a given that we need to have all three. Lets target the assumptions.
1) To build out a production quality solution that works well in the ecosystem, you just need to make a killer (storage/compute/network) appliance or product.
In many discussions, people may notice a mantra that most companies rightfully focus on building technology from a position of strength such as storage or networking, but miss out when they only attempt to integrate with one of the other legs of the virtualization stool. They rush to market but fail to execute well or deliver what end users actually want because the integration is half baked at most. I’ve worked with great storage products that focus on network access alone, but can’t deal with bare-metal container hypervisors; SDN products that can help make OpenStack work well for tenants, but ignore the storage component that itself is on the network; Storage products that require dedicated use of network interfaces for I/O without awareness of virtualized interfaces; Hypervisors that use networks for storage but can’t themselves hide transient network failure from the guests they host. These are just samples but they inform us that its hard to get it all right.
Although its a lot for any going concern to address, it is necessary for products and solutions to have well thought out scale-up and scale-out approaches that address all three legs of the stool. One could argue the forth leg could be orchestration, and that might need to be taken into account as well. Regardless, a focus on providing an integrated storage/networking/virtualization solution in the more simple cases, and building from there is more advantageous to customers looking to grow with these solutions long term.
2) To build a successful hyper converged product, you need to emulate how Google/AWS/etc build out their cloud infrastructure. 
I’d argue that most potential customers won’t necessarily want to pick a random vendor to bring AWS-like infrastructure on site. The economics and operational complexity would likely sway them to just host their products on AWS or similar, regardless of how much performance is lost in the multiple layered services offered by Amazon. In most cases, mid-enterprise and higher customers already have various solutions adopted, thus necessitating a brown field approach to adopting hyper converged solutions.
This gets to one important belief I have. The idea is that regardless of your solution, the ability for a site to bootstrap its use with minimal initial nodes (1 or more), truly enable migration of work loads onto it, and with time and confidence, allow the customer to grow out their installation. This pushes out another assumption that the hardware is uniform. With gradual roll outs, you should expect variation in the hardware as a site goes from proof-of-concept, to initial roll out and finally full deployment. All the while, management of the solution should allow the use of existing networks where possible and the phase out of or upgrade of nodes over time. There is no “end of deployment” as it is more of a continuous journey between site and solutions vendor.
You’ll notice a theme, that bootstrapping of initial storage and networking, and simply, is more important than to emulate the full stack from the get go. Don’t try and be Amazon, just make sure you have strong foundation storage/compute/networking so that you can span across multiple installations and even over pre-existing solutions like Amazon.
3) Just get the product in the door with good initial bootstrapping.
Extending from the above, simply bootstrapping is not enough. Support of a product tends to kill many startups or seasoned companies. However, it really bedevils startups as they strive to get initial market acceptance for their investors. In the end, a hyper converged solution is a strong relationship between vendor(s) and the customer. Deployment should be considered continuous up front, so the ability to add, maintain, and upgrade a product is not only central to the relationship, it informs the product design. Consider what hyper converged truly means. Each node it self sufficient but any work load should be able to migrate to another node, and a loss of any given node should not overly strain the availability of the infrastructure.
I don’t want to take away too much from how one should focus on ease of initial deployment. That can win customers over. However, keeping users of converged infrastructure happy with the relationship requires happiness with the overall architecture and not the initial technology.
4) Containers can run well enough in virtual machines. Just build on top of VMs and then layer on top our storage and network integrations.
VMs alone are no longer the target deployment, but rather applications and/or app containers. In the rush to enable container deployment, vendors take the shortcut and build on top of existing VM infrastructure. Its worse when it comes to existing hyperconverged solutions that have heavy infrastructure requirements to support VMs. If one now sees lightweight containers as a target, ones assumptions on what makes hyperconverged infrastructure changes. Merely adopting existing mature VM infrastructure becomes is its own issue.
I love the work that Joyent has done in this space. They’ve argued well, using mature container technology not found in Linux, and brought Linux and Docker into their infrastructure. What makes them better than the present competition is that they run their containers on bare-metal and right on top of virtualized storage (ZFS backed) and maturing network virtualization. The solution presented works wonders for those operating the tech as an IAAS. Market acceptance of a non-native Linux solution or some of the present non-clustered FS components may limit the appeal of excellent technology. At the same time, the primary market as a public cloud places them in the shadow of the larger players there.
A Joyent-like technological approach married with full SDN integration could be the dream product, if made into such a product. Already, the tech proves the assumption of using VMs to be a weak proposition. What VM infrastructure has and containers need for wider adoption is SDNs, namely overlay networks. If we enable not only pooled storage but pooled networking, we can gain better utilization of the network stack, inherent path redundancy, and via an SDN overlay, a safe way to build tenant storage networks arbitrarily over multiple physical networks, perhaps across multiple vendor data centers. Think how this serves the customers potential container mobility.
5) Docker is containers / Containers need to be immutable by design for a scale out architecture
All the above leads to this point. Taking past experience and intuition, I think that what holds back adoption of next generation infrastructure throughout the industry is in fact the religious arguments around immutability. The work loads that could make use of better solutions coming to market require some level of persistence per node/container. Though live migration may not be warranted in many cases, the ability to upgrade a container, vertically scale it, or run “legacy” applications using these new operation approaches will be required of a mature container solution. We just haven’t gotten there yet. A scale-out infrastructure can allow for persistent type containers, as long as end users architect for the expected failure modes that may occur for this application profile.
What matters to customers is the ability to use the build out and scale out nature of the cloud, with known data and operational redundancy and recovery at its core. In many cases, work loads that the market presents may not yet fit into the Docker world, but they can definitely be containerized and virtualized to hyper converged storage solutions.
If you believe that these are the target users, who don’t necessarily need to develop the whole solution with a ground-up “cloud” architecture, don’t wish to run on the public cloud in all cases, and need a way to on-board cloud technology into the present organization in a progressive way, you will see where we all collectively need to go. Its not so much a hyper converged world, but an inviting, enabling solution space that starts simple (perhaps one compute+storage node), scales out non-uniformly using SDNs to tie in the nodes as they come and go. Customers can adopt small, and grow with you as you build out better, more complete infrastructure.
Forward thoughts:
My experience shows that the IT work place is changing, and those managing data centers need to adjust to adopt the skill sets of cloud operators. There is pressure I’ve seen in many organizations to convert both the business and the work force to adopt cloud technology. The opposing forces are institutional resistance (or requirements) to not adopt the public cloud, or critical work loads that can’t trivially be converted to Docker-style immutable containers. That is a present market condition needing solutions. A stand alone product won’t work, but a long term approach the grows with an organization through this IT landscape change will serve both customers and solution providers well.

Go Anagram

The history is well known. Go started as a pragmatic effort by Google, to answer their own software needs to manage hundreds of thousands (some say, tens of millions) of servers in Google’s Data Centers. If there’s anywhere a scale, Google has it. Quoting the early introduction (which I strongly suggest to read) Go at Google: Language Design in the Service of Software Engineering:

The Go programming language was conceived in late 2007 as an answer to some of the problems we were seeing developing software infrastructure at Google. The computing landscape today is almost unrelated to the environment in which the languages being used, mostly C++, Java, and Python, had been created. The problems introduced by multicore processors, networked systems, massive computation clusters, and the web programming model were being worked around rather than addressed head-on.
Go was designed and developed to make working in this environment more productive. Besides its better-known aspects such as built-in concurrency and garbage collection, Go’s design considerations include rigorous dependency management, the adaptability of software architecture as systems grow, and robustness across the boundaries between components.

When I first started looking at Go aka golang 1 it was strictly in connection with Docker, coreos/rkt, LXD and various other Containers. All of which happen to be coded in Go.

Your First Go

“Like many programmers I like to try out new languages” – a quote from Adam Leventhal’s blog  on his first Rust program: anagrammer. My contention as well 2. Not sure about Rust though but my today’s anagrammer in Go follows below, likely far from the most elegant:

 1 package main
 3 import (
 4         "bufio"
 5         "fmt"
 6         "log"
 7         "os"
 8         "sort"
 9         "strings"
 10 )
 12 func normalize(word string) string {
 13         lcword := strings.ToLower(word)
 14         letters := []string{}
 16         for _, cp := range lcword {
 17                 letters = append(letters, string(cp))
 18         }
 19         sort.Strings(letters)
 20         return strings.Join(letters, "")
 21 }
 23 func do_wordmap() *map[string][]string {
 24         file, err := os.Open("/usr/share/dict/words")
 25         if err != nil {
 26                 log.Fatal(err)
 27         }
 28         defer file.Close()
 30         var allwords []string
 31         scanner := bufio.NewScanner(file)
 32         for scanner.Scan() {
 33                 t := scanner.Text()
 34                 allwords = append(allwords, t)
 35         }
 37         wordmap := make(map[string][]string)
 38         for _, w := range allwords {
 39                 nw := normalize(w)
 40                 wordmap[nw] = append(wordmap[nw], w)
 41         }
 42         return &wordmap
 43 }
 45 func main() {
 46         wordmap := do_wordmap()
 48         syn := bufio.NewReader(os.Stdin)
 49         for {
 50                 fmt.Print("Enter a word: ")
 51                 myword, _ := syn.ReadString('\n')
 52                 myword = strings.TrimSuffix(myword, "\n")
 53                 normal_w := normalize(myword)
 55                 fmt.Println((*wordmap)[normal_w])
 56         }
 57 }

It runs like this:

Enter a word: spare
[pares parse pears rapes reaps spare spear]

Distributed, Concurrent and Parallel

To facilitate distributed concurrent parallel processing on a massive scale, the language must include developer friendly primitives. To that end, Go includes for instance:

  • goroutine, for multitasking
  • channel, to communicate between the tasks

The latter were adopted from communicating sequential processes (CSP) first described in a 1978 paper by C. A. R. Hoare.

Here’s a snippet of code that will make sure to store concurrently exactly 3 replicas of each chunk:

  1 replicas_count := make(chan bool, len(servers))
  3 for _, s := range servers {
  4     go func(s *StorageServer) {
  5         s.Put(chunk)
  6         replicas_count <- true
  7     }(s)
  8 }
 10 for n := 0; n < 3; n++ {
 11     <-replicas_count
 12 }

That’s the power of Go.

  1. Use “golang” to disambiguate your google searches 
  2. Writing code focuses your mind and untroubles your soul (c) 

Global Namespace for Docker

This text has been posted to Docker development community, at  – might be a bit too technical at times.

1. Terms

Global Namespace: often refers to the capability to aggregate remote filesystems via unified (file/directory) naming while at the same time supporting unmodified clients. Not to be confused with LXC pid etc. namespaces

2. sha256

Docker Registry V2 introduces content-addressable globally unique (*) digests for both image manifests and image layers. The default checksum is sha256.

Side note: sha256 covers a space of more than 10 ** 77 unique random digests, which is about as much as the number of atoms in the observable universe. Apart from this unimaginable number sha256 has all the good crypto-qualities including collision resistance, avalanche effect for small changes, pre-image resistance and second pre-image resistance.

The same applies to sha512 and SHA-3 crypto-checksums, as well as, likely, Edon-R and Blake2 to name a few.

Those are the distinct properties that allows us to say the following: two docker images that have the same sha256 digest are bitwise identical; the same holds for layers and manifests or, for that matter, any other sha256 content-addressable “asset”.

This simple fact can be used not only to self-validate the images and index them locally via Graph’s in-memory index. This can be further used to support global container/image namespace and global deduplication. That is:

Global Namespace
Global Deduplication

  • for image layers. Hence, this Proposal.

3. Docker Cluster

Rest of this document describes only the initial implementation and the corresponding proof-of-concept patch:

The setup is a number (N >= 2) of hosts or VMs, logically grouped in a cluster and visible to each other through, for instance, NFS. Every node in the cluster runs docker daemon. Each node performs a dual role: it is NFS server to all other nodes, with NFS share sitting directly on the node’s local rootfs. Simultaneously, each node is NFS client, as per the diagram below:


Blue arrows reflect actual NFS mounts.

There are no separate NAS servers: each node, on one hand, shares its docker (layers, images) metadata and, separately, driver-specific data. And vice versa, each node mounts all clustered shares locally, under respective hostnames as shown above.

Note: hyper-convergence

Often times this type of depicted clustered symmetry, combined with the lack of physically separate storage backend is referred to as storage/compute “hyper-convergence”. But that’s another big story outside this scope..

Note: runtime mounting

As far as this initial implementation (link above) all the NFS shares are mounted statically and prior to the daemon’s startup. This can be changed to on-demand mount and more..

Back to the diagram. There are two logical layers: Graph (image and container metadata) and Driver (image and container data). This patch patches them both – the latter currently is done for aufs only.

4. Benefits

  • An orchestrator can run container on an image-less node, without waiting for the image to get pulled
  • Scale-out: by adding a new node to the cluster, we incrementally add CPU, memory and storage capacity for more docker images and containers that, in turn, can use the aggregated resource
  • Deduplication: any image or layer that exists in two or more instances can be, effectively, deduplicated. This may require pause/commit and restart of associated containers; this will require reference-counting (next)


It’s been noted in the forums and elsewhere that mixing images and containers in the Graph layer is probably not a good idea. From the clustered perspective it is easy to see that it is definitely not a good idea – makes sense to fork /var/lib/docker/graph/images and /var/lib/docker/graph/containers, or similar.

6. What’s Next

The patch works as it is, with the capability to “see” and run remote images. There are multiple next steps, some self-evident others may be less.

The most obvious one is to un-HACK aufs and introduce a new multi-rooted (suggested name: namespace) driver that would be in-turn configurable to use the underlying OS aufs or overlayfs mount/unmount.

This is easy but this, as well as the other points below, requires positive feedback and consensus.

Other immediate steps include:

  • graph.TagStore to tag all layers including remote
  • rootNFS setting via .conf for Graph
  • fix migrate.go accordingly

Once done, next steps could be:

  • on demand mounting and remounting via distributed daemon (likely etcd)
  • node add/delete runtime support – same
  • local cache invalidation upon new-image-pulled, image-deleted, etc. events (“cache” here implies Graph.idIndex, etc.)
  • image/layer reference counting, to correctly handle remote usage vs. ‘docker rmi’ for instance
  • and more

And later:

  • shadow copying of read-only layers, to trade local space for performance
  • and vice versa, removal of duplicated layers (the “dedup”)
  • container inter-node migration
  • container HA failover
  • object storage as the alternative backend for docker images and layers (which are in fact immutable versioned objects, believe it or not).

Some of these are definitely beyond just the docker daemon and would require API and orchestrator (cluster-level) awareness. But that’s, again, outside the scope of this proposal.

7. Instead of Conclusion

In the end the one thing that makes it – all of the above – doable and feasible is the immutable nature of image layers and their unique and global naming via crypto-content-hashes.