Embracing and Abandoning ZFS

If you are not familiar with ZFS, you should be. It is the best local file system ever developed.

It’s also likely to retain that title forever, because improving the local file system is not where the storage industry’s head is at these days. Distributed storage is the only way to scale-out to the degree that  storage now requires.

My first design for Object Storage at Nexenta sought to leverage ZFS as one implementation of a “Local File System” superclass. It could bring benefits to both our Object Storage clusgter and to OpenStack Swift.  This was a natural progression with evolutionary design. Nexenta is a ZFS company and we were very familiar with ZFS. We wanted to take advantage of the protection provided by ZFS within one server with distributed data protection. We dubbed this the “2+2” strategy. I presented the idea at the OpenStack Folsom summit.

The key advantage that ZFS enabled was the aibity to mix local replication with network replication, as summarized by the following two slides from that presentation:

OpenStackCCOW4 OpenStackCCOW5

The obstacles to this approach proved to be too high. Relying on two network replicas where each machine keeps a high reliability mirror (“2 by 2”) can only achieve the same availability if you have two network ports on each storage server. Without dual network ports the loss of a single network interface leaves you only a single failure from losing access to data.

But dual ports are a concept that virtualization management simply does not want to understand. It wants a single Ethernet link to a single storage node. They either work as a complete unit, or they do not. A “2×2” solution requires tracking nodes that are in “limping” states such as knowing when a storage node is only reachable by one of its two links. Keeping track of the state of each server as being either “working”, “limping” or “dead” sound simple enough, but just “working” or “dead” is a lot simpler. There are other conditions that can put a storage device in a “limping” state where it can be read but should not be written to, such as a drive that is starting to fail. But this is the only thing that would require the management plane to add this concept.

Management plane developers hate adding more concepts when 90% of the world is happy without the additional work. So that wasn’t going anywhere.

We also realized that multicast UDP was a much better solution. Rather than battling to get network management improved so that we could go from 2 excess deliveries to 1 excess delivery we could just use multicast UDP and end up with 0 excess deliveries.

All of these issues were actually minor compared to the challenges of providing high performance object storage using Python.

Basically, it does not work.

Swift advocates will claim otherwise, but they are trying to con you that Object Storage should not be expected to be as high performance as SAN or NAS storage. I don’t buy that line of thinking.

There were several new ideas that we wanted to incorporate in the design, all of which will be covered in later blogs.

  • Totally decentralized control and namespaces.
  • Using multicast communications rather than point-to-point protocols such as TCP/IP.
  • Avoiding the constraints of Consistent Hashing.
  • Truly embracing Key/Value storage, top-to-bottom.

But there were also a lot that we wanted to inherit from ZFS – building upon ideas sometimes works better than directly reusing the code via modularized or layered architectures. Those eternal ZFS ideas, or at least some of them, are:

  • Copy-on-Write: never overwrite in-use data
  • Never Trust Storage Devices
  • Always be Consistent on Disk
  • Use Transaction Logs to Improve Performance
  • Use Snapshots and Clones
  • Replication Is Important
  • "Rampant Layering Violation"


ZFS never overwrites in-use data. It writes new data, and then references the new data. The new object storage system would end up taking this even farther. Each new chunk has its content written exactly once. It can be replicated as needed, but a chunk once written is never updated.

Never Trust Storage Devices

It is not enough to respond to errors when disks report them. You need to validate the data read versus checksums stored elsewhere.

The new object storage system uses cryptographic hashes of each chunk’s payload to identify, locate and validate each chunk.

Always Be Consistent on Disk

The easiest way to always be consistent on what is written to persistent storage is to never write any data which a later action can invalidate.

The CCOW (Cloud Copy-on-Write) object storage system does not rely on any information stored about any chunk other than its cryptographic hash identifier.

Use Transaction Logs to Improve Performance

ZFS relies upon transaction logs to maintain its “always consistent on disk” goal.  The data on disk is consistent, after you apply the pending transactions in the log after a sudden reboot. Updating the root after every transaction is the only other way to always be consistent on disk, and that would require far too many disk writes.

NexentaEdge uses the same technique to allow eventual update of related data structures after a new Version Manifest is written. It cuts the number of disk writes required before an acknowledgement of a Version Manifest Put transaction nearly in half.

Use Snapshots and Clones

ZFS creates snapshots by not deleting them. It can turn them into clones by simply allowing new version forks to branch from there.

Keeping this with a distributed object system was a challenge, but we came up with a method of truly snapshotting a distributed set of metadata. To push the photo analogy it is a true snapshot that captures the system state at one instant, it just needs to be developed before you can see the picture. I’ll describe that in a later blog.

Replication Is Important

How a system replicates data after it is put is not an afterthought. ZFS features snapshot driven replication. That feature is retained by NexentaEdge, just using NexentaEdge snapshots instead.

“Rampant Layering Violation”

Perhaps the most important lesson is an inherited attitude.

ZFS was accused of being a “Rampant Layering Violation”. Jeff Bonwick’s response (https://blogs.oracle.com/bonwick/entry/rampant_layering_violation) was that it was merely more intelligent layering that picked more effective divisions of responsibilities.

NexentaEdge will likely be accused of far worse layering violations.

  • The Replicast protocol considers storage distribution and network utilization at the same time.
  • Since the end user ultimately pays for both we see nothing wrong with optimizing both of them.
  • This is still layered – we only optimize traffic on the storage network. It is physically or logically (VLAN) separated from other networking. Optimizing the storage network for storage is just common sense.

Embracing by Abandoning

NexentaEdge storage servers use key value storage, whether physical or a software simulation of key/value storage. This is the simplest API for copy-on-write storage to use. While it might have been possible to define a key/value ZVol it just isn’t worth the effort. There is too little left of single machine ZFS left to make it worth building NexentaEdge on top of ZFS.

The ideas of ZFS, however, inspired the entire design effort.