Map-free Multi-site Replication

A distributed storage system has do a great job of replicating content. Building a highly available storage service out of mildly reliable storage devices requires replicating after each failure. Replication is not an exception, it is a continuous process. So it’s important.

The design of the CCOW (Cloud Copy-on-Write) object storage system deals with replication within a local network very well. What it does not deal with anywhere near as well is spreading content over very long round-trip-times, as in different sites. I needed a feature that would allow customers to federate multiple sites without requiring 40 Gbe links connecting the sites.

This is not specific to CCOW object storage. The need to provide efficient storage locally will force any design to partition between “local” updating, which is fairly fast, and “remote” updating which will be considerably slower. If you don’t recognize this difference you’ll end up with a design that does all updates slowly.

But I needed a solution for CCOW. My first thought on how to replicate content between federated sites was to just layer on top of software bridges.

The problem actually maps fairly well:

  • Each site can be thought of as a switch.
  • The “switch” can deliver frames (chunks) to local destinations, or to links to other switches (sites).
  • A frame (chunk) is never echoed to the same link that it arrived upon.
  • There is no need to provision a site topology, the Spanning Tree will discover it. The latter spanning tree algorithms will do it better.

If you are not familiar with the Spanning Tree algorithms I suggest you go read up on them. Spanning Tree made Ethernet networking possible. See https://en.wikipedia.org/wiki/Spanning_Tree_Protocol.

Basically, we start with a set of linked Sites:

MultiSite1

Spanning Tree algorithms effectively prevent loops by logically de-activating some of the links.

MultiSite2

So re-using existing software (software switches and existing long-haul tunnels) was a very tempting solution, especially since it directly re-used networking layer code to solve storage layer problems. Pontificating that networking solutions are applicable to storage is one thing, actually leveraging network layer code would prove it.

This leads to the following distribution pattern, in this example for a frame originating from Site V.

MultiSite3

But it turns out that the storage layer problem is fundamentally simpler.

Radia Perlman’s spanning tree prevnts loops by discovering the links between the switches, and without central decision making de-activates a subset of the links so that would have enabled forwarding loops. It does this no matter how the switches are linked. It de-activates links to avoid loops. This is needed because Ethernet frames do not have a time-to-live marker. Without spanning tree it would be very easy for Switch A to forward a frame to Switch B, which would forward it to Switch C which would forward it to Switch A. While rare, such loops would crash any Ethernet network. Spanning tree was a simple algorithm that prevented that totally.

The fundamental assumption behind spanning tree was that links had to be de-activated because switches could not possibly remember what frames they had previously forwarded. Recognizing that a frame was being looped around would require remembering frames that have been forwarded eons previously, possibly even milliseconds.

If you understand how switches work the last thing any switch wants to do is to remember anything. Remembering things takes RAM, but more critically it takes time to update that RAM. RAM may get cheaper, but remembering things always takes longer than not remembering them and switches will kill to trim a microsecond from their relay time.

But remembering things is exactly what a storage cluster is deigned to do.

It turns out that the multi-site forwarding algorithm for multi-site storage of fingerprinted chunks is stunningly simple:

On receipt of a new Chunk X on one Inter-segment link:

If Chunk X is already known to this cluster

Do nothing.

Otherwise

Replicate Chunk X on all other inter-site inks.

Remember Chunk X locally.

That’s it. Having a cryptographic hash fingerprint of each chunk makes it easy to identify already seen chunks. This allows multiple sites to forward chunks over ad hoc links without having to build a site wide map, and all links can be utilized at all times.

Now the forwarding pattern is enhanced and there is no delay for building a tree:

Multisite4