July 2017 – TWENTY FIRST CENTURY STORAGE

Numbers don’t lie. Take any storage stack – local or distributed, eventually consistent or ACID-transactional, highly available or otherwise. Ask an innocent question: how does it perform? The benchmarks – if they are current, valid, and most importantly, published – will tell only a part of the story.

In reality, an infinitesimally small part. Consider the following, very modest, example with comments below:

(*) To get an idea of scope and diversity of the performance tunables, let’s see some popular examples:

Ext2/3/4 man page: http://man7.org/linux/man-pages/man5/ext4.5.html
Sample Ceph config: https://github.com/ceph/ceph/blob/master/src/sample.ceph.conf
MySQL: https://dev.mysql.com/doc/refman/8.0/en/server-system-variables.html
HDFS: http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml

In all cases, the numbers of tunables fluctuate anywhere between 20 and 40. On top of any/all of the above there’d often be a storage transport, with its own 10 or 20 client-and-server side knobs.

(**) The knobs themselves will frequently have continuous ranges. The most popular methods to enumerate continuous ranges include divide-by-a-constant and divide-by-a-power-of-two. If these two wouldn’t help, we then could just go ahead and apply Latin Hypercube Sampling – it’s brutal but still better than citing a single default and accompanying it with a stern warning not to change at your own risk, etc.

(***) As for the workloads, on the most basic level they are defined by: synchronous/asynchronous and random/sequential permutations as well as read/write ratios and (application-defined) transfer sizes. They also include specified numbers of worker threads, protocol-specific containers and objects per container, and depth of the container hierarchy (if applicable).

Using those primitives as a starter, and taking into account that read/write ratios are often applied at 25% increments, sequential write is different from sequential rewrite, O_DSYNC is different from NFS fsync – we then combine all this together and come up with estimates. Unsurprisingly, they all will be bigger than the 32 number from the table, by at least a couple orders of magnitude.

However: this presumably corrected workload number (whatever it is) would still be a far, far cry from full workload characterization – because the latter includes I/O burstiness, spatial and temporal localities, size of the working set, compress-ability and deduplication-ability.

Moreover, each of the resulting workloads must be cross-tested across a massive variety of influential environmental factors: on-disk layouts of allocated blocks/chunks/shards, the presence of snapshots and clones and their numbers, the depth of the metadata hierarchy and its distribution, the raw bit error rate as well as its post-ECC BER component. Many of these factors accumulate over time, thus adding to the condition called (quite literally) – aging.

But there is more.

(****) Constant traffic creates a new reality. If you have lived long enough and seen your share of performance charts, you might have noticed that a 10-minute interval may look strikingly different – before and after a couple hours of continuous workload. This nagging (unconfirmed) observation has an ample evidence – the horror stories on the web posted by unsuspecting users, multi-hour testing recommendations from the vendor community, and extensive multi-year studies:

A study titled The Tail at Store: A Revelation from Millions of Hours of Disk and SSD Deployments (documenting in particular “storage performance instability in the field”)
The most recent FAST’17 paper: On the Performance Variation in Modern Storage Stacks
A six year-long large-scale field study: Flash Reliability in Production: The Expected and the Unexpected

(*****) “One would expect that repeated, carefully controlled experiments might yield nearly identical performance results but we found otherwise,” – write the authors of the FAST’ 17 paper, correctly identifying the widespread, albeit naive, trust in the technological determinism.

But even though every single ‘if’ and ‘for’ statement is ostensibly quite deterministic, there is often no determinism at all when it comes to massively-complex systems. Tiny variations in the order of execution, the environment, the workload produce dramatically different performance results. Anecdotal evidence abounds with examples that create, for instance, small files in a slightly different order, and register a 15-175 times slow-down, etc.

The noise, the variance, the non-reproducibility of the common benchmarks drives the only available inference: a process of measuring storage performance is genuinely stochastic. As such, it must be accompanied by first and second moments along with confidence intervals.

It is difficult, however, to have at least 95% confidence when a sample size is below 100. It is, in fact, fairly impossible. Which means that the very last number in the table above – the 10 runs – must be considered totally inadequate, much like all the previously discussed numbers.

(As a corollary, a single run is a fluke and an outlier if performed below the expectations. Always a silver lining.)

Clustered CDF

Different sources cite different numbers. For instance, the already mentioned FAST’17 study compares three popular local filesystems. According to this research, the total benchmark time ranges anywhere between 10¹⁵ to 10³³years (per filesystem). Which, incidentally, exceeds the age of the universe by at least 4 orders of magnitude.

(The good news, though, is that, given enough hardware, the time to evaluate the storage stack performance can be mitigated.)

Scale is part of the problem. Suppose we have a server that 99% of the time handles requests with latency <= A. Now compare the two latency CDFs, for a single server (blue) and for 100 identical servers (red):

In a 100-node cluster the odds to observe greater than A latencies skyrocket to (1 – 0.99¹⁰⁰) = 63.4%. For an industry-grade five nines and a 1000-node cluster the same exercise gives 0.995%. Generally, the so-called tail latency becomes a real issue at scale, even when none of the specific standalone tails is fat, long, heavy or otherwise out of whack. Thus, corroborating the old adage that anything that can possibly go wrong, does with ever-growing chances.

Inherence

In light of the above, it should be no wonder that the performance-related discussions typically sound too open-ended at best, ambiguous or even hostile, at worst. Personally, I believe that the only way to cope with the associated unease is to state, and accept, the following:

The performance of a qualified storage stack cannot be known. (By qualified, I mean any stack that stores at least one petabyte in production – which seems like a reasonable threshold today – and that is used for/by mission-critical applications requiring low latency.) The stack’s performance is inherently unknowable.

The word “inherence”, by the way, originates from the Empedocles’ idea that the qualities of matter come from the relative proportions of each of the four elements: earth, water, air, and fire. This idea, as we know today, does not describe matter correctly, much like the still prevalent view that a storage system consists of four components: a controller attached to its memory and a target attached to its disk…

COPs

The scale of the cluster, the size of the working set, the number of concurrently-active tiers – all these factors exponentialize the complexity of the software/hardware constructions. Freeze all of the above – and what will remain is (only!) a parameter space of all possible workloads and all valid configurations.

As shown above, the parameter space in question is enormous – infinite, for all intents and purposes. Which is unfortunate, but maybe not the end of the world – if we could devise an analytical model or framework, to compute/estimate the stuff that we can never test. This model would, potentially, include a DAG for each IO request type, with edges reflecting causal and/or precedence relationships between the request’s parent and children (and their children) – at various stages of the IO execution.

It would also include inter-DAG causal and precedence relationships between the concurrent IOs within a context of a single transaction which, depending on the semantic model of the storage protocol, may or may not possess some or all ACID properties. (As well as inter-transactional relationships, etc.)

Further, any given IO DAG will be spanning local (virtual and physical) memory hierarchies, local (virtual and physical) disks, and – in the distributed-storage case – remote servers with their own layers of volatile and persistent caches, memories, journals, and disks.

As such, this DAG would be connecting multiple cross-over points (COPs) where the IO parent and its children belong to different domains: CPU caches vs. RAM, user vs. kernel, virtual vs. physical, fast memory (or disk) vs slow memory (or disk), etc. In a simplified model/framework, every single COP becomes a queue with consumers and producers having different resources and executing at vastly different rates – from nanoseconds (CPU caches, RAM) to milliseconds (TCP, HDD):

While bottlenecks and SPOFs are often in-your-face obvious and even documented, much of the performance trouble is subtle and elusive – sinister if you will. Much of it lies in and around those COPs – and here are some of the (maybe) less obvious reasons:

the number of simultaneously existing COPs is proportional to the (extreme) heterogeneity of the volatile and persistent tiers “multiplied” by the scale/velocity/volume of the concurrent IOs;
without designed-in deterministic mechanisms – for instance, resource reservations in the data path – it is extremely difficult to keep in-check utilizations on both sides of each logical COP;
none of the popular storage protocols utilize resource reservations in the data path (yet).

In addition, there are the usual overheads: queuing overhead, interrupt-handling overhead, polling overhead, data copying overhead, context switching overhead, locking-of-the-shared-resources overhead, etc. All the overheads “consolidating” in and around the edges of each and every COP.

To conclude this line, I’ll illustrate the importance of keeping utilization in-check. There are many ways to do that. Consider, for example, a queue that “connects” a Markovian producer with a single server – the Pollaczek–Khinchine formula:

Expectedly, at high utilizations the queue length L and, therefore, the waiting time approaches infinity. The formula works for an M/G/1 queue – and not for an M/G/k queue (let alone G/G/k queue). It is also only a single queue connected to a single “server” – and not an entire hypothetical super-multi-queued DAG where the arrivals and service times are non-deterministic and non-Markovian.

Combinatorial Explosion

The only known to humanity way to deal with an exponential complexity is to decompose things down to fairly isolated modules/components, and design/implement – or, better – reuse them one by one, one at a time. Modular programming, SEDA, multi-tier architectures, workflow systems, normalized systems, microservices architecture – all that.

“Let me try to explain to you” – wrote Dijkstra in the essay called On the role of scientific thought – “what to my taste is characteristic for all intelligent thinking. It is, that one is willing to study in depth an aspect of one’s subject matter in isolation for the sake of its own consistency, all the time knowing that one is occupying oneself only with one of the aspects <snip> It is what I sometimes have called the separation of concerns, which, even if not perfectly possible, is yet the only available technique for effective ordering of one’s thoughts”

Today, 43 years later, a logical question to ask would be: what’s modular or pluggable about the existing storage stacks, and how do the best of designs address the combinatorial effects of (environment, workload, configuration) changes multiplied by the passing of time (and therefore, changing = aging)?

Not shockingly, the answer will depend on who you ask. If you ask google, for instance, search results may appear to be limited, in a sense.

And so, here’s my final point. It may sound controversial, at first glance. Outrageous, at the second. But here it is:

Is SoC itself – a good thing? After all, when we separate IO performance from other vital concerns, such as:

data consistency, fault tolerance, data protection and security, usability, maintain-ability and upgrade-ability, features A, B, and C, protocols D, E, and F, APIs X, Y, and Z

when we do all that (separation), don’t we also, inadvertently, de-prioritize some concerns over the others?

And once de-prioritized, doesn’t the concern sort of vanish?

Well, think about it. Perhaps there will be answers, in due time. Until then, what remains for the prospective users (aka prospects) is – walking up and down the marketplace, being slightly dazzled by the variety, and wondering what kind of a package deal they’ll end up having…

TWENTY FIRST CENTURY STORAGE

more than just faster disks

Month: July 2017

Four decades of tangled concerns

Clustered CDF

Inherence

COPs

Combinatorial Explosion