Even though machine learning is used extensively and successfully in numerous distinct areas, the idea to apply it to hardcore (fast) datapaths may seem farfetched. Which is also why I’m currently looking at the distributed storage use case where storage nodes employ different I/O load balancing strategies. Some of these nodes run neural networks, others – maybe a more conventional logic. Some of these strategies perform better than others, but only when they compete against their deterministic counterparts. The text below is only scratching the surface, of course:

# Category: Load balancing

# Bayesian Optimization for Clusters

What is the difference between storage cluster and multi-armed bandit (MAB)? This is not a trick question – to help answer it, think about this one: what’s common between MAB and Cloud configurations for big data analytics?

Henceforth, I’ll refer to these use cases as cluster, MAB, and Cloud configurations, respectively. The answer’s down below.

In storage clusters, there are storage initiators and storage targets that continuously engage in routine interactions called I/O processing. In this “game of storage” an initiator is typically the one that makes multi-choice load-balancing decisions (e.g., which of the 3 available copies do I read?), with the clear objective to optimize its own performance. Which can be pictured as, often rather choppy, diagram where the Y axis reflects a commonly measured and timed property: I/O latency, I/O throughput, IOPS, utilization (CPU, memory, network, disk), read and write (wait, active) queue sizes, etc.

**Fig. 1.** Choosy Initiator: 1MB chunk latencies (us) in the cluster of 90 targets and 90 initiators.

A couple observations, and I’ll get to the point. The diagram above is (an example of) a real-time time-series, where the next measured value depends on both the *previous *history and the current *action*. In the world of distributed storages the action consists in choosing a given target at a given time for a given (type, size) I/O transaction. In the world of multi-armed bandits, *acting *means selecting an *arm*, and so on. All of this may sound quite obvious, on one hand, and disconnected, on another. But only at the first glance.

In all cases the observed response carries a noise, simply because the underlying system is a bunch of physical machines executing complex logic where a degree of nondeterminism results, in part, from myriad micro-changes on all levels of the respective processing stacks including hardware. No man ever steps in the same river twice (Heraclitus).

Secondly, there always exists a true (*latent,* *objective,* *black-box)* function **Y = f(history, action)** that can potentially be inferred given a good model, enough time, and a bit of luck. Knowing this function would translate as the ability to optimize across the board: I/O performance – for storage cluster, gambling returns – for multi-armed bandits, $/IOPS ratio – for big data clouds. Either way, this would be a good thing, and one compelling reason to look deeper.

Thirdly and finally, a brute-force search for the best possible solution is either too expensive (for MAB, and Cloud configurations), or severely limited in time (which is to say – too expensive) for the cluster. In other words, the solution must converge fast – or be unfeasible.

## The Noise

A wide range of dynamic systems operate in an *environment*, respond to *actions* by an *agent*, and* *generate serialized or timed *output:*

**Fig. 2.** Agent ⇔ Environment diagram.

This output is a function **f(history, action)** of the previous system states – the *history* – and the current action*.* If we could discover, approximate, or at least regress to this function, we could then, presumably, optimize – via a judicious choice of an optimal action at each next iteration…

And so, my first idea was: neural networks! Because, when you’ve got for yourself a new power, you better use it. Go ahead and dump all the cluster-generated time-series into neural networks of various shapes, sizes, and hyperparameters. Sit back and see if any precious latent **f()** shows up on the other end.

Which is exactly what I did, and well – it didn’t.

There are obstacles that are somewhat special and specific to distributed storage. But there’s also a common random *noise* that slows down the *learning* process, undermining its converge-ability and obfuscating the underlying system properties. Which is why the very second idea: get rid of the noise, and see if it helps!

To filter the noise out, we typically compute some sort of a moving average and say that this one is a real *signal*. Well, not so fast…

First off, averaging across time-series is a *lossy transformation*. The information that gets lost includes, for instance, the dispersion of the original series (which may be worse than Poissonian or, on the contrary, better than Binomial). In the world of storage clusters the index of dispersion may roughly correspond to “burstiness” or “spikiness” – an important characteristic that cannot (should not) be simplified out.

Secondly, there’s the technical question of whether we compute the average on an entire stream or just on a sliding window, and also – what would be the window size, and what about other hyperparameters that include moving weights, etc.

Putting all this together, here’s a very basic noise-filtering logic:**Fig. 3.** Noise filtering pseudocode.

This code, in effect, is saying that as long as the time-series fits inside the 3-sigma wide envelope around its moving average, we consider it devoid of noise and therefore we keep it as is. But, if the next data point ventures outside this “envelope”, we do a weighted average (line 3.b.i), giving it a slightly more credence and weight (0.6 vs 0.4). Which will then, at line 3.e, contribute to changing the first and the second moments based either on the entire timed history or on its more recent and separately configurable “window” (as in Fig. 1, which averages over a window of 10).

*As a side: it is the code like this (and its numerous variations – all based on pure common sense and intuition) – that makes one wonder: why? Why are there so many new knobs (aka hyperparameters) per line of code? And how do we go about finding optimal values for those hyperparameters?..*

Be as it may, some of the questions that emerge from Fig. 3 include: do we actually remove the noise? what is the underlying assumption behind the 3 (or whatever is configured) number of sigmas and other tunables? how do we know that an essential information is not getting lost in addition to, or maybe even instead of, “de-jittering” the stream?

## All things being Bayesian

It is always beneficial to see a big picture, even at the risk of totally blowing the locally-observable one out of proportion ©. The proverbial big picture may look as follows:

**Fig. 4.** Bayesian inference.

Going from top to bottom, left to right:

- A noise filtering snippet (Fig. 3) implicitly relies on an underlying assumption that the noise is normally distributed – that it is Gaussian (
*white*) noise. This is likely okay to assume in the cluster and Cloud configurations above, primarily due to size/complexity of the underlying systems and the working force of the Central Limit Theorem (CLT). - On the other hand, the conventional way to filter out the noise is called the Kalman Filter, which, in fact, does not make any assumptions in re Gaussian noise (which is a plus).
- The Kalman filter, in turn, is based on Bayesian inference, and ultimately, on the Bayes’ rule for conditional probabilities.
- Bayesian inference (Fig. 4) is at the core of very-many things, including Bayesian linear regression (that scales!) and iterative learning in Bayesian networks (not shown); together with Gaussian (or normal) distribution Bayesian inference forms a founding generalization for both hidden Markov models (not shown) and Kalman filters.
- On the
*third*hand, Bayesian inference happens to be a sub-discipline of statistical inference (not shown) which includes several non-Bayesian paradigms with their own respective taxonomies of methods and applications (not shown in Fig. 4). - Back to the Gaussian distribution (top right in Fig. 4): what if not only the (CLT-infused) noise but the function
**f(history, action)**itself can be assumed to be probabilistic? What if we could model this function as having a deterministic**mean(Y) = m(history, action)**, with the**Y=f()**values at each data point normally distributed around each of its**m()**means? - In other words, what if the
**Y = f(history, action)**is (sampled from) a Gaussian Process (GP) – the generalization of the Gaussian distribution to a distribution over functions:

- There’s a fast-growing body of research (including already cited MAB and Cloud configurations) that answers Yes to this most critical question.
- But there is more…
- Since the sum of independent Gaussian processes is a (joint) GP as well, and
- since a Gaussian distribution is itself a special case of a GP (with a constant
*K*=**σ**^{2}covariance, where*I*is the identity matrix), and finally,*I*- since white noise is an independent Gaussian

- since a Gaussian distribution is itself a special case of a GP (with a constant

– the noise filtering/smoothing issue (previous section) sort of “disappears” altogether – instead, we simply go through well-documented steps to find the joint GP, and the **f()** as part of the above.

- When a Gaussian process is used in Bayesian inference, the combination yields a prior probability distribution over functions, an iterative step to compute the (better-fitting) posterior, and ultimately – a distinct machine learning technique called Bayesian optimization (BO) – the method widely (if not exclusively) used today to optimize machine learning hyperparameters.
- One of the coolest things about Bayesian optimization is a so-called acquisition function – the method of computing the next most promising sampling point in the process that is referred to as exploration-versus-exploitation.
- The types of acquisition functions include (in publication order): probability of improvement (PI), expected improvement (EI), upper confidence bound (UCB), minimum regret search (MRS), and more. In each case, the acquisition function computes the next optimal (confidence-level wise) sampling point – in the process that is often referred to as exploration-versus-exploitation.
- The use of acquisition functions is an integral part of a broader topic called optimal design of experiments (not shown). The optimal design tracks at least 100 years back, to Kirstine Smith and her 1918 dissertation, and as such does not explain the modern Bayesian renaissance.
- In the machine learning setting, though, the technique comes to the rescue each and every time the number of
*training examples*is severely limited, while the need (or the temptation) to produce at least some interpretable results is very strong. - In that sense, a minor criticism of the Cloud configurations study is that, generally, the performance of real systems under-load is spiky, bursty and heavily-tailed. And it is, therefore, problematic to assume anything-Gaussian about the associated (true) cost function, especially when this assumption is then tested on a small sample out of the entire “population” of possible {configuration, benchmark} permutations.
- Still, must be better than random. How much better? Or rather, how much better versus other possible Bayesian/GP (i.e., having other possible kernel/covariance functions), Bayesian/non-GP, and non-Bayesian machine learning techniques?
- That is the question.

## Only a few remaining obstacles

When the objective is to optimize or (same) approximate the next data point in a time-series, both Bayesian and RNN optimizers perform the following 3 (three) generic meta-steps:

- Given the current state h(t-1) of the model, propose the next query point x(t)
- Process the response y(t)
- Update the internal state h(t+1)

Everything else about these two techniques – is different. The speed, the scale, the amount of engineering. The applicability, after all. Bayesian optimization (BO) has proven to work extremely well for a (constantly growing) number of applications – the list includes the already cited MAB and Cloud configurations, hyperparameter optimization, and more.

However. GP-based BO does not scale: its computational complexity grows as O(n^{3}), where n is the number of sampled data points. This is clearly an impediment – for many apps. As far as storage clustering, the other obstacles include:

- a certain adversarial aspect in the relationships between the clustered “players”;
- a lack of covariance-induced
*similarity*between the data points that – temporally, at least – appear to be very close to each other; - the need to rerun BO from scratch every time there is a
*change*in the*environment*(Fig. 2) – examples including changes in the workload, nodes dropping out and coming in, configuration and software updates and upgrades; - the microsecond latencies that separate load balancing decisions/actions, leaving little time for machine learning.

Fortunately, there’s a bleeding-edge research that strives to combine the power of neural networks with the Bayesian power to squeeze the “envelope of uncertainty” around the true function (Fig. 6) – in just a few iterations:

**Fig. 6.** Courtesy of: Taking the Human Out of the Loop: A Review of Bayesian Optimization.

To be continued…

# The Better Protocol (part II)

## The Proxy’s conundrum

In the end, group proxy must deliver:Requirements #2 and #3 are limiting, while #1 and (compound) #4 are *optimizing*, which also means that the proxy’s problem belongs to the vast field of multi-criteria decision making – aka vector optimization. This is a loosely named branch of computing where the domain-specific *decision maker* (DM) must be continuously taking multiple dynamic factors into account.

Most of the time the DM’s *decisions* will not maximize all factors/objectives simultaneously (see for instance Pareto optimality), which is especially true when the number of state variables is very large and the resulting behavior, even though deterministic at each of its state transitions, appears to be rather stochastic if not probabilistic.

Speaking of load balancing groups of servers, at any point in time the state of the group is described by a fixed number of variables that record and reflect the combination of network and disk resources already committed and scheduled by previous I/O requests. The following is generally true for all distributed storages where nodes have their own local (and locally owned) resources: each new I/O is effectively a “bump” in the utilization of the corresponding storage and network resource. Resource freeing process, on the other hand, is continuous and *constant-speed* – the resource-specific bandwidth or throughput.

The beauty of modeling, however, lies in the eye of beholder and in the immediacy of the results. Here’s a piece of model-generated log where the target/proxy T#46 makes a load-balancing selection for a newly arrived (highlighted) chunk:This log tells a story. At the 16.9..ms timestamp all targets in this group have pending requests referencing previously submitted/requested chunks. The model selects 3 targets denoted by ‘=>’ on the left, giving preference to the T#50. From the log we see that target T#52 was not selected even though it could ostensibly start writing (‘startW=’) the new chunk a bit earlier.

Runtime considerations of this sort where optimal write latency is weighed against other factors in play – is what I call the Proxy’s conundrum. Tradeoffs include the need to optimize and balance the reads, on one hand, while maintaining uniform distribution and balanced utilization, on another.

## Group Tetris

OK, so each new I/O is effectively a “bump”. Figure 6 helps to visualize it with a group of six (horizontal axis), and the corresponding per-server pending data chunks (vertical axis). One unit on a vertical scale is the time to deliver one fixed-size chunk from initiator to target, or vise versa.Figure 6. Write request targeting a group of 6 storage servers

Proxy’s choice in this case indicates targets 1, 3, and 6, with a horizontal dashed line at the point in the timeline when the very first bit of the new chunk can cross respective local network interfaces.

Now for a more complex (and realistic) visualization, here is a plotted benchmark where the proxied model is subjected to random 128K size, 50/50 read/write workload, and where frontend/backend of the cluster consists of 90 storage initiators and 90 targets, respectively. Each vertical step in the stacked bar chart captures newly received bytes and is exactly 100μs interval, with colored bars indicating the same exact moment in running time.Figure 7. Cumulative received bytes (vertical axis) measured in 100μs increments

One immediate observation: even though during a measured interval a given server receives almost nothing or nothing at all, overall as time progresses the 9 group members in this case find themselves more or less on the same cumulative (received-bytes) level. It appears that the model manages to avoid falling into the Pareto “trap”, whereby servers that did get utilized keep getting utilized even further.

It takes a handful of runs such as the one above, to start seeing a pattern. This pattern just happens to be reminiscent of a game of Tetris, where writes are blue, reads are green, and future is wide-open and uncharted:

## About the Tail

In part I of this series I talked about fat-tailed latency distributions: why do we often see them in the distributed post-Dynamo world, and how to make them go away.

The chart:Figure 9. Write latencies (in microseconds)

shows proxied (blue) and proxy-less (orange) sorted write latencies under the same 128K, 50/50 workload, as well as 95th percentiles as horizontal dotted lines of respective coloring.

Not to jump to conclusions, but a (tentative) observation that can be drawn from this limited experiment (and a few others that I don’t show) is that load balancing a group of clustered servers makes things **overall better**. At the very least it makes the proverbial tail lighter and more subdued (so to speak). Not clear what’ll happen if the model runs through couple million chunks instead of 3,500 (above). But all in due time.

## Optimizing the Future

Ultimately, the crux of the difference between proxy-less distributed system and its proxied counterpart is – information: load balancing proxy has just more of it as far as its group of servers is concerned, more than any given storage initiator at any time.

What’s next and where do we go from here? The following is a generalized 2-step sequence the proxy can run to optimize for the stated multi-objectives:

- identify R+K targets capable to execute new request with the best latency, where R is the required redundancy, K>0 is configurable, and R+K <= size of the load balancing group;
- apply the
**uniformity and balance**criteria to each R-element subset of the R+K set.

Most intriguing aspect for me would be to see if the past I/Os can be used to optimize future ones. Or rather, not ‘if’ and not ‘how’ but ‘whether’ – whether it’d make a better than single-digit percentage difference in the performance.

Hence, the idea. The setup is a very large distributed cluster under a given stable workload. To compute the next step, load-balancing proxy utilizes a *few* previous I/Os, where *a few* is greater equal 2 and less equal, say, 16.

The proxy uses past I/Os to generate same number of *future *I/Os while preserving respective sizes, read/write ratios and relative arrival frequencies. Next, the proxy executes what is normally called a **dry run**.

Once all of the above is set and done, the proxy then selects those R targets out of (R+K) that provide for the optimal extrapolated future.

This of course assumes that the immediate past – the past defined in terms of I/O sizes, read/write ratios and relative arrival frequencies – has a good instructional value as far as the future. It usually does.

## Post Scriptum

A zillion burning questions will have to remain out of scope. How will the proxied model perform for unicast datapath? What’s the performance delta between the with and without-proxy protocols which otherwise must be totally apples-to-apples comparable? What are the upper and lower bounds of this delta, and how would it depend on average chunk/block sizes, read/write ratios, sizes of the load balancing group, ratios between the network and disk bandwidths?..

One thing is clear to me, though. Thinking that initiator-driven hashed distribution means uniform and balanced distribution – this thinking is wrong (it is also naïve, irresponsible and borderlines on tastelessness, but that’s just me going off the record). This text tries to make a step. Stuff can be modeled, certain ideas tested – quickly..