Four decades of tangled concerns

Numbers don’t lie. Take any storage stack – local or distributed, eventually consistent or ACID-transactional, highly available or otherwise. Ask an innocent question: how does it perform? The benchmarks – if they are current, valid, and most importantly, published – will tell only a part of the story.

In reality, an infinitesimally small part. Consider the following, very modest, example with comments below:

tab1

(*) To get an idea of scope and diversity of the performance tunables, let’s see some popular examples:

In all cases, the numbers of tunables fluctuate anywhere between 20 and 40. On top of any/all of the above there’d often be a storage transport, with its own 10 or 20 client-and-server side knobs.

(**) The knobs themselves will frequently have continuous ranges. The most popular methods to enumerate continuous ranges include divide-by-a-constant and divide-by-a-power-of-two. If these two wouldn’t help, we then could just go ahead and apply Latin Hypercube Sampling – it’s brutal but still better than citing a single default and accompanying it with a stern warning not to change at your own risk, etc.

(***) As for the workloads, on the most basic level they are defined by: synchronous/asynchronous and random/sequential permutations as well as read/write ratios and (application-defined) transfer sizes. They also include specified numbers of worker threads, protocol-specific containers and objects per container, and depth of the container hierarchy (if applicable).

Using those primitives as a starter, and taking into account that read/write ratios are often applied at 25% increments, sequential write is different from sequential rewrite, O_DSYNC is different from NFS fsync – we then combine all this together and come up with estimates. Unsurprisingly, they all will be bigger than the 32 number from the table, by at least a couple orders of magnitude.

However: this presumably corrected workload number (whatever it is) would still be a far, far cry from full workload characterization – because the latter includes I/O burstiness, spatial and temporal localities, size of the working set, compress-ability and deduplication-ability.

Moreover, each of the resulting workloads must be cross-tested across a massive variety of influential environmental factors: on-disk layouts of allocated blocks/chunks/shards, the presence of snapshots and clones and their numbers, the depth of the metadata hierarchy and its distribution, the raw bit error rate as well as its post-ECC BER component. Many of these factors accumulate over time, thus adding to the condition called (quite literally) – aging.

But there is more.

(****) Constant traffic creates a new reality. If you have lived long enough and seen your share of performance charts, you might have noticed that a 10-minute interval may look strikingly different – before and after a couple hours of continuous workload. This nagging (unconfirmed) observation has an ample evidence – the horror stories on the web posted by unsuspecting users, multi-hour testing recommendations from the vendor community, and extensive multi-year studies:

(*****) “One would expect that repeated, carefully controlled experiments might yield nearly identical performance results but we found otherwise,” – write the authors of the FAST’ 17 paper, correctly identifying the widespread, albeit naive, trust in the technological determinism.

But even though every single ‘if’ and ‘for’ statement is ostensibly quite deterministic, there is often no determinism at all when it comes to massively-complex systems. Tiny variations in the order of execution, the environment, the workload produce dramatically different performance results. Anecdotal evidence abounds with examples that create, for instance, small files in a slightly different order, and register a 15-175 times slow-down, etc.

The noise, the variance, the non-reproducibility of the common benchmarks drives the only available inference: a process of measuring storage performance is genuinely stochastic. As such, it must be accompanied by first and second moments along with confidence intervals.

It is difficult, however, to have at least 95% confidence when a sample size is below 100. It is, in fact, fairly impossible. Which means that the very last number in the table above – the 10 runs – must be considered totally inadequate, much like all the previously discussed numbers.

(As a corollary, a single run is a fluke and an outlier if performed below the expectations. Always a silver lining.)

Clustered CDF

Different sources cite different numbers. For instance, the already mentioned FAST’17 study compares three popular local filesystems. According to this research, the total benchmark time ranges anywhere between 1015 to 1033 years (per filesystem). Which, incidentally, exceeds the age of the universe by at least 4 orders of magnitude.

(The good news, though, is that, given enough hardware, the time to evaluate the storage stack performance can be mitigated.)

Scale is part of the problem. Suppose we have a server that 99% of the time handles requests with latency <= A. Now compare the two latency CDFs, for a single server (blue) and for 100 identical servers (red):

fig1

In a 100-node cluster the odds to observe greater than A latencies skyrocket to (1 – 0.99100) = 63.4%. For an industry-grade five nines and a 1000-node cluster the same exercise gives 0.995%. Generally, the so-called tail latency becomes a real issue at scale, even when none of the specific standalone tails is fat, long, heavy or otherwise out of whack. Thus, corroborating the old adage that anything that can possibly go wrong, does with ever-growing chances.

Inherence

In light of the above, it should be no wonder that the performance-related discussions typically sound too open-ended at best, ambiguous or even hostile, at worst. Personally, I believe that the only way to cope with the associated unease is to state, and accept, the following:

The performance of a qualified storage stack cannot be known. (By qualified, I mean any stack that stores at least one petabyte in production – which seems like a reasonable threshold today – and that is used for/by mission-critical applications requiring low latency.) The stack’s performance is inherently unknowable

The word “inherence”, by the way, originates from the Empedocles’ idea that the qualities of matter come from the relative proportions of each of the four elements: earth, water, air, and fire. This idea, as we know today, does not describe matter correctly, much like the still prevalent view that a storage system consists of four components: a controller attached to its memory and a target attached to its disk…

COPs

The scale of the cluster, the size of the working set, the number of concurrently-active tiers – all these factors exponentialize the complexity of the software/hardware constructions. Freeze all of the above – and what will remain is (only!) a parameter space of all possible workloads and all valid configurations.

As shown above, the parameter space in question is enormous – infinite, for all intents and purposes. Which is unfortunate, but maybe not the end of the world – if we could devise an analytical model or framework, to compute/estimate the stuff that we can never test. This model would, potentially, include a DAG for each IO request type, with edges reflecting causal and/or precedence relationships between the request’s parent and children (and their children) – at various stages of the IO execution.

It would also include inter-DAG causal and precedence relationships between the concurrent IOs within a context of a single transaction which, depending on the semantic model of the storage protocol, may or may not possess some or all ACID properties. (As well as inter-transactional relationships, etc.)

Further, any given IO DAG will be spanning local (virtual and physical) memory hierarchies, local (virtual and physical) disks, and – in the distributed-storage case – remote servers with their own layers of volatile and persistent caches, memories, journals, and disks.

As such, this DAG would be connecting multiple cross-over points (COPs) where the IO parent and its children belong to different domains: CPU caches vs. RAM, user vs. kernel, virtual vs. physical, fast memory (or disk) vs slow memory (or disk), etc. In a simplified model/framework, every single COP becomes a queue with consumers and producers having different resources and executing at vastly different rates – from nanoseconds (CPU caches, RAM) to milliseconds (TCP, HDD):

While bottlenecks and SPOFs are often in-your-face obvious and even documented, much of the performance trouble is subtle and elusive – sinister if you will. Much of it lies in and around those COPs – and here are some of the (maybe) less obvious reasons:

  • the number of simultaneously existing COPs is proportional to the (extreme) heterogeneity of the volatile and persistent tiers “multiplied” by the scale/velocity/volume of the concurrent IOs;
  • without designed-in deterministic mechanisms – for instance, resource reservations in the data path – it is extremely difficult to keep in-check utilizations on both sides of each logical COP;
  • none of the popular storage protocols utilize resource reservations in the data path (yet).

In addition, there are the usual overheads: queuing overhead, interrupt-handling overhead, polling overhead, data copying overhead, context switching overhead, locking-of-the-shared-resources overhead, etc. All the overheads “consolidating” in and around the edges of each and every COP.

To conclude this line, I’ll illustrate the importance of keeping utilization in-check. There are many ways to do that. Consider, for example, a queue that “connects” a Markovian producer with a single server – the Pollaczek–Khinchine formula:

fig2

Expectedly, at high utilizations the queue length L and, therefore, the waiting time approaches infinity. The formula works for an M/G/1 queue – and not for an M/G/k queue (let alone G/G/k queue). It is also only a single queue connected to a single “server” – and not an entire hypothetical super-multi-queued DAG where the arrivals and service times are non-deterministic and non-Markovian.

Combinatorial Explosion

fig3

The only known to humanity way to deal with an exponential complexity is to decompose things down to fairly isolated modules/components, and design/implement – or, better – reuse them one by one, one at a time. Modular programming, SEDA, multi-tier architectures, workflow systems, normalized systems, microservices architecture – all that.

“Let me try to explain to you” – wrote Dijkstra in the essay called On the role of scientific thought “what to my taste is characteristic for all intelligent thinking. It is, that one is willing to study in depth an aspect of one’s subject matter in isolation for the sake of its own consistency, all the time knowing that one is occupying oneself only with one of the aspects <snip> It is what I sometimes have called the separation of concerns, which, even if not perfectly possible, is yet the only available technique for effective ordering of one’s thoughts”

Today, 43 years later, a logical question to ask would be: what’s modular or pluggable about the existing storage stacks, and how do the best of designs address the combinatorial effects of (environment, workload, configuration) changes multiplied by the passing of time (and therefore, changing = aging)?

Not shockingly, the answer will depend on who you ask. If you ask google, for instance, search results may appear to be limited, in a sense.

And so, here’s my final point. It may sound controversial, at first glance. Outrageous, at the second. But here it is:

Is SoC itself – a good thing? After all, when we separate IO performance from other vital concerns, such as:

data consistency, fault tolerance, data protection and security, usability, maintain-ability and upgrade-ability, features A, B, and C, protocols D, E, and F, APIs X, Y, and Z

when we do all that (separation), don’t we also, inadvertently, de-prioritize some concerns over the others?

And once de-prioritized, doesn’t the concern sort of vanish?

Well, think about it. Perhaps there will be answers, in due time. Until then, what remains for the prospective users (aka prospects) is – walking up and down the marketplace, being slightly dazzled by the variety, and wondering what kind of a package deal they’ll end up having…

fig4

Neural Networks for Storage Games

Even though machine learning is used extensively and successfully in numerous distinct areas, the idea to apply it to hardcore (fast) datapaths may seem farfetched. Which is also why I’m currently looking at the distributed storage use case where storage nodes employ different I/O load balancing strategies. Some of these nodes run neural networks, others – maybe a more conventional logic. Some of these strategies perform better than others, but only when they compete against their deterministic counterparts. The text below is only scratching the surface, of course:

Neural Networks for Storage Games (pdf)

Parallel optimization via multiple neural networks

When training a neural network, it is not uncommon to have to run through millions of samples, with each training sample (Xi, Yi) separately obtained by a (separate) evaluation of a system function F that maps ℜn ⇒ ℜ1 and that, when given an input Xi, produces an output Yi.

Therein lies the problem: evaluations are costly. Or – slow, which also means costly and includes a variety of times: time to evaluate the function Y=F(X), time to train the network, time to keep not using the trained network while the system is running, etc.

Let’s say, there’s a system that must be learned and that is already running. Our goal would be to start optimizing early, without waiting for a fully developed, trained-and-trusted model. Would that even be possible?

Furthermore, what if the system is highly dimensional, stateful, non-linear (as far as its multi-dimensional input), and noisy (as far as its observed and measured output). The goal would be to optimize the system’s runtime behavior via controlled actions after having observed only a few, or a few hundred, (Xi, Yi) pairs. The fewer, the better.

The Idea

In short, it’s a gradient ascent via multiple neural networks. The steps are:

  • First, diversify the networks so that each one ends up with its own unique training “trajectory”.
  • Second, train the networks, using (and reusing) the same (X, Y) dataset of training samples.
  • In parallel, exploit each of the networks to execute gradient ascents from the current local maximums of its network “siblings”.

As the term suggests, gradient ascent utilizes gradient vectors to ascend – all the way up to the function’s maximum. Since maximizing F(X) is the same as minimizing -F(X), the only thing that matters is the subject of optimization: the system’s own output versus, for instance, a distance from the corresponding function F, which is a function in its own right, often called a cost or, interchangeably, a loss.

In our case, we’d want both – concurrently. But first and foremost, we want the global max F.

The Logic

The essential logic goes as follows:

The statements 1 through 3 construct the specified number of neural networks (NNs) and their respective “runners” – the entities that asynchronously and concurrently execute NN training cycles. Network “diversity” (at 2) is achieved via hyperparameters and network architectures. In the benchmark (next section) I’ve also used a variety of optimization algorithms (namely Rprop, ADAM, RMSprop), and random zeroing-out of the weight matrices as per the specified sparsity.

At 4, we pretrain the networks on a first portion of the common training pool. At 6, we execute the main loop that consists of 3 nested loops, with the innermost 6.1.1.1 and 6.1.1.2 generating new training samples, incrementing numevals counter, and expanding the pool.

That’s about it. There are risks though, and pitfalls. Since partially trained networks generate gradients that can only be partially “trusted,” the risks involve mispartitioning of the evaluation budget between the pretraining phase at 4 and the main loop at 6. This can be dealt with by gradually increasing (multiplicatively decreasing) the number of gradient “steps,” and monitoring the success.

There’s also a general lack of convexity of the underlying system, manifesting itself in runners getting stuck in their respective local maximums. Imagine a system with 3 local maximums and 30 properly randomized neural networks – what would be the chances of all 30 getting stuck on the left and/or the right sides of the picture:

What would be the chances for k >> m, where k is the number of random networks, m – the number of local maximums?..

The Benchmark

A simplified and reduced Golang implementation of the above can be found on my github. Here, all NN runners execute their respective goroutines using their own fixed-size training “windows” into a common stream of training samples. The (simplified) division of responsibilities is as follows: runners operate on their respective windows, centralized logic rotates the windows counterclockwise when the time is right. In effect, each runner “sees,” and takes advantage of, every single evaluation, in parallel.

A number of synthetic benchmarks is available and widely used to compare global optimization methods. This includes Hartmann-6 featuring multiple local optima in a 6-dimensional unit hypercube. For the numbers of concurrent networks varying between 1 and 30, the resulting picture looks as follows:

The vertical axis on this 3-dimensional chart represents the distance from the global maximum (smaller is better), the horizontal axis – Hartmann-6 calls (from 300 to 2300), the “depth” axis – number of neural networks. The worst result is for the {k = 1} configuration consisting of a single network.

Evolution

There’s an alternative to SGD-based optimization – the so-called Natural Evolution Strategies (NES) family of the algorithms. This one comes with important benefits that allow to, for instance, fork the most “successful” or “promising” network, train it separately for a while, then merge back with its parent – the merger producing a better trained result.

The motivation remains the same: parallel training combined with global collaboration. One reason to collaborate globally boils down to finding a global maximum (as above), or – running an effective and fast multimodal (as in: good enough) optimization. In the NES case, though, collaboration gets an extra “dimension:” reusing the other network’s weights, hyperparameters, and architecture.

To be continued.

On TCP, avoiding congestion, and robbing the banks

The Transmission Control Protocol (TCP) was first proposed in 1974, first standardized in 1981, first implemented in 1984, and published in 1988. This 1988 paper, by Van Jacobson, for the first time described the TCP’s AIMD policy to control congestion. But that was then. Today, the protocol carries the lion’s share of all Internet traffic, which, according to Cisco, keeps snowballing at a healthy 22% CAGR.

TCP, however, is not perfect. “Although there is no immediate crisis, TCP’s AIMD algorithm is showing its limits in a number of areas”, says the responsible IETF group. Notwithstanding the imperfections, the protocol is, in effect, hardwired into all routers and middleboxes produced since 1984, which in part explains its ubiquity and dominance. But that’s not the point.

The point is – congestion. Or rather, congestion avoidance. At best – prevention, at least – control. Starting from its very first AIMD implementation, much of what TCP does is taking care of congestion. The difference between numerous existing TCP flavors (and much more numerous suggested improvements) boils down to variations in a single formula that computes TCP sending rate as a function of loss events and roundtrip times (RTT).

Of course, there is also a slow start and a receiver’s advertised window but that’s, again, beside the point…

The point is that the congested picture always looks the same:Time t1 – everything is cool, t2 – a spike, a storm, a flood, a “Houston, we have a problem”, t3 – back to cool again.

What we always want – for ourselves and for our apps – is this:But instead, it is often like this:Or worse.

Fundamentally, “congestion” (the word) relates to shortage of any shared resource. In networking, it’s just shorthand for a limited resource (a bunch of interconnected “pipes”) shared by multiple “users” (TCP flows).

When a) the resource is limited (it always is!) and when b) resource users are dynamic, numerous, bursty, short-long-lived, and all greedy to grab a share – then we would always have a c) congestion situation. Potentially. And likely.Especially taking into account that TCP flows are mutually-independent, scheduling-wise.

Therefore, from the pure-common-sense perspective, there must be only two ways to not have congestion:

  • better algorithms (that would either do a better job at scheduling TCP flows, or that would support some form of coordinated QoS), or
  • bigger pipes (as in: overprovisioning)

Logically, there seem to be no third alternative. That is, unless…

Unless we consider the following chain of logic. TCP gets congested – why? Because the pipe is limited and shared. But why then the congestion becomes a what’s called an exceptional event? Why can’t we simply ignore it? Because the two communicating endpoints expect the data to be delivered there and acknowledged now.

(Think about this for a second before reading the next sentence.)

The idea would be to find (invent, or reinvent) an app that would break with this there-and-now paradigm. It would, instead, start communicating at time t1:When it probably would – prior to t2 – send and receive some data. And then, more data at around and after t3. Which is fine, as long as the app in question could function without this data altogether and maybe (preferably) take advantage of having any of it, if available.

The behavior that we’d be looking for is called, of course, caching. There are, of course, a ton of networking apps that do cache. There are also a few technical details to flush out before any of it gets anywhere close to being useful (which would be outside the scope of this post).

What’s important, though, is breaking with the there-and-now paradigm, so entrenched that it’s almost hardwired into our brains. Like TCP – into middleboxes.

PS.

Which reminds of a Sutton’s law: when diagnosing, one should first consider the obvious. Why a bank robber robs banks? Because that’s where the money is. Why should you communicate at non-congested times?  Because that’s when you can…

Bayesian Optimization for Clusters

What is the difference between storage cluster and multi-armed bandit (MAB)? This is not a trick question – to help answer it, think about this one: what’s common between MAB and Cloud configurations for big data analytics?

Henceforth, I’ll refer to these use cases as cluster, MAB, and Cloud configurations, respectively. The answer’s down below.

In storage clusters, there are storage initiators and storage targets that continuously engage in routine interactions called I/O processing. In this “game of storage” an initiator is typically the one that makes multi-choice load-balancing decisions (e.g., which of the 3 available copies do I read?), with the clear objective to optimize its own performance. Which can be pictured as, often rather choppy, diagram where the Y axis reflects a commonly measured and timed property: I/O latency, I/O throughput, IOPS, utilization (CPU, memory, network, disk), read and write (wait, active) queue sizes, etc.


Fig. 1. Choosy Initiator: 1MB chunk latencies (us) in the cluster of 90 targets and 90 initiators.

A couple observations, and I’ll get to the point. The diagram above is (an example of) a real-time time-series, where the next measured value depends on both the previous history and the current action. In the world of distributed storages the action consists in choosing a given target at a given time for a given (type, size) I/O transaction. In the world of multi-armed bandits, acting means selecting an arm, and so on. All of this may sound quite obvious, on one hand, and disconnected, on another. But only at the first glance.

In all cases the observed response carries a noise, simply because the underlying system is a bunch of physical machines executing complex logic where a degree of nondeterminism results, in part, from myriad micro-changes on all levels of the respective processing stacks including hardware. No man ever steps in the same river twice (Heraclitus).

Secondly, there always exists a true (latent, objective, black-box) function Y = f(history, action) that can potentially be inferred given a good model, enough time, and a bit of luck. Knowing this function would translate as the ability to optimize across the board: I/O performance – for storage cluster, gambling returns – for multi-armed bandits, $/IOPS ratio – for big data clouds. Either way, this would be a good thing, and one compelling reason to look deeper.

Thirdly and finally, a brute-force search for the best possible solution is either too expensive (for MAB, and Cloud configurations), or severely limited in time (which is to say – too expensive) for the cluster. In other words, the solution must converge fast – or be unfeasible.

The Noise

A wide range of dynamic systems operate in an environment, respond to actions by an agent, and generate serialized or timed output:

Fig. 2. Agent ⇔ Environment diagram.
This output is a function f(history, action) of the previous system states – the history – and the current action. If we could discover, approximate, or at least regress to this function, we could then, presumably, optimize – via a judicious choice of an optimal action at each next iteration…

And so, my first idea was: neural networks! Because, when you’ve got for yourself a new power, you better use it. Go ahead and dump all the cluster-generated time-series into neural networks of various shapes, sizes, and hyperparameters. Sit back and see if any precious latent f() shows up on the other end.

Which is exactly what I did, and well – it didn’t.

There are obstacles that are somewhat special and specific to distributed storage. But there’s also a common random noise that slows down the learning process, undermining its converge-ability and obfuscating the underlying system properties. Which is why the very second idea: get rid of the noise, and see if it helps!

To filter the noise out, we typically compute some sort of a moving average and say that this one is a real signal. Well, not so fast…

First off, averaging across time-series is a lossy transformation. The information that gets lost includes, for instance, the dispersion of the original series (which may be worse than Poissonian or, on the contrary, better than Binomial). In the world of storage clusters the index of dispersion may roughly correspond to “burstiness” or “spikiness” – an important characteristic that cannot (should not) be simplified out.

Secondly, there’s the technical question of whether we compute the average on an entire stream or just on a sliding window, and also – what would be the window size, and what about other hyperparameters that include moving weights, etc.

Putting all this together, here’s a very basic noise-filtering logic:Fig. 3. Noise filtering pseudocode.

This code, in effect, is saying that as long as the time-series fits inside the 3-sigma wide envelope around its moving average, we consider it devoid of noise and therefore we keep it as is. But, if the next data point ventures outside this “envelope”, we do a weighted average (line 3.b.i), giving it a slightly more credence and weight (0.6 vs 0.4). Which will then, at line 3.e, contribute to changing the first and the second moments based either on the entire timed history or on its more recent and separately configurable “window” (as in Fig. 1, which averages over a window of 10).

As a side: it is the code like this (and its numerous variations – all based on pure common sense and intuition) – that makes one wonder: why? Why are there so many new knobs (aka hyperparameters) per line of code? And how do we go about finding optimal values for those hyperparameters?..

Be as it may, some of the questions that emerge from Fig. 3 include: do we actually remove the noise? what is the underlying assumption behind the 3 (or whatever is configured) number of sigmas and other tunables? how do we know that an essential information is not getting lost in addition to, or maybe even instead of, “de-jittering” the stream?

All things being Bayesian

It is always beneficial to see a big picture, even at the risk of totally blowing the locally-observable one out of proportion ©. The proverbial big picture may look as follows:

Fig. 4. Bayesian inference.
Going from top to bottom, left to right:

  1. A noise filtering snippet (Fig. 3) implicitly relies on an underlying assumption that the noise is normally distributed – that it is Gaussian (white) noise. This is likely okay to assume in the cluster and Cloud configurations above, primarily due to size/complexity of the underlying systems and the working force of the Central Limit Theorem (CLT).
  2. On the other hand, the conventional way to filter out the noise is called the Kalman Filter, which, in fact, does not make any assumptions in re Gaussian noise (which is a plus).
  3. The Kalman filter, in turn, is based on Bayesian inference, and ultimately, on the Bayes’ rule for conditional probabilities.
  4. Bayesian inference (Fig. 4) is at the core of very-many things, including Bayesian linear regression (that scales!) and iterative learning in Bayesian networks (not shown); together with Gaussian (or normal) distribution Bayesian inference forms a founding generalization for both hidden Markov models (not shown) and Kalman filters.
  5. On the third hand, Bayesian inference happens to be a sub-discipline of statistical inference (not shown) which includes several non-Bayesian paradigms with their own respective taxonomies of methods and applications (not shown in Fig. 4).
  6. Back to the Gaussian distribution (top right in Fig. 4): what if not only the (CLT-infused) noise but the function f(history, action) itself can be assumed to be probabilistic? What if we could model this function as having a deterministic mean(Y) = m(history, action), with the Y=f() values at each data point normally distributed around each of its m() means?
  7. In other words, what if the Y = f(history, action) is (sampled from) a Gaussian Process (GP) – the generalization of the Gaussian distribution to a distribution over functions:


Fig. 5. Gaussian Process.

  1. There’s a fast-growing body of research (including already cited MAB and Cloud configurations) that answers Yes to this most critical question.
  2. But there is more…
  3. Since the sum of independent Gaussian processes is a (joint) GP as well, and
    • since a Gaussian distribution is itself a special case of a GP (with a constant K=σ2I covariance, where I is the identity matrix), and finally,
      • since white noise is an independent Gaussian

– the noise filtering/smoothing issue (previous section) sort of “disappears” altogether – instead, we simply go through well-documented steps to find the joint GP, and the f() as part of the above.

  1. When a Gaussian process is used in Bayesian inference, the combination yields a prior probability distribution over functions, an iterative step to compute the (better-fitting) posterior, and ultimately – a distinct machine learning technique called Bayesian optimization (BO) – the method widely (if not exclusively) used today to optimize machine learning hyperparameters.
  2. One of the coolest things about Bayesian optimization is a so-called acquisition function – the method of computing the next most promising sampling point in the process that is referred to as exploration-versus-exploitation.
  3. The types of acquisition functions include (in publication order): probability of improvement (PI), expected improvement (EI), upper confidence bound (UCB), minimum regret search (MRS), and more. In each case, the acquisition function computes the next optimal (confidence-level wise) sampling point – in the process that is often referred to as exploration-versus-exploitation.
  4. The use of acquisition functions is an integral part of a broader topic called optimal design of experiments (not shown). The optimal design tracks at least 100 years back, to Kirstine Smith and her 1918 dissertation, and as such does not explain the modern Bayesian renaissance.
  5. In the machine learning setting, though, the technique comes to the rescue each and every time the number of training examples is severely limited, while the need (or the temptation) to produce at least some interpretable results is very strong.
  6. In that sense, a minor criticism of the Cloud configurations study is that, generally, the performance of real systems under-load is spiky, bursty and heavily-tailed. And it is, therefore, problematic to assume anything-Gaussian about the associated (true) cost function, especially when this assumption is then tested on a small sample out of the entire “population” of possible {configuration, benchmark} permutations.
  7. Still, must be better than random. How much better? Or rather, how much better versus other possible Bayesian/GP (i.e., having other possible kernel/covariance functions), Bayesian/non-GP, and non-Bayesian machine learning techniques?
  8. That is the question.

Only a few remaining obstacles

When the objective is to optimize or (same) approximate the next data point in a time-series, both Bayesian and RNN optimizers perform the following 3 (three) generic meta-steps:

  • Given the current state h(t-1) of the model, propose the next query point x(t)
  • Process the response y(t)
  • Update the internal state h(t+1)

Everything else about these two techniques – is different. The speed, the scale, the amount of engineering. The applicability, after all. Bayesian optimization (BO) has proven to work extremely well for a (constantly growing) number of applications – the list includes the already cited MAB and Cloud configurations, hyperparameter optimization, and more.

However. GP-based BO does not scale: its computational complexity grows as O(n3), where n is the number of sampled data points. This is clearly an impediment – for many apps. As far as storage clustering, the other obstacles include:

  • a certain adversarial aspect in the relationships between the clustered “players”;
  • a lack of covariance-induced similarity between the data points that – temporally, at least – appear to be very close to each other;
  • the need to rerun BO from scratch every time there is a change in the environment (Fig. 2) – examples including changes in the workload, nodes dropping out and coming in, configuration and software updates and upgrades;
  • the microsecond latencies that separate load balancing decisions/actions, leaving little time for machine learning.

Fortunately, there’s a bleeding-edge research that strives to combine the power of neural networks with the Bayesian power to squeeze the “envelope of uncertainty” around the true function (Fig. 6) – in just a few iterations:

Fig. 6. Courtesy of: Taking the Human Out of the Loop: A Review of Bayesian Optimization.

To be continued…

A Quick Note on Getting Better at Near Duplicates

Finding near-duplicates on the web is important for several unrelated reasons that include archiving and versioning, plagiarism and copyright, SEO and optimization of storage capacity. It is maybe instructive to re-review the original groundbreaking research in the space – an eleven-year-old highly cited study of 1.6 billion distinct web pages.

What’s interesting about this work is that it takes two relatively faulty heuristics and combines them both to produce a third one that is superior.

Specifically, let A and B be the two heuristics in question, and (p1, p2) – a random pair of web pages that get evaluated for near-duplication (in one of its domain-specific interpretations). Each method, effectively, computes a similarity between the two pages – let’s denote this as sim-A(p1, p2) and sim-B(p1, p2), respectively.

Next, based on their own configured tunables A and B would, separately, determine whether p1 and p2 are near-duplicate:

  • duplicate-A(p1, p2) = sim-A(p1, p2) > threshold-A
  • duplicate-B(p1, p2) = sim-B(p1, p2) > threshold-B

The combined algorithm – let’s call it C – evaluates as follows from left to right:

  • duplicate-C(p1, p2) = duplicate-A(p1, p2) && sim-B(p1, p2) > threshold-C

where threshold-C is a yet another constant that gets carefully tailored to minimize both false negatives and false positives (the latter, when a given pair of different web pages is falsely computed as near-duplicate).

Easy to notice that, given enough time to experiment and large enough datasets, this can work for more than two unrelated heuristics that utilize minhash, simhash and various other popular near-duplication detection techniques.

Maybe not so obvious is another observation: given real-life time and space limitations, any near-duplication algorithm will have to remain information-lossy and irreversible. And since the information that is being lost cannot be compared in terms of duplication (or non-duplication), we cannot eliminate the scenarios of non-identical (and maybe even disjoint) subsets of the respective false positives.

Therefore, given algorithms A and B and their respective false positive sets FP-A and FP-B, we should always be able to come up with a combined heuristics C(A, B) that optimizes its own inevitable FP-C as a function of (FP-A, FP-B).

PS. A blog that (at the time of this writing) comes on top when searching for “near-duplicate detection”. Easy and illustrated.

Learning to Learn by Gradient Descent with Rebalancing

Neural networks, as the name implies, comprise many little neurons. Often, multiple layers of neurons. How many? Quick googling on the “number of layers” or “number of neurons in a layer” leaves one with a strong impression that there are no good answers.

The first impression is right. There is a ton of recipes on the web, with the most popular and often-repeated rules of thumb boiling down to “keep adding layers until you start to overfit” (Hinton) or “until the test error does not improve anymore” (Bengio).

Part I of this post stipulates that selecting the optimal neural network architecture is, or rather, can be a search problem. There are techniques to do massive searches. Training a neural network (NN) can be counted as one such technique, where the search target belongs to the function space defined by both this environment and this NN architecture. The latter includes a certain (and fixed) number of layers and number of neurons per each layer. The question then is, would it be possible to use a neural network to search for the optimal NN architecture? To search for it in the entire NN domain, defined only and exclusively by the given environment?

Ignoratio elenchi

Aggregating multiple neural networks into one (super) architecture comes with a certain number of tradeoffs and risks including the one that is called ignoratio elenchi – missing the point. Indeed, a super net (Figure 1) would likely have its own neurons and layers (including hidden ones), and activation functions. (Even a very cursory acquaintance with neural networks would allow one to draw this conclusion.)

Which means that training this super net would inexorably have an effect of training its own internal “wiring” – instead of, or maybe at the expense of, helping to select the best NN – for instance, one of the 9 shown in Figure 1. And that would be missing the point, big time.

Fig. 1. Super network that combines 9 neural nets to generate 4 (green) outputs

The primary goal remains: not to train super-network per se but rather to use it to search the vast NN domain for an optimal architecture. This text describes one solution to circumvent the aforementioned ignoratio elenchi.

I call it a Weighted Mixer (Figure 2):

Fig. 2. Weighted sum of NN outputs

Essentially, a weighted mixer, or WM, is a weighted sum of the contained neural nets, with a couple important distinctions…

TL;DR – WM can automatically grade multiple network architectures (a link to the white paper and the supporting code follows below):

Animated Weights

One picture that is worth a thousand words. This shows a bunch of NN architectures, with the sizes and numbers of hidden layer ranging from 16 to 56 and from 1 to 3, respectively. The column charts (Figure 8) depict running weights that the WM assigns to each of the 16 outputs of each of the constituent nets that it (the WM) drives through the training:

Fig. 8. Weight updates in-progress

The winner here happens to be the (56, 2) network, with the (48, 2) NN coming close second.  This result, and the fact that, indeed, the 56-neurons-2-layers architecture converges faster and produces better MSE, can be separately confirmed by running each of the 18 NNs in isolation…

Searching Stateful Spaces

Optimizing a nonlinear, multidimensional, stateful system is equivalent to performing a search in the space of the (performance affecting) actions and system states.

Recurrent neural networks (RNN) have proved to be extremely efficient at searching function spaces. But, they come with a baggage.

For a given stateful transformation Y(t) = F(X(t), S(t)), there’s an RNN space – a function space in its own right, inversely defined by the original system function F.

The question then is: how to search for the optimal RNN? The one that would feature the fastest convergence and, simultaneously, minimal cost/loss?

The ability to quickly search RNN domain becomes totally crucial in presence of real-time requirements (e.g., when optimizing performance of a running storage system), where the difference between “the best” and “the rest” is the difference between usable and unusable…

  • PDF (full paper)
  • Keywords: RNN, reinforcement learning, hybrid storage, meta-learning, NFL