# Tag: Distributed Storage

# AIStore: an open system for petascale deep learning

*So tell me, what do you truly desire?*

* Lucifer Morningstar*

## Introduction

AIStore (or AIS) has been in development for more than three years so far and has accumulated a fairly long list of capabilities, all duly noted via release notes on the corresponding GitHub pages. At this stage, AIS meets common expectations to *a storage solution* – its usability, manageability, data protection, scalability and performance.

AIStore is a highly available partition-tolerant distributed system with n-way mirroring, erasure coding, and read-after-write consistency(*). But it is not purely – or not only – a storage system: it’ll shuffle user datasets and run custom extract-transform-load workloads. From its very inception, the idea was to provide an open-source, open-format software stack for AI apps – an ambitious undertaking that required incremental evolution via multiple internal releases and continuous refactoring…

AIS is lightweight, fully reliable storage that can be ad-hoc deployed, with or without Kubernetes, anywhere from a single Linux machine to a bare-metal cluster of any size. Prerequisites boil down to having a Linux and a disk. Getting started with AIS will take only a few minutes and can be done either by running a prebuilt all-in-one docker image or directly from the source.

AIS provides S3 and native APIs and can be deployed as fast storage (or a fast on-demand cache) in front of the 5 (five) supported backends (whereby AIS itself would be the number 6, respectively):

## Review

The focus on training apps and associated workloads results in a different set of optimization priorities and a different set of inevitable tradeoffs. Unlike most distributed storage systems, AIS does *not* break objects into uniform pieces – blocks, chunks, or fragments – with the corresponding metadata *manifests* stored separately (and with an elaborate datapath to manage all of the above and reassemble distributed pieces within each read request, which is also why the resulting latency is usually bounded by the slowest link in the corresponding flow chart).

Instead, AIS supports **direct** (compute <=> disk) I/O flows with the further focus on (re)sharding datasets **prior** to any workload that **actually** must perform – model training in the first place. The idea was to make a system that can easily run massive-parallel jobs converting small original files and/or existing shards to preferred sharded formats that the apps in question can readily and optimally consume. The result – linear scalability under random-read workloads, as in:

total-throughput = N * single-disk-throughput

where N is the total number of clustered disks.

It is difficult to talk about a storage solution and not say a few words about actual I/O flows. Here’s a high-level READ diagram, which however does not show checksumming, self-healing, versioning, and *scenarios* (that also include operation in the presence of nodes and disks being added or removed):

## Further

You can always start *small.* As long as there’s HTTP connectivity, AIS clusters can see and operate on each other’s datasets. One can, therefore, start with a single all-in-one container or a single (virtual) machine – one cluster, one deployment at a time. The resulting global namespace is easily extensible via Cloud buckets and other supported backends. Paraphrasing the epigraph, the *true desire* is to run on *commodity* Linux servers, perform close-to-data user-defined *transforms,* and, of course, radically simplify training models at *any scale.*

# High performance I/O for large scale deep learning

Abstract – Training deep learning (DL) models on petascale datasets is essential for achieving competitive and state-of-the-art performance in applications such as speech, video analytics, and object recognition. However, existing distributed filesystems were not developed for the access patterns and usability requirements of DL jobs. In this paper, we describe AIStore, a highly scalable, easy-to-deploy […]

# Finding the right storage system

What are the odds that you find the right storage system to run all the workloads that need to be running and to forever sustain all the apps that consume storage? Given the current state of the art, what would be your chances of doing so today?

Apart from being existential, the question also appears to be formalizable. Meaning that it can be approached with a certain degree of scientific rigor and scrupulousness. Of course, I’m talking about the Drake equation.

The Drake equation is a probabilistic argument used to estimate the number of active, communicative extraterrestrial civilizations in the Milky Way galaxy. Since its 1961 publication, and notwithstanding constructive criticism (see, for instance, “The Drake Equation Is Broken; Here’s How To Fix It“) the equation proved to be usable across wide range of application domains.

Perhaps the most popular application of Drake is estimating one’s chances to find true love (variations: girlfriend, boyfriend, soulmate, BFF).

Given a massive body of work on the subject I’ll have to limit citations to maybe just one: the 2008 paper “Why I don’t have a girlfriend” where Peter Backus clearly and succinctly explains **a)** how to apply Drake, and **b)** why at the time of the writing he didn’t have a girlfriend.

Back to storage systems though – equipped with Drake’s methodology, the question of finding the one and only storage system can be factorized (step 1), with the probabilities of each factor estimated (step 2), and fed into the equation (step 3) to generate an estimate. Consider a real-life example, whereby the storage system in question features:

**SDS**: software stack on commodity x86 servers**Price**$0.10/GB or below that covers hardware + software + 5 years of support**Petascale**clustering with 1PB usable today and growing to 100PB later in 2019**Storage media**: hard drives**Linearity**: must scale close to linear with each added node and each added spindle**HA**: must be highly available**Tiering**: unlimited**Data redundancy**: both local (RAID1/5/6/10) and replicated**AI**: built-in accelerations for AI workloads

A quick look at the above reveals the truth: the chance of locating a match is probably low.

Indeed, at the time of this writing, the Wikipedia’s computer storage category includes a total of about 150 companies. Even if we super-generously give each of the 9 factors an average 0.5 probability, the resulting (150 * 1/512) can only be interpreted as: oops, doesn’t exist.

There are numerous ways to confirm the first suspicion. For instance, given the limited dataset of 150, we could run Monte Carlo simulations by randomly selecting storage companies and filling-in samples, e.g.:

Storage company X.Y.Z. |
SDS: No |
Petascale: No |
Price: Not cheap |
Commodity x86: No |
HA: Yes |

Having accumulated enough samples to generate confidence intervals, we then compute the odds, find them depressingly low, and resort to finding some peace in the pure logic of this exercise, a consolation similar to the one Peter Backus found years ago.

We could also take a more conventional route which (conventionally) entails: reviewing datasheets, meeting with vendors, gathering references, reading Top-N IDC and Gartner reports, and then again reviewing the product datasheets. Here, for instance, are the top 13 companies from the IDC Worldwide Object Storage 2018 – we could take this report and painstakingly fill-in a 13-by-9 comparison table with those companies in the rows and the distinguishing characteristics in the columns (I will leave this as an exercise for the reader).

Source: The Register

Most likely, though, if nothing even remotely plausible comes up on the first day or two of screening, referencing, and filling out comparison tables, the chance (of finding the right storage system) must be considered extremely slim. The storage system that you need, seek, and depend upon may* not* exist.

Or, stated differently, the storage system that you *really* need may *really* not exist.

It’ll be hard to come to terms with this statement – the process will usually take anywhere between three to nine months. Eventually, though, the dilemma will crystallize as a simple binary choice: to build or not to build?

This is a tough choice to make. Here’s one possible answer. Why does it work for the 9 (nine) factors listed above and where does it go from there – that will be another story and a different time.

# Neural Networks for Storage Games

Even though machine learning is used extensively and successfully in numerous distinct areas, the idea to apply it to hardcore (fast) datapaths may seem farfetched. Which is also why I’m currently looking at the distributed storage use case where storage nodes employ different I/O load balancing strategies. Some of these nodes run neural networks, others – maybe a more conventional logic. Some of these strategies perform better than others, but only when they compete against their deterministic counterparts. The text below is only scratching the surface, of course:

# Bayesian Optimization for Clusters

What is the difference between storage cluster and multi-armed bandit (MAB)? This is not a trick question – to help answer it, think about this one: what’s common between MAB and Cloud configurations for big data analytics?

Henceforth, I’ll refer to these use cases as cluster, MAB, and Cloud configurations, respectively. The answer’s down below.

In storage clusters, there are storage initiators and storage targets that continuously engage in routine interactions called I/O processing. In this “game of storage” an initiator is typically the one that makes multi-choice load-balancing decisions (e.g., which of the 3 available copies do I read?), with the clear objective to optimize its own performance. Which can be pictured as, often rather choppy, diagram where the Y axis reflects a commonly measured and timed property: I/O latency, I/O throughput, IOPS, utilization (CPU, memory, network, disk), read and write (wait, active) queue sizes, etc.

**Fig. 1.** Choosy Initiator: 1MB chunk latencies (us) in the cluster of 90 targets and 90 initiators.

A couple observations, and I’ll get to the point. The diagram above is (an example of) a real-time time-series, where the next measured value depends on both the *previous *history and the current *action*. In the world of distributed storages the action consists in choosing a given target at a given time for a given (type, size) I/O transaction. In the world of multi-armed bandits, *acting *means selecting an *arm*, and so on. All of this may sound quite obvious, on one hand, and disconnected, on another. But only at the first glance.

In all cases the observed response carries a noise, simply because the underlying system is a bunch of physical machines executing complex logic where a degree of nondeterminism results, in part, from myriad micro-changes on all levels of the respective processing stacks including hardware. No man ever steps in the same river twice (Heraclitus).

Secondly, there always exists a true (*latent,* *objective,* *black-box)* function **Y = f(history, action)** that can potentially be inferred given a good model, enough time, and a bit of luck. Knowing this function would translate as the ability to optimize across the board: I/O performance – for storage cluster, gambling returns – for multi-armed bandits, $/IOPS ratio – for big data clouds. Either way, this would be a good thing, and one compelling reason to look deeper.

Thirdly and finally, a brute-force search for the best possible solution is either too expensive (for MAB, and Cloud configurations), or severely limited in time (which is to say – too expensive) for the cluster. In other words, the solution must converge fast – or be unfeasible.

## The Noise

A wide range of dynamic systems operate in an *environment*, respond to *actions* by an *agent*, and* *generate serialized or timed *output:*

**Fig. 2.** Agent ⇔ Environment diagram.

This output is a function **f(history, action)** of the previous system states – the *history* – and the current action*.* If we could discover, approximate, or at least regress to this function, we could then, presumably, optimize – via a judicious choice of an optimal action at each next iteration…

And so, my first idea was: neural networks! Because, when you’ve got for yourself a new power, you better use it. Go ahead and dump all the cluster-generated time-series into neural networks of various shapes, sizes, and hyperparameters. Sit back and see if any precious latent **f()** shows up on the other end.

Which is exactly what I did, and well – it didn’t.

There are obstacles that are somewhat special and specific to distributed storage. But there’s also a common random *noise* that slows down the *learning* process, undermining its converge-ability and obfuscating the underlying system properties. Which is why the very second idea: get rid of the noise, and see if it helps!

To filter the noise out, we typically compute some sort of a moving average and say that this one is a real *signal*. Well, not so fast…

First off, averaging across time-series is a *lossy transformation*. The information that gets lost includes, for instance, the dispersion of the original series (which may be worse than Poissonian or, on the contrary, better than Binomial). In the world of storage clusters the index of dispersion may roughly correspond to “burstiness” or “spikiness” – an important characteristic that cannot (should not) be simplified out.

Secondly, there’s the technical question of whether we compute the average on an entire stream or just on a sliding window, and also – what would be the window size, and what about other hyperparameters that include moving weights, etc.

Putting all this together, here’s a very basic noise-filtering logic:**Fig. 3.** Noise filtering pseudocode.

This code, in effect, is saying that as long as the time-series fits inside the 3-sigma wide envelope around its moving average, we consider it devoid of noise and therefore we keep it as is. But, if the next data point ventures outside this “envelope”, we do a weighted average (line 3.b.i), giving it a slightly more credence and weight (0.6 vs 0.4). Which will then, at line 3.e, contribute to changing the first and the second moments based either on the entire timed history or on its more recent and separately configurable “window” (as in Fig. 1, which averages over a window of 10).

*As a side: it is the code like this (and its numerous variations – all based on pure common sense and intuition) – that makes one wonder: why? Why are there so many new knobs (aka hyperparameters) per line of code? And how do we go about finding optimal values for those hyperparameters?..*

Be as it may, some of the questions that emerge from Fig. 3 include: do we actually remove the noise? what is the underlying assumption behind the 3 (or whatever is configured) number of sigmas and other tunables? how do we know that an essential information is not getting lost in addition to, or maybe even instead of, “de-jittering” the stream?

## All things being Bayesian

It is always beneficial to see a big picture, even at the risk of totally blowing the locally-observable one out of proportion ©. The proverbial big picture may look as follows:

**Fig. 4.** Bayesian inference.

Going from top to bottom, left to right:

- A noise filtering snippet (Fig. 3) implicitly relies on an underlying assumption that the noise is normally distributed – that it is Gaussian (
*white*) noise. This is likely okay to assume in the cluster and Cloud configurations above, primarily due to size/complexity of the underlying systems and the working force of the Central Limit Theorem (CLT). - On the other hand, the conventional way to filter out the noise is called the Kalman Filter, which, in fact, does not make any assumptions in re Gaussian noise (which is a plus).
- The Kalman filter, in turn, is based on Bayesian inference, and ultimately, on the Bayes’ rule for conditional probabilities.
- Bayesian inference (Fig. 4) is at the core of very-many things, including Bayesian linear regression (that scales!) and iterative learning in Bayesian networks (not shown); together with Gaussian (or normal) distribution Bayesian inference forms a founding generalization for both hidden Markov models (not shown) and Kalman filters.
- On the
*third*hand, Bayesian inference happens to be a sub-discipline of statistical inference (not shown) which includes several non-Bayesian paradigms with their own respective taxonomies of methods and applications (not shown in Fig. 4). - Back to the Gaussian distribution (top right in Fig. 4): what if not only the (CLT-infused) noise but the function
**f(history, action)**itself can be assumed to be probabilistic? What if we could model this function as having a deterministic**mean(Y) = m(history, action)**, with the**Y=f()**values at each data point normally distributed around each of its**m()**means? - In other words, what if the
**Y = f(history, action)**is (sampled from) a Gaussian Process (GP) – the generalization of the Gaussian distribution to a distribution over functions:

- There’s a fast-growing body of research (including already cited MAB and Cloud configurations) that answers Yes to this most critical question.
- But there is more…
- Since the sum of independent Gaussian processes is a (joint) GP as well, and
- since a Gaussian distribution is itself a special case of a GP (with a constant
*K*=**σ**^{2}covariance, where*I*is the identity matrix), and finally,*I*- since white noise is an independent Gaussian

- since a Gaussian distribution is itself a special case of a GP (with a constant

– the noise filtering/smoothing issue (previous section) sort of “disappears” altogether – instead, we simply go through well-documented steps to find the joint GP, and the **f()** as part of the above.

- When a Gaussian process is used in Bayesian inference, the combination yields a prior probability distribution over functions, an iterative step to compute the (better-fitting) posterior, and ultimately – a distinct machine learning technique called Bayesian optimization (BO) – the method widely (if not exclusively) used today to optimize machine learning hyperparameters.
- One of the coolest things about Bayesian optimization is a so-called acquisition function – the method of computing the next most promising sampling point in the process that is referred to as exploration-versus-exploitation.
- The types of acquisition functions include (in publication order): probability of improvement (PI), expected improvement (EI), upper confidence bound (UCB), minimum regret search (MRS), and more. In each case, the acquisition function computes the next optimal (confidence-level wise) sampling point – in the process that is often referred to as exploration-versus-exploitation.
- The use of acquisition functions is an integral part of a broader topic called optimal design of experiments (not shown). The optimal design tracks at least 100 years back, to Kirstine Smith and her 1918 dissertation, and as such does not explain the modern Bayesian renaissance.
- In the machine learning setting, though, the technique comes to the rescue each and every time the number of
*training examples*is severely limited, while the need (or the temptation) to produce at least some interpretable results is very strong. - In that sense, a minor criticism of the Cloud configurations study is that, generally, the performance of real systems under-load is spiky, bursty and heavily-tailed. And it is, therefore, problematic to assume anything-Gaussian about the associated (true) cost function, especially when this assumption is then tested on a small sample out of the entire “population” of possible {configuration, benchmark} permutations.
- Still, must be better than random. How much better? Or rather, how much better versus other possible Bayesian/GP (i.e., having other possible kernel/covariance functions), Bayesian/non-GP, and non-Bayesian machine learning techniques?
- That is the question.

## Only a few remaining obstacles

When the objective is to optimize or (same) approximate the next data point in a time-series, both Bayesian and RNN optimizers perform the following 3 (three) generic meta-steps:

- Given the current state h(t-1) of the model, propose the next query point x(t)
- Process the response y(t)
- Update the internal state h(t+1)

Everything else about these two techniques – is different. The speed, the scale, the amount of engineering. The applicability, after all. Bayesian optimization (BO) has proven to work extremely well for a (constantly growing) number of applications – the list includes the already cited MAB and Cloud configurations, hyperparameter optimization, and more.

However. GP-based BO does not scale: its computational complexity grows as O(n^{3}), where n is the number of sampled data points. This is clearly an impediment – for many apps. As far as storage clustering, the other obstacles include:

- a certain adversarial aspect in the relationships between the clustered “players”;
- a lack of covariance-induced
*similarity*between the data points that – temporally, at least – appear to be very close to each other; - the need to rerun BO from scratch every time there is a
*change*in the*environment*(Fig. 2) – examples including changes in the workload, nodes dropping out and coming in, configuration and software updates and upgrades; - the microsecond latencies that separate load balancing decisions/actions, leaving little time for machine learning.

Fortunately, there’s a bleeding-edge research that strives to combine the power of neural networks with the Bayesian power to squeeze the “envelope of uncertainty” around the true function (Fig. 6) – in just a few iterations:

**Fig. 6.** Courtesy of: Taking the Human Out of the Loop: A Review of Bayesian Optimization.

To be continued…

# A Quick Note on Getting Better at Near Duplicates

Finding near-duplicates on the web is important for several unrelated reasons that include archiving and versioning, plagiarism and copyright, SEO and optimization of storage capacity. It is maybe instructive to re-review the original groundbreaking research in the space – an eleven-year-old highly cited study of 1.6 billion distinct web pages.

What’s interesting about this work is that it takes two relatively faulty heuristics and combines them both to produce a third one that is superior.

Specifically, let A and B be the two heuristics in question, and (p1, p2) – a random pair of web pages that get evaluated for near-duplication (in one of its domain-specific interpretations). Each method, effectively, computes a *similarity* between the two pages – let’s denote this as sim-A(p1, p2) and sim-B(p1, p2), respectively.

Next, based on their own configured tunables A and B would, separately, determine whether p1 and p2 are near-duplicate:

- duplicate-A(p1, p2) = sim-A(p1, p2) > threshold-A
- duplicate-B(p1, p2) = sim-B(p1, p2) > threshold-B

The combined algorithm – let’s call it C – evaluates as follows from left to right:

- duplicate-C(p1, p2) = duplicate-A(p1, p2) && sim-B(p1, p2) > threshold-C

where threshold-C is a yet another constant that gets carefully tailored to minimize both false negatives and false positives (the latter, when a given pair of different web pages is falsely computed as near-duplicate).

Easy to notice that, given enough time to experiment and large enough datasets, this can work for more than two unrelated heuristics that utilize minhash, simhash and various other popular near-duplication detection techniques.

Maybe not so obvious is another observation: given real-life time and space limitations, any near-duplication algorithm will have to remain information-lossy and irreversible. And since the information that is being lost cannot be compared in terms of duplication (or non-duplication), we cannot eliminate the scenarios of non-identical (and maybe even disjoint) subsets of the respective false positives.

Therefore, given algorithms A and B and their respective false positive sets FP-A and FP-B, we should always be able to come up with a combined heuristics C(A, B) that optimizes its own inevitable FP-C as a function of (FP-A, FP-B).

PS. A blog that (at the time of this writing) comes on top when searching for “near-duplicate detection”. Easy and illustrated.

# Snapshotting Scale-out Storage

Presented at Storage Developer Conference (SNIA SDC 2016):

# Scalable Object Storage with Resource Reservations and Dynamic Load Balancing

# Storage Cluster in Thermal Equilibrium (part I)

*Mathematics is the art of giving the same name to different things. *

*Henri Poincare*

## 1. Life’s most persistent question

The question is: how to compute distribution of states of clustered nodes in a large distributed cluster running a steady workload? The cluster is fully distributed, and:None of this is uncommon. For one thing, we live at a time of ever-exploding problem sizes. Bigger problems lead to bigger clusters with nodes that, when failed, get replaced wholesale.

And small stuff. For instance, eventually consistent storage is fairly common knowledge today. It is spreading fast, with clumps of apps (including venerable big data and NoSQL apps) growing and stacking on top.

Those software stacks enjoy minimal centralized processing (and often none at all), which causes CAP-imposed scalability limits to get crushed. This further leads to even bigger clusters, where nodes run complex stateful software, etcetera.

Why complex and stateful, by the way? The most generic sentiment is that the software must constantly evolve, and as it evolves its complexity and stateful-ness increases. Version (n+1) is always more *complex* and more *stateful* than the (n) – and if there are any exceptions that I’m not aware of, they only confirm the rule. Which fully applies to storage, especially due to the fact that this particular software must keep up with the avalanche of hardware changes. There’s a revolution going on. The SSD revolution is still in full swing and raging, with SCM revolution right around the corner. The result is millions of lines of code, and counting. It’s a mess.