Lifecycle management

There’s a set of topics in system management that can often be found under alternative subtitles:
“graceful termination and cleanup”, “shutting down and restarting”, “adding/removing members”, “joining and leaving cluster”, and similar.

Any discussion along those lines typically involves state transitions, so let’s go ahead and name the states (and transitions).

To put things in perspective, this picture is about a node (not shown) in an aistore cluster (not shown). Tracking it from the top downwards, first notice a state called “maintenance mode”. “Maintenance mode” constitutes maybe the most gentle, if you will, way of removing a node from the operating cluster.

When in maintenance, the node stops keep-alive heartbeats but remains in the cluster map and remains connected. That is, unless you disconnect or shut it down manually (which would be perfectly fine and expected).

Next is “shutdown”. Graceful shutdown can also be achieved in a single shot, as indicated by one of those curly arrows in the picture’s left: “online” => “shutdown”.

When in “shutdown”, a node can easily get back and rejoin the cluster at any later time. It’ll take two steps, not one – see the blue arrows on the picture’s right where RESTART must be understood as a deployment-specific operation (e.g., kubectl run).

Both “maintenance” and “shutdown” involve a certain intra-cluster operation called “global rebalance” (aka “rebalance”).

But before we talk about it in any greater detail, let’s finish with the node’s lifecycle. The third and final special state is “decommission”. Loosely synonymous with cleanup (a very thorough cleanup, as it were), “decommission” entails:

  • migrating all user data the node is storing to other nodes that are currently “online” – the step that’s followed by:
  • partial or complete cleanup of the node in question, whereby the complete cleanup further entails:
  • removing all AIS metadata, all configuration files, and – last but not least – user data in its entirety.

Needless to say, there’s no way back out of “decommission” – the proverbial point of no return. To rejoin the cluster, a decommissioned node will have to be redeployed from scratch, but then it would be a totally different node, of course…


There’s one question that absolutely cannot wait: how to “terminate” or “cleanup” a cluster? Here’s how:

$ ais cluster decommission --rm-user-data --yes

The above command will destroy an existing cluster – completely and utterly, no questions asked. It can be conveniently used in testing/benchmarking situations or in any sort of non-production environment – see --help for details. It also executes very fast – Ctrl-C’s unlikely to help in case of change-of-mind…


Full Disclosure: all lifecycle management commands and all associated APIs require administrative privileges. There are, essentially, three ways:

  • deploy the cluster with authentication disabled:
$ ais config cluster auth --json
    "auth": {
        "secret": "xxxxxxxx",
        "enabled": false
  • use integrated AuthN server that provides OAuth 2.0 compliant JWT and a set of easy commands to manage users and roles (with certain permissions to access certain clusters, etc.);
  • outsource authorization to a separate, centralized (usually, LDAP-integrated) management system to manage existing users, groups, and mappings.


Conceptually, aistore rebalance is similar to what’s often called “RAID rebuild”. The underlying mechanics would be very different but the general idea is the same: user data massively migrating from some nodes in a cluster (or disks in an array) to some other nodes (disks), and vice versa.

In aistore, all the migration (aka “rebalancing”) that’s taking in place is the system response to a lifecycle event that’s already happened or is about to happen. In fact, it is the response to satisfy a singular purpose and a single location-governing rule that simply states: user data must be properly located.

Proper location

For any object in a cluster, its proper location is defined by the current cluster map and locally – on each target node – by the locally configured target’s mountpaths.

In that sense, the “maintenance” state, for instance, has its beginning – when the cluster starts rebalancing, and the post-rebalancing end, whereby the corresponding sub-state get recorded in a new version of the cluster map, which then gets safely distributed across all nodes, etc., etc.

Next section gives an example and further clarifies the notion of maintenance sub-states.

Quick example

Given a 3-node single-gateway cluster, we go ahead and shut down one of the nodes:

$ ais cluster add-remove-nodes shutdown <TAB-TAB>
p[MWIp8080]   t[ikht8083]   t[noXt8082]   t[VmQt8081]

$ ais cluster add-remove-nodes shutdown t[ikht8083] -y

Started rebalance "g47" (to monitor, run 'ais show rebalance').
t[ikht8083] is shutting down, please wait for cluster rebalancing to finish

Note: the node t[ikht8083] is _not_ decommissioned - 
it remains in the cluster map and can be manually
restarted at any later time (and subsequently activated via 'stop-maintenance' operation).

Once the command is executed, notice the following:

$ ais show cluster
t[ikht8083][x]   -   -   -   -   maintenance

At first, maintenance will show up in red indicating a simple fact that data is expeditiously migrating from the node (which is about to leave the cluster).

A visual cue, which is supposed to imply something like: “please don’t disconnect, do not power off”.

But eventually, if you run the command periodically:

$ ais show cluster --refresh 3

or a few times manually – eventually show cluster will report that rebalance (“g47” in the example) has finished and the node t[ikht8083] – gracefully terminated. Simultaneously, maintenance in the show output will become non-red.

The takeaway: global rebalance runs its full way before the node in question is permitted to leave. If interrupted for any reason whatsoever (power-cycle, network disconnect, new node joining, cluster shutdown, etc.) – rebalance will resume and will keep going until the governing condition is fully and globally satisfied.


lifecycle operationCLIbrief description
maintenance modestart-maintenanceThe most lightweight way to remove a node. Stop keep-alive heartbeats, do not insist on metadata updates – ignore the failures. For advanced usage options, see --help.
shutdownshutdownSame as above, plus node shutdown (aisnode exit).
decommissiondecommissionSame as above, plus partial (metadata only) or complete (both data and AIS metadata) cleanup. A decommissioned node is forever “forgotten” – removed from the cluster map.
remove node from cluster mapais advanced remove-from-smapStrictly intended for testing purposes and special use-at-your-own-risk scenarios. Immediately remove the node from the cluster and distribute updated cluster map with no rebalancing.
take node out of maintenancestop-maintenanceUpdate the node with the current cluster-level metadata, re-enable keep-alive, run global rebalance. Finally, when all succeeds, distribute updated cluster map (where the node shows up “online”).
join new node (ie., grow cluster)joinEssentially, same as above: update the node, run global rebalance, etc.

Assorted notes

Normally, a starting-up AIS node (aisnode) will use its local configuration to communicate with any other node in the cluster and perform what’s called self-join. The latter does not require a join command or any other explicit administration.

Still, the join command can be useful when, for instance, node is misconfigured or started as a standby.

When rebalancing, the cluster remains fully operational and can be used to read and write data, list, create, and destroy buckets, run jobs, and more. In other words, none of the listed lifecycle operations requires downtime. The persistent idea is that users never notice…


Transforming non-existing datasets

There’s an old trick that never quite gets old: you run a high-velocity exercise that generates a massive amount of traffic through some sort of a multi-part system, whereby some of those parts are (spectacularly) getting killed and periodically recovered.

TL;DR a simple demonstration that does exactly that (and see detailed comments inside):

cp-rmnode-rebalancetaking a random node to maintenance when there’s no data redundancy
cp-rmnode-ec(erasure coded content) + (immediate loss of a node)
cp-rmdisk(3-way replication) + (immediate loss of a random drive)

The scripts are self-contained and will run with any aistore instance that has at least 5 nodes, each with 3+ disks.

But when the traffic is running and the parts are getting periodically killed and recovered in a variety of realistic ways – then you would maybe want to watch it via Prometheus or Graphite/Grafana. Or, at the very least, via ais show performance – the poor man’s choice that’s always available.

for details, run: ais show performance --help

Observability notwithstanding, the idea is always the same – to see whether the combined throughput dips at any point (it does). And by how much, how long (it depends).

There’s one (and only one) problem though: vanilla copying may sound dull and mundane. Frankly, it is totally unexciting, even when coincided with all the rebalancing/rebuilding runtime drama behind the scenes.


And so, to make it marginally more interesting – but also to increase usability – we go ahead and copy a non-existing dataset. Something like:

$ ais ls s3
No "s3://" matching buckets in the cluster. Use '--all' option to list _all_ buckets.

$ ais storage summary s3://src --all
NAME             OBJECTS (cached, remote)
s3://src                  0       1430

$ ais ls gs
No "gs://" matching buckets in the cluster. Use '--all' option to list _all_ buckets.

$ ais cp s3://src gs://dst --progress --refresh 3 --all

Copied objects:              277/1430 [===========>--------------------------------------------------] 19 %
Copied size:    277.00 KiB / 1.40 MiB [===========>--------------------------------------------------] 19 %

The first three commands briefly establish non-existence – the fact that there are no Amazon and Google buckets in the cluster right now.

ais storage summary command (and its close relative ais ls --summary) will also report whether the source is visible/accessible and will conveniently compute numbers and sizes (not shown).

But because “existence” may come with all sorts of connotations the term is: presence. We say “present” or “not present” in reference to remote buckets and/or data in those buckets, whereby the latter may or may not be currently present in part or in whole.

In this case, both the source and the destination (s3://src and gs://dst, respectively) were ostensibly not present, and we just went ahead to run the copy with a progress bar and a variety of not shown list/range/prefix selections and options (see --help for details).


From here on, the immediate and fully expected question is: transformation. Namely – whether it’d be possible to transform datasets – not just copy but also apply a user-defined transformation to the source that may be (currently) stored in the AIS cluster, or maybe not or not entirely.

Something like:

$ ais etl init spec --name=my-custom-transform --from-file=my-custom-transform.yaml

followed by:

$ ais etl bucket my-custom-transform s3://src gs://dst --progress --refresh 3 --all

The first step deploys user containers on each clustered node. More precisely, the init-spec API call is broadcast to all target nodes; in response, each node calls K8s API to pull the corresponding image and run it locally and in parallel – but only if the container in question is not already previously deployed.

(And yes, ETL is the only aistore feature that does require Kubernetes.)

Another flavor of ais etl init command is ais etl init code – see --help for details.

That was the first step – the second is virtually identical to copying (see previous section). It’ll read remote dataset from Amazon S3, transform it, and place the result into another (e.g., Google) cloud.

As a quick aside, anything that aistore reads or writes remotely it also stores. Storing is always done in full accordance with the configured redundancy and other applicable bucket policies and – secondly – all subsequent access to the same content (that previously was remote) gets terminated inside the cluster.

Despite node and drive failures

The scripts above periodically fail and recover nodes and disks. But we could also go ahead and replace ais cp command with its ais etl counterpart – that is, replace dataset replication with dataset (offline) transformation, while leaving everything else intact.

We could do even more – select any startable job:

$ ais start <TAB-TAB>
prefetch           dsort              etl                cleanup            mirror             warm-up-metadata   move-bck
download           lru                rebalance          resilver           ec-encode          copy-bck

and run it while simultaneously taking out nodes and disks. It’ll run and, given enough redundancy in the system, it’ll recover and will keep going.


The ability to recover is much more fundamental than any specific job kind that’s already supported today or will be added in the future.

Not every job is startable. In fact, majority of the supported jobs have their own dedicated API and CLI, and there are still other jobs that run only on demand.

The Upshot

The beauty of copying is in the eye of the beholder. But personally, big part of it is that there’s no need to have a client. Not that clients are bad, I’m not saying that (in fact, the opposite may be true). But there’s a certain elegance and power in running self-contained jobs that are autonomously driven by the cluster and execute at (N * disk-bandwidth) aggregated throughput, where N is the total number of clustered disks.

At the core of it, there’s the (core) process whereby all nodes, in parallel, run reading and writing threads on a per (local) disk basis, each reading thread traversing local, or soon-to-be local, part of the source dataset. Whether it’d be vanilla copying or user-defined offline transformation on steroids, the underlying iterative picture is always the same:

  1. read the next object using a built-in (local or remote) or etl container-provided reader
  2. write it using built-in (local or remote), or container-provided writer
  3. repeat

Parallelism and autonomy always go hand in hand. In aistore, location rules are cluster-wide universal. Given identical (versioned, protected, and replicated) cluster map and its own disposition of local disks, each node independently decides what to read and where to write it. There’s no stepping-over, no duplication, and no conflicts.

Question to maybe take offline: how to do the “nexting” when the source is remote (i.e., not present)? How to iterate a remote source without loss of parallelism?

And so, even though it ultimately boils down to iteratively calling read and write primitives, the core process appears to be infinitely flexible in its applications.

And that’s the upshot.


AIStore: an open system for petascale deep learning

So tell me, what do you truly desire?

                  Lucifer Morningstar


AIStore (or AIS) has been in development for more than three years so far and has accumulated a fairly long list of capabilities, all duly noted via release notes on the corresponding GitHub pages. At this stage, AIS meets common expectations to a storage solution – its usability, manageability, data protection, scalability and performance.

AIStore is a highly available partition-tolerant distributed system with n-way mirroring, erasure coding, and read-after-write consistency(*). But it is not purely – or not only – a storage system: it’ll shuffle user datasets and run custom extract-transform-load workloads. From its very inception, the idea was to provide an open-source, open-format software stack for AI apps – an ambitious undertaking that required incremental evolution via multiple internal releases and continuous refactoring…

AIS is lightweight, fully reliable storage that can be ad-hoc deployed, with or without Kubernetes, anywhere from a single Linux machine to a bare-metal cluster of any size. Prerequisites boil down to having a Linux and a disk. Getting started with AIS will take only a few minutes and can be done either by running a prebuilt all-in-one docker image or directly from the source.

AIS provides S3 and native APIs and can be deployed as fast storage (or a fast on-demand cache) in front of the 5 (five) supported backends (whereby AIS itself would be the number 6, respectively):


The focus on training apps and associated workloads results in a different set of optimization priorities and a different set of inevitable tradeoffs. Unlike most distributed storage systems, AIS does not break objects into uniform pieces – blocks, chunks, or fragments – with the corresponding metadata manifests stored separately (and with an elaborate datapath to manage all of the above and reassemble distributed pieces within each read request, which is also why the resulting latency is usually bounded by the slowest link in the corresponding flow chart).

Instead, AIS supports direct (compute <=> disk) I/O flows with the further focus on (re)sharding datasets prior to any workload that actually must perform – model training in the first place. The idea was to make a system that can easily run massive-parallel jobs converting small original files and/or existing shards to preferred sharded formats that the apps in question can readily and optimally consume. The result – linear scalability under random-read workloads, as in:

total-throughput = N * single-disk-throughput

where N is the total number of clustered disks.

It is difficult to talk about a storage solution and not say a few words about actual I/O flows. Here’s a high-level READ diagram, which however does not show checksumming, self-healing, versioning, and scenarios (that also include operation in the presence of nodes and disks being added or removed):


You can always start small. As long as there’s HTTP connectivity, AIS clusters can see and operate on each other’s datasets. One can, therefore, start with a single all-in-one container or a single (virtual) machine – one cluster, one deployment at a time. The resulting global namespace is easily extensible via Cloud buckets and other supported backends. Paraphrasing the epigraph, the true desire is to run on commodity Linux servers, perform close-to-data user-defined transforms, and, of course, radically simplify training models at any scale.

Integrated Storage Stack for Training, Inference, and Transformations

The Problem

In the end, the choice, like the majority of important choices, comes down to a binary: either this or that. Either you go to storage, or you don’t. Either you cache a dataset in question (and then try to operate on the cache), or make the storage itself do the “operating.”

That’s binary, and that’s the bottom line.

Of course, I’m talking about ETL workloads. Machine learning has three, and only three, distinct workloads that are known at the time of this writing. And ETL is the number one.

[ Full disclosure: the other two include model training and hyperparameter optimization ]

ETL – or you can simply say “data preprocessing” because that’s what it is (my advice, though, if I may, would be to say “ETL” as it may help institute a sense of shared values, etc.) – in short, ETL is something that is usually done prior to training.

Examples? Well, ask a random person to name a fruit, and you’ll promptly hear back “an apple.” Similarly, ask anyone to name an ETL workload, and many, maybe most, will immediately respond with “augmentation”. Which in and of itself is a shortcut for a bunch of concrete sprightly verbs: flip, rotate, scale, crop, and more.

My point? My point is, and always will be, that any model – and any deep-learning neural network, in particular – is only as good as the data you feed into it. That’s why they flip and rotate and what-not. And that’s precisely why they augment or, more specifically, extract-transform-load, raw datasets commonly used to train deep learning classifiers. Preprocess, train, and repeat. Reprocess, retrain, and compare the resulting mAP (for instance). And so on.

Moreover, deep-learning over large datasets features the proverbial 3 V’s, 4 V’s, and some will even say 5 V’s, of the Big Data. That’s a lot of V’s, by the way! Popular examples include YouTube-8M, YouTube-100M, and HowTo100M. Many more examples are also named here and here.

Very few companies can routinely compute over those (yes, extremely popular) datasets. In US, you can count them all on the fingers of one hand. They all use proprietary wherewithal. In other words, there’s a problem that exists, is singularly challenging and, for all intents and purposes, unresolved.

After all, a 100 million YouTube videos is a 100 million YouTube videos – you cannot bring them all over to your machine. You cannot easily replicate 100 million YouTube videos.

And finally – before you ask – about caching. The usual, well-respected and time-honored, approach to cache the most frequently (recently) used subset of a dataset won’t work. There’s no such thing as “the most” – every single image and every second of every single video is equally and randomly accessed by a model-in-training.

Which circles me all the way back to where I’d started: the choice. The answer at this point appears to be intuitive: storage system must operate in place and in parallel. In particular, it must run user-defined ETLs on (and by) the cluster itself.


AIStore (or AIS) is a reliable distributed storage solution that can be deployed on any commodity hardware, can run user containers and functions to transform datasets inline (on the fly) and offline, scales linearly with no limitations.

AIStore is not gen-purpose storage. Rather, it is a fully-reliable extremely-lightweight object store designed from the ground up to serve as a foundation for an integrated hyper-converged stack with a focus on deep learning.

In the 3+ years the system has been in development, it has accumulated a long list of features and capabilities, all duly noted via release notes on the corresponding GitHub pages. At this stage AIS meets most common expectations in re usability, manageability, and data protection.

AIS is an elastic cluster that grows and shrinks with no downtime and can be easily-and-quickly deployed, with or without Kubernetes, anywhere from a single machine to a bare-metal cluster of any size. For Kubernetes-based deployments, there’s a whole separate repository that contains AIS deployment playbooks, Helm charts, and Kubernetes Operator.

The system features data protection and self-healing capabilities that users come to normally expect nowadays. But it can also be used as fast ad-hoc cache in front of the five (so far) supported backends, with AIS itself being the number six.

The picture below illustrates inline transformation, whereby each shard from a given distributed dataset gets transformed in-place by a user-provided function. It goes as follows:

  1. A user initiates custom transformation by executing documented REST APIs and providing either a docker image (that we can pull) or a transforming function that we further run using one of the pre-built runtimes;
  2. The API call triggers simultaneous deployment of multiple ETL containers (i.e., K8s pods) across the entire cluster: one container alongside each AIS target;
  3. Client-side application (e.g., PyTorch or TensorFlow-based training model) starts randomly reading sharded samples from a given dataset;
  4. Each read request:
    • quickly bounces off via HTTP redirect – first, of an AIS proxy (gateway) and, second, of AIS target – reaching its designated destination – the ETL container that happens to be “local” to the requested shard, after which:
    • the container performs local reading of the shard, applies user-provided transforming function to the latter, and, finally, responds inline to the original read request with the transformed bytes.


The sequence above is one of the many supported permutations that also include:

  • User-defined transformation via:
    • ETL container that runs HTTP server and implements one of the supported APIs, or
    • user function that we run ourselves given one of the supported runtimes;
  • AIS target <=> ETL container communication via:
    • HTTP redirect – as shown in the picture, or
    • AIS target performing the read and “pushing” read bytes into locally deployed ETL to further get back transformed bytes and respond to the original request.

And more. Offline – input dataset => output dataset – transformation is also available. A reverse-proxy option is supported as well, although not recommended.


In the end, the choice, like so many important choices we make, is binary. But it is good to know what can be done and what’s already actually working.


Efficient PyTorch I/O library for Large Datasets, Many Files, Many GPUs

Data sets are growing bigger every day and GPUs are getting faster. This means there are more data sets for deep learning researchers and engineers to train and validate their models.

  • Many datasets for research in still image recognition are becoming available with 10 million or more images, including OpenImages and Places.
  • million YouTube videos (YouTube 8M) consume about 300 TB in 720p, used for research in object recognition, video analytics, and action recognition.
  • The Tobacco Corpus consists of about 20 million scanned HD pages, useful for OCR and text analytics research.

Here’s a full text:

High performance I/O for large scale deep learning

Abstract – Training deep learning (DL) models on petascale datasets is essential for achieving competitive and state-of-the-art performance in applications such as speech, video analytics, and object recognition. However, existing distributed filesystems were not developed for the access patterns and usability requirements of DL jobs. In this paper, we describe AIStore, a highly scalable, easy-to-deploy […]

Finding the right storage system

What are the odds that you find the right storage system to run all the workloads that need to be running and to forever sustain all the apps that consume storage? Given the current state of the art, what would be your chances of doing so today?

Apart from being existential, the question also appears to be formalizable. Meaning that it can be approached with a certain degree of scientific rigor and scrupulousness. Of course, I’m talking about the Drake equation.

The Drake equation is a probabilistic argument used to estimate the number of active, communicative extraterrestrial civilizations in the Milky Way galaxy. Since its 1961 publication, and notwithstanding constructive criticism (see, for instance,The Drake Equation Is Broken; Here’s How To Fix It“) the equation proved to be usable across wide range of application domains.

Perhaps the most popular application of Drake is estimating one’s chances to find true love (variations: girlfriend, boyfriend, soulmate, BFF).

Given a massive body of work on the subject I’ll have to limit citations to maybe just one: the 2008 paper “Why I don’t have a girlfriend” where Peter Backus clearly and succinctly explains a) how to apply Drake, and b) why at the time of the writing he didn’t have a girlfriend.

Back to storage systems though – equipped with Drake’s methodology, the question of finding the one and only storage system can be factorized (step 1), with the probabilities of each factor estimated (step 2), and fed into the equation (step 3) to generate an estimate. Consider a real-life example, whereby the storage system in question features:

  1. SDS: software stack on commodity x86 servers
  2. Price $0.10/GB or below that covers hardware + software + 5 years of support
  3. Petascale clustering with 1PB usable today and growing to 100PB later in 2019
  4. Storage media: hard drives
  5. Linearity: must scale close to linear with each added node and each added spindle
  6. HA: must be highly available
  7. Tiering: unlimited
  8. Data redundancy: both local (RAID1/5/6/10) and replicated
  9. AI: built-in accelerations for AI workloads

A quick look at the above reveals the truth: the chance of locating a match is probably low.

Indeed, at the time of this writing, the Wikipedia’s computer storage category includes a total of about 150 companies. Even if we super-generously give each of the 9 factors an average 0.5 probability, the resulting (150 * 1/512) can only be interpreted as: oops, doesn’t exist.

There are numerous ways to confirm the first suspicion. For instance, given the limited dataset of 150, we could run Monte Carlo simulations by randomly selecting storage companies and filling-in samples, e.g.:

Storage company X.Y.Z. SDS: No Petascale: No Price: Not cheap Commodity x86: No HA: Yes

Having accumulated enough samples to generate confidence intervals, we then compute the odds, find them depressingly low, and resort to finding some peace in the pure logic of this exercise, a consolation similar to the one Peter Backus found years ago.

We could also take a more conventional route which (conventionally) entails: reviewing datasheets, meeting with vendors, gathering references, reading Top-N IDC and Gartner reports, and then again reviewing the product datasheets. Here, for instance, are the top 13 companies from the IDC Worldwide Object Storage 2018 – we could take this report and painstakingly fill-in a 13-by-9 comparison table with those companies in the rows and the distinguishing characteristics in the columns (I will leave this as an exercise for the reader).


Source: The Register

Most likely, though, if nothing even remotely plausible comes up on the first day or two of screening, referencing, and filling out comparison tables, the chance (of finding the right storage system) must be considered extremely slim. The storage system that you need, seek, and depend upon may not exist.

Or, stated differently, the storage system that you really need may really not exist.

It’ll be hard to come to terms with this statement – the process will usually take anywhere between three to nine months. Eventually, though, the dilemma will crystallize as a simple binary choice: to build or not to build?

This is a tough choice to make. Here’s one possible answer. Why does it work for the 9 (nine) factors listed above and where does it go from there – that will be another story and a different time.

Translational Symmetries in Storage Systems

What is data storage? Dictionaries define data as information, and storage as “act of storing” or “space for storing.” A data storage system, then, is something that stores, keeps, protects, and ultimately conserves information. When we talk about data storage, we must therefore consider conservation and the laws of thereof.

The conservation laws are the most fundamental, the most universal, the most principle of all the laws of nature. Physical systems conserve momentum and energy. Quantum mechanical systems additionally conserve information. A quantum mechanical system manages to conserve its information even in the most extreme conditions, such as the immediate proximity of a gigantic black hole (see Black hole information paradox and ER = EPR, in that sequence).

Moreover, all physical systems abide by their system-specific laws of motion. Electromagnetic systems abide by the electromagnetic laws of motion, thermodynamic systems – by thermodynamic, and so on.  Newton’s second law of motion (circa 1687), the Second law of thermodynamics, the Maxwell-Faraday equation, the Schrödinger wavefunction, the Einstein field equations – these are various laws of motion expressed as differential equations.

At this point, I’m going to go on a limb and posit that a storage system is a physical system. Further, a good storage system is a quantum mechanical system that preserves information with respect to its “motions” that include:

  • relocation, replication, rebalancing
  • snapshotting, cloning
  • compression, encryption, deduplication
  • decompression, decryption, un-snapshotting

and more.

In layman terms, the goodness of a storage system is henceforth defined as its compliance with the universal conservation laws and its ability to behave like any other well-behaved physical system. But what does it mean “to behave”? Well, it means to move, transform and evolve according to the system-specific laws of motion. The laws of motion, in turn, abide by the conservation laws, while those, in their most general presentation, state that some quantity Z in the mathematical description of a system’s evolution must remain constant. In effect, Z must remain invariant. Let’s talk therefore about:

Invariance and symmetry

The conservation laws are, what many scientists call, “principle laws” – universal laws about other universal laws (see The Unity Of The Universe by David Deutsch, at 10:35). It’s the conservation laws that define universal laws of motion, and not the other way around. In the hierarchy of all laws, the conservation laws are strategically positioned at the base.

A word on the established terminology. In the scientific description of conservation laws, the terms “symmetry” and “invariance” are interchangeable, with “symmetry” implying a specific class, a family, or better yet, a group of transformations. The following two statements are equivalent and should make this last point very clear:

(1) A conservation law states that a particular measurable property of a system does not change as the system evolves. (2) A conservation law states that a particular measurable property of a system remains invariant with respect to certain transformations that in combination constitute a symmetry (group).

Similarly, a good storage system (that conserves information in presence of lossless stateful transformations) is symmetric with respect to those transformations (aka translations).

But which transformations constitute a good-storage-system’s symmetry group? The answer will be based on yet another universal law that states:

What’s not forbidden is allowed. Moreover, everything that is not forbidden is compulsory.

For a good storage system, the direct and unavoidable implication of the above is that the system must support all stateful transitions resulting from the not forbidden (or rather, not yet forbidden) compressions, encryptions, replications, deduplications and so on.


So far, I’ve laid the groundwork to introduce the main topic: a perfect storage system. This is a good system of unimpeachable integrity abiding by the universal conservation laws, the most fundamental of which is that information is never lost. The storage system that evolves without ever taxing the user for its internal evolution.

That’s the perfection we’ll be looking for in the subsequent sections.

The Plan

One useful approach to build something new is to first destroy the current version that is getting old (or deemed to be old by a recent college grad). And then to think about the next step. This process is called creative destruction (with an emphasis on “creative”) and implies a certain reuse rather than total uncompromising annihilation. The typical roadmap is:

  • Blow the old version to smithereens, while setting aside a couple of original building blocks;
  • Refashion those creatively, and then immediately proceed to build the new.

But what about the building blocks? Which ones shall we select, retain and, subsequently, refashion? That is the question, and the sole purpose of this text, which also entails:

  • Introductory insight into the crucial value of intuition in storage engineering – in the Section “Trust your intuition”;
  • Historical and literary allusions – in Sections “1982” and “Down the rabbit hole”;
  • Creative destruction (with an emphasis on “destruction”) of contemporary art – in “Genericity” and “Reductionism”;
  • Forewarning of unexpected dangers that are to be expected – in “Against corruption”;
  • A note on natural selection in storage – in “Darwinism.”

You can also TLDR directly to the namesake Section – therein lies an important clue on how to start building a perfect storage system.

Trust your intuition

You are a software architect. One day you start early, work for a while, and devise a data structure. It is now sitting in the corner of your screen – a bunch of pointers sticking out of it, a few carefully crafted content fields. A question arises in the back of your mind: is this it?

Lifted from the depths of the unconscious, the question is what will happen once everything is said and done, and after the bits get GA-ed, shipped, deployed as a live storage system brimming with user data:That’s the human condition – we have no ways of finding out. Visualize a shopping mall: in an infinitely long row of data structures, each one will be neatly packaged and labeled with its pros and cons. The pros will include a faster disk access for sequential I/O, for instance. (How much faster? The label won’t tell.) The cons – fragmentation that would require compaction. (Was this information helpful? Give us your feedback by checking one of the boxes.)


To search for real answers, we need to go all the way back – to early frontiers.

The year is 1982. Pan Am is still operating across the Atlantic, but the 830 bombing will soon usher a new era. Recession grips the US. Brezhnev dies after 17 years of ruling the communist party and the USSR. The first CD player. Survivor’s Eye Of The Tiger. Times Man of the Year is the personal computer. A graduate student at UC Berkeley follows in the footsteps of other Unix pioneers (Bill Joy et al.), to build the very first filesystem that was later called Unix File System, or UFS.

Figure 1. UFS inode

UFS introduced the inode data structure (Fig. 1). UFS used (and keeps using) the inode to store UFS files in UFS directories.

Enough of the trivia though, and so much for a glorious history. A lot has changed in the intervening years except one little thing – the file being a byte array over a bunch of  512n, 512e, or 4Kn physical segments stitched together by the filesystem. (Nobody knows that last bit except maybe a few storage-systems engineers, and even they often forget.)

Meaning that, in all past, current and future filesystems, there’s a core metadata structure that represents and implements a file as a “sequence of bytes.” All existing and future files can be traced back to the very first inode in the Fig. 1.

That’s pretty damn remarkable.

Down the rabbit hole

You are a storage architect. One day you start early and work till dusk. You keep working into the night when the voices retreat, street sounds die out, and the darkness falls. And then – you see it: your storage system.

From a distance, it looks like a giant web splattered across non-volatile storage media that has been pooled together for greater capacity. It’s a giant mess, a jumble of trunks, branches, and leaves – growing, forking, intertwining, and multiplying.

When you get closer – in the process that must be quaintly familiar to anyone who ever experienced vertiginous Mandelbrot zooms or who used to fall with Alice down the rabbit hole – when you get closer, the cluttered tendrils become separable, and you see a picture that is eerily familiar. Oh yes, it’s a gigantic directed graph, not necessarily acyclic albeit traversable from the top down (at the top, it resembles a forest with a few discernable roots).

(Keep falling, accelerating, zooming-in.)

As the resolution magnifies, distinct nodes shift into focus, with their directed edges of different kinds. The most abundant will be the parent-contains-child type of edge, but there are others, connecting each other forward, backward, and sideways.

Gradually, a pattern emerges. You don’t notice it for a while because at first, all nodes appear to be infinitely different and categorically unique. It is only when you step back up a little bit, you start recognizing the node’s types aka metadata structures.

Some of those structures seem to be repetitive. As you stare at it some more, you realize that the word “repetitive” does not adequately describe the situation. Two, or maybe five of those metadata types cover most of the graph. They dominate the rest by a wide margin.

(And it is with this uneasy thought you wake up.)


There are only two viable perspectives of the world: user and system. The user’s point of view is that the data is deeply structured, hierarchically layered, intricately cross-referenced, and – on top of all of the above – multimedia-rich. On-disk though, it’s a graph that is so generic that its apparent genericity gives a new meaning to being generic.

At the same time, there’s the amazing, demonstrated, and million times confirmed efficiency of storing/retrieving data in all its incredible variety by utilizing just a handful metadata types. Reducing, in effect, the entire world to a few elementary “particles” of its as-yet undiscovered standard model.

Figure 2. The Standard Model of particle physics


Building a storage system is excruciatingly hard. There’s one thing that’s harder, though: seeing it through ­– living long enough to see it through.

Can we somehow reduce the mind-blowing complexity of a storage system? My answer is Yes. But first, before reducing anything anywhere, let’s come to terms with Reductionism.

Reductionism refers to several related philosophical ideas, says Wikipedia, and goes on to touch upon each of those in the spheres of natural sciences, mathematics, religion, and philosophy.

Sadly, storage engineering is never mentioned – not by the Wikipedia, or by any other reputable source.

Let’s close that knowledge gap. First, notice that Reductionism is a meta-philosophical abstraction that, by all appearances, belongs to the highest order of the purest Platonic Forms. At the same time, Reductionism is deceptively easy to grasp intuitively. What’s not easy at all is to define it.

Let’s, therefore, go ahead and define Reductionism in a way that can be dear and near to the hearts of software engineers.

The essence of Reductionism is to impose a can-be-derived-from ordering on a given theory or body of knowledge (broadly, on any phenomenon). Accordingly, some entities, properties or substances arise or emerge out of more fundamental entities, properties, and substances, respectively. And yet, those new ones (that have arisen or emerged) are in some sense irreducible.

There are many notable examples. The entire periodic table can be step-by-step built from the Standard Model’s particles (Fig. 2 above).

Only 3 of the 17 known particles (aka quantum fields) are required to build the entirety of ordinary matter (Fig. 2). Plus, the neutrino that takes part in the process of changing the atomic nucleus into a different kind.

In Mathematics, it’ll take a certain effort to find an area or a field that’s irreducible to natural numbers. Hence, the lower part of the corresponding “reductionist” pyramid of all of Math looks as follows:

Figure 3. The pyramid of numbers

In the picture above integers can be derived from the natural numbers via subtraction, rational numbers from integers – via division, real numbers from rational numbers – via a Dedekind cut, and complex numbers from real numbers – via the pairing of real numbers and defining the multiplication rule as follows:

(a, b) * (c, d) = (ac – bd, ad + bc)

where a, b, c, d are real numbers.

The multiplication rule may seem non-obvious and not very intuitive. It becomes abundantly clear, however, once the complex numbers are expressed in polar or, better yet, exponential forms.

Maybe it’s a cliché to say that mathematics builds on itself. The same, however, applies across the board to any branch of science or technology. Very often there’ll be a tall and growing taller pyramid, with a narrow foundation and layers upon layers of the derived complexity. Something that looks as follows, for instance:

Reductionism must be equivalent to consistent build-up as it is fully based on the ability to incontrovertibly derive the complexity of the upper layers from the building blocks of the lower layers, and ultimately, from the most basic and most irreducible axioms. Without this internally-consistent continuous build-up, there’d be nothing to reduce in the first place. Instead, there would be a confounding chaos, a primordial timeless void.

Against corruption

Here’s what we know beyond any shred of doubt: on-disk layout is self-reproducing, and the amount of its self-reproduction is driven exclusively by the sheer size and granularity of the user data.

On the other hand, a storage system must retain its on-disk consistency at all times and under any circumstances. A diverse and extended set of those circumstances comprises the entire disk lore that is crammed with tales of rapidly degrading disks, disks turning read-only, disks turning themselves into bricks, and disks conspiring with other disks to turn themselves into bricks.

There’s also silent data corruption which, as the name implies, corrupts rather silently albeit indiscriminately.

In layman terms, data corruption is called silent when the storage system does report it while the underlying solid-state or hard drive doesn’t (or wouldn’t).

Some studies attribute up to 10% of catastrophic storage failures to silent corruption. Other studies observe that “SLC drives, which are targeted at the enterprise market <snip>, are not more reliable than the lower end MLC drives,” thus providing a great educational value. There is, however, not a single study of what should’ve been appropriately called a silent undetected corruption. The related statistical data is probably just too painful to document.

Finally, the software/firmware malfunction that comes on top of all of the above and, to be honest, tops it all. It manifests itself in a rich variety of assorted colorfully named conditions: phantom writes and misdirected reads, to name a few.

About Darwinism

Nature repeats itself, especially if the repetition pays off. And if not, then not – end of story, thank you very much. The same is true for storage systems – some of them, frankly, keep capturing user data as we speak. But not all, not by a long stretch.

There are also those storage systems that linger. In fact, some of them linger for so much time that you’d take them for granted, as a steady fixture of the environment. But that’s beside the point. The point is, never-ever will those systems thrive and multiply in the strict Darwinian sense that defies corruption and defines success as the total deployed capacity.


Here’s the upshot of my so far carefully constructed argument:

  1. Storage systems rely on a handful of their internal (core) data structures to store just about everything;
  2. Generations of engineers, while firmly standing on the shoulders of the giants, augment those core data structures incrementally and anecdotally;
  3. Instead, when building a good storage system of the future, our focus should be directed solely at the core;
  4. We shall carve the perfection out of the existing raw material by applying conservation laws.

That’ll be the plan in sum and substance. To begin with, let’s consider a conventional core data structure in an existing storage system. It could be a file or a directory (metadata type) in a filesystem. It can be an object or a chunk of an object in an object store, a key or a value in a key/value store, a row, a column, or a table in a database. In short, it could be any and all of those things.

Now, visualize one such structure. If you absolutely have to have something in front of your eyes, take the inode from the Section “1982” – as good a starting point as any. Distilled to its pure essence, it will look as follows:

Figure 4. The core metadata, refined and purified

That’d be our initial building block: arbitrary content and maybe a few pointers to connect adjacent nodes: children, parents and/or siblings. Incidentally, the content is not relevant at this point (it may be later) – what’s relevant is the surrounding environment, the stage, and the World. Paraphrasing Shakespeare, all the World is a lattice, and nodes in it are… no, not actors – travelers. The nodes move around:

Figure 5. Basic transformations: a) move, b) replicate

The lattice is an n-dimensional grid (only two dimensions are shown in the Fig. 5 and elsewhere) that reflects addressing within the storage graph. Concrete storage-addressing implementations may include: keys (to address values), qualified names and offsets, volume IDs and LBAs, URIs and shard sequence numbers, and much more. To that end, our lattice/grid World is a tool to conduct thought experiments (gedankenexperiments) while abstracting out implementation specifics.

The leftmost part of the Fig. 5 depicts a snippet of the World with two nodes A and B placed in strict accordance with their respective cartesian coordinates. Simultaneously, the A would be pointing at the B via the A’s own internal Fig. 4 structure. That’s what we have – the setup at Time Zero.

What happens next? Well, what happens is that something happens, and node B moves (Fig. 5a) or replicates (Fig. 5b). The move (aka migration) could have resulted from the B getting compressed or encrypted in place. Or perhaps, getting tiered away from a dying disk. Either way, the result is a new World grid location denoted as B’. And since, excepting the change in coordinates, B’ is the same B, the picture illustrates the distinction between identity and location. The distinction that in and of itself is informative.

Many storage systems today bundle, pack, or otherwise merge identities and locations, the practice that became especially popular with wide-spread acceptance of the consistent hashing. In effect, those systems may be violating the symmetry with respect to spatial translations.

In a good storage system, the corresponding symmetry must be preserved even though it may (appear to) not directly connect to user requirements or auxiliary pragmatic considerations. In effect, the requirement of complying with the principle conservation laws changes our perspective on what is crucially important and what can maybe wait.

The long and short of this is that the most basic spatial symmetry implies a certain upgrade in the initial building block shown in the Fig. 4 above. The concrete implementation of this “upgrade” will depend on specifics and is therefore omitted.

Next, compare the following two scenarios where node B is referenced in the World grid by two other nodes: A and C.

Figure 6. Counting the references

Skipping for brevity the otherwise timely discussion of reference counting, let’s just point out that the Fig. 6a and Fig. 6b – are different. In a good storage system, the corresponding informative difference (which typically gets discarded without so much as a blink) must be preserved. This will most likely have implications for our future perfect building block.

To give one final example, consider a split and a merge – two lossless transformations that are also mutually inverse:

Figure 7. The logical boundaries must be flexible, subject to evolution of the system

Here again, from the user perspective, and assuming the names B and C are not externally visible, nothing really changes between the Figures 7a and 7b. On the other hand, there’s a material change that is distinctly different from the previous two gedankenexperiments. The change that must be a) permitted by the storage system, and b) preserved, information-content wise.

Further, conducting split and merge transformations en masse will likely require a larger-than-building-block abstraction: an object and its version, or maybe a dataset and its snapshot. Or even, a dataset and its immutable mutable snapshot – the snapshot that remains totally invincible in the presence of splits, merges, and other lossless transforms, members of the corresponding storage symmetry groups.

In conclusion

Storage systems that continuously evolve and self-optimize via natively supported lossless transformations? That’s maybe one idea to think about.

There’s also the crucial issue of controlling the complexity which tends to go through the roof in storage projects. Pick your own “inode” and build it up, one step at a time and block by perfect block? You decide.

Other questions and topics will have to remain out of scope. This includes immutable mutable snapshotting (IMS), ways to define it, and means to realize it. The potential role of the IMS in providing for the translational storage symmetries cannot be underestimated.

Ten Billion Objects and the Pursuit of Balance

Suppose, you’ve come across a set of 10 billion unique objects that are stored far, far away. You can now read them all, although that may at times be painfully slow, if not downright sluggish. And it may be costly as well – a couple cents per gigabyte quickly runs up to real numbers.

You also have a Data Center, or, to be precise, a colocation facility – a COLO. With just a handful of servers, and a promise to get more if things go well. Say, a total of N > 1 servers at your full disposal. The combined capacity of those N servers will not exceed one million objects no matter what – 0.01% of grand total.

Last but not the least, you’ve got yourself a machine-learning project. The project requires each compute job to read anywhere from 100K to 1M random objects out of the 10 billion stored far, far away. The greediest batch job may consume, and then try to digest, close to a million but no more.

Therein lies the question: how to utilize your N-server resource for optimal (i.e., minimal) read latency and optimal (i.e., minimal) outbound transfer cost?

Researching and Developing. Coding.

The reality of R&D is that it’s not like in the movies. It is mundane, repetitive. You tweak and run and un-tweak and rerun. This process can go on and on, for hours and days in a row. Most of the time, though, the job that undergoes the “tweaking” would want to access the same objects.

Which is good. Great, actually, because of the resulting, uneven and unbalanced, distribution of the hot/cold temperatures across the 10-billion dataset. This is good news.

In real life, there are multiple non-deterministic factors that independently collaborate. Running 27 tunable flavors of the same job can quickly help to narrow down the most reusable object set. Impromptu coffee break (factor) presenting a unique opportunity to run all 27 when the rest of the team is out socializing. And so on.

More often, though, it’d be a work-related routine: a leak or a crash, a regression. The one that, down the road, becomes reproducible and gets narrowed to a specific set of objects that causes it to reproduce. The latter then is bound to become very hot very soon.

In pursuit of balance

For whatever reason, the title phrase strongly associates itself with the California pinot noir and chardonnay. There’s the other type of balance though – the one that relates to software engineering. In particular, relates to read caching.

DHT is, maybe, the first thing that comes to mind. The DHT that would be mapping object IDs (OID) to their server locations, helping us to realize the most straightforward implementation of a distributed read cache.

Conceptually, when an object with OID = A shows up in its local server space computing job would execute DHT.insert(A). And then, DHT.delete(A) when a configured LRU/LFU/ARC policy triggers the A’s eviction. The operations would have to be transactional, so that DHT.get() always gets consistent and timely results.

There is a price though. For starters, the task of managing local caches does not appear to be the first concern when all you need is to make your zillion-layer neural network converge (and it wouldn’t).

Thus, the DHT (or whatever is the implementation of the cluster-wide key/value table) would have to be properly offloaded to a separate local daemon and, project-wise, relegated to the remainder of working hours.

Next, there is a flooding concern. A modest 1K objects-per-second processing rate by 10 concurrent jobs translates as 20K/s inserts and deletes on the DHT. That’s not a lot in terms of bandwidth but that is something in terms of TCP overhead and network efficiency. The control-messaging overhead tends to grow as O(number-of-jobs * number-of-servers) to ultimately affect the latency/throughput of critical flows.

In addition, local DHT daemon will have to discover other DHT daemons and engage them in a meaningful, albeit measured, conversation including (for starters) keep-alive and configuration-change handshakes. So that everyone keeps itself in-sync with everyone else, as far as this (meticulous) object tracking and stuff.

Further, there’s the usual range of distributed scenarios: servers joining and leaving (or crashing, or power-cycling, or upgrading), spurious messaging timeouts, intermittent node failures, inconsistent updates. Think distributed consensus. Think false-positives. Think slipping deadlines and headaches.

And then there is persistence across reboots – but I’ll stop. No need to continue in order to see a simple truth: the read-caching feature can easily blow itself out of all proportions. Where’s the balance here? Where is the glorified equilibrium that hovers in the asymptotic vicinity of optimal impact/effort and cost/benefit ratios? And why, after all, when California pinot noir and chardonnay can be perfectly balanced – why can’t the same be done with software engineering of the most innovative machine-learning project that simply wants to access its fair share of objects stored far, far away?

The pause

The idea is as old as time: trade distributed state for distributed communications, or a good heuristic, or both. Better, both.

Hence, it’d be very appropriate at this point to take a pause and observe. Look for patterns, for things that change in predictable ways, or maybe do not change at all…pob-pause

When object sets are static

It appears that many times the compute jobs would know the contents of their respective object sets before their respective runtimes. Moreover, those objsets can be totally preordered: an object B would follow an object A iff the B is processed after the A, or in parallel.

The phenomenon can be attributed to multiple runs and reruns (previous section) when all we do is update the model and see if that works. Or, it can be ascribed to searchable metadata that happens to be stored separately for better/faster observability and analytics. Whatever the reason, the per-job list of object IDs can be compiled and broadcast to the clustered servers, accompanied with the message to please hold-on to those objects if they are local.

More precisely, the steps are:

  1. A job that is about to start executing broadcasts the list of its OIDs as well as per-OID (min, max, location) records of when and where it’ll need the objects preloaded;
  2. Each of the (N-1) control plane daemons responds back with a message consisting of a bunch of (OID, t1, t2) tuples, where OID is an object ID and:
    • t1 = 0, t2 > 0: the object is local at the responding server, and the server commits to keep it until at least time = t2
    • t1 > 0, t2 > t1: the object is not currently present at the responding server; the server, however, makes a tentative suggestion (a bid) to load it by the time t1 and keep it until the time t2
  3. Finally, the job collects all the responses and in turn responds back to all the bids canceling some of them and confirming the others.

So that’s that – the fairly straightforward 3-step procedure that converges in seconds or less. There are, of course, numerous and fine details. Like, for instance, how do we come up with the t1 and t2 estimates to balance the load of already started or starting jobs while not degrading the performance of those that may come up in the future. Or the question of how to optimally calibrate the bidding process at runtime. And so on.

Those are the types of details that are beyond the point of this specific text. The point is, utilize both the prior knowledge and the combined N-server resource. Utilize both to the maximum.

When object sets are dynamic

But what if the object sets are not static, as in:pod-next-oidCould we still optimize when the next object is not known until the very last moment it needs to be processed? Clearly, the problem immediately gets infinitely trickier. But not all hope is lost if we take another pause…pob-pause

– to experiment a bit and observe the picture:


Fig 1. Percentage of virtually snapshotted objects as a function of elapsed time.

No, this is not an eternal inflation of the multiverse with bubble nucleation via quantum tunneling. This is a tail distribution aka CCDF – the percentage of objects that remain local after a given interval of time – a cluster-wide survival function. To obtain that curve, a simple script would visit each server node and append its local object IDs to a timestamped log, in effect creating a virtual snapshot of the clustered objects at a given time.

Later, the same script uses the same log to periodically check on the (virtually snapshotted) objects. In the Fig. 1, the bump towards the end indicates reuploading of already evicted objects – the fact that strongly indicates suboptimal behavior of the system. Subsequently, after factoring-in environmental differences and gaussian noise (both causing the shape of CCDF to falter and flap across job runs and coffee breaks), we obtain this:


Fig 2. Survival function with noise

One observation that stands strong throughout the experiments is that an object almost never gets evicted during the first time talive interval of its existence inside the compute cluster. As time goes by, the chances to locate most of the original virtual snapshot in the cluster slowly but surely approach zero depicted as a fuzzy reddish area at the bottom of the Fig. 2.

The fork

Clearly, for any insight on the dynamic object locations, we need to have some form of virtual snapshotting. Introduce it as a cluster-wide periodic function, run it on every clustered node, broadcast the resulting compressed logs, etc.

But how often do we run it? Given an observed average lifecycle of a virtual snapshot T (Fig. 2), we could run it quite frequently – say, every max(T/10,000, 1min) time, to continuously maintain an almost precise picture of object locations (all 10 billion of those).

Alternatively, we could run it infrequently – say, every min(T/2, talive) – and derive the corresponding P(is-local) probabilities out of the approximated survival function.

Hence, the dilemma. The fork in the road.pob-pause


Since the “frequent” option is a no-brainer (and a costly one), let’s consider the “infrequent” option first, with the intention to make it slightly more frequent in the future, thus approximating the ultimate optimally-balanced compromise. Given an object’s OID and bunch of outdated virtual snapshots, we should be able to track the object to the last known time it was uploaded from far, far away. Let’s call this time toid:


Fig 3. Bernoulli trial

Next, notice that for any two objects A and B, and a sufficiently small interval dt the following would hold:pob-a-and-b-localThat is, because the objects that have similar local lifetimes within the compute cluster must also have similar chances of survival.

Further, select a sufficiently small dt (e.g., dt = (T – talive – tdead)/100) and, when a job starts executing, gradually gather an evidence denoted in the Fig. 3 as yes and no for: “object is local” and “not local”, respectively. Each small interval then automatically becomes an independent Bernoulli trial with its associated (and yet unknown) probability p(dt) and the Binomial PMF:pob-binomial-pmfThe question is how we can estimate the p for each or some of the 100 (or, configured number of) small intervals. One way to do it is via Bayesian posterior update that takes in two things:

  1. the prior probability p, and
  2. the Bernoulli trial evidence expressed as two integers: n (number of objects) and k (number of local objects),

and transforms all of the above into the posterior probability distribution function:pob-beta-distributionHere’s how this formula is working out. Recall that n and k are two counters that are being measured on a per dt basis. On the right side of the equation, we have the Bayes’ theorem that divides the prior probability (of k successes in the n Bernoulli trials) by the total probability (of k successes in the n Bernoulli trials). The prior in this case corresponds to our belief that the real probability equals p.

On the left side, therefore, is a posterior probability distribution function parametrized by p – a formula of the most basic type y = f(x) which can be further used to estimate the expected value and its variance, integrate across intervals (to approximate parts of the CDF), and so on.

Wrapping up

The  formula above happens to be the standard Beta distribution. Which means that from gathering per-interval statistics to estimating per-interval probabilities, there is a single step that sounds as follows: go ahead and look up the corresponding Wikipedia page. Arguably, that step can be performed just once, maybe two or (max) three times.

That helps. What helps even more is the general downtrend of the survival functions, from which one can deduce that “giving up” on finding the objects locally for a certain specific interval (ti, ti + dt) can be followed by abandoning ongoing efforts (if any) to compute the probabilities to the right of this ti.

Further, the notion of “giving up” vs “not giving up” will have to be defined as a threshold probability that is itself a function of the cumulative overhead of finding the objects locally, normalized by the number of successful finds.

However. It is incumbent upon us to wrap it up. Other fascinating details pertaining to the business of handling treasure troves of billions, trillions, and Graham numbers of objects will have to wait until they can be properly disclosed.

All in due time.