AIS on NFS

It has long been said that aistore (AIS) scales linearly with no limitation, with combined throughput increasing as (N+1)/N, where N and N+1 are, respectively, the total number of clustered disks before and after.

In other words, the system is a universal aggregator of Linux machines (of any kind) that have disks (of any kind). That‘s been the definition and state of the art for a long time.

Tempora mutantur, however – The times change, and so do we. AIS will now aggregate not only disks (or, not just disks) but also directories. Any file directories – assuming, of course, they are not nested. Any number of regular file directories that, ostensibly, may not even have underlying block devices.

Here’s what has happened:

This is an AIS cluster sandwiched between two interfaces often referred to as frontend and backend. Cluster provides the former (which inevitably entails S3 but not only), and utilizes the latter to perform fast-tiering function across vendors and solutions.

TL;DR

Long story short, the upcoming v3.23 removes the requirement that every aistore mountpath is a separate filesystem (“FS” above) that owns a block device. In the past, this block device would have to present itself at a node’s startup and wouldn’t be permitted to be used by (or shared with) any other mountpath.

Not anymore, though. Planned for late April or May 2024, the release will introduce a separate indirection called “mountpath label.”

Speaking of which. AIS mountpath, admittedly not yet a household name, goes way back. In fact, it goes all the way back – the abstraction shows up in one of the very first commits circa 2017. Tempora mutantur …

To have more control

The original motivation was the usual one that drives most software abstractions – to have more control. Control, as it were, over the two most basic aspects of storage clustering:

  • content distribution (which better be balanced across storage devices), and
  • execution parallelism that must scale proportionally with each added device.

And so, that’s how it was done and then remained. For a very long time, actually.

  • 1 mountpath root
  • 1 filesystem
  • 1 undivided block device – ideally, a local HDD or SSD. Or, at the very least, a hardware RAID.

Needless to say, this 1-1-1 schema had a lot going for it. It still does. Simplicity, symmetry, elegance – you name it. But there’s one little problem: it’s limiting, deployment-wise. How do you deploy an aistore when all you have is a remote file share?

Yes, I know – it’s loopback. But that’s not a good answer. It’s a last-resort type answer. It’s a last-ditch-effort-when-all-hope-is-gone type of compromise, and that’s unfortunate.

Hence, v3.23. When it gets released, aistore will deploy on pretty much anything. On Lustre, for instance.

The simplified diagram depicts Lustre-based and labeled mountpaths that appear to be diskless. Indeed, lsblk would show nothing (relevant, that is); df, on the other hand, will show something:

There’s also the stuff this particular diagram doesn’t show:

  • multiple AIS gateways providing independent access points;
  • storage nodes a.k.a. targets that possibly have multiple (multihoming) interfaces to facilitate throughput;
  • Lustre servers on the other side of the Top-of-Rack switch…

And more. Because the point is different: there are no disks. Instead, there are labels, presumably indicating different NFS endpoints: “lustre1”, “lustre2”, and so on.

What’s in a label

So, what’s the motivation behind “mountpath labels”? In the end, it’s the ability to deliver functionality across individual configurable buckets. Orthogonal to the buckets, if you will.

More precisely:

Full disclosure: out of the 5-items list (above), the number 2 (storage class) and number 3 (parallelism multiplier) are not done, and whether any of it will be done, and how soon – is not clear yet.

However. An empty or missing label is fine. Still recommended for production deployments. No-label simply means that the aistore itself will resolve the device, ensure non-sharing, keep reporting the device’s statistics (via Prometheus/Grafana and logs), and use those statistics for read load-balancing at runtime.

What’s next

Maybe a class of storage? It’s a cool feature. After all, Amazon has 11 of those and counting.

But for a product like aistore, the immediate question would be – Why? The software deploys ad-hoc on arbitrary Linux machines with no limitations. Further, it populates itself on demand (upon the first epoch) and/or upon request, in any of the multiple supported ways.

Why would you then have one heterogeneous cluster instead of having, say, two separate clusters that self-populate from remote data sources and see each other’s datasets in one combined global namespace (of all datasets) – one dedicated cluster for each “storage class”?

That is the question.