Hyperconvergence: the Litmus Test

July 29, 2015July 30, 2015 Alex Aizman

What’s in the name

It is difficult to say when exactly the “hyper” prefix was the first time attached to “converged” infrastructure. Hyper-convergence, also known as hyperconvergence, ultimately means a fairly simple thing: running user VMs on virtualized and clustered storage appliances. The latter in turn collaborate behind the scenes to a) pool combined local storage resources to further b) share the resulting storage pool between all (or some) VMs and c) scale-out the VM’s generated workload.

That’s the definition, with an emphasis on storage. A bit of a mouthful but ultimately simple if one just pays attention to a) and b) and c) above. Players in the space include Nutanix, SimpliVity and a few other companies on one hand, and VMware on another.

Virtualized storage appliance that does storage pooling, sharing and scaling is a piece of software, often dubbed “storage controller” that is often (but not always) packaged as a separate VM on each clustered hypervisor. Inquisitive reader will note that, unlike Nutanix et al, VMware’s VSAN is part and parcel of the ESXi’s own operating system, an observation that only reinforces the point or rather the question, which is:

True Scale-Out

True scale-out always has been and will likely remain the top challenge. What’s a “true scale-out” – as opposed to limited scale-out or apparent scale-out? Well, true scale-out is the capability to scale close to linearly under all workloads. In other words, it’s the unconditional capability to scale: add new node to an existing cluster of N nodes, and you are supposed to run approximately (N + 1)/N faster. Workloads must not even be mentioned with respect to the scale-out capabilities, and if they are, there must be no fine print..

Needless to say, developers myself including will use every trick in the book and beyond, to first and foremost optimize workloads that represent (at the time of development) one or two or a few major use cases. Many of those tricks and techniques are known for decades and generally relate to the art (or rather, the craft) of building distributed storage systems.

(Note: speaking of distributed storage systems, a big and out-of-scope topic in and of itself, I will cover some of the aspects in presentation for the upcoming SNIA SDC 2015 and will post the link here, with maybe some excerpts as well)

But there’s one point, one difference that places hyperconverged clusters apart from the distributed storage legacy: locality of the clients. With advent of hyperconverged solutions and products the term local acquired a whole new dimension, simply due to the fact that those VMs that generate user traffic – each and every one of them runs locally as far as (almost) exactly 1/Nth of the storage resources.

Conceptually in that sense, clustered volatile and persistent memories are stacked as follows, with the top tier represented by local hypervisor’s RAM and the lowest tier – remote hard drives.

(Note: within local-rack vs. inter-rack tiering, as well as, for instance, utilizing remote RAM to store copies or stripes of hot data, and a few other possible tiering designations are intentionally simplified out as sophisticated and not very typical today)

Adding a new node to hyperconverged cluster and placing a new bunch of VMs on it may appear to scale if much of the new I/O traffic remains local to this new node, utilizing its new RAM, its new drives and its new CPUs. De-staging (the picture above) would typically/optimally translate as gradual propagation of copies of aggregated and compressed fairly large in size logical blocks across conceptual layers, with neighboring node’s SSDs utilized first..

But then again, we’d be talking about an apparent and limited scale-out that performs marginally better but costs inevitably much more, CapEx and OpEx wise, than any run-of-the-mill virtual cluster.

Local Squared

Great number, likely a substantial majority of existing apps keep using legacy non-clustered filesystems and databases that utilize non-shared LUNs. When deploying those apps inside VMs, system administrators simply provision one, two, several vdisks/vmdks for exclusive (read: local, non-shared) usage by the VM itself.

Here again the aforementioned aspect of hyperconvergent locality comes out in force: a vendor will make double sure to keep at least caching and write-logging “parts” of those vdisks very local to the hypervisor that runs the corresponding VM. Rest assured, best of the vendors will apply the same “local-first” policy to the clustered metadata as well..

Scenarios: probabilistic, administrative and cumulative

The idea therefore is to come up with workloads that break (the indeed, highly motivated) locality of generated I/Os by forcing local hypervisor to write (to) and read from remote nodes. And not in a deferred way but inline, as part of the actual (application I/O ⇔ persistent storage) datapath with client that issued the I/Os waiting for completions on the other side of the I/O pipeline..

An immediate methodological question would be: Why? Why do we need do run some abstracted workload that may in no way be similar to one produced by the intended apps? There are at least 3 answers to this one that I’ll denote as, say, probabilistic, administrative and cumulative.

Real-life workloads are probabilistic in nature: the resulting throughput, latencies and IOPS are distributed around their respective averages in complex ways which, outside pristine lab environments, is very difficult to extrapolate with bell curves. Most of the time a said workload may actively be utilizing a rather small percentage of total clustered capacity (the term for this is “working set” – a portion of total data accessed by the app at any given time). Most of the time – but not all the time: one typical issue for users and administrators is that a variety of out-of-bounds situations requires many hours, days or even weeks of continuous operation to present itself. Other examples include bursts of writes or reads (or both) that are two or more times longer than a typical burst, etc.

Administrative maintenance operations do further exacerbate. For example, deletion of older VM images that must happen from time to time on any virtualized infrastructure – when this seemingly innocuous procedure is carried out concurrently with user traffic, long-time cold data and metadata sprawled across the entire cluster suddenly comes into play – in force.

Cumulative-type complications further include fragmentation of logical and physical storage spaces that is inevitable after multiple delete and rewrite operations (of the user data, snapshots, and user VMs), as well as gradual accumulation of sector errors. This in turn will gradually increase ratios of one or two or may be both of the otherwise unlikely scenarios: random access when reading and writing sequential data, and read-modify-writes of the metadata responsible to track the fragmented “holes”.

Related to the above would be the compression+rewrite scenario: compression of aggregated sequentially written logical blocks whereby the application updates, possibly sequentially as well, its (application-level) smaller-size blocks thus forcing storage subsystem to read-modify-write (or more exactly, read-decompress-modify-compress and finally write) yet again.

Critical Litmus Test: 3 simple guidelines

The question therefore is, how to quickly exercise the most critical parts of the storage subsystem for true scale-out capability. Long and short of it boils down to the following simple rules:

1. Block size

The block size must be small, preferably 4K or 8K. The cluster that scales under application-generated small blocks will likely scale under medium (32K to 64K, some say 128K) and large blocks (1M and beyond) as well. On the other hand, a failure to scale under small blocks will likely manifest future problems in handling some or all of the scenarios mentioned above (and some that are not mentioned as well).

On the third hand, a demonstrated scale-out capability to operate on medium and large blocks does not reflect on the chances to handle tougher scenarios – in fact, there’d be no correlation whatsoever.

2. Working set

The working set must be set to at least 60% of the total clustered capacity, and at least 5 times greater than RAM of any given node. That is, the MAX(60% of total, 5x RAM).

Reason for this is that it is important to exercise the cluster under workloads that explicitly prevent “localizations” of their respective working sets.

3. I/O profile

Finally, random asynchronous I/Os with 100%, 80/20% and 50/50% of write/read ratios respectively, carrying pseudo-random payload that is either not compressible at all or compressible at or below 1.6 ratio.

That’ll do it, in combination with vdbench or any other popular micro-benchmark capable to generate pedal-to-the-metal type load that follows these 3 simple rules. Make sure not to bottleneck on CPUs and/or network links of course, and run it for at least couple hours on at least two clustered sizes, one representing a scaled-out version of the other..

Final Remarks

Stating maybe the obvious now – when benchmarking, it is important to compare apples-to-apples: same vendor, same hardware, same software, same workload scaled out proportionally – two different clustered sizes, e.g. 3 nodes and 6 nodes, for starters.

The simple fact that a given hyperconverged hardware utilizes latest-greatest SSDs may contribute to the performance that will look very good, in isolation. In comparison and under stress (above) – maybe not so much.

One way to view the current crop of hyperconverged products: a fusion of distributed storage, on one hand and hybrid/tiered storage, on another. This “fusion” is always custom, often ingenious, with tiering hidden inside to help deliver performance and assist (put it this way) scaling out..

HyperConverged Storage Directions — How to get users down the path

July 26, 2015July 29, 2015 joelittle

I wanted to title this Hypo Diverged Storage, as seemingly there are many ideas of what “hyper converged” storage and architectures are in the industry. Targeted at developers and solution providers alike, this piece will strive to combine a string of ideas of where I think we want to go with our solutions, and I hope to dispel some arguments that are assumptive or techno-religious in nature. Namely

What is really necessary for a strong foundational technology/product
Emulate the service architectures
It just needs to be good enough
Containers: Just build out the minimum and rely of VMs/hypervisors to handle the rest
You must build fresh cloud native, immutable applications only

In my role at Stanford Electrical Engineering I have numerous interactions with researchers and vendors, including research projects that have now become the formative infrastructure companies in the industry. My hand in many of these has been slight, but we have had more effect in guiding the build out of usable solutions, taking many into production use first at our facilities. Among the many projects over time, we have dealt with the beginnings and maturation of virtualization in compute, storage, and networking. Today’s emerging on-prem and cloud infrastructure now takes as a given that we need to have all three. Lets target the assumptions.

1) To build out a production quality solution that works well in the ecosystem, you just need to make a killer (storage/compute/network) appliance or product.

In many discussions, people may notice a mantra that most companies rightfully focus on building technology from a position of strength such as storage or networking, but miss out when they only attempt to integrate with one of the other legs of the virtualization stool. They rush to market but fail to execute well or deliver what end users actually want because the integration is half baked at most. I’ve worked with great storage products that focus on network access alone, but can’t deal with bare-metal container hypervisors; SDN products that can help make OpenStack work well for tenants, but ignore the storage component that itself is on the network; Storage products that require dedicated use of network interfaces for I/O without awareness of virtualized interfaces; Hypervisors that use networks for storage but can’t themselves hide transient network failure from the guests they host. These are just samples but they inform us that its hard to get it all right.

Although its a lot for any going concern to address, it is necessary for products and solutions to have well thought out scale-up and scale-out approaches that address all three legs of the stool. One could argue the forth leg could be orchestration, and that might need to be taken into account as well. Regardless, a focus on providing an integrated storage/networking/virtualization solution in the more simple cases, and building from there is more advantageous to customers looking to grow with these solutions long term.

2) To build a successful hyper converged product, you need to emulate how Google/AWS/etc build out their cloud infrastructure.

I’d argue that most potential customers won’t necessarily want to pick a random vendor to bring AWS-like infrastructure on site. The economics and operational complexity would likely sway them to just host their products on AWS or similar, regardless of how much performance is lost in the multiple layered services offered by Amazon. In most cases, mid-enterprise and higher customers already have various solutions adopted, thus necessitating a brown field approach to adopting hyper converged solutions.

This gets to one important belief I have. The idea is that regardless of your solution, the ability for a site to bootstrap its use with minimal initial nodes (1 or more), truly enable migration of work loads onto it, and with time and confidence, allow the customer to grow out their installation. This pushes out another assumption that the hardware is uniform. With gradual roll outs, you should expect variation in the hardware as a site goes from proof-of-concept, to initial roll out and finally full deployment. All the while, management of the solution should allow the use of existing networks where possible and the phase out of or upgrade of nodes over time. There is no “end of deployment” as it is more of a continuous journey between site and solutions vendor.

You’ll notice a theme, that bootstrapping of initial storage and networking, and simply, is more important than to emulate the full stack from the get go. Don’t try and be Amazon, just make sure you have strong foundation storage/compute/networking so that you can span across multiple installations and even over pre-existing solutions like Amazon.

3) Just get the product in the door with good initial bootstrapping.

Extending from the above, simply bootstrapping is not enough. Support of a product tends to kill many startups or seasoned companies. However, it really bedevils startups as they strive to get initial market acceptance for their investors. In the end, a hyper converged solution is a strong relationship between vendor(s) and the customer. Deployment should be considered continuous up front, so the ability to add, maintain, and upgrade a product is not only central to the relationship, it informs the product design. Consider what hyper converged truly means. Each node it self sufficient but any work load should be able to migrate to another node, and a loss of any given node should not overly strain the availability of the infrastructure.

I don’t want to take away too much from how one should focus on ease of initial deployment. That can win customers over. However, keeping users of converged infrastructure happy with the relationship requires happiness with the overall architecture and not the initial technology.

4) Containers can run well enough in virtual machines. Just build on top of VMs and then layer on top our storage and network integrations.

VMs alone are no longer the target deployment, but rather applications and/or app containers. In the rush to enable container deployment, vendors take the shortcut and build on top of existing VM infrastructure. Its worse when it comes to existing hyperconverged solutions that have heavy infrastructure requirements to support VMs. If one now sees lightweight containers as a target, ones assumptions on what makes hyperconverged infrastructure changes. Merely adopting existing mature VM infrastructure becomes is its own issue.

I love the work that Joyent has done in this space. They’ve argued well, using mature container technology not found in Linux, and brought Linux and Docker into their infrastructure. What makes them better than the present competition is that they run their containers on bare-metal and right on top of virtualized storage (ZFS backed) and maturing network virtualization. The solution presented works wonders for those operating the tech as an IAAS. Market acceptance of a non-native Linux solution or some of the present non-clustered FS components may limit the appeal of excellent technology. At the same time, the primary market as a public cloud places them in the shadow of the larger players there.

A Joyent-like technological approach married with full SDN integration could be the dream product, if made into such a product. Already, the tech proves the assumption of using VMs to be a weak proposition. What VM infrastructure has and containers need for wider adoption is SDNs, namely overlay networks. If we enable not only pooled storage but pooled networking, we can gain better utilization of the network stack, inherent path redundancy, and via an SDN overlay, a safe way to build tenant storage networks arbitrarily over multiple physical networks, perhaps across multiple vendor data centers. Think how this serves the customers potential container mobility.

5) Docker is containers / Containers need to be immutable by design for a scale out architecture

All the above leads to this point. Taking past experience and intuition, I think that what holds back adoption of next generation infrastructure throughout the industry is in fact the religious arguments around immutability. The work loads that could make use of better solutions coming to market require some level of persistence per node/container. Though live migration may not be warranted in many cases, the ability to upgrade a container, vertically scale it, or run “legacy” applications using these new operation approaches will be required of a mature container solution. We just haven’t gotten there yet. A scale-out infrastructure can allow for persistent type containers, as long as end users architect for the expected failure modes that may occur for this application profile.

What matters to customers is the ability to use the build out and scale out nature of the cloud, with known data and operational redundancy and recovery at its core. In many cases, work loads that the market presents may not yet fit into the Docker world, but they can definitely be containerized and virtualized to hyper converged storage solutions.

If you believe that these are the target users, who don’t necessarily need to develop the whole solution with a ground-up “cloud” architecture, don’t wish to run on the public cloud in all cases, and need a way to on-board cloud technology into the present organization in a progressive way, you will see where we all collectively need to go. Its not so much a hyper converged world, but an inviting, enabling solution space that starts simple (perhaps one compute+storage node), scales out non-uniformly using SDNs to tie in the nodes as they come and go. Customers can adopt small, and grow with you as you build out better, more complete infrastructure.

Forward thoughts:

My experience shows that the IT work place is changing, and those managing data centers need to adjust to adopt the skill sets of cloud operators. There is pressure I’ve seen in many organizations to convert both the business and the work force to adopt cloud technology. The opposing forces are institutional resistance (or requirements) to not adopt the public cloud, or critical work loads that can’t trivially be converted to Docker-style immutable containers. That is a present market condition needing solutions. A stand alone product won’t work, but a long term approach the grows with an organization through this IT landscape change will serve both customers and solution providers well.

TWENTY FIRST CENTURY STORAGE

more than just faster disks

Month: July 2015