Lifecycle management

May 5, 2023May 5, 2023 Alex Aizman

There’s a set of topics in system management that can often be found under alternative subtitles:
“graceful termination and cleanup”, “shutting down and restarting”, “adding/removing members”, “joining and leaving cluster”, and similar.

Any discussion along those lines typically involves state transitions, so let’s go ahead and name the states (and transitions).

To put things in perspective, this picture is about a node (not shown) in an aistore cluster (not shown). Tracking it from the top downwards, first notice a state called “maintenance mode”. “Maintenance mode” constitutes maybe the most gentle, if you will, way of removing a node from the operating cluster.

When in maintenance, the node stops keep-alive heartbeats but remains in the cluster map and remains connected. That is, unless you disconnect or shut it down manually (which would be perfectly fine and expected).

Next is “shutdown”. Graceful shutdown can also be achieved in a single shot, as indicated by one of those curly arrows in the picture’s left: “online” => “shutdown”.

When in “shutdown”, a node can easily get back and rejoin the cluster at any later time. It’ll take two steps, not one – see the blue arrows on the picture’s right where RESTART must be understood as a deployment-specific operation (e.g., kubectl run).

Both “maintenance” and “shutdown” involve a certain intra-cluster operation called “global rebalance” (aka “rebalance”).

But before we talk about it in any greater detail, let’s finish with the node’s lifecycle. The third and final special state is “decommission”. Loosely synonymous with cleanup (a very thorough cleanup, as it were), “decommission” entails:

migrating all user data the node is storing to other nodes that are currently “online” – the step that’s followed by:
partial or complete cleanup of the node in question, whereby the complete cleanup further entails:
removing all AIS metadata, all configuration files, and – last but not least – user data in its entirety.

Needless to say, there’s no way back out of “decommission” – the proverbial point of no return. To rejoin the cluster, a decommissioned node will have to be redeployed from scratch, but then it would be a totally different node, of course…

Cluster

There’s one question that absolutely cannot wait: how to “terminate” or “cleanup” a cluster? Here’s how:

$ ais cluster decommission --rm-user-data --yes

The above command will destroy an existing cluster – completely and utterly, no questions asked. It can be conveniently used in testing/benchmarking situations or in any sort of non-production environment – see --help for details. It also executes very fast – Ctrl-C’s unlikely to help in case of change-of-mind…

Privileges

Full Disclosure: all lifecycle management commands and all associated APIs require administrative privileges. There are, essentially, three ways:

deploy the cluster with authentication disabled:

$ ais config cluster auth --json
    "auth": {
        "secret": "xxxxxxxx",
        "enabled": false
    }

use integrated AuthN server that provides OAuth 2.0 compliant JWT and a set of easy commands to manage users and roles (with certain permissions to access certain clusters, etc.);
outsource authorization to a separate, centralized (usually, LDAP-integrated) management system to manage existing users, groups, and mappings.

Rebalance

Conceptually, aistore rebalance is similar to what’s often called “RAID rebuild”. The underlying mechanics would be very different but the general idea is the same: user data massively migrating from some nodes in a cluster (or disks in an array) to some other nodes (disks), and vice versa.

In aistore, all the migration (aka “rebalancing”) that’s taking in place is the system response to a lifecycle event that’s already happened or is about to happen. In fact, it is the response to satisfy a singular purpose and a single location-governing rule that simply states: user data must be properly located.

Proper location

For any object in a cluster, its proper location is defined by the current cluster map and locally – on each target node – by the locally configured target’s mountpaths.

In that sense, the “maintenance” state, for instance, has its beginning – when the cluster starts rebalancing, and the post-rebalancing end, whereby the corresponding sub-state get recorded in a new version of the cluster map, which then gets safely distributed across all nodes, etc., etc.

Next section gives an example and further clarifies the notion of maintenance sub-states.

Quick example

Given a 3-node single-gateway cluster, we go ahead and shut down one of the nodes:

$ ais cluster add-remove-nodes shutdown <TAB-TAB>
p[MWIp8080]   t[ikht8083]   t[noXt8082]   t[VmQt8081]

$ ais cluster add-remove-nodes shutdown t[ikht8083] -y

Started rebalance "g47" (to monitor, run 'ais show rebalance').
t[ikht8083] is shutting down, please wait for cluster rebalancing to finish

Note: the node t[ikht8083] is _not_ decommissioned - 
it remains in the cluster map and can be manually
restarted at any later time (and subsequently activated via 'stop-maintenance' operation).

Once the command is executed, notice the following:

$ ais show cluster
...
t[ikht8083][x]   -   -   -   -   maintenance

At first, maintenance will show up in red indicating a simple fact that data is expeditiously migrating from the node (which is about to leave the cluster).

A visual cue, which is supposed to imply something like: “please don’t disconnect, do not power off”.

But eventually, if you run the command periodically:

$ ais show cluster --refresh 3

or a few times manually – eventually show cluster will report that rebalance (“g47” in the example) has finished and the node t[ikht8083] – gracefully terminated. Simultaneously, maintenance in the show output will become non-red.

The takeaway: global rebalance runs its full way before the node in question is permitted to leave. If interrupted for any reason whatsoever (power-cycle, network disconnect, new node joining, cluster shutdown, etc.) – rebalance will resume and will keep going until the governing condition is fully and globally satisfied.

Summary

lifecycle operation	CLI	brief description
maintenance mode	`start-maintenance`	The most lightweight way to remove a node. Stop keep-alive heartbeats, do not insist on metadata updates – ignore the failures. For advanced usage options, see `--help`.
shutdown	`shutdown`	Same as above, plus node shutdown (`aisnode` exit).
decommission	`decommission`	Same as above, plus partial (metadata only) or complete (both data and AIS metadata) cleanup. A decommissioned node is forever “forgotten” – removed from the cluster map.
remove node from cluster map	`ais advanced remove-from-smap`	Strictly intended for testing purposes and special use-at-your-own-risk scenarios. Immediately remove the node from the cluster and distribute updated cluster map with no rebalancing.
take node out of maintenance	`stop-maintenance`	Update the node with the current cluster-level metadata, re-enable keep-alive, run global rebalance. Finally, when all succeeds, distribute updated cluster map (where the node shows up “online”).
join new node (ie., grow cluster)	`join`	Essentially, same as above: update the node, run global rebalance, etc.

Assorted notes

Normally, a starting-up AIS node (aisnode) will use its local configuration to communicate with any other node in the cluster and perform what’s called self-join. The latter does not require a join command or any other explicit administration.

Still, the join command can be useful when, for instance, node is misconfigured or started as a standby.

When rebalancing, the cluster remains fully operational and can be used to read and write data, list, create, and destroy buckets, run jobs, and more. In other words, none of the listed lifecycle operations requires downtime. The persistent idea is that users never notice…

References

https://github.com/NVIDIA/aistore

A series of 3 (so far) blog posts on the topic of large-scale ETL and tools that are available

October 29, 2021 Alex Aizman

AIStore: an open system for petascale deep learning

April 14, 2021April 14, 2021 Alex Aizman

So tell me, what do you truly desire?

Lucifer Morningstar

Introduction

AIStore (or AIS) has been in development for more than three years so far and has accumulated a fairly long list of capabilities, all duly noted via release notes on the corresponding GitHub pages. At this stage, AIS meets common expectations to a storage solution – its usability, manageability, data protection, scalability and performance.

AIStore is a highly available partition-tolerant distributed system with n-way mirroring, erasure coding, and read-after-write consistency(*). But it is not purely – or not only – a storage system: it’ll shuffle user datasets and run custom extract-transform-load workloads. From its very inception, the idea was to provide an open-source, open-format software stack for AI apps – an ambitious undertaking that required incremental evolution via multiple internal releases and continuous refactoring…

AIS is lightweight, fully reliable storage that can be ad-hoc deployed, with or without Kubernetes, anywhere from a single Linux machine to a bare-metal cluster of any size. Prerequisites boil down to having a Linux and a disk. Getting started with AIS will take only a few minutes and can be done either by running a prebuilt all-in-one docker image or directly from the source.

AIS provides S3 and native APIs and can be deployed as fast storage (or a fast on-demand cache) in front of the 5 (five) supported backends (whereby AIS itself would be the number 6, respectively):

Review

The focus on training apps and associated workloads results in a different set of optimization priorities and a different set of inevitable tradeoffs. Unlike most distributed storage systems, AIS does not break objects into uniform pieces – blocks, chunks, or fragments – with the corresponding metadata manifests stored separately (and with an elaborate datapath to manage all of the above and reassemble distributed pieces within each read request, which is also why the resulting latency is usually bounded by the slowest link in the corresponding flow chart).

Instead, AIS supports direct (compute <=> disk) I/O flows with the further focus on (re)sharding datasets prior to any workload that actually must perform – model training in the first place. The idea was to make a system that can easily run massive-parallel jobs converting small original files and/or existing shards to preferred sharded formats that the apps in question can readily and optimally consume. The result – linear scalability under random-read workloads, as in:

total-throughput = N * single-disk-throughput

where N is the total number of clustered disks.

It is difficult to talk about a storage solution and not say a few words about actual I/O flows. Here’s a high-level READ diagram, which however does not show checksumming, self-healing, versioning, and scenarios (that also include operation in the presence of nodes and disks being added or removed):

Further

You can always start small. As long as there’s HTTP connectivity, AIS clusters can see and operate on each other’s datasets. One can, therefore, start with a single all-in-one container or a single (virtual) machine – one cluster, one deployment at a time. The resulting global namespace is easily extensible via Cloud buckets and other supported backends. Paraphrasing the epigraph, the true desire is to run on commodity Linux servers, perform close-to-data user-defined transforms, and, of course, radically simplify training models at any scale.

High performance I/O for large scale deep learning

ImageDecember 17, 2019January 8, 2020 Alex Aizman

Abstract – Training deep learning (DL) models on petascale datasets is essential for achieving competitive and state-of-the-art performance in applications such as speech, video analytics, and object recognition. However, existing distributed filesystems were not developed for the access patterns and usability requirements of DL jobs. In this paper, we describe AIStore, a highly scalable, easy-to-deploy […]

Finding the right storage system

August 3, 2018August 3, 2018 Alex Aizman1 Comment

What are the odds that you find the right storage system to run all the workloads that need to be running and to forever sustain all the apps that consume storage? Given the current state of the art, what would be your chances of doing so today?

Apart from being existential, the question also appears to be formalizable. Meaning that it can be approached with a certain degree of scientific rigor and scrupulousness. Of course, I’m talking about the Drake equation.

The Drake equation is a probabilistic argument used to estimate the number of active, communicative extraterrestrial civilizations in the Milky Way galaxy. Since its 1961 publication, and notwithstanding constructive criticism (see, for instance, “The Drake Equation Is Broken; Here’s How To Fix It“) the equation proved to be usable across wide range of application domains.

Perhaps the most popular application of Drake is estimating one’s chances to find true love (variations: girlfriend, boyfriend, soulmate, BFF).

Given a massive body of work on the subject I’ll have to limit citations to maybe just one: the 2008 paper “Why I don’t have a girlfriend” where Peter Backus clearly and succinctly explains a) how to apply Drake, and b) why at the time of the writing he didn’t have a girlfriend.

Back to storage systems though – equipped with Drake’s methodology, the question of finding the one and only storage system can be factorized (step 1), with the probabilities of each factor estimated (step 2), and fed into the equation (step 3) to generate an estimate. Consider a real-life example, whereby the storage system in question features:

SDS: software stack on commodity x86 servers
Price $0.10/GB or below that covers hardware + software + 5 years of support
Petascale clustering with 1PB usable today and growing to 100PB later in 2019
Storage media: hard drives
Linearity: must scale close to linear with each added node and each added spindle
HA: must be highly available
Tiering: unlimited
Data redundancy: both local (RAID1/5/6/10) and replicated
AI: built-in accelerations for AI workloads

A quick look at the above reveals the truth: the chance of locating a match is probably low.

Indeed, at the time of this writing, the Wikipedia’s computer storage category includes a total of about 150 companies. Even if we super-generously give each of the 9 factors an average 0.5 probability, the resulting (150 * 1/512) can only be interpreted as: oops, doesn’t exist.

There are numerous ways to confirm the first suspicion. For instance, given the limited dataset of 150, we could run Monte Carlo simulations by randomly selecting storage companies and filling-in samples, e.g.:

Storage company X.Y.Z.

SDS: No

Petascale: No

Price: Not cheap

Commodity x86: No

HA: Yes

Having accumulated enough samples to generate confidence intervals, we then compute the odds, find them depressingly low, and resort to finding some peace in the pure logic of this exercise, a consolation similar to the one Peter Backus found years ago.

We could also take a more conventional route which (conventionally) entails: reviewing datasheets, meeting with vendors, gathering references, reading Top-N IDC and Gartner reports, and then again reviewing the product datasheets. Here, for instance, are the top 13 companies from the IDC Worldwide Object Storage 2018 – we could take this report and painstakingly fill-in a 13-by-9 comparison table with those companies in the rows and the distinguishing characteristics in the columns (I will leave this as an exercise for the reader).

Source: The Register

Most likely, though, if nothing even remotely plausible comes up on the first day or two of screening, referencing, and filling out comparison tables, the chance (of finding the right storage system) must be considered extremely slim. The storage system that you need, seek, and depend upon may not exist.

Or, stated differently, the storage system that you really need may really not exist.

It’ll be hard to come to terms with this statement – the process will usually take anywhere between three to nine months. Eventually, though, the dilemma will crystallize as a simple binary choice: to build or not to build?

This is a tough choice to make. Here’s one possible answer. Why does it work for the 9 (nine) factors listed above and where does it go from there – that will be another story and a different time.

Neural Networks for Storage Games

June 21, 2017June 22, 2017 Alex Aizman

Even though machine learning is used extensively and successfully in numerous distinct areas, the idea to apply it to hardcore (fast) datapaths may seem farfetched. Which is also why I’m currently looking at the distributed storage use case where storage nodes employ different I/O load balancing strategies. Some of these nodes run neural networks, others – maybe a more conventional logic. Some of these strategies perform better than others, but only when they compete against their deterministic counterparts. The text below is only scratching the surface, of course:

Neural Networks for Storage Games (pdf)

Bayesian Optimization for Clusters

May 11, 2017May 30, 2017 Alex Aizman

What is the difference between storage cluster and multi-armed bandit (MAB)? This is not a trick question – to help answer it, think about this one: what’s common between MAB and Cloud configurations for big data analytics?

Henceforth, I’ll refer to these use cases as cluster, MAB, and Cloud configurations, respectively. The answer’s down below.

In storage clusters, there are storage initiators and storage targets that continuously engage in routine interactions called I/O processing. In this “game of storage” an initiator is typically the one that makes multi-choice load-balancing decisions (e.g., which of the 3 available copies do I read?), with the clear objective to optimize its own performance. Which can be pictured as, often rather choppy, diagram where the Y axis reflects a commonly measured and timed property: I/O latency, I/O throughput, IOPS, utilization (CPU, memory, network, disk), read and write (wait, active) queue sizes, etc.

Fig. 1. Choosy Initiator: 1MB chunk latencies (us) in the cluster of 90 targets and 90 initiators.

A couple observations, and I’ll get to the point. The diagram above is (an example of) a real-time time-series, where the next measured value depends on both the previous history and the current action. In the world of distributed storages the action consists in choosing a given target at a given time for a given (type, size) I/O transaction. In the world of multi-armed bandits, acting means selecting an arm, and so on. All of this may sound quite obvious, on one hand, and disconnected, on another. But only at the first glance.

In all cases the observed response carries a noise, simply because the underlying system is a bunch of physical machines executing complex logic where a degree of nondeterminism results, in part, from myriad micro-changes on all levels of the respective processing stacks including hardware. No man ever steps in the same river twice (Heraclitus).

Secondly, there always exists a true (latent, objective, black-box) function Y = f(history, action) that can potentially be inferred given a good model, enough time, and a bit of luck. Knowing this function would translate as the ability to optimize across the board: I/O performance – for storage cluster, gambling returns – for multi-armed bandits, $/IOPS ratio – for big data clouds. Either way, this would be a good thing, and one compelling reason to look deeper.

Thirdly and finally, a brute-force search for the best possible solution is either too expensive (for MAB, and Cloud configurations), or severely limited in time (which is to say – too expensive) for the cluster. In other words, the solution must converge fast – or be unfeasible.

The Noise

A wide range of dynamic systems operate in an environment, respond to actions by an agent, and generate serialized or timed output:

Fig. 2. Agent ⇔ Environment diagram.
This output is a function f(history, action) of the previous system states – the history – and the current action. If we could discover, approximate, or at least regress to this function, we could then, presumably, optimize – via a judicious choice of an optimal action at each next iteration…

And so, my first idea was: neural networks! Because, when you’ve got for yourself a new power, you better use it. Go ahead and dump all the cluster-generated time-series into neural networks of various shapes, sizes, and hyperparameters. Sit back and see if any precious latent f() shows up on the other end.

Which is exactly what I did, and well – it didn’t.

There are obstacles that are somewhat special and specific to distributed storage. But there’s also a common random noise that slows down the learning process, undermining its converge-ability and obfuscating the underlying system properties. Which is why the very second idea: get rid of the noise, and see if it helps!

To filter the noise out, we typically compute some sort of a moving average and say that this one is a real signal. Well, not so fast…

First off, averaging across time-series is a lossy transformation. The information that gets lost includes, for instance, the dispersion of the original series (which may be worse than Poissonian or, on the contrary, better than Binomial). In the world of storage clusters the index of dispersion may roughly correspond to “burstiness” or “spikiness” – an important characteristic that cannot (should not) be simplified out.

Secondly, there’s the technical question of whether we compute the average on an entire stream or just on a sliding window, and also – what would be the window size, and what about other hyperparameters that include moving weights, etc.

Putting all this together, here’s a very basic noise-filtering logic:Fig. 3. Noise filtering pseudocode.

This code, in effect, is saying that as long as the time-series fits inside the 3-sigma wide envelope around its moving average, we consider it devoid of noise and therefore we keep it as is. But, if the next data point ventures outside this “envelope”, we do a weighted average (line 3.b.i), giving it a slightly more credence and weight (0.6 vs 0.4). Which will then, at line 3.e, contribute to changing the first and the second moments based either on the entire timed history or on its more recent and separately configurable “window” (as in Fig. 1, which averages over a window of 10).

As a side: it is the code like this (and its numerous variations – all based on pure common sense and intuition) – that makes one wonder: why? Why are there so many new knobs (aka hyperparameters) per line of code? And how do we go about finding optimal values for those hyperparameters?..

Be as it may, some of the questions that emerge from Fig. 3 include: do we actually remove the noise? what is the underlying assumption behind the 3 (or whatever is configured) number of sigmas and other tunables? how do we know that an essential information is not getting lost in addition to, or maybe even instead of, “de-jittering” the stream?

All things being Bayesian

It is always beneficial to see a big picture, even at the risk of totally blowing the locally-observable one out of proportion ©. The proverbial big picture may look as follows:

Fig. 4. Bayesian inference.
Going from top to bottom, left to right:

A noise filtering snippet (Fig. 3) implicitly relies on an underlying assumption that the noise is normally distributed – that it is Gaussian (white) noise. This is likely okay to assume in the cluster and Cloud configurations above, primarily due to size/complexity of the underlying systems and the working force of the Central Limit Theorem (CLT).
On the other hand, the conventional way to filter out the noise is called the Kalman Filter, which, in fact, does not make any assumptions in re Gaussian noise (which is a plus).
The Kalman filter, in turn, is based on Bayesian inference, and ultimately, on the Bayes’ rule for conditional probabilities.
Bayesian inference (Fig. 4) is at the core of very-many things, including Bayesian linear regression (that scales!) and iterative learning in Bayesian networks (not shown); together with Gaussian (or normal) distribution Bayesian inference forms a founding generalization for both hidden Markov models (not shown) and Kalman filters.
On the third hand, Bayesian inference happens to be a sub-discipline of statistical inference (not shown) which includes several non-Bayesian paradigms with their own respective taxonomies of methods and applications (not shown in Fig. 4).
Back to the Gaussian distribution (top right in Fig. 4): what if not only the (CLT-infused) noise but the function f(history, action) itself can be assumed to be probabilistic? What if we could model this function as having a deterministic mean(Y) = m(history, action), with the Y=f() values at each data point normally distributed around each of its m() means?
In other words, what if the Y = f(history, action) is (sampled from) a Gaussian Process (GP) – the generalization of the Gaussian distribution to a distribution over functions:

Fig. 5. Gaussian Process.

There’s a fast-growing body of research (including already cited MAB and Cloud configurations) that answers Yes to this most critical question.
But there is more…
Since the sum of independent Gaussian processes is a (joint) GP as well, and
- since a Gaussian distribution is itself a special case of a GP (with a constant K=σ²I covariance, where I is the identity matrix), and finally,
  - since white noise is an independent Gaussian

– the noise filtering/smoothing issue (previous section) sort of “disappears” altogether – instead, we simply go through well-documented steps to find the joint GP, and the f() as part of the above.

When a Gaussian process is used in Bayesian inference, the combination yields a prior probability distribution over functions, an iterative step to compute the (better-fitting) posterior, and ultimately – a distinct machine learning technique called Bayesian optimization (BO) – the method widely (if not exclusively) used today to optimize machine learning hyperparameters.
One of the coolest things about Bayesian optimization is a so-called acquisition function – the method of computing the next most promising sampling point in the process that is referred to as exploration-versus-exploitation.
The types of acquisition functions include (in publication order): probability of improvement (PI), expected improvement (EI), upper confidence bound (UCB), minimum regret search (MRS), and more. In each case, the acquisition function computes the next optimal (confidence-level wise) sampling point – in the process that is often referred to as exploration-versus-exploitation.
The use of acquisition functions is an integral part of a broader topic called optimal design of experiments (not shown). The optimal design tracks at least 100 years back, to Kirstine Smith and her 1918 dissertation, and as such does not explain the modern Bayesian renaissance.
In the machine learning setting, though, the technique comes to the rescue each and every time the number of training examples is severely limited, while the need (or the temptation) to produce at least some interpretable results is very strong.
In that sense, a minor criticism of the Cloud configurations study is that, generally, the performance of real systems under-load is spiky, bursty and heavily-tailed. And it is, therefore, problematic to assume anything-Gaussian about the associated (true) cost function, especially when this assumption is then tested on a small sample out of the entire “population” of possible {configuration, benchmark} permutations.
Still, must be better than random. How much better? Or rather, how much better versus other possible Bayesian/GP (i.e., having other possible kernel/covariance functions), Bayesian/non-GP, and non-Bayesian machine learning techniques?
That is the question.

Only a few remaining obstacles

When the objective is to optimize or (same) approximate the next data point in a time-series, both Bayesian and RNN optimizers perform the following 3 (three) generic meta-steps:

Given the current state h(t-1) of the model, propose the next query point x(t)
Process the response y(t)
Update the internal state h(t+1)

Everything else about these two techniques – is different. The speed, the scale, the amount of engineering. The applicability, after all. Bayesian optimization (BO) has proven to work extremely well for a (constantly growing) number of applications – the list includes the already cited MAB and Cloud configurations, hyperparameter optimization, and more.

However. GP-based BO does not scale: its computational complexity grows as O(n³), where n is the number of sampled data points. This is clearly an impediment – for many apps. As far as storage clustering, the other obstacles include:

a certain adversarial aspect in the relationships between the clustered “players”;
a lack of covariance-induced similarity between the data points that – temporally, at least – appear to be very close to each other;
the need to rerun BO from scratch every time there is a change in the environment (Fig. 2) – examples including changes in the workload, nodes dropping out and coming in, configuration and software updates and upgrades;
the microsecond latencies that separate load balancing decisions/actions, leaving little time for machine learning.

Fortunately, there’s a bleeding-edge research that strives to combine the power of neural networks with the Bayesian power to squeeze the “envelope of uncertainty” around the true function (Fig. 6) – in just a few iterations:

Fig. 6. Courtesy of: Taking the Human Out of the Loop: A Review of Bayesian Optimization.

To be continued…

A Quick Note on Getting Better at Near Duplicates

May 1, 2017May 1, 2017 Alex Aizman

Finding near-duplicates on the web is important for several unrelated reasons that include archiving and versioning, plagiarism and copyright, SEO and optimization of storage capacity. It is maybe instructive to re-review the original groundbreaking research in the space – an eleven-year-old highly cited study of 1.6 billion distinct web pages.

What’s interesting about this work is that it takes two relatively faulty heuristics and combines them both to produce a third one that is superior.

Specifically, let A and B be the two heuristics in question, and (p1, p2) – a random pair of web pages that get evaluated for near-duplication (in one of its domain-specific interpretations). Each method, effectively, computes a similarity between the two pages – let’s denote this as sim-A(p1, p2) and sim-B(p1, p2), respectively.

Next, based on their own configured tunables A and B would, separately, determine whether p1 and p2 are near-duplicate:

duplicate-A(p1, p2) = sim-A(p1, p2) > threshold-A
duplicate-B(p1, p2) = sim-B(p1, p2) > threshold-B

The combined algorithm – let’s call it C – evaluates as follows from left to right:

duplicate-C(p1, p2) = duplicate-A(p1, p2) && sim-B(p1, p2) > threshold-C

where threshold-C is a yet another constant that gets carefully tailored to minimize both false negatives and false positives (the latter, when a given pair of different web pages is falsely computed as near-duplicate).

Easy to notice that, given enough time to experiment and large enough datasets, this can work for more than two unrelated heuristics that utilize minhash, simhash and various other popular near-duplication detection techniques.

Maybe not so obvious is another observation: given real-life time and space limitations, any near-duplication algorithm will have to remain information-lossy and irreversible. And since the information that is being lost cannot be compared in terms of duplication (or non-duplication), we cannot eliminate the scenarios of non-identical (and maybe even disjoint) subsets of the respective false positives.

Therefore, given algorithms A and B and their respective false positive sets FP-A and FP-B, we should always be able to come up with a combined heuristics C(A, B) that optimizes its own inevitable FP-C as a function of (FP-A, FP-B).

PS. A blog that (at the time of this writing) comes on top when searching for “near-duplicate detection”. Easy and illustrated.