A Quick Note on Getting Better at Near Duplicates

Finding near-duplicates on the web is important for several unrelated reasons that include archiving and versioning, plagiarism and copyright, SEO and optimization of storage capacity. It is maybe instructive to re-review the original groundbreaking research in the space – an eleven-year-old highly cited study of 1.6 billion distinct web pages.

What’s interesting about this work is that it takes two relatively faulty heuristics and combines them both to produce a third one that is superior.

Specifically, let A and B be the two heuristics in question, and (p1, p2) – a random pair of web pages that get evaluated for near-duplication (in one of its domain-specific interpretations). Each method, effectively, computes a similarity between the two pages – let’s denote this as sim-A(p1, p2) and sim-B(p1, p2), respectively.

Next, based on their own configured tunables A and B would, separately, determine whether p1 and p2 are near-duplicate:

  • duplicate-A(p1, p2) = sim-A(p1, p2) > threshold-A
  • duplicate-B(p1, p2) = sim-B(p1, p2) > threshold-B

The combined algorithm – let’s call it C – evaluates as follows from left to right:

  • duplicate-C(p1, p2) = duplicate-A(p1, p2) && sim-B(p1, p2) > threshold-C

where threshold-C is a yet another constant that gets carefully tailored to minimize both false negatives and false positives (the latter, when a given pair of different web pages is falsely computed as near-duplicate).

Easy to notice that, given enough time to experiment and large enough datasets, this can work for more than two unrelated heuristics that utilize minhash, simhash and various other popular near-duplication detection techniques.

Maybe not so obvious is another observation: given real-life time and space limitations, any near-duplication algorithm will have to remain information-lossy and irreversible. And since the information that is being lost cannot be compared in terms of duplication (or non-duplication), we cannot eliminate the scenarios of non-identical (and maybe even disjoint) subsets of the respective false positives.

Therefore, given algorithms A and B and their respective false positive sets FP-A and FP-B, we should always be able to come up with a combined heuristics C(A, B) that optimizes its own inevitable FP-C as a function of (FP-A, FP-B).

PS. A blog that (at the time of this writing) comes on top when searching for “near-duplicate detection”. Easy and illustrated.

Learning to Learn by Gradient Descent with Rebalancing

Neural networks, as the name implies, comprise many little neurons. Often, multiple layers of neurons. How many? Quick googling on the “number of layers” or “number of neurons in a layer” leaves one with a strong impression that there are no good answers.

The first impression is right. There is a ton of recipes on the web, with the most popular and often-repeated rules of thumb boiling down to “keep adding layers until you start to overfit” (Hinton) or “until the test error does not improve anymore” (Bengio).

Part I of this post stipulates that selecting the optimal neural network architecture is, or rather, can be a search problem. There are techniques to do massive searches. Training a neural network (NN) can be counted as one such technique, where the search target belongs to the function space defined by both this environment and this NN architecture. The latter includes a certain (and fixed) number of layers and number of neurons per each layer. The question then is, would it be possible to use a neural network to search for the optimal NN architecture? To search for it in the entire NN domain, defined only and exclusively by the given environment?

Ignoratio elenchi

Aggregating multiple neural networks into one (super) architecture comes with a certain number of tradeoffs and risks including the one that is called ignoratio elenchi – missing the point. Indeed, a super net (Figure 1) would likely have its own neurons and layers (including hidden ones), and activation functions. (Even a very cursory acquaintance with neural networks would allow one to draw this conclusion.)

Which means that training this super net would inexorably have an effect of training its own internal “wiring” – instead of, or maybe at the expense of, helping to select the best NN – for instance, one of the 9 shown in Figure 1. And that would be missing the point, big time.

Fig. 1. Super network that combines 9 neural nets to generate 4 (green) outputs

The primary goal remains: not to train super-network per se but rather to use it to search the vast NN domain for an optimal architecture. This text describes one solution to circumvent the aforementioned ignoratio elenchi.

I call it a Weighted Mixer (Figure 2):

Fig. 2. Weighted sum of NN outputs

Essentially, a weighted mixer, or WM, is a weighted sum of the contained neural nets, with a couple important distinctions…

TL;DR – WM can automatically grade multiple network architectures (a link to the white paper and the supporting code follows below):

Animated Weights

One picture that is worth a thousand words. This shows a bunch of NN architectures, with the sizes and numbers of hidden layer ranging from 16 to 56 and from 1 to 3, respectively. The column charts (Figure 8) depict running weights that the WM assigns to each of the 16 outputs of each of the constituent nets that it (the WM) drives through the training:

Fig. 8. Weight updates in-progress

The winner here happens to be the (56, 2) network, with the (48, 2) NN coming close second.  This result, and the fact that, indeed, the 56-neurons-2-layers architecture converges faster and produces better MSE, can be separately confirmed by running each of the 18 NNs in isolation…

Searching Stateful Spaces

Optimizing a nonlinear, multidimensional, stateful system is equivalent to performing a search in the space of the (performance affecting) actions and system states.

Recurrent neural networks (RNN) have proved to be extremely efficient at searching function spaces. But, they come with a baggage.

For a given stateful transformation Y(t) = F(X(t), S(t)), there’s an RNN space – a function space in its own right, inversely defined by the original system function F.

The question then is: how to search for the optimal RNN? The one that would feature the fastest convergence and, simultaneously, minimal cost/loss?

The ability to quickly search RNN domain becomes totally crucial in presence of real-time requirements (e.g., when optimizing performance of a running storage system), where the difference between “the best” and “the rest” is the difference between usable and unusable…

  • PDF (full paper)
  • Keywords: RNN, reinforcement learning, hybrid storage, meta-learning, NFL

FTL, translated

The seismic shift

Fact is, NAND media is inherently not capable of executing in-place updates. Instead, the NAND device (conventionally, SSD) emulates updates via write + remap or copy + write + remap operations, and similar. The presence of such emulation (performed by SSD’s flash translation layer, or FTL) remains somewhat concealed from an external observer.

Other hardware and software modules, components and layers of storage stacks (with a notable exception of SMR) are generally not restricted by similar intervention of one’s inherent nature. They could do in-place updates if they’d wanted to.

Nonetheless, if there’s any universal trend in storage of the last decade and a half – that’d be avoiding in-place updates altogether. Circumventing them in the majority of recent solutions. Treating them as a necessary evil at best.

Read More »

Docker, NFS, ZFS, and extended attributes

It may be difficult to develop an emotional connection to all of the features of filesystems and filers. Take deduplication for instance. Dedup is cool. Rabin-Karp rolling hash, sliding-window Content Defined Chunking (CDC) – those were cool 15 years ago and remain cool today. Improvements and products (and startups) keep pouring in.

But when it comes to extended file attributes (xattrs), emotions range from a blank stare to dismay. As in: wouldn’t touch with a ten-foot pole.

Come to think of it, part of the problem is – NFS. And part of the NFS problem is that both v3 and v4 do not support xattrs. There is no support whatsoever: none, nada, zilch. And how there can be with no interoperable standard?

Read More »

Storage Cluster in Thermal Equilibrium (part II)

In essence, Part I of this post stipulates that distribution of states in large clusters can be approximated without making any assumptions on what kind of distribution it is in the first place.

The claim, hypothetical at this point, is that storage clusters under certain conditions must be conforming to the laws of statistical mechanics (StMch). The narrow version of the same claim relates strictly to the mathematics used in StMch.

Since the remaining part II came out pretty lengthy, heavy on math, and densely populated with equations, I’m including it here as a separate PDF.


The results, I’d say, are inconclusive-but-promising. Part of being “inconclusive” is simply – not enough data, too early to say. Testing the theory on larger, 10K nodes and beyond, clusters seems like a no-brainer. However. Out of all possible whats-next ideas and steps my first preference would be to check out this theory not directly on the clustered nodes but instead on the load-balancing groups and their group-wide aggregated states..

Docker Detour

Docker keeps fascinating me, purely as a use case. From the image hosting perspective, there are a couple things that are missing in its current stage of development. The biggest and the most obvious one is – a shared,  distributed, and deduplicated store for both image manifests (image metadata) and layer content (the data).

Due to the immutable sha256-protected nature of both the related complexity is about 3 orders of magnitude lower than (this complexity would look like for) anything less specialized.

Distributing the content-hashed and stacked stuff like this:


Read More »

Storage Cluster in Thermal Equilibrium (part I)

Mathematics is the art of giving the same name to different things.

Henri Poincare

1. Life’s most persistent question

The question is: how to compute distribution of states of clustered nodes in a large distributed cluster running a steady workload? The cluster is fully distributed, and:boltz-txt1None of this is uncommon. For one thing, we live at a time of ever-exploding problem sizes. Bigger problems lead to bigger clusters with nodes that, when failed, get replaced wholesale.

And small stuff. For instance, eventually consistent storage is fairly common knowledge today.  It is spreading fast, with clumps of apps (including venerable big data and NoSQL apps) growing and stacking on top.

Those software stacks enjoy minimal centralized processing (and often none at all), which causes CAP-imposed scalability limits to get crushed. This further leads to even bigger clusters, where nodes run complex stateful software, etcetera.

Why complex and stateful, by the way? The most generic sentiment is that the software must constantly evolve, and as it evolves its complexity and stateful-ness increases. Version (n+1) is always more complex and more stateful than the (n) – and if there are any exceptions that I’m not aware of, they only confirm the rule. Which fully applies to storage, especially due to the fact that this particular software must keep up with the avalanche of hardware changes. There’s a revolution going on. The SSD revolution is still in full swing and raging, with SCM revolution right around the corner. The result is millions of lines of code, and counting. It’s a mess.

Read More »