CrowdSharing Part 2

June 29, 2015June 29, 2015 Alex Aizman

When it comes to ad-hoc wireless (mesh) networks, we probably visualize something like this:

The picture is a youtube screen grab; the video itself introduces the project SPAN (github: https://github.com/ProjectSPAN) to “enable communications in challenged environments”. Wikipedia has a somewhat better picture though, covering all kinds of wireless scenarios, including satellite.

Introduction

Wireless ad hoc networks go back 40 plus years, to the early 1970s (DARPA) projects, with the 3rd generation taking place in the mid-1990s with advent of 802.11 standards and 802.11 cards – as always, wikipedia compiles a good overview ¹ ² ³.

Clearly there’s a tremendous and yet untapped value in using cumulative growing power of smartphones to resolve complex tasks. There are daunting challenges as well: some common (security, battery life and limited capacity, to name a few), others – application specific. Separate grand puzzle is to monetizing it, making it work for users and future investors.

There’s one not very obvious thing that has happened though fairly recently, as far as this 40 plus year track. Today we know exactly how to build a scaleable distributed object storage system, system that can be deployed on pretty much anything, would be devoid of the infamous single-MDS bottleneck, load-balance on the backend, and always retain the 3 golden copies of each chunk of each object. Abstractions and underlying principles of this development can in fact be used for the ad hoc wireless applications.

I’ll repeat. Abstractions and principles that at the foundation of infinitely scaleable object storage system can be used for selected ad hoc wireless applications.

Three Basic Requirements

CrowdSharing (as in: crowdsourcing + file-sharing) is one distinct application (the “app”), one of several that can be figured out today.

The idea was discussed in our earlier blog and is a very narrow case of an ad hoc network. CrowdSharing boils down to uploading and downloading files and objects (henceforth “fobjects” ⁴, for shortness sake) whereby a neighbor’s smartphone may serve as an intermediate hop.

(On the picture above, Bob at the bottom left may be sending or receiving his fobject to Alice at the top left ⁵)

CrowdSharing by definition is performed on a best-effort basis, securely and optimally while at the same time preventing exponential flooding (step 1: Bob (fobject a) => Alice; step 2: Alice (fobject a, fobject b) => Bob; step 3: repeat), where the delays and jitters can count in minutes and even hours.

Generally, 3 basic things are required for any future ad hoc wireless app:

must be ok with the ultimate connectionless nature of the transport protocol
must manipulate fobjects [^4]
must be supported by a (TBD) business model and infrastructure capable to provide enough incentives to both smartphone users and investors

Next 3 sections briefly expand on these points.

Connectionless

There’s a spectrum. On one side of it there are transport protocols that, prior to sending first byte of data, provision resources and state across every L2/L3 hop between two endpoints; the state is then being maintained throughout the lifetime of the corresponding flow aka connection. On the opposite side, messages just start flying without as little as a handshake.

Apps that will deploy (and sell) over ad hoc wireless must be totally happy with this other side: connectionless transports. Messages may take arbitrary routes which also implies that each message must be self-sufficient as far as its destination endpoint and the part of the resulting fobject it is encoding. Out-of-order or duplicated delivery must be treated not as an exception but as a typical sequence of events, a normal occurrence.

Eventually Consistent Fobjects

Fobjects ⁴ are everywhere. Photos and videos, .doc and .pdf files, Docker images and Ubuntu ISOs, log structured transaction logs and pieces of the log files that are fed into MapReduce analytics engines – these are all immutable fobjects.

What makes them look very different is an application-level, and in part, transactional semantics: ZFS snapshot would be an immediately consistent fobject, while, for instance, a video that you store on Amazon S3 only eventually consistent..

Ad hoc wireless requires eventual consistency ⁶ for the application-specific fobjects.

Business Model

On one hand, there are people that will pay for the best shot to deliver the data, especially in the situations when/where the service via a given service provider is patchy or erratic at best, if present at all.

On another, there is a growing smartphone power in the hands of people who would like to maybe make an extra buck by not necessarily becoming an Uber driver.

Hence, supply and demand forces that can start interacting via CrowdSharing application “channel”, thus giving rise to the new and not yet developed market space..

How that can work

Messages

On the left side of the diagram below there’s a fobject that Bob is about to upload. The fobject gets divided into chunks, with each chunk getting possibly sliced as well via Reed-Solomon or its derivatives. RSA (at least) 1024-bit key is used to encrypt those parts that are showed below in red and brown.

Each chunk or slice forms a self-sufficient message that can find its way into the destination object through unpredictable and not very reliable set of relays, including in this case Alice’s smartphone. Content of this message carries metadata and data, a simplified version of this is depicted above. Destination (network) address is the only part that is in plain text, Bob’s address, however, is encrypted with a key known to the service provider at the destination, while the content itself may be encrypted with the different key that the provider does not have.

The entire message is then associated with its sha512 content digest that simultaneously serves as a checksum and a global (and globally unique) de-duplicator.

Actors

The diagram shows 3 “actors”: Bob – uploads his fobject; Alice – provides her smartphone’s resources and maybe her own home network (or other network) to forward Bob’s object to the specified destination.

And finally, Application Provider aka Control Center.

The latter can be made optional; it can and must serve though to perform a number of key functions:

maintain user profiles and their (user) ratings based on the typical like/dislike and trust/distrust feedback provided by the users themselves
record the start/end, properties and GPS coordinates of each transaction – the latter can become yet another powerful tool to establish end-to-end security and authenticity (and maybe I will talk about it in the next blog)
take actual active part during the transaction itself, as indicated on the diagram and explained below, and in particular:
act as a trusted intermediary to charge user accounts on one hand, and deposit service payments to users that volunteer to provide their smartphones for CrowdSharing service, on another..

Steps

Prior to sending the first message, Bob’s smartphone, or rather the CrowdSharing agent that runs on it, would ask Alice’s counterpart whether it is OK to send. Parameters are provided, including the sha512 key-digest of the message, and the message’s size.

At this point Alice generally has 3 choices depicted above atop of the light-green background. She can flat-out say No, with possible reasons including administrative policy (“don’t trust Bob”), duplicated message (“already has or already has seen this sha512”), and others.

Alice can also simply say Yes, and then simply expect the transfer.

Or, she can try to charge Bob for the transaction, based on the criteria that takes into account:

the current CrowdSharing market rate at a given GPS location (here again centralized Application Provider could be instrumental)
remaining capacity and battery life (the latter – unless the phone is plugged-in)
Alice’s data plan and its per month set limits with wireless service provider, and more.

Bob, or rather the CrowdSharing agent on his smartphone will then review the Alice’s bid and make the corresponding decision, either automatically or after interacting with Bob himself.

Payments

Payments must be postponed until the time when Alice (in the scenario above) actually pushes the message onto a wired network.

Payments must be likely carried out in points (and I suggest that each registered user deposits, say $20 for 2,000 CrowdSharing points) rather than dollars, to be computed and balanced later based on the automated review of the transaction logs by the Application Provider. Accumulated points of course must convert back into dollars on user’s request.

The challenges are impressive

The challenges, again, are impressive and numerous; there’s nothing though that I think technically is above and beyond today’s (as they say) art.

To name a few.. Downloading (getting) fobjects represent a different scenario due to the inherent asymmetry of the (Wi-Fi user, wired network) situation. BitTorrent (the first transport that comes to mind for possible reuse and adaptation) will dynamically throttle based on the amount of storage and bandwidth that each “peer” provides but is of course unaware of the consuming power and bandwidth of the mobile devices on-battery.

Extensive control path logic will have to be likely done totally from scratch – the logic that must take into account user-configurable administrative policies (e.g., do not CrowdShare if the device is currently running on-battery), the limits set by the wireless data plans ⁷, the ratings (see above), and more.

Aspects of the business model are in this case even more curious than technical questions.

Wireless ad hoc network ↩
Stochastic geometry models of wireless networks ↩
Wireless mesh network ↩
fobjects are files and/or objects that get created only once, retrieved multiple times, and never updated. Not to be confused with fobject as a combo of “fun and object” (urban dictionary) ↩ ↩
Alice and Bob ↩
Eventual consistency ↩
In one “overage” scenario, Alice could “charge” Bob two times her own overage charges if the latter is getting reached or approximated (diagram above) ↩

Map-free Multi-site Replication

June 28, 2015July 30, 2015 Caitlin Bestler

A distributed storage system has do a great job of replicating content. Building a highly available storage service out of mildly reliable storage devices requires replicating after each failure. Replication is not an exception, it is a continuous process. So it’s important.

The design of the CCOW (Cloud Copy-on-Write) object storage system deals with replication within a local network very well. What it does not deal with anywhere near as well is spreading content over very long round-trip-times, as in different sites. I needed a feature that would allow customers to federate multiple sites without requiring 40 Gbe links connecting the sites.

This is not specific to CCOW object storage. The need to provide efficient storage locally will force any design to partition between “local” updating, which is fairly fast, and “remote” updating which will be considerably slower. If you don’t recognize this difference you’ll end up with a design that does all updates slowly.

But I needed a solution for CCOW. My first thought on how to replicate content between federated sites was to just layer on top of software bridges.

The problem actually maps fairly well:

Each site can be thought of as a switch.
The “switch” can deliver frames (chunks) to local destinations, or to links to other switches (sites).
A frame (chunk) is never echoed to the same link that it arrived upon.
There is no need to provision a site topology, the Spanning Tree will discover it. The latter spanning tree algorithms will do it better.

If you are not familiar with the Spanning Tree algorithms I suggest you go read up on them. Spanning Tree made Ethernet networking possible. See https://en.wikipedia.org/wiki/Spanning_Tree_Protocol.

Basically, we start with a set of linked Sites:

Spanning Tree algorithms effectively prevent loops by logically de-activating some of the links.

So re-using existing software (software switches and existing long-haul tunnels) was a very tempting solution, especially since it directly re-used networking layer code to solve storage layer problems. Pontificating that networking solutions are applicable to storage is one thing, actually leveraging network layer code would prove it.

This leads to the following distribution pattern, in this example for a frame originating from Site V.

MultiSite3

But it turns out that the storage layer problem is fundamentally simpler.

Radia Perlman’s spanning tree prevnts loops by discovering the links between the switches, and without central decision making de-activates a subset of the links so that would have enabled forwarding loops. It does this no matter how the switches are linked. It de-activates links to avoid loops. This is needed because Ethernet frames do not have a time-to-live marker. Without spanning tree it would be very easy for Switch A to forward a frame to Switch B, which would forward it to Switch C which would forward it to Switch A. While rare, such loops would crash any Ethernet network. Spanning tree was a simple algorithm that prevented that totally.

The fundamental assumption behind spanning tree was that links had to be de-activated because switches could not possibly remember what frames they had previously forwarded. Recognizing that a frame was being looped around would require remembering frames that have been forwarded eons previously, possibly even milliseconds.

If you understand how switches work the last thing any switch wants to do is to remember anything. Remembering things takes RAM, but more critically it takes time to update that RAM. RAM may get cheaper, but remembering things always takes longer than not remembering them and switches will kill to trim a microsecond from their relay time.

But remembering things is exactly what a storage cluster is deigned to do.

It turns out that the multi-site forwarding algorithm for multi-site storage of fingerprinted chunks is stunningly simple:

On receipt of a new Chunk X on one Inter-segment link:

If Chunk X is already known to this cluster

Do nothing.

Otherwise

Replicate Chunk X on all other inter-site inks.

Remember Chunk X locally.

That’s it. Having a cryptographic hash fingerprint of each chunk makes it easy to identify already seen chunks. This allows multiple sites to forward chunks over ad hoc links without having to build a site wide map, and all links can be utilized at all times.

Now the forwarding pattern is enhanced and there is no delay for building a tree:

Go Anagram

June 25, 2015June 27, 2015 Alex Aizman

The history is well known. Go started as a pragmatic effort by Google, to answer their own software needs to manage hundreds of thousands (some say, tens of millions) of servers in Google’s Data Centers. If there’s anywhere a scale, Google has it. Quoting the early introduction (which I strongly suggest to read) Go at Google: Language Design in the Service of Software Engineering:

The Go programming language was conceived in late 2007 as an answer to some of the problems we were seeing developing software infrastructure at Google. The computing landscape today is almost unrelated to the environment in which the languages being used, mostly C++, Java, and Python, had been created. The problems introduced by multicore processors, networked systems, massive computation clusters, and the web programming model were being worked around rather than addressed head-on.
<snip>
Go was designed and developed to make working in this environment more productive. Besides its better-known aspects such as built-in concurrency and garbage collection, Go’s design considerations include rigorous dependency management, the adaptability of software architecture as systems grow, and robustness across the boundaries between components.

When I first started looking at Go aka golang ¹ it was strictly in connection with Docker, coreos/rkt, LXD and various other Containers. All of which happen to be coded in Go.

Your First Go

“Like many programmers I like to try out new languages” – a quote from Adam Leventhal’s blog on his first Rust program: anagrammer. My contention as well ². Not sure about Rust though but my today’s anagrammer in Go follows below, likely far from the most elegant:

 1 package main
 2
 3 import (
 4         "bufio"
 5         "fmt"
 6         "log"
 7         "os"
 8         "sort"
 9         "strings"
 10 )
 11
 12 func normalize(word string) string {
 13         lcword := strings.ToLower(word)
 14         letters := []string{}
 15
 16         for _, cp := range lcword {
 17                 letters = append(letters, string(cp))
 18         }
 19         sort.Strings(letters)
 20         return strings.Join(letters, "")
 21 }
 22
 23 func do_wordmap() *map[string][]string {
 24         file, err := os.Open("/usr/share/dict/words")
 25         if err != nil {
 26                 log.Fatal(err)
 27         }
 28         defer file.Close()
 29
 30         var allwords []string
 31         scanner := bufio.NewScanner(file)
 32         for scanner.Scan() {
 33                 t := scanner.Text()
 34                 allwords = append(allwords, t)
 35         }
 36
 37         wordmap := make(map[string][]string)
 38         for _, w := range allwords {
 39                 nw := normalize(w)
 40                 wordmap[nw] = append(wordmap[nw], w)
 41         }
 42         return &wordmap
 43 }
 44
 45 func main() {
 46         wordmap := do_wordmap()
 47
 48         syn := bufio.NewReader(os.Stdin)
 49         for {
 50                 fmt.Print("Enter a word: ")
 51                 myword, _ := syn.ReadString('\n')
 52                 myword = strings.TrimSuffix(myword, "\n")
 53                 normal_w := normalize(myword)
 54
 55                 fmt.Println((*wordmap)[normal_w])
 56         }
 57 }

It runs like this:

Enter a word: spare
[pares parse pears rapes reaps spare spear]

Distributed, Concurrent and Parallel

To facilitate distributed concurrent parallel processing on a massive scale, the language must include developer friendly primitives. To that end, Go includes for instance:

goroutine, for multitasking
channel, to communicate between the tasks

The latter were adopted from communicating sequential processes (CSP) first described in a 1978 paper by C. A. R. Hoare.

Here’s a snippet of code that will make sure to store concurrently exactly 3 replicas of each chunk:

  1 replicas_count := make(chan bool, len(servers))
  2
  3 for _, s := range servers {
  4     go func(s *StorageServer) {
  5         s.Put(chunk)
  6         replicas_count <- true
  7     }(s)
  8 }
  9
 10 for n := 0; n < 3; n++ {
 11     <-replicas_count
 12 }

That’s the power of Go.

Use “golang” to disambiguate your google searches ↩
Writing code focuses your mind and untroubles your soul (c) ↩

GitTorrent – An Oxymoron?

June 23, 2015 Caitlin Bestler

The idea of combining peer-to-peer distribution (torrents) with source control (git) has recently been promoted (GitTorrent-2015-Announcement) but it actually has been around for awhile (GitTorrent-GoogleCode).

The appeal for this amalgamation is the combining distributing software products (packages, ISOs, etc.) via torrents while keeping all the benefits of source control systems such as Git.

The conflict between the ideas is fairly obvious. Torrents are not viewed as providing reliable service for the fundamental reason that they are not reliable. They optimize crowd distribution of popular material. Git needs to retain historic boring archives of old versions which will probably never be fetched.

There are several inherent contradictions between the “Git” and “Torrent” worlds:

Source Control takes responsibility for retention of every version of an object, forever.
Peer-to-peer networking is dependent on a social control which assumes that there is demand to access the content being shared. It is easy for a peer-to-peer network to identify and shame “leaches” who only download material but never donate bandwidth to the community. Measuring who is actually retaining seldom retrieved versions would be far harder.
A primary feature of source control systems is to attribute every byte of the current version of a file/object to the specific update when it was introduced, and who to “blame” for that update. Source control systems frequently reqiure each edit to be password and/or certificate authenticated. Torrents which required tracable attribution for every contribution would see a major portion of their seeders unwilling to “contribute” material. The MPAA would probably claim that this is most of the content being torrented.

However, there are some major architectural similarities between Object Storage solutions designed for Enterprise storage (whether in-house and/or cloud-based) and peer-to-peer networking:

Both rely on a large number of storage targets rather than upon high availability of specific storage targets. Many low-cost modestly-available targets end up being cheaper and more reliable than a few highly available targets. But peer-to-peer targets don’t just “fail”, they are withdrawn from availability by their owners. That is a problem that Enterprise storage does not have to address.
Both benefit from the use of cryptographic hashes to identify chunks. Software distribution has been publishing checksums of packages and ISOs for some time, but they are typically maintained separately from the storage. This is unforntate because it requires the end user to remember to apply the fingerprint to the downloaded content, something that is probably actually done several times a day.

Actually peer-to-peer networks are more in need of cryptographic hashes than Enterprise storage clusters, but so far they tend to use cheaper hash algorithms. Every contributing sever cannot be vetted by the peer-to-peer network. Volunteer help seldom sticks around for along validation process. Therefore peer-to-peer networking must be designed to prevent malevolent and/or misbehaving servers from corrupting the distributed storage cluster.

Directories?

One of the interesting statements that Chris Ball made in his blog announcement of the 2015 vintage GitTorrent was that FTP was unsuitable as a distribution method because there’s no master index of which hosts have which files in FTP, so we wouldn’t know where to look for anything.

But BitTorrent style directories are not the real solution either. They are the equivalent of HDFS or pNFS – centralized metadata referencing distributed chunks/blocks.

When you fully embrace identifying chunks with Cryptographic Hashes you just have to go all the way (which is what we did with NexentaEdge). Everything needs to be a chunk identified by a cryptographic hash of its payload. That includes the metadata. You just have to name some of the metadata chunks to anchor the references to the other chunks.

The root metadata chunks need to have an additional lookup driven by the cryptographic hash of their fully qualified name. To make a globally unique name just scope it by a domain name.

The Name Hash of a fully qualified, domain name scoped, object identifier needs to be able to find a list of the versions of the named object.

Each chunk is either considered to be “named” or “unnamed”. Named chunks are located by their Name Hash. Unnamed chunks by the Content Hash. Whichever hash is selected yields some form of partition of the global namespace. For example the configuration might specify that there are 1024 partitions. You then query for the desired hash on servers registered to support that partition.

To find the Metadata for an object:

Hash the fully qualified name of the object.
Map the hash value to a partition.
Ask the servers supporting that partition to return the most recent version of the Manifest for this object, or refer you to a later version that they know about but do not have a copy of.
Note that the returned Manifest still has a cryptographic hash of its content, allowing end-to-end validation of the manifest.
Pick the most recent valid version returned.

Then used the saved content hashes in the Manifest to find the payload chunks.

Map the content hash to its partition.
Ask the servers supporting that partition to return this chunk, or to refer you to another server that has it.
Only the first replica found actually found needs to be returned.

Note: this is slightly different from what NexentaEdge does, but NexentaEdge has the benefit of having all of the storage servers for a given cluster being at a single location. This enables use of multicast messaging to query all members of a “partition” in a single operation. There isno way to use multicast reliably over wide area networks.

In the following diagram various payload chunks are put to various partitions. Their content hashes are saved, and then referenced in version manifests which are stored (and indexed) on the partition derived from the name hash of the object name.

Object Storage Features Applied to Peer-to-Peer Networking

Despite the inherent conflicts there are several features of enterprise object storage that are applicable to the problem of peer-to-peer storage.

In both Enterprise and peer-to-peer storage, objects can be broken down into chunks Identified by a Cryptographic Hash. Each chunk represents a portion of an Object Version’s payload or metadata. Payload chunks can be identified by a cryptographic hash of their payload. Metadata chunks can be also identified by a cryptographic hash of their payload, but also need to be found using only the object name. This can be done using the cryptographic hash of the object’s fully qualified name.

Using cryptographic hashes to identify chunks has several benefits.

Cryptographic Hashes Self-Validate the Chunk. As each chunk is retrieved the client can validate that it’s payload is compatible with its cryptographic hash.
Cryptographic Hashes enable distributed deduplication.
Cryptographic Hashes can drive the distribution of Chunks. The cryptographic hash of the content and/or name can determine which storage targets hold each chunk.

Breaking up Storage

Distributing chunks by cryptographic hashes allows each peer-to-peer server to only deal with a subset of the objects in the system.

A client looking for a peer-to-peer replica of a chunk only need to contact the set of server dealing with the desired slice of the “hash space”. This applies whether seeking metadata or payload.

Authenticating Users

In enterprise object storage systems, users typically authenticate using a tenant-specific authorization server, such as Active Directory or LDAP.

For peer-to-peer storage a given domain name would publish a public key. The holder of the matching private key could either retain the exclusive right to make new versions or just the right to delete posts made by others.

The system should probably also require each version be signed with a key associated with the email address.

Remaining Issues

One of the toughest challenges for anyone building a peer-to-peer storage network is determining how many replicas (or erasure encoded slices) are needed to make the chance of losing content to be effectively zero (or some other TBD threshold).

This is related to the other challenge of how to encourage users to provide storage.

The costs of long-haul communications are such that a peer-to-peer storage network is unlikely to be cost effective on a $$$ per IOP or TB basis. But there are other benefits of a peer-to-peer storage community that may be appealing:

There is no need to tell anyone other than your end users what your are storing. Metadata can be encrypted. If you know the key you can decrypt the metadata and confirm the name and version of the version metadata retrieved. If you do not know the key, you have a blob with a name that hashes to X.
Content Distribution (albeit a distributed mirroring system may be more valuable here). Those fetching from a mirror do not have to independtly validate the content, corrupted or altered data cannot be retrieved from the system.
There is another, related to dynamic ingest of data, which we’ll cover in my next blog.

Disclaimer

This post touches upon features of Nexenta’s object storage product NexentaEdge. Portions of NexentaEdge are covered by patent applications and/or patents. Nexenta reserves all rights under those patents.

CrowdSharing

June 23, 2015June 23, 2015 Caitlin Bestler

It is a conventional wisdom: peer-to-peer storage works for distributing of popular items. A torrent swarm is never more efficient than when delivering the latest episode of Game of Thrones.

But there is another usage of peer-to-peer networking that is almost the opposite – inbound swarming. This idea was originally suggested by Robert Novak while he was at Nexenta, but many details have been flushed out since then. An inbound swarm is quite unlike a Game of Thrones swarm: it carries a horde of distinct files that are bound to only one destination, and most will never be accessed.

Turning Peer-to-Peer Inside-Out

The use-case is simple: imagine that you have a very large number of people generating content at a location with limited upstream capacity. The “limited” capacity can be quite high, it just has to be limited relative to the demand.

If the use case is not clear, consider a technical conference at the Moscone Center (or any venue). Or consider a sporting event that wasn’t built with a WIFI router for each row of seat. In either case you have a lot of content being generated by smartphones and tablets, a lot more content than can be transmitted live from the venue.

This content has some special characteristics:

It is undoubtedly targeted to some form of object storage: a photo archive site, a cloud storage folder or perhaps even a home server.
99.9% of it will reach its destination without any need for peer-to-peer networking, eventually. Specifically when the smartphone or tablet returns to the office or home it will upload everything via WiFi without any real problems.
It needs to be backed up. This content is at risk of failure, either of the smartphone/tablet or of the SD card that the content is stored on.

Peer-to-peer networking can provide a backup service to these mobile devices tailored exactly to these scenarios.

Basically, the mobile devices back each other up. But this needs to be automatic, between collaborating users without relying on negotiating trust on an individual basis.

To establish this trust the application would have to:

Throttle the amount of storage and bandwidth each volunteer provided. This is a feature of torrenting software already.
Promise the source of the material that nobody would be able to view or alter the content before it reached its final destination.
Promise the intermediate holders of the data that they could in no way be held responsible for any of the content because there was absolutely no way for them to review it.

Once the trust issues are solved there is more than enough bandwidth within a Conference center, and more than enough storage for all the content being generated. Content just needs to be held on multiple devices until the bandwidth is available to deliver the blob to its ultimate destination.

Each file/object is:

Encrypted. The payload is completely undecipherable until it reaches its destination.
Fingerprinted. There is a unique id for each file/object calculated by a cryptographic hash of the file/object content.

This serves several purposes:

It prevents unauthorized alteration of the content by any of the volunteer devices holding a backup of it.
It allows easy detection of when the original device or a different backup device has already delivered the file/object. It is likely that if such a service were deployed that the typical file/object would never be accessed. Most of the time the original device will make it safely back to the office or home and deliver the file itself. Like any backup image, the preferred scenario is that it is never accessed.
The designated destination could even be an intermediate destination to provide greater anonymity. The first destination would decrypt the object enough to know where it would should route it to. This doesn’t have to be NSA-proof obfuscation; but most people would rather not be broadcasting their Google Photo account identification to everyone at the venue with them

Pieces of this solution have been around for a long time. Routing schemes exist for ad hoc networks that are dynamically discovered. Store and forward relay is as old as email. But this solution is actually far simpler than either of those. Each replica is stored on at most one intermediate device, and is forwarded at most once. The only ad hoc network used is at the source venue. Forwarding would be limited to known “home” Wi-Fi networks that are connected to the Internet.

Crowd-Sourced Backup

So this is essentially a mobile “crowd-sourced” backup system that is specifically tailored to the needs of files generated by mobile devices, such as photos of presentations at a conference, or photos of friends together at a convert, etc.

The goal is to provide backup replicas that are retained for perhaps half a day. There is no need for long term storage. Rather than seeking to distribute replicas to many destinations, it instead seeks many paths to a single destination per object.

The biggest gotcha with this idea is that it really takes a Mobile OS vendor to properly enable the on-site file transfers. There isn’t any incentive for users to download a “CrowdSharing” app unless they are likely to encounter other devices with the same App installed.

Without critical mass this doesn’t work. Do you remember the Zune’s Wireless Sharing feature? It would have been a big plus, if you ever ran across other people with Zune’s to share with.

So this is a great potential application, but realistically it probably will not go anywhere unless embraced by Google and/or Apple.

Hybrid storage arrays versus all-flash

June 22, 2015 Alex Aizman

Instead of introduction: a fair summary at TheRegister.

This, and many other writings on the topic reiterate the pros and cons, restate the importance of considering your use case and your workload, mention the cost per IOPS and cost per TB. The latter, by the way, ranges between $70/TB (low-end SATA) to thousands and tens of thousands (high-end all-flash array).

To be fair, per IOPS price for those high-end arrays still beats 7.2K RPM SATA by about 20 to 100 times.

In the meantime, tiering in the last couple years went out of vogue, looks like. In part and specifically this (coming out of vogue) relates to automated within-LUN (sub-LUN or intra-LUN) and within-target hot <=> cold tiering by storage array transparently for the users and applications. Case in point: Compellent’s Data Progression, 3PAR’s Dynamic Optimization of the previous decade.

And many others. We don’t hear about this kind of tiering, the “automated data migration: much anymore.

The reason, one might say, is that SSDs become better and cheaper, denser and more reliable at the rate set by Moore’s law back in 1965. The real reason, I say, is different: hybrid is tough. Designing, building and delivering storage software to optimize IO datapath across hybrid array – is a rocket science.

Let’s make a mental experiment though (a leap of faith, really) and assume there’s a piece of such storage software, something like this when looking from 30,000 ft:

Wouldn’t we want then to combine excellent read performance (whereby active data timely and on-demand migrates to SSDs) and total capacity, while at the same time using the same flash tier for user writes, to complete them at flash latencies?

Of course we would.

The fact that an optimal and stable hybrid storage array(*) is not done yet does not mean that it cannot be done. Yes – the corresponding software is racy and hard to debug (**). More importantly, it is tough to test across exploding number of use cases. But – it is doable. And once it is done and demonstrated, it might as well change the perspective and the common story line..

Hybrid storage arrays or all-flash?

(*) Term “hybrid” is used in many different contexts. On the drive level, there are hybrid SSDs, HDDs and even SSHDs. At the opposite extreme storage administrators build hybrid solutions combining all-flash arrays (the proverbial “primary tier”) and archival Cloud services. Hybrid storage array, hybrid volume (LUN) – is somewhere in the middle..

(**) Disclaimer: some of the related work is being done at Nexenta, and will be part of the next-gen release of the storage appliance..

Global Namespace for Docker

June 21, 2015June 22, 2015 Alex Aizman1 Comment

This text has been posted to Docker development community, at https://github.com/docker/docker/issues/14049 – might be a bit too technical at times.

1. Terms

Global Namespace: often refers to the capability to aggregate remote filesystems via unified (file/directory) naming while at the same time supporting unmodified clients. Not to be confused with LXC pid etc. namespaces

2. sha256

Docker Registry V2 introduces content-addressable globally unique (*) digests for both image manifests and image layers. The default checksum is sha256.

Side note: sha256 covers a space of more than 10 ** 77 unique random digests, which is about as much as the number of atoms in the observable universe. Apart from this unimaginable number sha256 has all the good crypto-qualities including collision resistance, avalanche effect for small changes, pre-image resistance and second pre-image resistance.

The same applies to sha512 and SHA-3 crypto-checksums, as well as, likely, Edon-R and Blake2 to name a few.

Those are the distinct properties that allows us to say the following: two docker images that have the same sha256 digest are bitwise identical; the same holds for layers and manifests or, for that matter, any other sha256 content-addressable “asset”.

This simple fact can be used not only to self-validate the images and index them locally via Graph’s in-memory index. This can be further used to support global container/image namespace and global deduplication. That is:

Global Namespace
Global Deduplication

for image layers. Hence, this Proposal.

3. Docker Cluster

Rest of this document describes only the initial implementation and the corresponding proof-of-concept patch:

The setup is a number (N >= 2) of hosts or VMs, logically grouped in a cluster and visible to each other through, for instance, NFS. Every node in the cluster runs docker daemon. Each node performs a dual role: it is NFS server to all other nodes, with NFS share sitting directly on the node’s local rootfs. Simultaneously, each node is NFS client, as per the diagram below:

Blue arrows reflect actual NFS mounts.

There are no separate NAS servers: each node, on one hand, shares its docker (layers, images) metadata and, separately, driver-specific data. And vice versa, each node mounts all clustered shares locally, under respective hostnames as shown above.

Note: hyper-convergence

Often times this type of depicted clustered symmetry, combined with the lack of physically separate storage backend is referred to as storage/compute “hyper-convergence”. But that’s another big story outside this scope..

Note: runtime mounting

As far as this initial implementation (link above) all the NFS shares are mounted statically and prior to the daemon’s startup. This can be changed to on-demand mount and more..

Back to the diagram. There are two logical layers: Graph (image and container metadata) and Driver (image and container data). This patch patches them both – the latter currently is done for aufs only.

4. Benefits

An orchestrator can run container on an image-less node, without waiting for the image to get pulled
Scale-out: by adding a new node to the cluster, we incrementally add CPU, memory and storage capacity for more docker images and containers that, in turn, can use the aggregated resource
Deduplication: any image or layer that exists in two or more instances can be, effectively, deduplicated. This may require pause/commit and restart of associated containers; this will require reference-counting (next)

5. Comments

It’s been noted in the forums and elsewhere that mixing images and containers in the Graph layer is probably not a good idea. From the clustered perspective it is easy to see that it is definitely not a good idea – makes sense to fork /var/lib/docker/graph/images and /var/lib/docker/graph/containers, or similar.

6. What’s Next

The patch works as it is, with the capability to “see” and run remote images. There are multiple next steps, some self-evident others may be less.

The most obvious one is to un-HACK aufs and introduce a new multi-rooted (suggested name: namespace) driver that would be in-turn configurable to use the underlying OS aufs or overlayfs mount/unmount.

This is easy but this, as well as the other points below, requires positive feedback and consensus.

Other immediate steps include:

graph.TagStore to tag all layers including remote
rootNFS setting via .conf for Graph
fix migrate.go accordingly

Once done, next steps could be:

on demand mounting and remounting via distributed daemon (likely etcd)
node add/delete runtime support – same
local cache invalidation upon new-image-pulled, image-deleted, etc. events (“cache” here implies Graph.idIndex, etc.)
image/layer reference counting, to correctly handle remote usage vs. ‘docker rmi’ for instance
and more

And later:

shadow copying of read-only layers, to trade local space for performance
and vice versa, removal of duplicated layers (the “dedup”)
container inter-node migration
container HA failover
object storage as the alternative backend for docker images and layers (which are in fact immutable versioned objects, believe it or not).

Some of these are definitely beyond just the docker daemon and would require API and orchestrator (cluster-level) awareness. But that’s, again, outside the scope of this proposal.

7. Instead of Conclusion

In the end the one thing that makes it – all of the above – doable and feasible is the immutable nature of image layers and their unique and global naming via crypto-content-hashes.

Interoperable Image Storage

June 21, 2015June 23, 2015 Alex Aizman

cubes

The idea is probably not new, bits and pieces of it exist in multiple forms for many years and likely constitute some Platonic truth that is above and beyond anyconcrete implementation. This applies to image storage systems(*), object storage systems, file archival storage systems and even block storage. There’s often a content checksum that accompanies the (image, object, file, block) as part of the corresponding metadata – checksum of the corresponding content that is used to validate the latter AND, at least sometimes, address it – that is, find its logical and/or physical location when reading/searching.

Historically those content checksums were 16bit and 32bit wide CRCs and simlar; nowadays, however, more often than not storage systems use cryptographically-secure, collision- and pre-image-attack resistant SHA-2 and SHA-3, possibly Edon-R and Blake2 as well.

Now.. example. Let’s say, Vendor A and Vendor B both store a given popular image, say, one of those Ubuntu LTS as per https://wiki.ubuntu.com/LTS. For instance. Vendor A would store this content as a file over blocks, while Vendor B would use object backend to store the same image as bunch of distributed chunks.

When storing the image, in addition to (and preferably, in parallel with) checksumming blocks – A, or chunks – B, both vendors would compute the whole-image checksum: SHA-2 and/or SHA-3 for starters (NIST’s final report provides other viable alternatives). The latter then must become the stored-image metadata – alias on the image’s name, or more exactly, a specific version of the latter. Further, users may request this crypto-alias, or aliases, to be returned as part of the operation of storing the image, or later, in “exchange” for the stored version name.

Let’s see what it gives..

User can now perform
```
Get(hash-type, crypto-hash alias)
```
– on any vendor that supports the above, to uniquely retrieve and validate a given version of the image
The like/dislike and trust/distrust feedback can be metadata-attached to the image’s cryptohash and thus becomes globally cumulative and available to all users
Vendors can begin partnering to deduplicate common content, independently of their respective vendor-specific and often-proprietary backend mechanisms
The content can be uniquely identified and end-to-end validated with 100% certainty

The long and short of it is that secure crypto-hash on a given content can be used for vendor-neutral access to trusted images which can further enable cross-vendor interoperability, global internet-wide deduplication, peer-to-peer networking with torrent-like service to distribute the same trusted self-validating content, and beyond.

See related: Global Namespace for Docker