Distributed/P2P computing

This page describes the basic theory behind various techniques for building distributed peer-to-peer software. What some of the main building blocks are, how they work as nontechnically & briefly as I can, what they’re useful for, and namedrop some implementations you might have heard of.

These techniques all solve different issues and in many cases can/should be used in conjunction.

Explicitly out-of-scope is cryptography, that’ll be discussed elsewhere. Despite how relevant hashfunctions are here.

Direct Connections

First off the most obvious solution is to send messages directly to the recipient without routing through a middleman who we don’t trust. Because why have a fully-capable routing infrastructure when you’re only sending messages to FAANG?

Unfortunately ISPs have refused to upgrade to IPv6 (which allows direct routing between up to 3.4*10^38, rather than 4.3billion, devices), leaving walls (“NAT”) in place for peer-to-peer (P2P) software to hack around. Which we’ve gotten pretty good at. Arguably in many cases making a system centralized is a hack around NATs! Also we need to lookup the address to mail messages to…

E.g. 1on1 videochats, Syncthing, Bittorrent

Gossip Networks (suggested by Izzy Swart)

Early attempts at addressing P2P’s need for looking up information involved shouting your query to the crowd for them to forward it to all their contacts until they’ve found the answer to send back. And if they data is invalid for some definition they can refuse to forward the message, allowing them to enforce some “invariants”.

Unfortunately gossip networks might fall over as they grow too large, though they do still have their uses. Included ones fundamental to the continued operation of the internet!

E.g. Named Data Networking

Distributed Hash Tables

In the early 2000’s we figured out a way to build distributed key-value stores using a technique called DHTs (e.g. memcached).

DHTs maintain an invariant that every peer knows the values (e.g. IP address) for the keys who’s hash is closest to it’s own identifier, so we can simply ask around to get rapidly closer & closer to the peer who actually knows.

This invariant is established by having each node query it’s own ID upon coming online.


DHTs are the techniqued used to store data P2P, though usually there’s layers of indirection to balance network costs & often fail to protect privacy. You may have heard of Bittorrent, GNUNet, IPFS, or Dat?

If a DHT’s keys are hash of their value (as in many of those examples) it’s called a Content Addressed Store.

DHTs were retrospectively added to earlier successful P2P like Bittorrent & Tor networks (creating Tor Hidden Services, which is usually what the media’s villifying when they say “dark web”) to replace centralized components. Though it took too long for GNU Jami to add it to SIP.

To address some false marketting: DHTs do lose data. The difference is that it’s more like a human forgetting something due to it becoming irrelevant to them than like invoking your Right To Be Forgotten on Google. Cryptographers have even proposed using this to share time-limited secrets!

Merkle Trees

Merkle trees, e.g. Git-era version control, are immutable datastructures (like you might write in e.g. Haskell) where pointers are represented as a hash of the addressed value. Which allows you to check you’ve got the correct dereferenced value from another peer.

In Git this serves to cut down on syncing costs. IPFS, & I think Dat, is characterized by combining a Content Addressed Store with Merkle trees to store larger structured data. And they’re the titular chain in blockchains.

CRDTs

The technique that really excites me, as well as Ink & Switch, are Conflict-free Replicated Data Types (CRDTs, e.g. Cryptpad)!! They are inversely-linked immutable datastructures which blur the lines between the data & the operations which constructed it, making trivial to merge divirging documents.

CRDTs are how you do P2P collaborative document editing! (Google Docs is built on a precursor called Operational Transforms)

Maybe I need practical example to explain them… Plain text editing?

E.g. Trellis, couldn’t find a widely-known implementation.


You can implement a collaborative plaintext editor by representing the text as an inversely-linked list which you flatten for display by sorting by timestamp from e.g. a vector clock. Each character (letter, etc) points to the one before it. You delete a letter by appending a new “tombstone” pointing to it (yes, CRDTs store the whole undo history, which can become wasteful).

To collaborate with others send these insert/delete operations directly to be merged into their copy at whatever rate you like.

CRDT History (written by Yaaps)

CRDT was Commutative Replicated Data Types in the original paper. That CRDT only covered situations where operations were agnostic to order of application. So… <I| is your identity operator (an empty document or change), <A> represents a shared reference state for 2 or more commutative operations |+b> and |+c>. You don’t have a fan out of state because <A> |+b> |+c> and <A> |+c> |+b> converge. At the point of convergence, there’s a common reference <D> that replaces <A> as the base reference

When you introduce a synchronization point, then you can use time information in additional strategies, like “last write wins” to allow non-commutative operations. At some point CRDT got backronymed to Conflict-free Replicated Data Type.

Proof of Work

P2P networks may need to discourage certain (spammy?) activities, and an effective way to do that is to purposefully increase the computational cost of those activities. To require some ammount of extra work (often involving computing new hashes until you find a requested pattern) before the message is accepted.

Counterintuitively, making it more energy-intensive to send an email could theoretically help reduce our energy usage.

Blockchains

Finally by now everyone has heard of blockchains (e.g. Bitcoin-era cryptocurrencies), many even have strong feelings for or against it. But what are they for?

They get everyone to agree on a order of events so large networks can agree on who called dibs on a name or whether you’ve already spent your last *coin. Allowing peers forwarding messages amongst themselves to enforce more invariants by choosing a random “block”, via a semi-random peer (traditionally using Proof of Work, leading to the energy inefficiency criticisms), to be added to the front of the merkle tree.

They don’t actually store data for retrieval beyond what is necessary for validating future blocks.

Acknowledgements

This page incorporates contributions/feedback from: