Dat Protocol

DEP-0008: Multi-Writer

Title: DEP-0008: Multi-Writer

Short Name: 0008-multiwriter

Type: Standard

Status: Draft (as of 2018-07-06)

Github PR: Draft

Authors: Bryan Newbold, Stephen Whitmore, Mathias Buus

Summary

Multi-Writer is a set of schema, API, and feature extentions to allow multiple agents (users, devices, or software) to write to the same hyperdb database. By building on top of this abstraction layer, future versions of hyperdrive and Dat will gain these features.

Mechanisms for distributed consistency and granting trust are specified here; the need for merge conflict algorithms and secure key distribution are mentioned but specific solutions are not specified.

This DEP forms the second half of the hyperdb specification; the first half covered only the key/value database aspects of hyperdb.

Motivation

The current hypercore/Dat ecosystem currently lacks solutions for two fundamental use cases:

Access to a single secret key is currently required to make any change to a hypercore feed, and it is broadly considered best practice not to distribute secret keys between multiple users or multiple devices. In fact, the current hypercore implementation has no mechanism to resolve disputes or recover if multiple agents used the same secret key to append to the same feed.

Solutions to these two use cases are seen as essential for many current and future Dat ecosystem applications.

Concepts, Behavior, and Usage

The multi-writer features of hyperdb are implemented by storing and replicating the contributions of each writer in a separate hypercore feed. This specification concerns itself with the details of how changes from multiple feeds (which may be written and replicated concurrently or asynchronously) are securely combined to present a unified key/value interface.

The following related concerns are explicitly left to application developers to design and implement:

Before we go any further, a few definitions:

Feed: A hypercore feed: an append-only log of Entries, which can be arbitrary data blobs.

Database: in this context, a Hyperdb key/value database. Built from several Feeds (two Feeds per Writer).

Writer: a user (or user controlled device or software agent) that has a distinct feed with a public/private key pair, and thus the ability to append hypercore entries and "write" changes to their version of the database.

Original Writer: the writer who created a given hyperdb database in the form of the Original Feed. The public key of the original feed is the one used to reference the database as a collection of feeds (eg, for the purpose of discovery).

At a high level, multi-writer hyperdb works by having existing authorized writers (starting with the original writer) include authorization of new writers by appending metadata to their own feed which points to the new feeds (by public key). Each entry in each writer's feed contains "clock" metadata that records the known state of the entire database (all writers) seen from the perspective of that writer at the time they created the entry, in the form of "clock" version pointers. This metadata (a "vector clock") can be used by other writers to resolve (or at least identify) conflicting content in the database. The technical term for this type of system is a "Conflict-free replicated data type" (CRDT), and specifically an "Operation-based" (as opposed to "State-based") CRDT.

Core API

A "node" is a data structure representing a database entry, including the key, value, and feed that the entry is commited to.

db.get(key) (as described in the hyperdb DEP) returns an array of nodes. If it is unambiguous what the consistent state of a key is, the array will have only that one value. If there is a conflict (ambiguity), multiple nodes will be returned. If a key has unambiguously been removed from the database, a "null" or empty datatype is returned. If one branch of a conflict has a deletion (but at least one of the others does not), a node with the deleted flag will be returned; note that such "tombstone" nodes can still have a value field, which may contain application-specific metadata (such as a self-reported timestamp), which may help resolve the conflict.

If multiple nodes are returned from a get, the writer should attempt to merge the values (or chose one over the other) and write the result to the database with db.put(key, value). Client libraries can make this process more ergonomic by accepting a helper function to select between multiple nodes. Libraries can also offer an option to either directly return the value of a single node (instead of the node itself), or raise an error; this is likely to be more ergonomic for applications which do not intend to support multiple writers per database.

db.authorize(key) will write metadata to the local feed authorizing the new feed (corresponding to key) to be included in the database. Once authorized, a feed may further authorize additional feeds (recursively).

db.authorized(key) (returning a boolean) indicates whether the given key is an authorized writer to the hyperdb database.

At the time of this DEP there is no mechanism for revoking authorization.

Scaling

There is some overhead associated with each "writer" added to the feed, impacting the number of files on disk, memory use, and the computational cost of some lookup oprations. The design should easily accomodate dozens of writers, and should scale to 1,000 writers without too much additional overhead. Note that a large number of writers also implies a larger number and rate of append operations, and additional network connections, which may cause scaling issues on their own. More real-world experience and benchmarking is needed in this area.

Implementation Details

The complete protobuf schemas for the hyperdb "Entry" and "InflatedEntry" message types (as specified in the hyperdb DEP) are:

message Entry {
  required string key = 1;
  optional bytes value = 2;
  optional bool deleted = 3;
  required bytes trie = 4;
  repeated uint64 clock = 5;
  optional uint64 inflate = 6;
}

message InflatedEntry {
  message Feed {
    required bytes key = 1;
  }

  required string key = 1;
  optional bytes value = 2;
  optional bool deleted = 3;
  required bytes trie = 4;
  repeated uint64 clock = 5;
  optional uint64 inflate = 6;
  repeated Feed feeds = 7;
  optional bytes contentFeed = 8;
}

The fields of interest for multi-writer are:

When serialized on disk in a SLEEP directory:

Feeds and Vector Clocks

At any point in time, each writer has a potentially unique view of the "current" state of the database as a whole; this is the nature of real-world distributed systems. For example, a given write might have the most recent appends from one peer (eg, only seconds old), but be missing much older appends from another (eg, days or weeks out of date). By having each writer include metadata about their percieved state of the system as a whole in operations to their Feed, all writers are able to collectively converge on an "eventually consistent" picture of the database as whole (this process will be described in the next section).

A writer's "current known state" of the database consists of the set of active Feeds, and for each the most recent entry sequence number ("version"). This state can be serialized as an array of integers, refered to as a vector clock.

Each put() operation on the database appends a node to the writer's local feed, and contains the writer's vector clock as observed at that time. InflatedEntry nodes also contain a list of all known authorized Feeds; inflated nodes only need to be written when the Feed list changes. Every non-inflated entry contains a pointer back to the most recent inflated entry; inflated entries themselves contain a pointer back to the previous inflated entry (the first inflated entry has a null pointer). Elements of a vector clock are ordered by the Feed list from the corresponding Inflated entry.

By convention, the order of Feed lists is to start with the writer's local feed first, then proceed by the order in which Feeds were discovered. Note that this ordering is not consistent across writers, only within the same feed.

As an example, if a node (non-inflated entry) had a vector clock of [0, 2, 5], that would mean:

Multi-Feed Aware hyperdb

The hyperdb DEP specifies the procedures for lookup (get()) and append (put()) operations to the database, as well as binary encoding schemes for entry messages.

Note that the trie encoding specifies pointers in a (feed, entry) pair format. The feed integer is an index into the most recent Feed list (found in the most recent inflated entry; see the last section). When working with a multi-writer hyperdb database, simply look up entries in the appropriate feed, instead of only looking in the current feed. The next section ("Consistent History") describes which entry (or entries) to start with instead of simply assuming the most recent entry from the local feed.

Consistent History

The set of all appended nodes in all feeds of a hyperdb, and all the vector clock pointers between them, forms a "directed acyclic graph" (DAG). Any node which does not have another node pointing to it is called a "head" (this terminology is similar to that used in git). At any point in time, an observed copy of a database has one or more heads, each representing the top of a tree-like graph. In the trivial case of a non-multi-writer hyperdb, there is always only a single head: the most recent entry in the local feed. Just after appending to the local feed, there is also always a single head, because that node's vector clock will reference all know most recent entries from other feeds. It is only when nodes are appended by separate writers who did not know of the others' concurrent action (and then these changes are replicated) that there are multiple heads.

When operating on the database (eg, executing a get() operation), all heads must be considered. The lookup proceedure documented in the hyperdb DEP must be repeated for each head, and nodes returned representing the set of all unique values.

The situation where a get() operation multiple heads returns different values for the same key is called a "conflict" and requires a "merge" to resolve. Some writer (either a human being or application-layer code) must decide on the correct value for the key and write that value as a new entry (with a vector clock that includes the previous heads). The procedure for chosing the best value to use in a given conflict is sometimes easy to determine, but is impossible to determine algorithmically in the general case. See the "Usage" section for more details.

Examples

Let's say Alice starts a new hyperdb and writes two key/value entries to it:

// Operations
Alice: db.put('/foo/bar', 'baz')
Alice: db.put('/foo/2',   '{"json":3}')

// Alice's Feed
0 (key='/foo/bar', value='baz',
   vector_clock=[0], inflated=null, feeds=['a11ce...']) (InflatedEntry)
1 (key='/foo/2', value='{"json":3}',
   vector_clock=[0], inflated=0)

// Graph
Alice:  0  <---  1

The vector clock at seq=1 points back to seq=0.

Next Alice authorizes Bob to write to the database. Internally, this means Alice writes an Inflated entry to her feed that contains Bob's Feed (identified by his public key) in her feed list.

// Operations
Alice: db.authorize('b0b123...')

// Alice's Feed
0 (key='/foo/bar', value='baz',
   vector_clock=[0], inflated=null, feeds=['a11ce...']) (InflatedEntry)
1 (key='/foo/2', value='{"json":3}',
   vector_clock=[0], inflated=0)
2 (key=null, value=null,
   vector_clock=[1], inflated=0, feeds=['a11ce...', 'b0b123...']) (InflatedEntry)

// Graph
Alice: 0  <---  1  <---  2

Bob writes a value to his feed, and then Alice and Bob sync. The result is:

// Operations
Bob: db.put('/a/b', '12)

// Alice's Feed
0 (key='/foo/bar', value='baz',
   vector_clock=[0], inflated=null, feeds=['a11ce...']) (InflatedEntry)
1 (key='/foo/2', value='{"json":3}',
   vector_clock=[0], inflated=0)
2 (key=null, value=null,
   vector_clock=[1], inflated=0, feeds=['a11ce...', 'b0b123...']) (InflatedEntry)

// Bob's Feed
0 (key='/a/b', value='12',
   vector_clock=[0], inflated=null, feeds=['b0b123...']) (InflatedEntry))

// Graph
Alice: 0  <---  1  <---  2
Bob  : 0

Notice that none of Alice's entries refer to Bob's, and vice versa. Neither has written any entries to their feeds since the two became aware of each other. Right now there are two "heads" of the graph: Alice's feed at seq 2, and Bob's feed at seq 0. Any get() operations would need to descend from both heads, though in this situation there would be no conflicts as the keys in the two feeds are disjoint.

Next, Alice writes a new value, and her latest entry will refer to Bob's:

// Operations
Alice: db.put('/foo/hup', 'beep')

// Alice's Feed
0 (key='/foo/bar', value='baz',
   vector_clock=[0], inflated=null, feeds=['a11ce...']) (InflatedEntry)
1 (key='/foo/2', value='{"json":3}',
   vector_clock=[0], inflated=0)
2 (key=null, value=null,
   vector_clock=[1, null], inflated=0, feeds=['a11ce...', 'b0b123...']) (InflatedEntry)
3 (key='/foo/hup', value='beep',
   vector_clock=[2,0], inflated=2)

// Bob's Feed
0 (key='/a/b', value='12',
   vector_clock=[0], inflated=null, feeds=['b0b123...']) (InflatedEntry))


// Graph
Alice: 0  <---  1  <---  2  <--/  3
Bob  : 0  <-------------------/

Alice's latest feed entry now points to Bob's latest feed entry, and there is only one "head" in the database. This means that any get() operations only need to run once, starting at seq=3 in Alice's feed.

Security and Privacy Concerns

As noted above, there is no existing mechanism for removing authorization for a feed once added, and an authorized feed may recursively authorize additional feeds. There is also no mechanism to restrict the scope of an authorized feed's actions (eg, limit to only a specific path prefix). This leaves application designers and users with few tools to control trust or access ("all or nothing"). Care must be taken in particular if self-mutating software is being distributed via hyperdb, or when action may be taken automatically based on the most recent content of a database (eg, bots or even third-party tools may publish publicly, or even take real-world action like controlling an electrical relay).

There is no mechanism to remove malicious history (or any history for that matter); if an authorized (but hostile) writer appends a huge number of key operations (bloating hyperdb metadata size), or posts offensive or illegal content to a database, there is no way to permanently remove the data without creating an new database.

The read semantics of hyperdb are unchanged from hypercore: an actor does not need to be "authorized" (for writing) to read the full history of a database, they only need the public key.

As noted in other DEPs, a malicious writer can potentially execute a denial of service (DoS) attack by appending hyperdb entries that for a cyclic loop of references.

Drawbacks

Mutli-writer capability incurs a non-trivial increase in library, application, and user experience complexity. For many applications, collaboration is an essential feature, and the complexity is easily justified. To minimize complexity for applications which do not need multi-writer features, implementation authors should consider configuration modes which hide the complexity of unused features. For example, by having an option to returning a single node for a get() (and throw an error if there is a conflict), or a flag to throw an error if a database unexpectedly contains more than a single feed.

Two tasks (conflict merges and secure key distribution) are left to application developers. Both of these are Hard Problems. The current design mitigates the former by reducing the number of merge conflicts that need to be handled by an application (aka, only the non-trivial ones need to be handled), and implementation authors are encouraged to provide an ergonomic API for writing conflict resolvers. The second problem (secure key distribution) is out of scope for this DEP. It is hoped that at least one pattern or library will emerge from the Dat ecosystem such that each application author doesn't need to invent a solution from scratch.

Rationale and alternatives

Design goals for hyperdb (including the multi-writer feature) included:

If a solution for core use cases like collaboration and multi-device synchronization is not provided at a low level (as this DEP provides), each application will need to invent a solution at a higher level, incuring duplicated effort and a higher risk of bugs.

As an alternative to CRDTs, Operational Transformation (OT) has a reputation for being more difficult to understand and implement.

Unresolved questions

What is the actual on-disk layout (folder structure), if not what is documented here?

The local feed's sequence number could skipped from vector clocks, because it's implied by the sequence number of the hypercore entry itself. Same with the key in the feed list (for inflated entries). In both cases, the redundant data is retained for simplicity.

If there are multiple heads, but they all return the same value for a get() operation, how is it decided which node will be returned? AKA, values are the same, but node metadata might not be (order of vector clock, etc).

Suspect that some details are off in the example: shouldn't the InflatedEntry authorizing a new feed include a vector clock reference to a specific seq in that feed? Should new local (not yet authorized) feeds reference their source in an initial InflatedEntry (eg, Bob at seq=0)? Should the first InflatedEntry in a feed point to itself in it's vector clock?

Changelog

As of March 2018, Mathias Buus (@mafintosh) is leading development of a hyperdb nodejs module on github, which includes multi-writer features and is the basis for this DEP.

Jim Pick (@jimpick) has been an active contributor working out multi-writer details.