A chain is an image we all know. It represents strength and reliability. Bitcoin has made use of such imagery in the implementation of a blockchain. A blockchain is a chain of links where each link couples transactions (blocks of transactions) in a reliable and cryptographically secured manner.
Data chains represent, instead, links that couples together data identifiers. A data identifier is not the data. As the blockchain does not hold bitcoins a data chain does not hold data.
This post discusses an overview of data chains, there is also an RFC covering some basic use of DataChains and also a codebase to implement DataChains, as well as Data handling (recognition) and long term storage.
What use is that then? If these things hold no data then what use are they?
This is a critical question to answer and resist writing off as irrelevant, just yet. If we can be assured a transaction happened in a blockchain, then it allows us to know where a bitcoin should exist as if it were a real thing. It is the same with a data chain, if we can be sure a piece of data should exist and where it should exist. However with data chains, identified data is real (documents, videos, etc). That is, if we know these files should exist and can identify them, then we just need to get the data and validate it.With a network that both stores and validates data and their IDs, we gain a lot in efficiency and simplicity as compared to blockchains which cannot store significant amounts of data such as files. Data chains would additionally allow cross network/blockchain patterns, but one must ask why do that, duplication is not efficient?
Therefore if a block identifies indisputable information about a file, such as (naively) a hash of the content, then we can be presented with data and compare to a valid block. If the block says the data is recognised, then it was created by whatever mechanism the network agreed to assume responsibility for the data.
Now we can historically validate that data belongs on the network and it was paid for or otherwise accepted, i.e. it is valid network data. This is a great step, some will think, oh I can do that with XYZ, but hang on for a few minutes.
Here, we separate from a standard one truth chain and look at chain parts, or a decentralised chain on a decentralised network. A chain where only a very small number of nodes will know of the validity of some small sets of data, but the network as a whole knows everything and all data. Ok, we can see now this is a bit different.
We need now to look again at the chain a little closer, here is another picture to help. We can see here that there seems to be links and blocks. This is an important issue to cement in our thinking. In a blockchain for instance the link is simple (simple is good, in security the simpler the stronger in many cases).
Here though, a link is another agreement block. These links are actually a collection of names and signatures of the group who agreed on the chain at this time. Each link must contain at least a majority of the previous link members. With every change to the group then a new link is created and added to the chain. Please see the consensus overview blog for an overview of these groups and their importance.
Between these links, lie the data blocks, these represent data identifiers (hash, name, type etc.) of real data. Again these are signed by a majority of the previous link (in actual fact they can be slightly out of order as required in an out of order network, or planet 🙂 ).
Now the picture of a DataChain should be building. But how does it split into a decentralised chain?
Remember also we have a notion of groups of nodes. So lets take a group size of 2 as an example.
The process would be:
- Node 1 starts the network, there is a link with one member.
- Node 2 starts and gets the chain so far (it’s small with only node1 listed). Node 1 then sends a link to node 2 signed by node 1.
- node 2 then sends a link to node 1 signed by node 2.
So now the chain has two links, one with node 1 alone and the next with nodes 1 and 2, both signed by each other. This just continues, so no need for me to bore you here,
However, then node 4 joins and assuming all nodes have a purely even distribution (meaning 2 nodes address starts with 0 and the other two nodes address start with 1) the chain splits! two nodes go down the 0 branch and two go down the 1 branch. The chain has split, but maintains the same “genesis” block. The link with node 1 only. Between these links the data blocks are inserted as data is added (Put), edited (POST) or Deleted from the network, each data block again is signed by the majority of the previous link block (to be valid).
So as more nodes join then this process continues, with the chain splitting as the network grows (for detail see RFC and code). This allows some very powerful features, which we will not dive too deeply in but to name a few as example:
- Nodes can be shown to exist on the network (i.e. not fake nodes).
- Nodes can prove group membership.
- Nodes can have a provable history on the network.
- Nodes can restart and republish data securely that the network can agree should exist.
- Network can react to massive churn and still recover.
- Network can recover from complete outage.
- Network can handle Open ledgers for publicly verifiable data.
- Network can individually remember a transaction (so for instance a single safecoin transaction can be remembered as a receipt, at a cost though).
- Network can handle versioning of any data type (each version is payed for though).
- Fake nodes cannot forge themselves onto a chain without actually taking over a group (meaning they cannot be fake nodes 🙂 ).
- As data blocks are held between verifiable links then data validity is guaranteed.
- As data is guaranteed some nodes can hold only identifiers and not the data, or a small subset of data.
- As not all nodes are required to store data, churn events will produce significantly less traffic as not all nodes need all data all the time.
- The network can handle nodes of varying capabilities and use the strongest node in a group to handle the action it is strongest with (i.e. nodes with large storage, store lots, nodes with high bandwidth transfer a lot etc.).
- Archive nodes become a simple state of a capable node that will be rewarded well in safecoin, but have to fight for its place in the group, too many reboots or missed calls and others will take its place.
- Nodes can be measured and ranked easily by the network (important security capability)
There is a lot more this pattern allows, an awful lot more, such as massive growth or an immediate contraction in the number of nodes. It is doubtful the full capability of such a system can be quantified easily and certainly not in a single blog post like this, but now it’s time to imagine what we can do?
Moving forward we need many things to happen, these include, but are not limited to:
- Design in detail must be complete for phase 1 (data security and persistence)
- Open debates, presentations and discussion must take place.
- Code must be written.
- Code must be tested.
- Integration into existing system.
- End to End testing
- Move to public use.
Of these points 1, 2, 3 & 4 are ongoing right now. Point 5 requires changes to the existing SAFE routing table starting with an RFC. Point 4 will be enhanced as point 2 gets more attention and input. Point 6 is covered by community testing and point 7 is the wider community tests (i.e. Alpha stages).
Data chains would appear to be something that is a natural progression for decentralised systems. They allow data of any type, size or format to be looked after and maintained in a secure and decentralised manner. Not only the physical data itself (very important), but the validity of such data on the network.
Working on the data chains component has been pretty tough with all the other commitments a typical bleeding edge project has to face, but it looks like it has been very much worth the effort and determination. The pattern is so simple (remember simple is good) and when I tried to tweak it for different problems I thought may happen, the chain fixed itself. So it is a very simple bit of code really, but provides incredible amount of security to a system as well as historical view of a network, republish capability etc. and all with ease. The single biggest mistake in this library/design would be to try any specialisation for any special situation, almost every time, it takes care of itself, at least so far.
This needs a large scale simulation, which we will do in MaidSafe for sure prior to user testing, but seems to provide a significant missing link from many of today’s networks and decentralised efforts. I hope others find this as useful, intriguing and interesting as I do.
We think there will be many posts depicting various aspects of this design pattern over the next while as we get the ability to explain what it offers in many areas. (i.e. when we understand it well enough to explain it easily)