Data chains: what? why? how?

What?


A chain is an image we all know. It represents strength and reliability. Bitcoin has made use of such imagery in the implementation of a blockchain. A blockchain is a chain of links where each link couples transactions (blocks of transactions) in a reliable and cryptographically secured manner.

chain

Data chains represent, instead, links that couples together data identifiers. A data identifier is not the data. As the blockchain does not hold bitcoins a data chain does not hold data.

This post discusses an overview of data chains, there is also an RFC covering some basic use of DataChains and also a codebase to implement DataChains, as well as Data handling (recognition) and long term storage.

Why?


What use is that then? If these things hold no data then what use are they?

This is a critical question to answer and resist writing off as irrelevant, just yet. If we can be assured a transaction happened in a blockchain, then it allows us to know where a bitcoin should exist as if it were a real thing. It is the same with a data chain, if we can be sure a piece of data should exist and where it should exist. However with data chains, identified data is real (documents, videos, etc). That is, if we know these files should exist and can identify them, then we just need to get the data and validate it.With a network that both stores and validates data and their IDs, we gain a lot in efficiency and simplicity as compared to blockchains which cannot store significant amounts of data such as files. Data chains would additionally allow cross network/blockchain patterns, but one must ask why do that, duplication is not efficient?

Data handling 

Therefore if a block identifies indisputable information about a file, such as (naively) a hash of the content, then we can be presented with data and compare to a valid block. If the block says the data is recognised, then it was created by whatever mechanism the network agreed to assume responsibility for the data.

Now we can historically validate that data belongs on the network and it was paid for or otherwise accepted, i.e. it is valid network data. This is a great step, some will think, oh I can do that with XYZ, but hang on for a few minutes.

Network handling

Here, we separate from a standard one truth chain and look at chain parts, or a decentralised chain on a decentralised network. A chain where only a very small number of nodes will know of the validity of some small sets of data, but the network as a whole knows everything and all data. Ok, we can see now this is a bit different.

We need now to look again at the chain a little closer, here is another picture to help. We can see here that there seems to be links and blocks. This is an important issue to cement in our thinking.  In a blockchain for instance the link is simple (simple is good, in security the simpler the stronger in many cases).

datachain_diagram

Here though, a link is another agreement block. These links are actually a collection of names and signatures of the group who agreed on the chain at this time. Each link must contain at least a majority of the previous link members. With every change to the group then a new link is created and added to the chain. Please see the consensus overview blog for an overview of these groups and their importance.

Between these links, lie the data blocks, these represent data identifiers (hash, name, type etc.) of real data. Again these are signed by a majority of the previous link (in actual fact they can be slightly out of order as required in an out of order network, or planet 🙂 ).

Now the picture of a DataChain should be building. But how does it split into a decentralised chain?

Splitting up

The easy way here is to consider just the above chain picture, but also remember back to a xor network13706-20c, or plain binary tree, here is one as a reminder.

Remember also we have a notion of groups of nodes. So lets take a group size of 2 as an example.

The process would be:

  • Node 1 starts the network, there is a link with one member.
  • Node 2 starts and gets the chain so far (it’s small with only node1 listed). Node 1 then sends a link to node 2 signed by node 1.
  • node 2 then sends a link to node 1 signed by node 2.

So now the chain has two links, one with node 1 alone and the next with nodes 1 and 2, both signed by each other. This just continues, so no need for me to bore you here,

However, then node 4 joins and assuming all nodes have a purely even distribution (meaning 2 nodes address starts with 0 and the other two nodes address start with 1) the chain splits! two nodes go down the 0 branch and two go down the 1 branch. The chain has split, but maintains the same “genesis” block. The link with node 1 only. Between these links the data blocks are inserted as data is added (Put), edited (POST) or Deleted from the network, each data block again is signed by the majority of the previous link block (to be valid).

So as more nodes join then this process continues, with the chain splitting as the network grows (for detail see RFC and code). This allows some very powerful features, which we will not dive too deeply in but to name a few as example:

  • Nodes can be shown to exist on the network (i.e. not fake nodes).
  • Nodes can prove group membership.
  • Nodes can have a provable history on the network.
  • Nodes can restart and republish data securely that the network can agree should exist.
  • Network can react to massive churn and still recover.
  • Network can recover from complete outage.
  • Network can handle Open ledgers for publicly verifiable data.
  • Network can individually remember a transaction (so for instance a single safecoin transaction can be remembered as a receipt, at a cost though).
  • Network can handle versioning of any data type (each version is payed for though).
  • Fake nodes cannot forge themselves onto a chain without actually taking over a group (meaning they cannot be fake nodes 🙂 ).
  • As data blocks are held between verifiable links then data validity is guaranteed.
  • As data is guaranteed some nodes can hold only identifiers and not the data, or a small subset of data.
  • As not all nodes are required to store data, churn events will produce significantly less traffic as not all nodes need all data all the time.
  • The network can handle nodes of varying capabilities and use the strongest node in a group to handle the action it is strongest with (i.e. nodes with large storage, store lots, nodes with high bandwidth transfer a lot etc.).
  • Archive nodes become a simple state of a capable node that will be rewarded well in safecoin, but have to fight for its place in the group, too many reboots or missed calls and others will take its place.
  • Nodes can be measured and ranked easily by the network (important security capability)

There is a lot more this pattern allows, an awful lot more, such as massive growth or an immediate contraction in the number of nodes. It is doubtful the full capability of such a system can be quantified easily and certainly not in a single blog post like this, but now it’s time to imagine what we can do?

How? 

Moving forward we need many things to happen, these include, but are not limited to:

  1. Design in detail must be complete for phase 1 (data security and persistence)
  2. Open debates, presentations and discussion must take place.
  3. Code must be written.
  4. Code must be tested.
  5. Integration into existing system.
  6. End to End testing
  7. Move to public use.

Of these points 1, 2, 3 & 4 are ongoing right now. Point 5 requires changes to the existing SAFE routing table starting with an RFC. Point 4 will be enhanced as point 2 gets more attention and input. Point 6 is covered by community testing and point 7 is the wider community tests (i.e. Alpha stages).

Conclusion

Data chains would appear to be something that is a natural progression for decentralised systems. They allow data of any type, size or format to be looked after and maintained in a secure and decentralised manner. Not only the physical data itself (very important), but the validity of such data on the network.

Working on the data chains component has been pretty tough with all the other commitments a typical bleeding edge project has to face, but it looks like it has been very much worth the effort and determination.  The pattern is so simple (remember simple is good) and when I tried to tweak it for different problems I thought may happen, the chain fixed itself. So it is a very simple bit of code really, but provides incredible amount of security to a system as well as historical view of a network, republish capability etc. and all with ease. The single biggest mistake in this library/design would be to try any specialisation for any special situation, almost every time, it takes care of itself, at least so far.

This needs a large scale simulation, which we will do in MaidSafe for sure prior to user testing, but seems to provide a significant missing link from many of today’s networks and decentralised efforts. I hope others find this as useful, intriguing and interesting as I do.

We think there will be many posts depicting various aspects of this design pattern over the next while as we get the ability to explain what it offers in many areas. (i.e. when we understand it well enough to explain it easily)

 

 

Enthusiastic human :-)

Posted in Uncategorized
21 comments on “Data chains: what? why? how?
  1. dyamanaka says:

    “the chain fixed itself” sounds like self preservation. This ability is common in living organisms. The Network may not be self aware, but one could describe it as … wanting to live!

    I find that amazing.

  2. […] By @dirvine David Irvine – July 20, 2016 1 0 SHARE Facebook Twitter This post was originally published on this site metaquestions.me […]

  3. great explanation. How does computations fit into the datachain concept? A link to the CPU container that proves it must have occurred or just record the computation output as new data in the datachain?

    Also, the word validation. Been exploring that around the concept of ‘review chains’ as in scientific review. Just because there is a datachain record it says nothing about the validity of the origin of the data or the claims it asserts? But you could see how a datachain could be put through a review process and those reviews recorded and captured in the datachain.

  4. […] David Irvine’s blog post on Data Chains […]

  5. […] David Irvine’s blog post on Data Chains […]

  6. […] David Irvine’s blog post on Data Chains […]

  7. […] David Irvine’s blog post on Data Chains […]

  8. […] David Irvine’s blog post on Data Chains […]

  9. […] David Irvine’s blog post on Data Chains […]

  10. […] David Irvine’s blog post on Data Chains […]

  11. […] David Irvine’s blog post on Data Chains […]

  12. Chris says:

    Does the network have any internal mechanism to determine whether the referenced data has be altered since being added to the chain?

    • David Irvine says:

      Yes the hash of the data item is held in the chain structure. There is an additional RFC for locking the data chains, this compresses it substantially and disallows key reuse or removal of any data item.

  13. baasrijschool says:

    So, what computational power would be required to manage a chain that describes where my file is?
    Given that the Internet is designed to survive link outages, and BGP-4 updates require ever larger router memories and faster asics, chain management also has to work with cryptographic hashes. Fluctuation in available nodes paired with ever increasing amounts of data will quickly render the network saturated. But then.. that’s based on a 10 minute read of the “impossible Network essay” and referenced RFC’s and stuff..

    • David Irvine says:

      Your data is in xor space and found via an XOR lookup (kademlia like), basically Olog(2)N hops (max) but can be much less with known prefix’s (as we have with disjoint sections) lets call them x then the lookup max is Olog(2 x)N and also caching. Olog(2)N is the maximum number of hops, where N is number of nodes. The chain though allows any node that can provide the chain part you need to validate the data to send it back to you, even out of band. This assits with partition management.

    • John Galt says:

      [[[ So, what computational power would be required to manage a chain that describes where my file is? ]]

      The same amount of computational power built into any common RAID controller; or, in the alternative, the same CPU load requirements of any software-based RAID system. Hint hint.

  14. […] stores data identifiers (in data chains) and also stores and protects the data itself, this is what farmers get paid to do. That is how we […]

  15. […] David Irvine’s blog post on Data Chains […]

  16. […] two charts are clear enough to make us understand that in the category of on-chain data Ethereum is leading in terms of macro-level adoption although there is still some kind of […]

Leave a reply to David Irvine Cancel reply

Member of The Internet Defense League

Categories
Follow Metaquestions on WordPress.com

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 2,678 other subscribers