Proof Of Storage (maidsafe part II)

Sorry folks, another too fast an furious post, forgive mistakes and I will again update with any comments and improvements.

Proof of Storage != Proof Of Resource 

Proof of storage is an important part of proof of resource, but only a part of it. I will explain the whole proof of resource as this series continues. This part of the MaidSafe network should in fact be rather simple and make people wonder, why did we not see this before? So lets get to the problem quickly. 

“I think you hold some data, but I need to be sure you have it” (are you telling lies?). 

There are many grand schemes about for trying to achieve this very thing. It is considered a very hard problem and it is, well until you can see a solution then its a nothing problem.First we need to understand what a hash is, it sounds hard, but it is pretty straightforward. 

What is a hash?

In simple terms a hash is a fingerprint of some information. It is (like a human fingerprint) a certain size, or fixed length. so a SHA256 hash is 256bits long, a SHA512 hash is 512 bits long. Thats all that fixed length means. No matter how large or small the data we want the hash of is, the answer is always 256 or 512 in the above example. We use 512 bit hashes, so lets say no matter what we have, the answer is a 512 bit long fingerprint. That’s it really. Very simple.  N,B. As a human fingerprint does not contain the human, a digital fingerprint does not contain the data (weird maths arguments can start here for data less then hash length etc. but lets ignore that). 

So a hash is the digital version of a human fingerprint, but for data! It does not contain data, you cannot reverse a hash to get data, you cannot reverse a fingerprint to get a human. They are always the same length for any given hash type. 

An irreversible data fingerprint.

What is a secure hash?

This is a statement that throws people, they think, oh secure, it encrypts my data. It does not (read previous paragraph).  All that the security means is:

The more secure the algorithm, the less chance two pieces of data will share the same fingerprint (hash). [This is called a hash collision] 

It also means (and we use this) that for given random inputs the hash result will be evenly distributed across the whole address range. This means even a tiny change to data, will produce a wildly different hash result

“I think you hold some data, but I need to be sure you have it”

Right OK, back to the question.

Now we see, hash is a pretty simple to understand thing (its actually pretty simple code, which shows its beauty). In MaidSafe we already know that data is stored in an unusual manner (again simplicity). Each data element uses the hash (fingerprint) of its contents as the name. This means 

When you construct that piece of data in code, you hash the content and check the name is the same as that result.

This helps a little, but is not proof you have the data (we can prove that many ways, the simplest is to ask you for it 🙂 ). What MaidSafe does for many reasons is NOT to ask for the data, but instead we want you to prove you have it, more than that we want to prove you have the data and not a virus (a good reason not to ask for it). To achieve this we use a very simple process, called Integrity Check (code is here).

Integrity Check

feh_008831_000001_UnversionedData

Do not panic, this is just a reminder that there are nodes on the network (Data Managers) who knows some nodes that should hold data (PmidNodes) and these PmidNodes are managed by the nodes closest to them. You can sum this all up, by just stating, the network knows what data you should be holding

This proof in MaidSafe uses a mechanism similar to a zero knowledge proof. In this case the check should not require to know the content of any data to be checked, but must know the data is in fact held and held in a manner that is accurate. This means zero corruption or viruses…etc… can have affected the data. This is achieved with the following steps:

  1. A checking group (Data Managers) creates a random string
  2. This random string is sent, encrypted to all holders of the data
  3. The data holder takes this string and appends to the original data and hashes the result
  4. The result is collected at the checking group and compared
  5. If any node returns a different result then it is believed compromised and de-ranked

This mechanism triggers on Get requests and during account transfers etc. It is non deterministic and randomised by use by users. It is considered to be secure and uses zero knowledge, not to conceal content (as anyone can ask for any data), but to ensure any data with a contamination is not required to be transferred.

It is really this simple. 

An observation

There are two really important points to make here, and I cannot emphasise them enough

1: Hard things are only hard till you solve them, difficult is not always smart maths or amazing crypto babble, difficult means we cannot imagine the solution, we are trained not to and defeating all those years of conditioning is very hard. 

2: If a solution is complex then its very likely to not be an optimal solution in itself. If there are lots of conditions and checks in the solution, then I tend to not like it, at best it is a sticking plaster solution and will not last. The optimum solution will be simple in its own right, anyone should be able to understand it without too much effort (or its not optimal enough, arguably). 

This is actually what makes MaidSafe hard and complex looking.  Most of the solutions are very simple and start by requiring us to go down the sticking plaster approach, this is the part where folks say, thats amazing and complex, you must be smart, actually the opposite is likely the case. Then we remove that for a more simple algorithm that improves the solution, this is very hard, but the result should look as though it was easy all along (quandary? you bet).

This boils down to a situation where, you can explain a little part (as I am supposed to be doing here) and people say, oh yea thats rubbish cause look at this other problem, bet you have not thought of that (bet highly we have, not 100% but bet highly).  So explain the little parts and making them simple is OK, the complexity in MaidSafe is that it is a network of many of these little parts and they compliment each other. So to understand it you need to understand the little parts in isolation, then stitch them together. This is a difficult thing for people who want the answer now!

Further considerations 

This short text has touched on the issues that require solving and hopefully presented a simple, easy to understand solution to proof of storage. It is the part folk talk and pontificate about with elaborate schemes to solve. This shows the solution is in fact simple. It masks the actual issue though (as do all the expert opinions on this). What about mutating data, i.e. directories change when a file changes, thats not held as a hash, safecoin and transactions need a key that cannot be a hash of the data, so … there is a taster of the actual issue, when you solve this one. This is all for later in the series though, I hope you will see it’s just as simple, but all this is required info to get to that position.

Part 1 of this series is here.

Enthusiastic human :-)

Posted in complex systems, MaidSafe, Uncategorized
2 comments on “Proof Of Storage (maidsafe part II)
  1. […] we consider it only for obfuscation and creation of non repeating data). This is relevant to the previous post, where we describe hashes and their use for proof of retrievability and corruption free […]

Leave a comment

Member of The Internet Defense League

Categories
Follow Metaquestions on WordPress.com

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 2,678 other subscribers