Contributor: When blockchains begin to scale, they often run into the data availability problem — we break down exactly what that is here.
“Data availability” and the “data availability problem” are terms used to refer to a specific problem faced in various blockchain scaling strategies. This problem asks: how can nodes be sure that when a new block is produced, all of the data in that block was actually published to the network? The dilemma is that if a block producer doesn’t release all of the data in a block, no one could detect if there is a malicious transaction hidden within that block.
In this article, we’ll do a deep dive on the data availability problem, why it’s important and what solutions exist for it.
How Blockchain Nodes Function
In a blockchain, each block consists of two pieces:
- A block header. This is the meta-data for the block, which consists of some basic information about the block, including the Merkle root of transactions.
- The transaction data. This makes up the majority of the block, and consists of the actual transactions.
There are also generally two types of nodes in a blockchain network:
- Full nodes (also known as fully validating nodes). These are nodes that download and check that every transaction in the blockchain is valid. This requires a lot of resources and hundreds of gigabytes of disk space, but these are the most secure nodes as they can’t be tricked into accepting blocks that have invalid transactions.
- Light clients. If your computer doesn’t have the resources to run a full node, then you can run a light client. A light client doesn’t download or validate any transactions. Instead, they only download the block header, and assume that the block only contains valid transactions, so light clients are less secure than full nodes.
There’s just one problem — in order for a full node to generate a fraud proof for a block, they need to know the transaction data for that block. If a block producer just publishes the block header but not the transaction data, then full nodes won’t be able to check if the transactions are valid and generate fraud proofs if they’re not valid. It is a requirement that block producers must publish all the data for their blocks, but we need a way to enforce this.
To solve this problem, there needs to be some sort of way for light clients to check that the transaction data for a block was actually published to the network, so that full nodes can check it. However, we want to avoid requiring light clients to download the entire block itself to check that it’s been published, because that defeats the point of a light client.
How do we solve this? First, let’s discuss where else the data availability problem is relevant, and then we’ll dive into the solutions.
Where Is the Data Availability Problem Relevant?
In the first section, we introduced the data availability problem. Let’s discuss which scalability solutions it’s important for.
Increasing the Size of Blocks
But what if we wanted to increase the block size limit? Less people will afford to run full nodes and independently verify the chain, and more people will run light clients that are less secure. This is bad for decentralization, because it would be easier for block producers to change the protocol rules and insert invalid transactions that light clients will accept as valid. Therefore, adding fraud proof support for light clients becomes very important, but as discussed, light clients need a way to check that all the data in blocks has been published for this to work.
Sharding
Typically, a full node in a sharded blockchain will run a full node for only one or a few shards, and run a light client for every other shard. After all, anyone running a full node for every shard defeats the purpose of sharding, which is to split up the resources of the network to different nodes.
However, this method has its problems. What if the block producers in a shard become malicious and start accepting invalid transactions? This is more likely to happen in a sharded system than a non-sharded system, as a sharded system is easier to attack since it has only a few block producers in each shard. Remember that the block producers are split up into different shards.
In order to solve the problem of detecting if any shard accepted an invalid transaction, you need to be able to guarantee that all the data in that shard was published and made available, so that any invalid transaction can be proven with a fraud proof.
Rollups
Zero-knowledge (ZK) rollups are similar to optimistic rollups, but instead of using fraud proofs to detect invalid blocks, they use a cryptographic proof called a validity proof to prove that a block is vald. Validity proofs themselves don’t require data availability. However, ZK rollups as a whole still require data availability, because if a block producer makes a valid block and proves it with a validity proof but doesn’t release the data for the block then users won’t know what the state of the blockchain is and what their balances are, and so won’t be able to interact the chain.
Going Further
Rollups are a design that uses a blockchain only as a data availability layer to dump transactions, but all the actual transaction processing and computation happens on the rollup itself. This leads to an interesting insight: a blockchain doesn’t actually need to do any computation, but at minimum it needs to order transactions into blocks and guarantee the data availability of transactions.
What Solutions Are Available for the Data Availability Problem?
Downloading All The Data
The most obvious way, as discussed, to solve the data availability problem is to simply require everyone (including light clients) to download all the data. Clearly, this doesn’t scale well. This is what most blockchains, such as Bitcoin and Ethereum, currently do.
Data Availability Proofs
Data availability proofs are a new technology that allows clients to check with very high probability that all the data for a block has been published, by only downloading a very small piece of that block.
This means that in order for 100% of a block to be available, only 50% of it needs to be published to the network by the block producer. If a malicious block producer wants to withhold even 1% of the block, they must withhold 50% of the block, because that 1% can be recovered from the 50%.
Armed with this knowledge, clients can do something clever to make sure that no parts of the block have been withheld. They can try to download some random chunks from the block, and if they are unsuccessful in downloading any of those chunks (i.e. the chunk is in the 50% of the chunks that a malicious block producer didn’t publish), then they will reject the block as unavailable. After trying to download one random chunk, there’s a 50% chance that they will detect that the block is unavailable. After two chunks, there’s a 75% chance, after three chunks, there’s a 87.5% chance and so on until after seven chunks, there’s a 99% chance. This is very convenient, because it means that clients can check with high probability that the entire block was published, by only downloading a small portion of it.
Conclusion
In this article, we introduced the data availability problem, showed why it’s important for blockchain scalability, and described a solution.
To learn more, check out the following resources:
- John Adler’s whiteboard session about fraud and data availability proofs
- Original fraud and data availability proofs paper
- Coded Merkle Trees paper on an alternative data availability scheme
- Ethereum Research wiki post on the data availability problem