Author: Stanford Blockchain Club Source: W-SOURCE Translation: Shan Ouba, Golden Finance
Introduction
Since its inception, blockchain and cryptocurrency have been committed to changing the financial landscape by providing broader access and removing intermediaries. Over time, the development of web3 has expanded its application scenarios, emphasizing the potential of blockchain technology in creating an Internet where creators prosper and users control their data.
A key part of the infrastructure that empowers end users while ensuring decentralization is to ensure that data is stored in a resilient, censorship-resistant database. Although centralized databases are convenient and familiar, they fail to provide the necessary security guarantees and require the permission of the database owner, limiting global applications.
Distributed data storage systems meet the needs for fault-tolerant and highly resilient storage by building a network of nodes that store, manage and share data.Removing the need for a central authority and distributing data in a P2P manner enhances security and transparency. Distributed storage systems are usually based on blockchain or similar technologies and tend to replicate data through redundancy and availability.
While distributed storage systems rely on unused storage capacity, providing security, data resilience, and potential cost-effectiveness, they face challenges in terms of regulation and interoperability.
For a truly open and accessible web, distributed storage is essential. A key aspect of any storage system is how to prove the way data is stored and maintained. The more important question for users is, how can data be proven to be stored and maintained? This is addressed through data proofs.
In general, we distinguish between two types of proofs:
Deterministic proofs: allow data holders to create a proof for a specific data that proves that it was created and matches the hash of the data. This type of proof only publicly discloses the hash generated using the data as input.
Probabilistic proofs: rely on probability to demonstrate that the underlying data is highly likely to be available. This is a type of proof that rationally demonstrates the degree of certainty of a particular hypothesis, and is applicable to data that has been published and can be retrieved when necessary.
The rest of this article will discuss design choices for proving data storage and integrity in three different systems. First, we will look at Tagion, which focuses on high-volume and scalable data, and then discuss how the decentralized storage network Filecoin ensures data storage at scale. Finally, this article will explore Celestia, which focuses on storing and providing blockchain data.
Tagion
Architecture
Tagion is a decentralized network dedicated to high-volume transactions with the goal of building a unique monetary system based on technology and democratic governance. The project relies on innovative database architecture and cryptography to achieve large scale. It is not a blockchain, but a distributed ledger that uses the DART database to optimize storage. Tagion's proof mechanism is an example of deterministic proof.
The core function of the DART database is as a distributed hash table that stores data according to hash keys. As the storage of information increases, the structure naturally generates more branches, each supporting up to 256 combined archives and sub-branches.
In addition to being similar to a distributed hash table, Tagion's infrastructure can also be understood as a sparse Merkle tree (SMT). SMT is an authenticated data structure based on key-value pairs that supports standard database operations such as lookup, insertion, update, and deletion. Each key-value pair represents a leaf, and the hash of the parent branch is derived by recursively hashing the child nodes to the Merkle root.
SMT significantly improves efficiency by enabling the prover to verify the existence of an element without having to access unrelated data elements or download specific pieces of data. In addition, the independence of the values in the tree allows updates in any order without changing the final structure of the tree.
Tagion's system leverages a root hash that contains all child branch hashes to quickly verify data status with minimal computation. To further enhance processing power, the system can create child DARTs for specific ecosystems, similar to sharded blockchains. These designated nodes manage subsets of data, increase throughput, and enable the network to be customized for different applications, similar to application chains.
Using DARTs creates a stateless system without the need to maintain a complete history of system transitions. This means that data can be deleted, reducing storage requirements overall and potentially increasing the decentralization of the system through lightweighting.
Tagion uses HiBON (Hash Invariant Binary Representation Object) to further facilitate the storage process, ensuring that data remains hashed as it enters, simplifying data retrieval based on the associated hash. Hash immutability means that data will always generate the same hash if it is processed in a different order. This is a proven technique used to speed up data retrieval and writing in databases.
Through these mechanisms, Tagion not only stores data securely, but also efficiently verifies its inclusion and integrity in the network.
Data Integrity
All subsystems of Tagion perform so-called random walks to check that data is stored and available as needed. Nodes that fail the retention verification challenge are excluded from the network.
All archives contain timestamps and require a fee to extend the storage time. During the walk, the system checks whether the payment has been received, and if not, the data is deleted to free up space.
Filecoin
Filecoin is a decentralized storage network that incentivizes miners to provide storage capacity through its native token Filecoin. To earn these rewards, miners must generate proofs to verify their storage capacity.
Filecoin's basic storage unit is called a sector, which has a standard size and a lifespan that can be extended by the provider. The design strikes a careful balance between security and availability. All user data stored on Filecoin is encrypted, with multiple copies distributed across the network, ensuring that miners cannot access file contents.
Miners have influence in the Filecoin network proportional to the amount of storage they provide, which also allows them to participate in the network's consensus mechanism. The Filecoin virtual machine is responsible for executing smart contracts and facilitating market operations, such as pairing storage providers with users.
Filecoin's architecture is modular, allowing nodes to operate specific parts of the system as needed. For example, a node can be a storage node only and not participate in market operations.
To ensure data integrity and availability, Filecoin relies on two algorithms: Proof of Storage and Proof of Replication.
Proof of Storage
Miners in Filecoin generate proofs to verify that they hold a copy of the data at any given time. This proof is achieved through challenges: the system asks miners questions that can only be answered correctly if they have the data.
To ensure that miners do not copy the data only when the challenge is raised, the challenges are designed to randomly target different parts of the data at unpredictable time intervals. The combination of randomness and uncertainty in time intervals makes it impossible, uneconomical, and irrational for miners to retrieve data only when a challenge is presented.
Proof of Spacetime (PoSt)
Filecoin introduces Proof of Spacetime (PoSt) to ensure continuous storage and data availability. Proof of Spacetime verifies storage within a time interval by presenting a cryptographic challenge to miners. Miners can only pass the challenge if the file is stored within the specified time frame.
PoSt includes two types of challenges:
Winning Proof of Spacetime (Winning PoSt): Miners verify that they stored a copy of the data at a specific time, usually when the algorithm selects the miner to mine the next block. Short deadlines ensure that they have the data.
Window Proof of Spacetime (WindowPoSt): A recurring challenge where miners submit proofs that they have maintained the data as required. It is more expensive for miners to seal the data only when it is submitted.
Sealing is part of the Proof of Replication algorithm, which is computationally intensive, so rational miners want to reduce the need for sealing as much as possible.
Proof of Replication
Sealing is part of the Proof of Replication algorithm, which is computationally intensive, and encourages miners to reduce the frequency of sealing. Proof of Replication ensures that users, miners, have created and stored a unique copy on their physical hardware. This proof includes:
The data itself
The miner who sealed the data
The time and date of sealing
The block height when the data was sealed
Requiring miners to generate two types of proofs provides users with a guarantee of secure file storage, and only miners who provide actual storage can receive rewards. Because the proof is too large to be put on the chain, miners generate zero-knowledge succinct non-interactive argument of knowledge (zkSNARK) and then submit it to the chain, making Filecoin the largest user of zkSNARK, generating 6 to 7 million proofs per day.
Overall, Filecoin combines the methods of deterministic proofs of existence (PoRep) and probabilistic proofs of existence (PoSt), taking a hybrid approach.
Celestia
The third example in this article is Celestia, a so-called data availability blockchain that provides execution and data storage for modular blockchains, allowing them to outsource core functions.
With the rise of Ethereum Rollup, data availability solutions like Celestia have gained popularity for providing a cheaper alternative to Rollup transaction data storage than Ethereum archive nodes.
Proving Data Availability
Unlike Filecoin, Celestia does not provide storage solutions for end users, but instead focuses on solving the data availability problem. Data availability ensures that blockchain data has been published correctly. Typically, blockchain nodes must download entire blocks to verify availability, a resource-intensive process that can hinder verification.
To simplify this process, Celestia employs data availability sampling (DAS). This method involves light nodes downloading only a small portion of the data until a predetermined confidence level is reached. If the data in the sample is all available, the data is considered published, as a probabilistic proof of data availability.
It works as follows:
The proposer creates a block of data.
The block data is split into k×k blocks, forming a matrix.
This matrix is extended by adding parity data, creating a 2k×2k matrix using Reed-Solomon encoding. Such encodings allow the entire data set to be recovered from a subset of the data.
Independent Merkle roots for each row and column of the extended matrix are calculated and combined.
Finally, the Merkle root of all these combined roots is added to the block data commitment in the block header, confirming the availability of the data.
To verify availability, the light node randomly extracts unique coordinates in the expansion matrix and then queries the full node for the data block with the Merkle proof corresponding to these coordinates. If the response is correct, it means that the data of the entire block is available with a high probability.
The node then broadcasts the received data block with the correct Merkle proof to the rest of the network. As long as the sampling is sufficient, the node can reconstruct the entire block, allowing Celestia to rely more on nodes with limited resources for verification, thereby helping decentralization.
At the time of writing, Celestia is still very new. However, data availability sampling is a technique that may be adopted beyond Celestia, and Ethereum core developers are discussing adding it to the protocol to help scale.
Conclusion
In summary, various methods for storing and verifying data availability in distributed networks are in operation and actively used.
Tagion uses the DART database, which is sharded to increase throughput and support the development of specialized sub-ecosystems protected by sparse Merkle trees.
Filecoin's architecture utilizes two different algorithms, Proof of Spacetime and Proof of Replication, to enable miners to verify and prove that they have reliably stored data. These proofs are then recorded on-chain in the form of zero-knowledge proofs.
Celestia, as a data availability layer, utilizes Reed-Solomon encoding to expand data blocks into matrices. This structure allows light clients to perform random sampling to confirm data availability, bypassing the need to download the entire dataset.
As the landscape of distributed storage systems continues to evolve, Tagion, Filecoin, and Celestia each propose unique strategies for ensuring data integrity, availability, and accessibility. Together, these platforms make important contributions to building resilient data publishing and storage systems that support decentralized networks.