Looking at the history of computer parallelism: the first level of parallelism is instruction-level parallelism. Instruction-level parallelism was the main way to improve the performance of the architecture in the last 20 years of the 20th century. Instruction-level parallelism can improve performance while maintaining binary compatibility of programs, which is particularly popular among programmers. There are two types of instruction-level parallelism. One is time parallelism, that is, the instruction pipeline. The instruction pipeline is like the assembly line of a factory producing cars. The car production factory will not wait until one car is assembled before starting the production of the next car, but will produce multiple cars at the same time in multiple processes. The other is spatial parallelism, that is, multiple launches, or superscalar. Multiple launches are like a multi-lane road, and out-of-order execution allows overtaking on multiple lanes. Superscalar and out-of-order execution are often used together to improve efficiency. After the emergence of RISC in the 1980s, the development of instruction-level parallelism reached a peak in the following 20 years. After 2010, there was not much room for further exploration of instruction-level parallelism. The second level of parallelism is data-level parallelism, which mainly refers to the vector structure of single instruction stream multiple data stream (SIMD). The earliest data-level parallelism appeared on ENIAC. In the 1960s and 1970s, vector machines represented by Cray were very popular, from Cray-1, Cray-2, to later Cray X-MP and Cray Y-MP. Until Cray-4, SIMD was silent for a while, but now it has begun to recover and is used more and more. For example, the AVX multimedia instructions in X86 can use 256-bit channels to perform four 64-bit operations or eight 32-bit operations. As an effective supplement to instruction-level parallelism, SIMD has played an important role in the field of streaming media. It was mainly used in special-purpose processors in the early days, and now it has become a standard feature of general-purpose processors.
The third level of parallelism is task-level parallelism. Task-level parallelism exists in large numbers in Internet applications. Task-level parallelism is represented by multi-core processors and multi-threaded processors, and is the main method for improving performance in current computer architectures. Task-level parallelism has a large parallel granularity, with hundreds or more instructions in one thread.
From the perspective of the development of parallel computing, blockchain is now in the process of transition from the first level to the second level. Mainstream blockchain systems usually adopt single-chain or multi-chain architectures. Single-chain, such as the common Bitcoin and Ethereum, has only one chain in the entire system, and each node in the chain executes exactly the same smart contract transactions and maintains a completely consistent on-chain state. In each blockchain node, smart contract transactions are usually executed serially, resulting in low throughput of the entire system. Some high-performance blockchain systems that have emerged recently, although using a single-chain architecture, also support parallel execution of smart contract transactions. Thomas Dickerson and Maurice Herlihy of Brown University and Yale University first proposed a parallel execution model based on the STM (Software Transactional Memory) method in their PODC’17 paper. This method uses optimistic parallelism to execute multiple transactions in parallel. If a transaction encounters a state access conflict during the parallel execution, the state is rolled back through STM, and then the state-conflicting transactions are executed serially. This type of method has been applied to many high-performance blockchain projects, including Aptos, Sei, and Monad. Correspondingly, another parallel execution model is based on pessimistic concurrency, that is, before a transaction is executed in parallel, it is necessary to detect whether the state accessed by the transaction conflicts. Only transactions without conflicts can be executed in parallel. This type of method usually adopts a pre-compute method, using program analysis tools to perform static analysis on the smart contract code and construct state dependencies, such as a directed acyclic graph (DAG). After concurrent transactions are submitted to the system, the system determines whether the transaction can be executed in parallel based on the state that the transaction needs to access and the dependencies between the states. Only transactions that have no state dependencies between each other will be executed in parallel. This type of approach has been applied to high-performance blockchain projects such as Zilliqa (CoSplit version) and Sui. The above parallel execution models can significantly improve the system throughput. These two solutions correspond to the instruction-level parallelism mentioned above. However, there are two problems with these works: 1) scalability issues and 2) parallel semantic expression issues, which we will elaborate on below.
Parallel Design
We will use the typical technical solutions of Solana and Monad projects as examples to disassemble their parallel architecture designs, including parallelization classification, data dependencies, and other key indicators that affect parallelism and TPS.
Solana
At a high level, Solana's design philosophy is that blockchain innovation should develop with the advancement of hardware. As hardware continues to improve according to Moore's Law, Solana aims to benefit from higher performance and scalability. Solana co-founder Anatoly Yakovenko originally designed Solana’s parallelized architecture over five years ago, and today, parallelism as a blockchain design principle is spreading rapidly.
Solana uses Deterministic Parallelization, which comes from Anatoly’s past experience with embedded systems, where developers typically declare all state up front. This allows the CPU to understand all dependencies, allowing it to prefetch necessary portions of memory. The result is optimized system execution, but again, it requires developers to do extra work up front. On Solana, all memory dependencies of a program are required and stated in constructed transactions (i.e., access lists), allowing the runtime to efficiently schedule and execute multiple transactions in parallel. The next major component of the Solana architecture is the Sealevel VM, which supports processing multiple contracts and transactions in parallel based on the number of cores a validator has. Validators in a blockchain are network participants responsible for verifying and confirming transactions, proposing new blocks, and maintaining the integrity and security of the blockchain. Because transactions declare up front which accounts require read-write locks, the Solana scheduler is able to determine which transactions can be executed concurrently. Because of this, when validating, "block producers" or leaders are able to sort through thousands of pending transactions and schedule non-overlapping transactions in parallel.
Monad
Monad is building a parallel EVM layer 1 with Turing completeness. What makes Monad unique is not only its parallelization engine, but also the optimization engine they built under the hood. Monad takes a unique approach to its overall design, incorporating several key features including pipes, asynchronous I/O, separation of consensus and execution, and MonadDB.
Similar to Sei, Monad blockchain uses Optimistic Concurrency Control (OCC) to execute transactions. Concurrent transaction processing occurs when multiple transactions exist in the system at the same time. This transaction method has two phases: execution and verification.
During the execution phase, transactions are processed optimistically and all reads/writes are temporarily stored in transaction-specific storage. After this, each transaction will enter the verification phase, where the information in the temporary storage operation is checked against any state changes made by previous transactions. Transactions run in parallel if they are independent. If one transaction reads data modified by another transaction, a conflict arises.
A key innovation of Monad design is pipelining with a slight offset. This offset allows more processes to be parallelized by running multiple instances simultaneously. Therefore, pipelines are used to optimize many functions, such as the state access pipeline, the transaction execution pipeline, the pipeline within the consensus and execution, and the pipeline of the consensus mechanism itself, which corresponds to washing, drying, folding clothes and putting them in the closet in the figure below.
Monad, transactions are ordered linearly within the block, but the goal is to reach the final state faster by leveraging parallel execution. Monad uses an optimistic parallelization algorithm to design the execution engine. Monad's engine processes transactions simultaneously and then analyzes them to ensure that the results will be the same if the transactions are executed one after another. If there is a conflict, re-execution is required. The parallel execution here is a relatively simple algorithm, but combining it with the other key innovations of Monad is what makes this approach novel. One thing to note here is that even if re-execution occurs, it is usually cheap because the inputs required by the invalid transaction are almost always retained in cache, so it will be a simple cache lookup. The re-execution will definitely succeed because you have already executed the previous transactions in the block.
In addition to delaying execution, Monads also improve performance by decoupling execution and consensus, similar to Solana and Sei. The idea here is that if you relax the condition for execution to complete when consensus is complete, you can run both in parallel, giving both extra time. Of course, Monad handles this situation with a deterministic algorithm to ensure that one doesn't run too far ahead to catch up.
Whether optimistic or pessimistic parallel execution is adopted, the above system uses shared-memory as the underlying data model abstraction, that is, no matter how many parallel units there are, the parallel units can obtain all data (here refers to all on-chain data of the blockchain), and the state data can be directly accessed by different parallel execution units (that is, all on-chain data can be directly read and written by parallel units). Blockchain systems that use shared-memory as the underlying data model usually have concurrency limited to a single physical node (Solana), and the concurrency of each physical node is limited by the computing power of the node, that is, the number of physical threads (assuming that each thread supports a virtual machine).
This intra-node parallel approach only requires modifying the architecture of the smart contract execution layer, and does not require modifying the logic of the system consensus layer, which is very suitable for improving the throughput of single-chain systems. Therefore, since there is no sharding of data storage, each node in the blockchain network still needs to execute all transactions and store all states. At the same time, compared to the shared-nothing architecture that is more suitable for distributed expansion, these systems that use shared-memory as the underlying data model abstraction cannot achieve horizontal expansion of processing power, that is, the expansion of system state storage and execution capacity by increasing the number of physical machines, thus failing to fundamentally solve the scalability problem of blockchain.
So is there a ready-made solution?
Parallel Programming Model
Before introducing PREDA, we would like to raise a natural question: Why use parallel programming? In the 1970s, 1980s and even part of the 1990s, we were very satisfied with single-threaded programming (or serial programming). You can write a program to complete a task. After the execution, it will give you a result. The task is completed and everyone will be happy! Although the task is completed, if you are doing a particle simulation that requires millions or even billions of calculations per second, or processing images with tens of thousands of pixels, you will want the program to run faster, which means you need a faster CPU. Before 2004, CPU manufacturers IBM, Intel, and AMD could provide you with faster and faster processors, with processor clock frequencies gradually increasing from 16 MHz, 20 MHz, 66 MHz, 100 MHz, to 200 MHz, 333 MHz, 466 MHz... It seemed that they could continue to increase the speed of the CPU, that is, they could continue to improve the performance of the CPU. But by 2004, it was obvious that the trend of increasing CPU speed could not continue due to technological limitations. This requires other technologies to continue to provide higher performance. The solution of CPU manufacturers is to put two CPUs in one CPU, even if both CPUs work slower than a single CPU. For example, two CPUs (manufacturers call them cores) running at 200 MHz can perform more calculations per second (that is, intuitively 2×200 > 300) than a single CPU running at 300 MHz. The dream-like story of "single CPU with multiple cores" has become a reality, which means that programmers must now learn parallel programming methods to take advantage of the two cores. If a CPU can execute two programs at the same time, then programmers must write two programs. But does this translate into running the program twice as fast? If not, then our idea of 2×200 > 300 is problematic. What happens if one core doesn't have enough work? That is, only one core is really busy, while the other core is doing nothing? In this case, it's better to use a single core at 300 MHz. After the introduction of multiple cores, many similar problems become very prominent, and only programming can effectively utilize these cores.
In the figure below, we imagine Bob and Alice as two gold diggers, and gold mining requires four steps:
Driving to the mine
Mining
Loading ore
Storing and polishing
Preview
Gain a broader understanding of the crypto industry through informative reports, and engage in in-depth discussions with other like-minded authors and readers. You are welcome to join us in our growing Coinlive community:https://t.me/CoinliveSG