https://medium.com/@numencyberlabs/analysis-of-the-first-critical-0-day-vulnerability-of-aptos-move-vm-8c1fd6c2b98e
1. Preface
The Move programming language is rising in popularity lately due to the strong advantages it has over Ethereum’s Solidity language. Move is used in many well-known projects, such as Aptos and Sui. Recently, Numen Web3 security vulnerability detection product discovered a critical-level security vulnerability in the Virtual Machine (VM) of the Aptos public chain. What we discovered was that a vulnerability in the language can cause Aptos nodes to crash and cause denial of service. In this article, we hope you will have a better understanding of the Move language and its security through the explanation of this vulnerability. As a leader in Move language security research, we will continue to make continuous contribution to its ecological security.
2. Important Concepts of the Move language
Modules and Scripts
Move has two different types of programs: Modules and Scripts. Modules are libraries that define structural types and the functions that operate on those types. Structure types define the pattern of the global storage of a Move, and module functions define the rules for updating the storage. Modules themselves are also stored in the global storage. Scripts are entry points to executables, similar to the main function in traditional languages. Scripts usually call functions of published modules to update the global storage. Scripts are temporary code fragments that are not published in the global storage. A Move source file (or compilation unit) may contain multiple modules and scripts. However, publishing modules or executing scripts utilise separate virtual machine (VM) operations.
For those familiar with operating systems, a Move module is similar to a dynamic library module loaded when the system’s executable is ran, and a script is similar to the main program. Users can write their own scripts to access the global storage, including the code that calls the module.
Global Storage
The purpose of the Move program is to read and write to the global storage in the form of a tree. The program cannot access the file system, the network or any data outside this tree.
In a pseudo-code, the global storage looks like this:
Structurally, the global storage is a forest, which consists of trees rooted at the address of an account. Each address can store resource data and module code. As the above pseudo-code shows, each address can store at most one resource value of a given type and at most one module of a given name.
MOVE Virtual Machine Principle
movevm and evm virtual machine are the same, where it needs to compile the source code into byte code, and then executed in the virtual machine. The following chart shows the process.
1. the bytecode is loaded in through the function execute_script
2. Execute load_script function, this function is mainly used to deserialize the bytecode, and verify whether the bytecode is legal, if the verification fails, it will return as a failure
3. After successful verification, the real bytecode code is then executed
4. Execute the bytecode, access or modify the state of global storage, including resources, modules
Note: There are many other features related to Move, but we will not be introducing them all here, and we will continue to analyze the features of the move language from a security perspective.
3. Vulnerability Description
This vulnerability mainly involves the verification module. Before talking about the specific vulnerability, the function of the verification module and StackUsageVerifier::verify will be introduced.
Verification Module
We know that before the real execution of bytecode code, there will be verification of bytecode, and the verification can be subdivided into a number of sub-processes respectively.
They are:
BoundsChecker, is mainly used to check the boundary security of the module and script. This includes checking the boundary of signature, constants, etc.
DuplicationChecker, a module that implements a checker to verify whether each vector in a CompiledModule contains different values
SignatureChecker, which checks that the field structure is correct when the signature is used for function parameters, local variables, and structure members
InstructionConsistency, which verifies instruction consistency
Constants are used to verify that constants are of the original type and that the data of constants are correctly serialized to their type
CodeUnitVerifier, to verify the correctness of the function body code, via stack_usage_verifier.rs and abstract_interpreter.rs respectively
script_signature, to verify that a script or entry function is a valid signature
The vulnerability occurs within the verify process
CodeUnitVerifier::verify_script(config, script)? ;
function. You can see that there are many verifying subprocesses here.
These are stack-safe checksum, type-safe checksum, local variable-safe checksum, and reference-safe checksum. The vulnerability arises in the stack security verification process.
Stack Security Verification (StackUsageVerifier::verify)
This module is used to verify that the basic blocks in the bytecode instruction sequence of a function are used in a balanced manner. Each basic block, except those ending with the Ret (return to caller) opcode, must ensure that it leaves the block with the same stack height as at the beginning. In addition, for any basic block, the stack height must not be lower than the stack height at the beginning of the block.
Loop through all blocks to verify that the above conditions are met:
The loop iterates through to verify the legitimacy of all basic blocks.
Vulnerability Details
As introduced earlier, since movevm is a stack virtual machine, when verifying the legitimacy of instructions, it is obvious that firstly, we need to make sure that the instruction bytecode is correct, and secondly, we need to make sure that the stack memory is legal after a block call, i.e., the stack is balanced after a stack operation. verify_block
function is used to accomplish the second purpose.
As we can see from the verify_block
code, it will loop through all the instructions in the block code block and then verify whether the effect of the instruction block on the stack is legal by adding or subtracting num_pops
, num_pushes
. Firstly, through stack_size_increment < num_pops
to determine whether the stack space is legal. If num_pops
is larger than stack_size_increment
, that means the number of bytecode pops is larger than the size of the stack itself, and the error is returned and the bytecode checksum fails. Then, via stack_size_increment -= num_pops; stack_size_increment += num_pushes;
, these two instructions modify the impact on the stack height after each instruction is executed. And finally, when the loop ends, stack_size_increment
needs to be equal to 0, i.e. After keeping the operations in this block, the stack needs to be balanced.
It seems that there is nothing wrong here, but because in the execution of 16 lines of code, it doesn’t determine whether there is an integer overflow, resulting in an integer overflow vulnerability that can be indirectly controlled by constructing a large num_pushes, stack_size_increment. So how do we construct such a huge number of pushes?
It seems that there is no problem , but since the 16th lines of code is executed here, it is not judged whether there is an integer overflow or not. As a result, the stack_size_increment
can be indirectly controlled by constructing an oversized num_pushes
, resulting in an integer overflow vulnerability.
Here we first need to introduce the move bytecode file format.
Move Bytecode File Format
Like Windows PE files, or linux ELF files, move bytecode files end in .mv, and the files themselves have a certain format.
First is the magic, the value is A11CEB0B, next is the version information, and the number of tables, after that is the tables headers, there can be many tables. Table kind is the type of table, a total of 0x10 kinds (as shown on the right side of the figure), for more details you may wish to view the move language documentation, Next is the offset of the table, and the length of the table. After that is the table contents, and finally is Specific Data, there are two kinds, for module, it is Module Specific Data, for script type, it is Script Specific Data.
Constructed Malicious File Format
Here we are interacting with Aptos in script, so we construct the file format shown below to cause a stack_size_increment overflow:
First, let’s explain the format of this bytecode file:
+0x00–0x03: is magic word 0xA11CEB0B
+0x04–0x7: is file format version,its version is 4
+0x8–0x8: is table count, value is 1
+0x9–0x9: is table kind, its type is SIGNATURES
+0xa-0xa: is table offset, value is 0
+0xb-0xb: is table length,value is 0x10
+0xc-0x18: is the data of SIGNATURES Token
Starting from 0x22, it is the code part of the main function code of script.
Through the move-disassembler tool, we can see that the disassembly code of the instruction is as follows:
Among them, the codes corresponding to the three instructions 0, 1, and 2 are the data in the red box, the green box, and the yellow box respectively.
LdU64 has no relationship with the vulnerability itself. We will not go into too much detail here, but you may check the code if you are interested. Here we focus on explaining the VecUnpack instruction. The function of VecUnpack is to push all the data to the stack when the vector object is encountered in the code.
In this constructed file, we construct the VecUnpack twice,The num of its vector are 3315214543476364830,18394158839224997406 respectively.
When the function instruction_effect
is executed, the second line of code below is actually executed:
After executing the instruction_effect
function, it returns (1,3315214543476364830) for the first time. At this time, stack_size_increment is 0, num_pops is 1, and num_pushes is 3315214543476364830. The second return is (1,18394158839224997406). When executing again stack_size_increment += num_pushes;
stack_size_increment is already 0x2e020210021e161d (3315214543476364829).
num_pushes is 0xff452e02021e161e (18394158839224997406), when the two are added, it is greater than the maximum value of u64, resulting in data truncation, and the value of stack_size_increment becomes 0x12d473012043c2c3b, which causes an integer overflow, which causes the Aptos node to crash, which in turn causes the node to stop running. Due to the security features of the rust language, it will not cause further code security impacts like C/C++.
4. Vulnerability Impact
Since this vulnerability occurs in the Move execution module, for nodes on the chain, if the bytecode code is executed, it will cause a DoS attack. In severe cases, the Aptos network can be completely stopped, which will cause incalculable damage, and have a serious impact on the stability of the node.
5. Official Fix
When we discovered this vulnerability, we reported it to the official Aptos team, and they quickly fixed the vulnerability. You may refer to the figure below for a screenshot of the fix.
Relevant code link is below: