https://medium.com/numen-cyber-labs/the-story-of-a-high-vulnerability-in-move-reference-safety-verify-module-2340f3d8c642
0x0 Preface
Some time ago, we found a critical vulnerability in Aptos Movevm. After a bout of deep research, we found yet another critical vulnerability which is also an integer overflow but this time, it is extremely interesting.
We know that the Move language verifies the code unit before executing the bytecode. In the code unit verify it’s split into 4 steps. This bug occurs at reference_safety. We will be going more into detail about it below.
This module defines the transfer functions for verifying the reference safety of a procedure body. The checks include (but are not limited to) verifying that there are no dangling references, accesses to mutable references are safe, and accesses to global storage references are safe.
Here is the verify entry-point, it will call analyze_function.
In analyze_function, the function will be verified with each basic block, so what’s the basic block?
In the compiler construction, a basic block is a straight-line code sequence with no branches in except for the entry and no branches out except at the exit.
In Move language, how do we identify a basic block?
In the Move language, the basic block is determined by traversing the bytecode, finding all branch instructions, and loop instructions. You can see the core code below:
Here is an example of a code block of the Move language. We can see there are 3 basic blocks. The branches are determined by instruction: BrTrue, Branch, Ret.
0x1 Reference Safety in Move
Move supports two types of references: immutable — defined with & (e.g. &T) and mutable — &mut (e.g. &mut T). You may use immutable (&) references to read data from structs and use mutable (&mut) to modify them. By using the proper type of references, you help maintain security and you should know whether this method changes the value or only reads.
Below is an example from the official Move tutorial:
In the example above, we can see the mut_ref_t is the mutable reference of the value t.
So, the Move reference safety module tries to confirm whether the reference is valid or not which by take the function as the unit, scanning the basic blocks in the function, and judging whether all reference operations are legal through bytecode instructions verification.
The figure below shows the routine in which it verifies the safety of the reference.
The state here is AbstractState which contains the borrow graph and locals of which both of them are used to secure references.
borrow_graph is a graph representing the relationship between locals references.
We can see from above figure, there is a pre state which includes locals and borrow graph (L ,BG) and then executing the basic block will generate a post state with the (L’,BG’), and then will merge the state before and after to update the blocks state and propagate postcondition of this block to successor blocks. This is like the Sea of Nodes optimization in V8 turbofan.
The below code is the main loop corresponding to the figure above. First, execute the block code (It will return an AnalysisError if executing the instruction is unsuccessful) and then try to merge the pre state and post state by determining whether the join_result is changed or not. If it is changed and the current block contains an edge point (which means there is a loop ), it will jump back to the beginning of the loop, in the next round will still execute this block until the post state is equal to pre state or abort by some error.
How do we judge whether the joinResult is changed or unchanged?
Through the above code, we can know whether the join result has changed or not by judging whether the relationship between locals and borrow relationship has changed or not. The join_ function here is used for updating the locals and borrow graph state.
With the join_ code below, line 6 is to initialize a new locals Map, line 9 is used to iterate all the index in the locals if the all value is None, before and after executing the block so you don’t insert to the new locals maps. If before the state has value, post state is None, we need to release the brow_graph id, which means to eliminate the borrow relationship of the value. The opposite is also the same. In particular, when both values exist and are the same, insert them into the new map like the lines 30–33 and then merge the borrow_graph at line 38.
From above we can see self.iter_locals() is the nums of the locals. And note this local include not only the function’s real locals but also parameters.
0x2 The Vulnerability
Here we have crossed all the codes related to the vulnerability, have you found it?
If you can’t find the vulnerability, it matters not. I will be going through the vulnerability-triggering process in detail below.
First in the blew code if the parameters length add locals length bigger than 256. There seems to be no problem right?
But this function will return Iterator with the item type u8.
So in the function join_() is function_view.parameters().len() and function_view.locals().len() combined value is bigger than 256.
In the code,for local in self.iter_locals(), the local variable is of type u8. After 256 iterations, it will cause an overflow. After the overflow, the value of local is 8.
Actually, Move has the routine to verify the local's number, but unfortunately only verifies the locals in the check bounds module, not including the length of the parameters.
Seems the developers know here of the need to check the parameters + locals value. However, the code verifies the locals count in the check bounds module ,it does not include the parameter’s length.
0x3 Move Overflow to DoS
We know there is a main loop to scan the code block and after calling the execute_block function and joining the state. If the move code exists a loop will jump to the block beginning to execute again.
So if we make a loop code block and exploit the overflow to change the state of the block, it makes the new locals map in the AbstractState object Differently from before, and then executes the block again with the function execute_block function, we know this function analyzes the code bytecode and gives access the locals, so if the reference value offset does not exist in the new AbstractState local's map, it will lead to a DoS.
After auditing the code I found that by using the MoveLoc/CopyLoc/FreeRef opcode, we can achieve this goal.
Here let us see the copy_loc function which is the code called by execute_block function in the file path:
move/language/move-bytecode-verifier/src/reference_safety/abstract_state.rs
In line 287, the code tries to get the local value by the LocalIndex as parameter and if the LocalIndex does not exist, it will lead to panic, Imagation, when the node executes this evil code, will cause the whole node to crash.
0x4 PoC
Here is the PoC, of which you can reproduce in the git commit:add615b64390ea36e377e2a575f8cb91c9466844
This is the crash log:
thread ‘regression_tests::reference_analysis::PoC’ panicked at ‘called `Option::unwrap()` on a `None` value’, language/move-bytecode-verifier/src/reference_safety/abstract_state.rs:287:39
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
The DoS trigger Steps:
we can see the code block is a no-condition branch, and every time it executes the last instruction branch(0), it will jump back to the first instruction so it will call the execute_block and join function multiple times.
1. First time In here we set the parameters to SignatureIndex(0), locals to SignatureIndex(0) will lead the num_locals 132*2=264. So after calling the
Leading new locals length to be 264–256=8
2. The Second time when executing the execute_block function and executing the first instruction copy_local(57), the 57 is the offset of the locals that need to push to the stack but this time locals only with the length 8, offset 57 does not exist, so this will cause the get(57).unwrap() function to return nothing and panic.
0x5 Summary
This is the whole story about this vulnerability. From it, we learn that:
First, this vulnerability shows that there is no code that is absolutely safe. The Move language does perform a good static verification before the code is executed, but just like this vulnerability, the previous boundary check can be completely bypassed through the overflow vulnerability.
Secondly,code audit is very important, as programmers can be negligent sometimes. As the leaders of Move language security, we will continue to dig deep into the security issues of Move.
The third point is that for Move language, we recommend that language designers add more check codes to prevent unexpected situations happen.
Currently, the Move language mainly performs a series of security checks in the verification stage, but I think this will not suffice. Once the verification is bypassed, there is not too much security reinforcement in the runtime stage, which will lead to further hazards being deepened, causing more serious problems. Finally, we have discovered yet another vulnerability in the Move language, and we will disclose it to you soon.